Github recently released Copilot, an artificial intelligence powered assistant tool intended to help software developers with coding suggestions. Released as a technical preview in June, software developers and cyber security professionals have been quick to expressed concerns and a recent academic paper just released suggests 40% of Copilot outputs resulted in shotty code, open source licensing issues, disclosed secrets and code ripe for exploitation. Let's dive in.
What is Copilot?
Copilot is a collaboration between Microsoft’s Github and OpenAI. OpenAI has developed a machine learning system called Codex which is a machine learning neural network that has used natural-language text as well as public code repositories on Github to build the Copilot engine and provide AI assistance for developers.
How does Copilot work?
Within Microsoft’s VS Code application, developers can sign up for Copilot and install the extension. Once installed, developers can write a commented line describing the ask for Copilot which is referred to as a prompt. In seconds, Copilot analyzes the prompt and produces a suggested snippet of code for the developer to use or modify.
From Copilot’s website here is an example of a prompt:
// Determine whether sentiment of text is positive
//use a web service
The Academic's “Empirical Cyber Security Evaluation” of Copilot
A team from New York University's Tandon School of Engineering set out to evaluate the quality of outputs from Copilot in the context of application security. They created 89 test scenarios with the goal of assessing Copilot’s ability to avoid producing vulnerable code, based on Mitre’s Common Weakness Enumeration top 25, whether Copilot would introduce SQL injection vulnerabilities based on what was requested of it through prompts and lastly Copilot’s capabilities across a variety of mainstream and niche languages.
After 89 test scenarios the team found 40% of tests resulted in code with vulnerabilities and that Copilot struggled with less popular coding languages and made trivial mistakes that even a junior developer would not make.
On Copilot’s website there show several examples of this working well, however in the real world there were some interesting examples of developer’s tests going sideways. The developer community as well as the academic team have both provided several examples where Copilot was asked to produce something fairly simple and it went out to left field. In one example the academic team had asked Copilot to generate “3 random floats” and instead of Copilot producing something elegant, it produced an overly complex snippet that would result in a buffer overflow and unstable software.
One of the main reasons Copilot has had such a lack luster start is because all of its intelligence is based off of a non-curated training set from Github’s public repos. As you can imagine there are 13 years of Github projects, many of them left to rot and chalked full of legacy methods, bad habits and vulnerabilities.
When I first read about the training set Copilot was based on, I was surprised. More data is usually better for machine learning, however the data needs to be good or at least tagged as bad so the algorithms can establish good. However, it seems it has been learning unsupervised which in my opinion means it has become an AI system built to proliferate, not eliminate insecure and bad coding practices.
Github fully admits Copilot is not ready for prime time and will not be replacing developers anytime soon. They see it as a maturing tool to help developers generate boiler-plate code and suggest 3rd party libraries or frameworks in an effort to reduce time spent and not something companies will use to produce code without heavy supervision from developers.
AI Sucks… For Now
There are some nifty technologies leveraging ML/AI out there to expedite production of work. In fact in researching this subject, I tried an AI copywriting tool called Jarvis.AI based on the GPT-3 engine to write this article. However, after a couple hours of learning how to interact with the engine and wrestling with it to produce something that was somewhat coherent, I gave up and reverted back to writing this article on my own.
My experience and frustrations with Jarvis.AI are inline with many of the developers sentiments with Autopilot. Jarvis.AI was very good at producing unnuanced dribble you’d find online such as Clickbait, lifestyle and consumer tech blogs. However, when asked to produce content on a complex/nuanced subject, Jarvis.AI really struggled and acted more like a jargon generator than a tool I could use to make my life easier.
These are early days with consumer accessible AI technologies and as training sets are better curated and learning algorithms improve, we will see these tool become more useful. For now though, Ill stick with coffee.
Here is some light to heavy reading on the Github’s Copilot:
The Register Article: https://www.theregister.com/2021/08/25/github_copilot_study/
Academic paper for a deeper dive: https://arxiv.org/pdf/2108.09293.pdf