GitHub Copilot slammed with the first class-action against "unfair" AI

Alfonso Maruccia

Posts: 1,022   +301
Staff
A hot potato: The first class-action lawsuit against a machine learning algorithm has been filed in San Francisco federal court. Proponents are calling for millions of GitHub users to reaffirm their rights against Copilot, an AI that suggests new code by violating open-source licenses and several other copyright infringement laws.

Litigators have filed a class-action lawsuit against the AI system behind GitHub Copilot, a feature designed by Microsoft and OpenAI to help programmers write better code faster. According to the lawyers, Copilot is trampling on the rights of "possibly millions" of GitHub subscribers, (ab)using countless lines of codes with no proper permission and against the rules of many open-source licenses.

Copilot is a machine learning algorithm developed by OpenAI – the same company behind the image generating system known as DALL-E – to suggest code and even entire functions in real-time, right from the code editor or the user's preferred IDE. GitHub Copilot has been trained "on billions of lines of code," and is seemingly capable of turning natural language prompts into coding suggestions across dozens of different languages.

The Copilot algorithm has been a success thus far, with nearly 30 percent of new code hosted on GitHub written with the AI assistance. Behind Copilot's public approval is what lawyers describe as the systematic violation of the "legal rights of a vast number of creators who posted code or other work under certain open-source licenses on GitHub."

The class-action lawsuit says that a set of 11 popular open-source licenses (MIT, GPL, Apache etc.) require attribution of the author's name and copyright, and Copilot has never done any of that. Beyond proper authorship attribution, the suit states, GitHub's AI has violated the service's own terms of service and privacy policies, DMCA § 1202 (which forbids the removal of copyright management information), the California Consumer Privacy Act, and "other laws giving rise to related legal claims."

The class-action proponents are introducing their lawsuit as a first step in a long journey, in what should be the first attempt to legally challenge the training and output of AI systems in the US. And it will not be the last, the lawyers say, because "AI systems are not exempt from the law" and the creators of these technologies must remain accountable.

"If companies like Microsoft, GitHub, and OpenAI choose to disregard the law," the class-action continues, "they should not expect that we the public will sit still." AI needs to be fair and ethical for everyone, otherwise it will just become "another way for the privileged few to profit from the work of the many."

Permalink to story.

 
First of all, the reason copyright exists is "to promote the progress of science and the useful arts", so it cannot impede education (the process of adopting and copying ideas of proven value from others) because education it promotes the progress of science and useful arts as well.

And how will they prove that the neural network saw and learned from their own code? If they don't prove it, they have no legitimate interest to get involved.
How do they know and prove that it wasn't trained on internal or Boost-licensed code that doesn't require the creator's name to be rendered?
 
Last edited:
The Copilot algorithm has been a success thus far, with nearly 30 percent of new code hosted on GitHub written with the AI assistance
So the class action lawyers are hoping GitHub users will sue themselves basically?

Also, I had no idea this was seeing so much use, it makes me feel like I should check it out...

 
Anyone want to sue me? I've written plenty of code after reading code that others have written. Since I can only learn from code that's available to be read, I too am systematically "ripping off" open-source code. As a bonus, I'm not attributing every source I ever learned from.
 
AI is here to help and lawyers are here to make money especially if this drags on. AI is here to stay and if it proves successful, than maybe license agreements will change in due time.
 
Last edited:
The Streisand effect is real. This lawsuit got me to install Copilot for the first time. I'm glad I did.

For fun maybe GitHub could launch a follow on AI that can automatically prepare class action complaints against class action lawyers for wasting our time and money.
 
I don't know if the plaintiffs have standing to sue, but I understand the concerns here. Github Copilot is not open source, so the FOSS licenses could have been violated here. You can't take copyleft software, tweak a few things, and then relicense your work as proprietary. Can you get away with it just because you use ML and blended together millions of projects at once? I'm not so sure you can, but then again, copyright law doesn't prohibit transformative use cases, which this most certainly is. However, copyright does give an almost unreasonable power to copyright holders: if copyright infringement is alleged, it follows a "guilty until you show fair use" paradigm.

Here's an excerpt from the FSF's whitepaper call on this very subject (https://www.fsf.org/licensing/copilot/):

"Perhaps most importantly, in the USA, “fair use” is an affirmative defense to answer copyright infringement. In concrete terms, that means — particularly in cases where the circumstances are novel — a copyright holder brings an infringement lawsuit and then the alleged infringer shows in court that their actions met the relevant factors for “fair use” sufficiently. Frankly, we refuse to do these companies’ job for them. Copyleft activists need not tell Microsoft and GitHub why this isn’t “fair use”, rather, they need to tell us why training the model with copylefted code is “fair use” and prove that the trained model itself is not a “work based on” the GPL’d software."

On the other hand, a different paper states that it likely is fair use:

"There is a strong argument that GitHub’s use of the code repositories as training data is a transformative use. The original code was written to accomplish a particular purpose of the developer — say, to sort a list of elements, or to perform a mathematical calculation. GitHub uses the code for an entirely different purpose: to teach its AI how to generate new code based on a natural language description. An analogy may be found in a case in which students were required to submit their papers to an online plagiarism-detection service. After comparing a submitted paper to papers available on the Internet and those in its own database, the company would add the student’s paper to its database and use it when analyzing future submissions. The court found that the company’s use of the papers was highly transformative: the purpose of a paper is to convey its expressive content, while the purpose of the database is to detect plagiarism. A.V. ex rel. Vanderhye v. iParadigms, LLC, 562 F.3d 630 (4th Cir. 2009). Likewise, GitHub is not using the copied code to sort lists, but rather to train its AI to create code that will accomplish a particular purpose."

In the end, this lawsuit is a very interesting one, and many of us in the data science and software development communities will be extremely interested in how it plays out, assuming it gets past the boring legal question: do the plaintiffs have standing?
 
The Streisand effect is real. This lawsuit got me to install Copilot for the first time. I'm glad I did.

For fun maybe GitHub could launch a follow on AI that can automatically prepare class action complaints against class action lawyers for wasting our time and money.
On the bright side, those lawyers are going to be reading lines of code they have no clue on.
 
AI needs to be fair and ethical for everyone, otherwise it will just become "another way for the privileged few to profit from the work of the many."

It seems to be the opposite in this case.
 
New code "violations" will appear faster than can be litigated, cascading in an era of super code with no owners or understanding who they originated from.
 
Typical...totally ignoring all the proprietary art they probably used to train DALL-E. How would you feel about someone using your work to make money and not giving you a cut or even ask you if it was OK? If these machines were independent and sentient, then I might feel differently, but right now they are tools, just like youtube/instagram/etc that slurp up original content and have a foot on the neck of creators due to the power imbalance so they only mete out what "they" feel is a fair compensation and apparently OpenAI thinks fair compensation is $0.00.
 
Back