Meta admits using pirated books to train AI, but won't pay for it

Alfonso Maruccia

Posts: 1,025   +302
Staff
A hot potato: Training advanced AI models with proprietary material has become a controversial issue. Many companies now face legal challenges from authors and media organizations in court. Meta admitted to using the well-known "pirate" dataset, Books3, yet the company is reluctant to compensate writers adequately.

A group of authors filed a lawsuit against Meta, alleging the unlawful use of copyrighted material in developing its Llama 1 and Llama 2 large language models. In response, Facebook addressed writer and comedian Sarah Silverman, author Richard Kadrey, and other rights holders spearheading the legal action, acknowledging that its LLMs were trained using copyrighted books.

Meta has admitted to using the Books3 dataset, among many other materials, to train Llama 1 and Llama 2 LLMs. Books3 is a well-known set comprising a plaintext collection of over 195,000 books totaling nearly 37GB. The archive was created by AI researcher Shawn Presser in 2020 as a way to provide a better data source to improve machine learning algorithms.

The widespread availability of the Books3 dataset has led to its extensive use in AI training by many researchers. Big Tech companies, including Meta, have utilized Books3 and other contentious datasets for their commercial AI products. On that account, the New York Times has sued OpenAI and Microsoft for allegedly using millions of copyrighted articles to develop the ChatGPT chatbot.

OpenAI has openly declared that training AI models without using copyrighted material is "impossible," arguing that judges and courts should dismiss compensation lawsuits brought by rights holders. Echoing this stance, Meta admitted to using Books3 but denied any intentional misconduct.

Meta has acknowledged using parts of the Books3 dataset but argued that its use of copyrighted works to train LLMs did not require "consent, credit, or compensation." The company refutes claims of infringing the plaintiffs' "alleged" copyrights, contending that any unauthorized copies of copyrighted works in Books3 should be considered fair use.

Furthermore, Meta is disputing the validity of maintaining the legal action as a Class Action lawsuit, refusing to provide any monetary "relief" to the suing authors or others involved in the Books3 controversy. The dataset, which includes copyrighted material sourced from the pirate site Bibliotik, was targeted in 2023 by the Danish anti-piracy group Rights Alliance, demanding that digital archiving of the Books3 dataset should be banned and is using DMCA notices to enforce those takedowns.

Permalink to story.

 
"OpenAI has openly declared that training AI models without using copyrighted material is impossible,"
That's not a defense, but the admission of guilt. It should be treated as such.

Also it's time to hold leaders of large companies criminally liable for such willful disregard of others' rights, especially when done at such a massive scale.
 
"OpenAI has openly declared that training AI models without using copyrighted material is impossible,"
That's not a defense, but the admission of guilt. It should be treated as such.

Also it's time to hold leaders of large companies criminally liable for such willful disregard of others' rights, especially when done at such a massive scale.
Well I do agree that it is impossible to train AI without the use of copyrighted works but whether or not that constitutes fair use is up for debate. Also, I believe that the DMCA gives copyright holders too much power and modern copyright law is excessive and outdated.
 
"It's impossible to do this without breaking the law" - therefore we should be allowed to do this.

It's like saying "its impossible to drink and drive without breaking the law, therefore we should be allowed to drink and drive".

Also it isn't impossible to do, you just have to pay all the rightsholders for the material you use to build these models rather than stealing it. That's expensive and difficult I expect, but too bad. I think these LLM's should, by law, have to fully disclose all the data they were built using.
 
"It's impossible to do this without breaking the law" - therefore we should be allowed to do this.

It's like saying "its impossible to drink and drive without breaking the law, therefore we should be allowed to drink and drive".

Also it isn't impossible to do, you just have to pay all the rightsholders for the material you use to build these models rather than stealing it. That's expensive and difficult I expect, but too bad. I think these LLM's should, by law, have to fully disclose all the data they were built using.
Not to mention that they are possible in the process of greatly reducing or removing many jobs from the very people they steal. This is hilarious on so many levels.
And above all, I started to doubt a lot of people have any idea what they are actually doing.
This stuff could do something amazing for humanity. It still needs and must have limits.
 
This stuff could do something amazing for humanity. It still needs and must have limits.
Unfortunately the out of control mega-corporations that now rule America and which the broken US legal and political system seems completely unable to police have snapped it all up and are now going to use it for all the wrong purposes. MS are bad, but Google and Facebook - that some next-level bad news for us all.
 
Unfortunately the out of control mega-corporations that now rule America and which the broken US legal and political system seems completely unable to police have snapped it all up and are now going to use it for all the wrong purposes. MS are bad, but Google and Facebook - that some next-level bad news for us all.
And yet people will still say we have a Capitalistic based economy, when in fact we have a Corporatist based economy, and Corporatism always leads to fascism, and that is exactly what we have with the corporations and the government. We are not dealing with a fair market and laws and not created equal.
 
It's bad when someone does it to Meta (much of that data isn't actually Meta's, but that's a different argument); Meta does it to someone else, it's A-ok.



OpenAI's justification is even more laughable. Take those billions that are getting sunk into the company and pay the people who make it possible for you to make said money.
 
I openly declare that in my opinion 'living the lifestyle I would like to live is impossible without pirating money' so I presume on that basis Meta won't mind if I hack their systems and pilfer their accounts?
 
How is Meta(facebook) stealing a surprise to anyone? They've been doing it for over a decade with everyone's personal info and data.
 
Back