Meta admits using pirated books to train AI, but won't pay for it

Alfonso Maruccia · Jan 13, 2024

A hot potato: Training advanced AI models with proprietary material has become a controversial issue. Many companies now face legal challenges from authors and media organizations in court. Meta admitted to using the well-known "pirate" dataset, Books3, yet the company is reluctant to compensate writers adequately.

A group of authors filed a lawsuit against Meta, alleging the unlawful use of copyrighted material in developing its Llama 1 and Llama 2 large language models. In response, Facebook addressed writer and comedian Sarah Silverman, author Richard Kadrey, and other rights holders spearheading the legal action, acknowledging that its LLMs were trained using copyrighted books.

Meta has admitted to using the Books3 dataset, among many other materials, to train Llama 1 and Llama 2 LLMs. Books3 is a well-known set comprising a plaintext collection of over 195,000 books totaling nearly 37GB. The archive was created by AI researcher Shawn Presser in 2020 as a way to provide a better data source to improve machine learning algorithms.

The widespread availability of the Books3 dataset has led to its extensive use in AI training by many researchers. Big Tech companies, including Meta, have utilized Books3 and other contentious datasets for their commercial AI products. On that account, the New York Times has sued OpenAI and Microsoft for allegedly using millions of copyrighted articles to develop the ChatGPT chatbot.

OpenAI has openly declared that training AI models without using copyrighted material is "impossible," arguing that judges and courts should dismiss compensation lawsuits brought by rights holders. Echoing this stance, Meta admitted to using Books3 but denied any intentional misconduct.

Meta has acknowledged using parts of the Books3 dataset but argued that its use of copyrighted works to train LLMs did not require "consent, credit, or compensation." The company refutes claims of infringing the plaintiffs' "alleged" copyrights, contending that any unauthorized copies of copyrighted works in Books3 should be considered fair use.

Furthermore, Meta is disputing the validity of maintaining the legal action as a Class Action lawsuit, refusing to provide any monetary "relief" to the suing authors or others involved in the Books3 controversy. The dataset, which includes copyrighted material sourced from the pirate site Bibliotik, was targeted in 2023 by the Danish anti-piracy group Rights Alliance, demanding that digital archiving of the Books3 dataset should be banned and is using DMCA notices to enforce those takedowns.

Permalink to story.

https://www.techspot.com/news/101507-meta-admits-using-pirated-books-train-ai-but.html

FF222 · Jan 13, 2024

"OpenAI has openly declared that training AI models without using copyrighted material is impossible,"
That's not a defense, but the admission of guilt. It should be treated as such.

Also it's time to hold leaders of large companies criminally liable for such willful disregard of others' rights, especially when done at such a massive scale.

yRaz · Jan 13, 2024

FF222 said:
"OpenAI has openly declared that training AI models without using copyrighted material is impossible,"
That's not a defense, but the admission of guilt. It should be treated as such.

Also it's time to hold leaders of large companies criminally liable for such willful disregard of others' rights, especially when done at such a massive scale.

Well I do agree that it is impossible to train AI without the use of copyrighted works but whether or not that constitutes fair use is up for debate. Also, I believe that the DMCA gives copyright holders too much power and modern copyright law is excessive and outdated.

Thanthan · Jan 13, 2024

Ah yes. The classic type of fair use in which I take your stuff, make money on it, and don’t compensate you or even acknowledge you.

I’ve heard a lot of this very normal case of fair use.

Furriest · Jan 13, 2024

Rules for thee, but not for me.

loki1944 · Jan 13, 2024

Great to have the people programming "AI" lack any moral code

OortCloud · Jan 14, 2024

"It's impossible to do this without breaking the law" - therefore we should be allowed to do this.

It's like saying "its impossible to drink and drive without breaking the law, therefore we should be allowed to drink and drive".

Also it isn't impossible to do, you just have to pay all the rightsholders for the material you use to build these models rather than stealing it. That's expensive and difficult I expect, but too bad. I think these LLM's should, by law, have to fully disclose all the data they were built using.

Uncle Al · Jan 14, 2024

And they say nobody is above the law ........ horse hockey!!!

toooooot · Jan 14, 2024

OortCloud said:
"It's impossible to do this without breaking the law" - therefore we should be allowed to do this.

It's like saying "its impossible to drink and drive without breaking the law, therefore we should be allowed to drink and drive".

Also it isn't impossible to do, you just have to pay all the rightsholders for the material you use to build these models rather than stealing it. That's expensive and difficult I expect, but too bad. I think these LLM's should, by law, have to fully disclose all the data they were built using.

Not to mention that they are possible in the process of greatly reducing or removing many jobs from the very people they steal. This is hilarious on so many levels.
And above all, I started to doubt a lot of people have any idea what they are actually doing.
This stuff could do something amazing for humanity. It still needs and must have limits.

OortCloud · Jan 14, 2024

toooooot said:
This stuff could do something amazing for humanity. It still needs and must have limits.

Unfortunately the out of control mega-corporations that now rule America and which the broken US legal and political system seems completely unable to police have snapped it all up and are now going to use it for all the wrong purposes. MS are bad, but Google and Facebook - that some next-level bad news for us all.

PEnnn · Jan 15, 2024

Considering how morally bankrupt are some of the those AI companies, I fear their AI products would reflect their thuggish behavior.

saladbarsmash · Jan 15, 2024

OortCloud said:
Unfortunately the out of control mega-corporations that now rule America and which the broken US legal and political system seems completely unable to police have snapped it all up and are now going to use it for all the wrong purposes. MS are bad, but Google and Facebook - that some next-level bad news for us all.

And yet people will still say we have a Capitalistic based economy, when in fact we have a Corporatist based economy, and Corporatism always leads to fascism, and that is exactly what we have with the corporations and the government. We are not dealing with a fair market and laws and not created equal.

RandomWAN · Jan 15, 2024

It's bad when someone does it to Meta (much of that data isn't actually Meta's, but that's a different argument); Meta does it to someone else, it's A-ok.

Taking Action Against Scraping for Hire | Meta

An update on what we're doing to protect people from scraping data on Facebook and Instagram.

about.fb.com

Meta sues for scraping Facebook and Instagram data

Meta announced it's suing the U.S. subsidiary of a Chinese tech company, accusing it of offering data-scraping services for Facebook and Instagram.

techcrunch.com

OpenAI's justification is even more laughable. Take those billions that are getting sunk into the company and pay the people who make it possible for you to make said money.

Glubernate · Jan 15, 2024

I think the term we are looking for here, is absolute p*ss take.

ChrisH1 · Jan 15, 2024

I openly declare that in my opinion 'living the lifestyle I would like to live is impossible without pirating money' so I presume on that basis Meta won't mind if I hack their systems and pilfer their accounts?

ZedRM · Jan 16, 2024

How is Meta(facebook) stealing a surprise to anyone? They've been doing it for over a decade with everyone's personal info and data.

Meta admits using pirated books to train AI, but won't pay for it

Alfonso Maruccia

Posts: 2,572 +956

FF222

Posts: 457 +498

yRaz

Posts: 9,257 +16,299

Thanthan

Posts: 278 +572

Furriest

Posts: 55 +63

loki1944

Posts: 1,437 +1,091

OortCloud

Posts: 2,230 +3,220

Uncle Al

Posts: 10,519 +10,149

toooooot

Posts: 4,693 +3,002

OortCloud

Posts: 2,230 +3,220

PEnnn

Posts: 1,285 +1,854

saladbarsmash

Posts: 74 +44

RandomWAN

Posts: 331 +470

Taking Action Against Scraping for Hire | Meta

Meta sues for scraping Facebook and Instagram data

Glubernate

ChrisH1

Posts: 349 +224

ZedRM

Posts: 3,352 +2,299

Similar threads

Latest posts