Anthropic destroyed millions of physical books to train its AI, court documents reveal

Daniel Sims

Posts: 1,954   +53
Staff
WTF?! Generative AI has already faced sharp criticism for its well-known issues with reliability, its massive energy consumption, and the unauthorized use of copyrighted material. Now, a recent court case reveals that training these AI models has also involved the large-scale destruction of physical books.

Buried in the details of a recent split ruling against Anthropic is a surprising revelation: the generative AI company destroyed millions of physical books by cutting off their bindings and discarding the remains, all to train its AI assistant. Notably, this destruction was cited as a factor that tipped the court's decision in Anthropic's favor.

To build Claude, its language model and ChatGPT competitor, Anthropic trained on as many books as it could acquire. The company purchased millions of physical volumes and digitized them by tearing out and scanning the pages, permanently destroying the books in the process.

Furthermore, Anthropic has no plans to make the resulting digital copies publicly available. This detail helped convince the judge that digitizing and scraping the books constituted sufficient transformation to qualify under fair use. While Claude presumably uses the digitized library to generate unique content, critics have shown that large language models can sometimes reproduce verbatim material from their training data.

Anthropic's partial legal victory now allows it to train AI models on copyrighted books without notifying the original publishers or authors, potentially removing one of the biggest hurdles facing the generative AI industry. A former Metal executive recently admitted that AI would die overnight if required to comply with copyright law, likely because developers wouldn't have access to the vast data troves needed to train large language models.

Still, ongoing copyright battles continue to pose a major threat to the technology. Earlier this month, the CEO of Getty Images acknowledged the company couldn't afford to fight every AI-related copyright violation. Meanwhile, Disney's lawsuit against Midjourney – where the company demonstrated the image generator's ability to replicate copyrighted content – could have significant consequences for the broader generative AI ecosystem.

That said, the judge in the Anthropic case did rule against the company for partially relying on libraries of pirated books to train Claude. Anthropic must still face a copyright trial in December, where it could be ordered to pay up to $150,000 per pirated work.

Permalink to story:

 
Not sure about that as the the book was converted from a physical format to a digital one during the training process. In theory a book can be used by one person at a time, an AI trained on a book can be used by how ever many users the server can handle (millions of users). I'm not sure fair usage applies, or is in fact fair to the copyright holder. It sounds like their should be some licensing type agreements like artist have with Spotify.

One book sold and destroyed doesn't seem to equal potentially million upon millions of request for information on a book that provide 0 revenue for the copyright holder.
 
Not sure about that as the the book was converted from a physical format to a digital one during the training process. In theory a book can be used by one person at a time, an AI trained on a book can be used by how ever many users the server can handle (millions of users). I'm not sure fair usage applies, or is in fact fair to the copyright holder. It sounds like their should be some licensing type agreements like artist have with Spotify.

One book sold and destroyed doesn't seem to equal potentially million upon millions of request for information on a book that provide 0 revenue for the copyright holder.
You're looking at it wrong.
If a person was to read and learn from a book (and that book was only read by them), said person could theoretically talk about that book to millions of people.
It's the same concept here. Relaying the information learned from a source is fair use for humans (as long as it's "transformative"), and would be the same for AI.
 
It’s “new” tech and old laws don’t apply - without reinterpretation.

That reinterpretation simply depends on the agendas of those making the rulings. If we value AI over copyright owners, than the AI companies win… if we decide that AI is no longer useful/desirable, you’ll see the copyright holders win.

I’d be betting on AI…
 
It's never going to be illegal for AI to train in the same ways humans can just because it can do it faster and better. You wouldn't see a person put on trial because they trained themself on too many books.


They bought the books. They can do literally whatever they want with them.
 
Back