OpenAI to regulators: Training AI models without copyrighted material is "impossible"

Cal Jeffrey

Posts: 4,181   +1,427
Staff member
A hot potato: Artificial intelligence researchers used to work in peace. However, now that companies like OpenAI, Microsoft, Google, and others are commercializing generative AI, the use of copyrighted training material has come under fire. Regulators in the UK are asking for information regarding the issue, and OpenAI recently responded.

OpenAI recently told members of the House of Lords that it is "impossible" to train large language models (LLMs) without using copyrighted material. The claim was in response to the UK's Communications and Digital Select Committee, which is looking into the legal issues involving current AI systems.

Current consumer applications like ChatGPT and Dall-E are based on GPT-3. Since 2018, OpenAI has trained the model on billions of samples of writings, art, and photographs, mostly scraped from the internet. In March, OpenAI released GPT-4, which uses a dataset of text samples measuring about 570GB. Some examples in the training material include websites and books, which are without question protected works. However, copyright law goes far beyond books and websites.

"Because copyright today covers virtually every sort of human expression – including blogposts, photographs, forum posts, scraps of software code, and government documents – it would be impossible to train today's leading AI models without using copyrighted materials," OpenAI's submission to the House of Lords reads.

Indeed, under current copyright law, a copyright does not even have to be registered to be protected. Any intellectual property is instantly copyrighted when the creator sets it to permanent media. It does not matter if it's a digital file, video, book, blog post, or a forum comment. All copyright laws apply.

This issue wasn't much of a problem in years past because machine learning research was strictly academic. Training was largely considered fair use and nobody bothered researchers. However, now that LLMs are going commercial, they have entered a gray area of the fair use doctrine.

On rare occasions, ChatGPT "regurgitates" copyrighted snippets, which is a cut-and-dry infringement and a problem that OpenAI is working hard to eliminate. However, that issue is not directly related to what happens when researchers train an LLM with protected material. Instead, the system uses the works, copyrighted or otherwise, to learn how language is structured and used so that it may create original content that humans can understand.

Unfortunately, being a new frontier, copyright law has no legal definition regarding AI training. So, allegedly infringed parties have begun bringing cases to courts. Companies like OpenAI and Microsoft are saying, "No. Training falls under fair use like it always has."

"Training AI models using publicly available internet materials is fair use, as supported by long-standing and widely accepted precedents," OpenAI related in a blog post this week. "We view this principle as fair to creators, necessary for innovators, and critical for US competitiveness."

Despite believing that the fair use doctrine covers LLM training, OpenAI provides a simple opt-out process, which The New York Times used in August last year. OpenAI's tools can no longer access the NYT website, yet the newspaper filed a lawsuit in December.

"We support journalism, partner with news organizations, [but] believe The New York Times lawsuit is without merit," it said.

OpenAI faces similar lawsuits from several published authors, including high-profile comedian Sarah Silverman. It's an issue that the courts cannot handle alone. The US Patent and Trademark Office, along with lawmakers, need to clearly define the role AI training plays in copyright rules.

Permalink to story.

 
As much as I use ChatGPT, I don't agree with OpenAI. I am not even really a fan of how many things are copyrighted. However, these AI companies need to suck it up and pay for the data they use. I just feel that if they are going to sell their AI services based on work of others, then those original authors need some form of payment. I don't know the best way, but they need to at least try and put some effort into it. Just saying its "impossible" is just trying to have your cake and eat it too.
 
These same scumbags that want the data for free, then turn and around and demand payment for using their AI services. For example Adobe only gives photoshop subscribers a 50 generative AI's per month and refuses to allow local generation. I saw one new company the other day wants a minimum of $99 per month to give you 100 image generations.
 
At least they admit what they do.
I wonder what will companies outside the western worlds do.
I mean yes, they will keep using EVERYTHING
they want to train their AI, and thus getting ahead.
So, what is next. Will their AI tools
be banned in western countries for copyright infringement or else.
 
Text and data mining have already been included in fair use. This is basically a form of text and data mining.
 
OpenAI recently told members of the House of Lords that it is "impossible" to train large language models (LLMs) without using copyrighted material

What they are really saying:

"We don't want to pay people for their work because we are cheap as ****. We want our search engine AI to be able to parrot as much information back as possible to make it sound like it knows what it is doing. After that, we want to charge people to use it by incorporating it into everything else we sell and call it 'AI' so we can get on this money train!"
 
This is not so hard. Anybody wanting to use copyrighted materials for AI needs to negotiate payment for use, just like any other use. It does not matter that bots scrape copyrighted data from behind firewalls to use by AI.
 
Back