OpenAI to regulators: Training AI models without copyrighted material is "impossible"

Cal Jeffrey · Jan 10, 2024

A hot potato: Artificial intelligence researchers used to work in peace. However, now that companies like OpenAI, Microsoft, Google, and others are commercializing generative AI, the use of copyrighted training material has come under fire. Regulators in the UK are asking for information regarding the issue, and OpenAI recently responded.

OpenAI recently told members of the House of Lords that it is "impossible" to train large language models (LLMs) without using copyrighted material. The claim was in response to the UK's Communications and Digital Select Committee, which is looking into the legal issues involving current AI systems.

Current consumer applications like ChatGPT and Dall-E are based on GPT-3. Since 2018, OpenAI has trained the model on billions of samples of writings, art, and photographs, mostly scraped from the internet. In March, OpenAI released GPT-4, which uses a dataset of text samples measuring about 570GB. Some examples in the training material include websites and books, which are without question protected works. However, copyright law goes far beyond books and websites.

"Because copyright today covers virtually every sort of human expression – including blogposts, photographs, forum posts, scraps of software code, and government documents – it would be impossible to train today's leading AI models without using copyrighted materials," OpenAI's submission to the House of Lords reads.

Indeed, under current copyright law, a copyright does not even have to be registered to be protected. Any intellectual property is instantly copyrighted when the creator sets it to permanent media. It does not matter if it's a digital file, video, book, blog post, or a forum comment. All copyright laws apply.

This issue wasn't much of a problem in years past because machine learning research was strictly academic. Training was largely considered fair use and nobody bothered researchers. However, now that LLMs are going commercial, they have entered a gray area of the fair use doctrine.

On rare occasions, ChatGPT "regurgitates" copyrighted snippets, which is a cut-and-dry infringement and a problem that OpenAI is working hard to eliminate. However, that issue is not directly related to what happens when researchers train an LLM with protected material. Instead, the system uses the works, copyrighted or otherwise, to learn how language is structured and used so that it may create original content that humans can understand.

Unfortunately, being a new frontier, copyright law has no legal definition regarding AI training. So, allegedly infringed parties have begun bringing cases to courts. Companies like OpenAI and Microsoft are saying, "No. Training falls under fair use like it always has."

"Training AI models using publicly available internet materials is fair use, as supported by long-standing and widely accepted precedents," OpenAI related in a blog post this week. "We view this principle as fair to creators, necessary for innovators, and critical for US competitiveness."

Despite believing that the fair use doctrine covers LLM training, OpenAI provides a simple opt-out process, which The New York Times used in August last year. OpenAI's tools can no longer access the NYT website, yet the newspaper filed a lawsuit in December.

"We support journalism, partner with news organizations, [but] believe The New York Times lawsuit is without merit," it said.

OpenAI faces similar lawsuits from several published authors, including high-profile comedian Sarah Silverman. It's an issue that the courts cannot handle alone. The US Patent and Trademark Office, along with lawmakers, need to clearly define the role AI training plays in copyright rules.

Permalink to story.

https://www.techspot.com/news/101475-openai-tells-regulators-training-usable-ai-models-without.html

user556 · Jan 10, 2024

It's funny how M$ is happy to back the claim that copying for training is fair use now, when it also backed the claim that even the tiniest copying of SCO Unix wasn't fair use not so many years ago.

Dr Roboto · Jan 10, 2024

As much as I use ChatGPT, I don't agree with OpenAI. I am not even really a fan of how many things are copyrighted. However, these AI companies need to suck it up and pay for the data they use. I just feel that if they are going to sell their AI services based on work of others, then those original authors need some form of payment. I don't know the best way, but they need to at least try and put some effort into it. Just saying its "impossible" is just trying to have your cake and eat it too.

Mr Majestyk · Jan 10, 2024

These same scumbags that want the data for free, then turn and around and demand payment for using their AI services. For example Adobe only gives photoshop subscribers a 50 generative AI's per month and refuses to allow local generation. I saw one new company the other day wants a minimum of $99 per month to give you 100 image generations.

toooooot · Jan 10, 2024

At least they admit what they do.
I wonder what will companies outside the western worlds do.
I mean yes, they will keep using EVERYTHING
they want to train their AI, and thus getting ahead.
So, what is next. Will their AI tools
be banned in western countries for copyright infringement or else.

MasterAce · Jan 11, 2024

All I ever used was the free chatGPT, and it should remain free for all, instead of the few privileged ones.

PanGrns · Jan 11, 2024

Our work + personal life must not be their profits

Shaitan · Jan 11, 2024

So it's just "FAIR" when firing thousands or tens thousands of people?
It's "FAIR" to use copyrighted material but not pay for it?
So.. it's fair to torrent stuff then?!

GoldenGoat · Jan 11, 2024

Text and data mining have already been included in fair use. This is basically a form of text and data mining.

Neatfeatguy · Jan 11, 2024

OpenAI recently told members of the House of Lords that it is "impossible" to train large language models (LLMs) without using copyrighted material

What they are really saying:

"We don't want to pay people for their work because we are cheap as ****. We want our ~~search engine~~ AI to be able to parrot as much information back as possible to make it sound like it knows what it is doing. After that, we want to charge people to use it by incorporating it into everything else we sell and call it 'AI' so we can get on this money train!"

Ben Myers · Jan 12, 2024

This is not so hard. Anybody wanting to use copyrighted materials for AI needs to negotiate payment for use, just like any other use. It does not matter that bots scrape copyrighted data from behind firewalls to use by AI.

OpenAI to regulators: Training AI models without copyrighted material is "impossible"

Cal Jeffrey

Posts: 4,181 +1,427

user556

Posts: 56 +36

Dr Roboto

Posts: 167 +341

Mr Majestyk

Posts: 2,058 +1,903

toooooot

Posts: 2,824 +1,598

MasterAce

Posts: 50 +45

PanGrns

Posts: 34 +22

Shaitan

Posts: 254 +318

GoldenGoat

Posts: 239 +270

Neatfeatguy

Posts: 1,622 +3,054

Ben Myers

Posts: 323 +124

Similar threads

Latest posts