OpenAI transcribed over a million hours of YouTube videos to train its LLMs, Google engaged in same practice

midian182

Posts: 9,748   +121
Staff member
A hot potato: One of the many controversial elements surrounding generative AIs and the training data used to build large language models (LLMs) is the potential for copyright infringement. It's a topic under the spotlight once again following a report that OpenAI transcribed over a million hours of YouTube videos to train GPT-4. Why didn't YouTube owner Google object? Because they did the same thing.

In order to access more reputable English language-based text on the internet in 2021, OpenAI researchers created a speech recognition tool called Whisper, reports The New York Times. It was designed to transcribe audio from YouTube videos, giving the company a trove of data to train its LLMs.

OpenAI reportedly knew that scraping YouTube data was legally questionable but did it anyway, assuming such action could be considered fair use. The Times writes that OpenAI president Greg Brockman was personally involved in collecting videos that were transcribed.

One would imagine Google being less than happy about OpenAI's actions, but that would have been hypocritical given that Google also transcribed YouTube videos for its AI models, potentially violating creators' copyrighted material.

YouTube CEO Neal Mohan said during an interview with Bloomberg last week that the platform's terms of service do not permit unauthorized transcripts or downloading of video content. When asked about OpenAI's transcribing, he said, "I have seen reports that it may or may not have been used. I have no information myself."

Google spokesperson Matt Bryant repeated the ToS rules, adding that the company takes "technical and legal measures" to prevent this sort of unauthorized practice "when we have a clear legal or technical basis to do so." Google said that its AI models are trained "on some YouTube content" that is allowed under agreements with creators.

The NY Times states that Google has since expanded its terms of service, giving it more rights to use consumer data such as publicly available Google Docs and restaurant reviews on Google Maps for the company's AI models.

The revised policy was released on July 1 in the hope that the Independence Day weekend would act as a distraction.

Meta was also said to be considering shady methods of attaining more data for its LLM training. The NY Times writes that the Facebook parent considered collecting copyrighted data from the internet, even if that meant facing lawsuits, as negotiations with license holders would take too long.

Thousands of organizations and individuals are complaining and filing lawsuits against large AI companies over the use of their content without payment or acknowledgment. The New York Times is suing OpenAI and Microsoft for using its copyrighted news articles. In February, OpenAI accused the publication of paying someone to "hack" its famous chatbot and other products to generate misleading evidence supporting these claims.

Masthead: Souvik Banerjee

Permalink to story:

 
It’s obvious that it’s been trained on YouTube data because you can ask it to write a transcript in the style of a given YouTuber and it works.
 
So basically meta is saying that we can take whatever we want,
because it would take too much time to work for it...
 
Google complaining about someone stealing data to train AI is the ultimate in hypocrisy. Even Trump would be impressed.
 
Back