Apple, Nvidia, and others trained their AI on YouTube content without user consent or knowledge

Cal Jeffrey

Posts: 4,595   +1,682
Staff member
Here we go again: Giant corporations, including Apple and Nvidia, have used video transcripts from thousands of YouTube creators for AI training without consent or compensation. The news is not that surprising as it seems par for the course. They are simply joining the ranks of Microsoft, Google, Meta, and OpenAI in the unethical use of copyrighted material.

An investigation by Proof News has uncovered that some of the wealthiest AI companies, including Anthropic, Nvidia, Apple, and Salesforce, have used material from thousands of YouTube videos to train their AI models. This practice directly contradicts YouTube's terms of service, prohibiting data harvesting from the platform without permission, but follows a trend set by Google, OpenAI, and others.

The data, called "YouTube Subtitles," is a subset of a larger dataset called "The Pile." It includes transcripts from 173,536 YouTube videos from over 48,000 channels spanning educational content providers like Khan Academy, MIT, and Harvard, as well as popular media outlets like The Wall Street Journal, NPR, and the BBC. The cache even includes entertainment shows like "The Late Show With Stephen Colbert." Even YouTube megastars like MrBeast, Jacksepticeye, and PewDiePie have content in the cache.

Proof News Contributor Alex Reisner uncovered The Pile last year. It contains scraps of everything, from copyrighted books and academic papers to online conversations and YouTube Closed Caption transcripts. In response to the find, Reisner created a searchable database of the content because he felt that IP owners should know whether AI companies are using their work to train their systems.

"I think it's hard for us as a society to have a conversation about AI if we don't know how it's being built," Reisner said. "I thought YouTube creators might want to know that their work is being used. It's also relevant for anyone who's posting videos, photos, or writing anywhere on the internet because right now AI companies are abusing whatever they can get their hands on."

David Pakman, host of "The David Pakman Show," expressed his frustration, revealing that he found nearly 160 of his videos in the dataset. These transcripts were taken from his channel, stored, and used without his knowledge. Pakman, whose channel supports four full-time employees, argued that he deserves compensation if AI companies benefit financially from his work. He highlighted the substantial effort and resources invested in creating his content, describing the unauthorized use as theft.

"No one came to me and said, 'We would like to use this,'" said Pakman. "This is my livelihood, and I put time, resources, money, and staff time into creating this content. There's really no shortage of work."

Dave Wiskus, CEO of the creator-owned streaming service Nebula, echoed this sentiment, calling the practice disrespectful and exploitative. He warned that generative AI could potentially replace artists and harm the creative industry. Compounding the problem is that some large content producers like the Associated Press are penning lucrative deals with AI creators while smaller ones are having their work stolen without notice.

The investigation revealed that EleutherAI is the company behind The Pile dataset. Its stated goal is to make cutting-edge AI technologies available to everyone. However, its methods raise ethical concerns – primarily those of the hush-hush deals made with big AI players. Various AI developers, including multitrillion-dollar tech giants like Apple and Nvidia, have used The Pile dataset to train their models. None of the companies involved have responded to requests for comment.

Lawmakers have been slow to respond to the various threats that AI brings. After years of deepfake technology advances and abuses, the US Senate finally introduced a bill to curb deepfake and AI abuse dubbed the "Content Origin Protection and Integrity from Edited and Deepfaked Media Act" or COPIED Act. The bill aims to create a framework for the legal and ethical gray area of AI development. It promises transparency and an end to the rampant theft of intellectual property via internet scraping, among other things.

Permalink to story:

 
The problem with these claims is that "watched by an AI algorithm" is not the same as "copied."

I know I'm clear that if the publisher of my 4th grade math textbook tried to sue me now because I went on to use the math I learned in part from that textbook throughout my career, I wouldn't feel they have a valid claim. Furthermore, even if that was a valid legal theory, there'd be an evidence problem as to what derived from that one textbook vs. everything else I've ever learned from any other source. Of course, everyone was clear that's what I'd be doing at the time the textbook was sold so it's not an exact analogy.

My general feeling is that when technology introduces new questions about who has what rights, the answers should fall towards not retroactively creating new restrictions. If new technology suddenly enables Ford to control what roads my car can drive on, they should not retroactively be able to demand a yearly Highway Driving license fee on the car I already purchased three years ago.

So my tentative feeling is that if a video was published to the public and accordingly ingested by an AI prior to any new discussion / license terms, there shouldn't be a retroactive claim. Of course if the AI ends up copying that content in its output to other users, it's subject to the same copyright laws we already had in effect.


 
I'm afraid I've seen these companies continue to abuse the system and their customers. I'm all for locking them up and throwing away the key!
 
I used to wonder if pirating content would ever be done en mass by corporations, not just individuals. I don't wonder about that any more.

Not claiming that this is piracy or not, the juries are still out on these subjects, but legal technicalities aside, training AI models has given corporations the largest reason to bust through paywalls.

It's interesting that a totally different fad that fizzled out, the metaverse, ultimately relies on an open ecosystem to support interoperability if it is to be the successful dream that Zuckerburg and others were touting. Before AI, it was hard to believe that there would be any incentive for companies to do that. Now with AI and their hunger for data, it makes one wonder if the outcome will be a set of cross-licensing or other kinds of interoperability agreements that someday lay the foundation for a true metaverse.
 
This AI stuff in this current iteration are not that 'intelligence' in my opinion . Its just basically copying what available mixing it with other to make it looks difference. There is a stark difference of us, human being learning, thinking and have real intelligence compared to this AI many said and promoted nowadays. For example, when we were young and teach how to solve that 2+2 = 4, we as intelligence learning human being could applied to other similar problem, like 3+4, 5+5, etc, but this AI of todays if the result of 3+4 or 5+5 never published in the internet they will never able to answer that question. AI of now are just another search engine with sometimes have the ability to mix what it scraped to look like intelligence being...
 
Said it months ago: the AI push is basically another Ponzi scheme that works exactly like crypto and NFTs except the names of the companies involved are bigger, the biggest actually.

This should pretty much confirm it: even the champion of 'User privacy' is not above stealing anything they want to train AI and sell it back to you.

Bottom line is there is no way to keep AI 'Ethical' because there is no way to be an ethical company under capitalism and specially no way to be some of the biggest corporations in the world while being ethical: they're all scum.
 
Apple stealing data, ignoring consumer rights, say it isn't so. Next thing you'll tell me they don't support right to repair.
 
The problem with these claims is that "watched by an AI algorithm" is not the same as "copied."
Indeed, because an algorithm can not "watch" it, can only copy it, and needs to copy it, multiple times over to "learn" from it. Especially if it's using a training set that was "created" (aggregated) and distributed completely illegally. Which most AI models do with most of their content, because they don't download their training content from the original sources one by one, but are using mostly training sets/crates that are illegally distributed to begin with.

Also copyright law was not created with AI in mind (because, well, people can't see the future), and trying to abuse the fair use doctrine of it (which builds on the understanding of the limited capabilities of use and abuse by individuals, completely unlike that of AI models) clearly flies in the face of what's the intention of copyright law, which is allow creators to retain control over the use of their content, so, that they don't lose (the financial) incentive to create newer and newer content.

Which is exactly what AIs pretty much prevent, because they're essentially offering a competing product (or a service that allows easy creation of competing products), based on the very creative work that they compete against, but without the permission of the original authors to use their creative works for that purpose, and without paying or trying to compensate them by any means.

The question isn't whether AI companies are violating copyright law. The question is when will they be prosecuted for that?

I know I'm clear that if the publisher of my 4th grade math textbook tried to sue me now because I went on to use the math I learned in part from that textbook throughout my career, I wouldn't feel they have a valid claim.
Which does not compare to theft for AI training by any means, because for one you've used the book in accordance with the original intention of the original author, which allow human students to learning from it by reading it - which is however is not the case with AI training, as authors have never expressed explicit intention for their works to be used to train AI models, and some have even explicitly expressed intention to NOT allow that. Also, textbox authors get paid for their work when those learning from their book buy their books, while AI companies don't pay a dime to all the authors whose content they've been abusing to train their models.

Furthermore, even if that was a valid legal theory, there'd be an evidence problem as to what derived from that one textbook
That's an irrelevant question, and is a burden of proof fallacy anyway. Merely downloading and using the content to train an AI is a copyright violation in itself, especially if it flies against the explicit expressed intention of the authors, which AI companies didn't and don't check one by one. Also it's AI companies that have to prove that their training data was either generated by them or licenced properly from the authors, not the authors that have to prove that their content was stolen, because of the obvious black box nature of how AI models work.


So my tentative feeling is that if a video was published to the public and accordingly ingested by an AI prior to any new discussion / license terms, there shouldn't be a retroactive claim.
The videos were published with the understanding that it will be watched by humans, not that it will be used to train AI models. Hence claiming that it was all done legally is completely unsubstantiated, as authors have never expressed intention to allow that. They also didn't get paid for their content either, as all these training sets are illegally distributed to begin with.
 
Last edited:
I thought that it was pretty clear that if you used whatever 'platform' what you posted was their property. F... er, Zuckerburg made that pretty clear, if you hadn't noticed, and no matter what the laws said, has stood by that stance. So, everybody else takes that same viewpoint. Now, along comes AI... The most expensive gamble that tech business has made, with uncertain outcomes. Well, one outcome is certain, but it only has to do with the expensive bit.... In order to recover THEIR costs, YOU will pay. And pay. And pay some more, for your OWN content regurgitated back at you. Never forget.... YOU ARE THE PRODUCT.
 
As it was expected and AI is up for grabs. Nothing big will change as big tech is Mult-Nationally owned, people with deep pockets.
 
The problem with these claims is that "watched by an AI algorithm" is not the same as "copied."

I know I'm clear that if the publisher of my 4th grade math textbook tried to sue me now because I went on to use the math I learned in part from that textbook throughout my career, I wouldn't feel they have a valid claim. Furthermore, even if that was a valid legal theory, there'd be an evidence problem as to what derived from that one textbook vs. everything else I've ever learned from any other source. Of course, everyone was clear that's what I'd be doing at the time the textbook was sold so it's not an exact analogy.

My general feeling is that when technology introduces new questions about who has what rights, the answers should fall towards not retroactively creating new restrictions. If new technology suddenly enables Ford to control what roads my car can drive on, they should not retroactively be able to demand a yearly Highway Driving license fee on the car I already purchased three years ago.

So my tentative feeling is that if a video was published to the public and accordingly ingested by an AI prior to any new discussion / license terms, there shouldn't be a retroactive claim. Of course if the AI ends up copying that content in its output to other users, it's subject to the same copyright laws we already had in effect.
"watched by an AI algorithm" is not the same as "copied." - in this case it is the same. content is taken (sometimes copied word by word, or organized in specific data sets) and used to develop a new product that uses parts of what was copied.

"if the AI ends up copying that content in its output" - it's been proven time and time again that it does that. and even if it doesn't always 100% replicate things, it just needs to have a portion from the copied content to show in the output. for example, you can make a new song which is 80% completely original, but if the 20% is taken from somewhere else, you can still be sued (we have plenty of such court cases).

FYI the people who made that maths book are in their right to sue you if you are putting portions of their book in another book (portions that can be copyrightable, equations are not) or using images of it in another commercial product.
 
Back