Microsoft AI CEO: Content on the open web is "freeware" for AI training

zohaibahd

Posts: 934   +19
Staff
What just happened? The use of copyrighted material to train AI has become a hot-button issue, with experts divided on whether it constitutes theft or a legitimate form of study akin to artistic training. Microsoft's AI top executive thought it would be a good idea to add fuel to the fire by making some bold claims about what companies can legally do with online content when training their AI systems.

Mustafa Suleyman, who's been heading Microsoft's AI efforts since March, told CNBC in an interview that material published openly on the web essentially becomes "freeware" that anyone can copy and use as they please.

"I think that with respect to content that's already on the open web, the social contract of that content since the '90s has been that it is fair use. Anyone can copy it, recreate with it, reproduce with it," he stated. "That has been 'freeware,' if you like, that's been the understanding."

That's certainly a spicy take – and an inaccurate one – you only need to look at the FAQ page from the US Copyright Office. One answer therein states that "your work is under copyright protection the moment it is created and fixed in a tangible form that it is perceptible either directly or with the aid of a machine or device."

The same FAQ adds that you do not even need to register "to be protected." The only time registration is needed is when you wish to file a lawsuit for infringement. So it's safe to say fair use doesn't come from any "social contract" as Suleyman suggests.

Suleyman did seemingly acknowledge the importance of the robots.txt file, stating that mentioning "do not scrape or crawl" on a website might make scraping a "grey area." But adhering to this basic protocol blocking web crawlers is more of a courtesy, not something that needs to "work its way through the courts," as he suggested.

Not surprisingly, even robots.txt is being ignored by various AI companies including Anthropic, Perplexity, and OpenAI.

This isn't the first time an executive working on AI advancement has made controversial claims. A big reason behind the prevalence of such statements is likely that despite over a year since ChatGPT's launch, the legal grounds are still being mapped out regarding training data and copyright.

Microsoft and partner OpenAI are indeed facing multiple lawsuits from publishers over allegations of using copyrighted online articles to train their powerful language models without permission. However, these cases have yet to reach final resolutions that could provide more legal clarity.

Suleyman's statements reflect a view of AI's scraping of the internet similar to how artists have always studied great works while learning their craft. "What are we, collectively, as an organism of humans, other than a knowledge and intellectual production engine?" he mused in the same interview.

However, the difference between AI and artists is that only one is capable of ingesting and regurgitating the world's content into profitable AI products and services on an unprecedented scale.

Permalink to story:

 
Eventually the laws will change - there is no way to enforce protection of the net, so they’ll just give up.

 
Surely, any website that wants to protect its content, Netflix, Disney+, AppleTV, they all force you to login with an account right? So they aren't "open web"?

Even YouTube, you have to login to see stuff that's marked as more adult content or "not kid friendly" content.

Soo... if you don't want your content crawled, why not lock it behind a login of some kind? What am I missing here?

Let's say you're walking down the public high street, female model's in windows for clothing brands, Warhammer 40k giant figures at the Games Workshop, Shoes all over the walls in the shoe shops. If I took photos of any of these publicly accessible places, am I breaking copyright laws in some way?

If these shops weren't showing stuff in their front windows, and doors were closed, and the only way to get in was a pin, enter your email address, okay here's the pin to get in (as an example), then once I was inside, and then started taking photo's, then I'd get it, I'm actively exploiting something they don't want the world to see.

I absolutely might not have read this correctly though and I've got the wrong end of the stick.
 
Surely, any website that wants to protect its content, Netflix, Disney+, AppleTV, they all force you to login with an account right? So they aren't "open web"?

Even YouTube, you have to login to see stuff that's marked as more adult content or "not kid friendly" content.

Soo... if you don't want your content crawled, why not lock it behind a login of some kind? What am I missing here?

Let's say you're walking down the public high street, female model's in windows for clothing brands, Warhammer 40k giant figures at the Games Workshop, Shoes all over the walls in the shoe shops. If I took photos of any of these publicly accessible places, am I breaking copyright laws in some way?
If you start selling those photos you are…
If these shops weren't showing stuff in their front windows, and doors were closed, and the only way to get in was a pin, enter your email address, okay here's the pin to get in (as an example), then once I was inside, and then started taking photo's, then I'd get it, I'm actively exploiting something they don't want the world to see.
And now they get less visits…
I absolutely might not have read this correctly though and I've got the wrong end of the stick.
 
I am glad you pointed out in the article his claims are complete bullshit . Surprised even he would stoop to such obvious bare face lies.

 
I am glad you pointed out in the article his claims are complete bullshit . Surprised even he would stoop to such obvious bare face lies.
While, legally, they are certainly lies… in reality, they are the truth.

MS, OpenAI and every other AI company have been, currently are, and will always use the internet for training and there is virtually nothing law enforcement can or will be able to do about it. At most, they will get a fine - which will represent a tiny percentage of their ENORMOUS profits - and will proceed to train their AIs as they please.
 
Going by his way of thinking I guess if a copy of Windows 11 or Microsoft Office was in the wild, I guess we could download it, install it, and use it for free? I could use the same excuse and say if I am using it for training purposes I shouldn't have to pay for it. I guess now we can go to Microsoft and let them know we are using their programs and not paying for them because your CEO of AI said we are allowed
 
More such statement please. I expect they will be very handy in court to prove the companies either intentionally violated the copyright law, or willingly neglected copyright protection.
 
Mustafa Suleyman is obviously wrong. The articles on the "open web" are absolutely copyrighted. Plagiarizing them is illegal. However, the question for AI should not be if the "open web" is copyrighted. The question is if an AI reading in copyrighted martial to learn from is ok. I don't think there is any law saying it's not. A good AI never exactly reproduces the original material. It doesn't store the original material so it can't cut and paste it back into a new article. It's no different from a person reading an article and learning something form it and then telling a friend about it in their own words. That is not a copyright violation to do that. But the government could change copyright law to make it illegal to train an AI without paying. I don't think there is any law saying that is required right now.
 
While, legally, they are certainly lies… in reality, they are the truth.

MS, OpenAI and every other AI company have been, currently are, and will always use the internet for training and there is virtually nothing law enforcement can or will be able to do about it. At most, they will get a fine - which will represent a tiny percentage of their ENORMOUS profits - and will proceed to train their AIs as they please.

true
 
All M$ products are now free to use. Pirate them I say. If they don't want to pay for their "data" then we don't have to pay for theirs. I think it is that simple.
 
Who hired this 'leader'? Ah, Microsoft :)

However, another difference between AI and Humans is that only one is capable of voluntarily destroying his own shareholder value by driving future settlements to the sky.
 
Surely, any website that wants to protect its content, Netflix, Disney+, AppleTV, they all force you to login with an account right? So they aren't "open web"?

Even YouTube, you have to login to see stuff that's marked as more adult content or "not kid friendly" content.

Soo... if you don't want your content crawled, why not lock it behind a login of some kind? What am I missing here?

Let's say you're walking down the public high street, female model's in windows for clothing brands, Warhammer 40k giant figures at the Games Workshop, Shoes all over the walls in the shoe shops. If I took photos of any of these publicly accessible places, am I breaking copyright laws in some way?

If these shops weren't showing stuff in their front windows, and doors were closed, and the only way to get in was a pin, enter your email address, okay here's the pin to get in (as an example), then once I was inside, and then started taking photo's, then I'd get it, I'm actively exploiting something they don't want the world to see.

I absolutely might not have read this correctly though and I've got the wrong end of the stick.

Yea no. The obvious comparison point here would be reprinting techspot articles verbatum on my own website ‘tech hub’ and making money off of that. That is not ‘taking a photo of the merchandise’ that is ‘stealing it and selling it’.

Fair use under copyright law is more akin to me having read a bunch of articles and making my own ‘review mashup video/essay’ where I summarise the findings of a bunch of reviewers in my own words on YouTube or a blog.

But even that gets complicated with AI. Because if I as a human do that, it requires effort, and unless I have something interesting to say, people will most likely just read the reviews I am aggregating for better content. The AI however auto scrapes the web, posts review findings verbatim to anyone who searches on the topic with Google / Bing / ChatGPT. this essentially means that large tech companies with a massive advantage when it comes to people interfacing with their tools (how do you find reviews? You search for them) are providing the findings of people who’ve done work before you’re even directed to links to the people who’ve done works sites. Rather they’re presented as coming from a magical ‘A.I’ made by the tech companies. So suddenly everyone is provided answers (occasionally non factual due to hallucinations) before ever interfacing with anyone who has done ‘work’ to answer your query, stealing man hours directly from people provided no compensation. That… is not fair use. That’s abuse, and also a doomed business model although the gen A.I. Sellers haven’t noticed yet…

Because if those doing the work aren’t compensated… well they go bankrupt… and now your bot has nothing to scrape any more. And the bot doesn’t have any answers. Because it’s not intelligent, nor can it do any work.
 
:D so you declare so, therefore it is like so? Just throw IP and copyright out the window, right?

That's not how it works. That's not how any of this works :D
 
Yea no. The obvious comparison point here would be reprinting techspot articles verbatum on my own website ‘tech hub’ and making money off of that. That is not ‘taking a photo of the merchandise’ that is ‘stealing it and selling it’.
Correction, COPYING and selling it. Before you try and argue this, Stealing implies the original owner no longer has said object or ownership, which in the digital world, they still would.
Fair use under copyright law is more akin to me having read a bunch of articles and making my own ‘review mashup video/essay’ where I summarise the findings of a bunch of reviewers in my own words on YouTube or a blog.

But even that gets complicated with AI. Because if I as a human do that, it requires effort, and unless I have something interesting to say, people will most likely just read the reviews I am aggregating for better content. The AI however auto scrapes the web, posts review findings verbatim to anyone who searches on the topic with Google / Bing / ChatGPT. this essentially means that large tech companies with a massive advantage when it comes to people interfacing with their tools (how do you find reviews? You search for them) are providing the findings of people who’ve done work before you’re even directed to links to the people who’ve done works sites. Rather they’re presented as coming from a magical ‘A.I’ made by the tech companies. So suddenly everyone is provided answers (occasionally non factual due to hallucinations) before ever interfacing with anyone who has done ‘work’ to answer your query, stealing man hours directly from people provided no compensation. That… is not fair use. That’s abuse, and also a doomed business model although the gen A.I. Sellers haven’t noticed yet…

Because if those doing the work aren’t compensated… well they go bankrupt… and now your bot has nothing to scrape any more. And the bot doesn’t have any answers. Because it’s not intelligent, nor can it do any work.
I actually totally agree with all your points, I'm just playing devils advocate here (I've been encouraged to use AI at work and even got given a Copilot license for a while) and I'm just seeing where AI could fit in.

Here me out, if I manually go trawling through the web, and make my own list, that's fine, but if an automated system does the same thing, that's bad? What money does Techspot and all the other websites I visit make if I'm using an Ad Blocker?

And Googles basically been doing this for forever, it just was written on a page rather than some AI attempting to use natural language.
 
So... I guess if I scrape all the code behind every M$ site on the web and then publish it for my own purposes, I'm in the clear right? Talk about making a misleading and obviously self serving BS statement. Then again it is M$, once referred to as the "evil empire", so I'm not too surprised...
 
While, legally, they are certainly lies… in reality, they are the truth.

MS, OpenAI and every other AI company have been, currently are, and will always use the internet for training and there is virtually nothing law enforcement can or will be able to do about it. At most, they will get a fine - which will represent a tiny percentage of their ENORMOUS profits - and will proceed to train their AIs as they please.
That's not how it happens. They get a "cease and desist", then the plaintiff sues. Each lawsuit is worth millions of dollars, and scraping the Internet creates millions of plaintiffs.
 
That's not how it happens. They get a "cease and desist", then the plaintiff sues. Each lawsuit is worth millions of dollars, and scraping the Internet creates millions of plaintiffs.
Precisely - and how often do you think the tiny company who sues MS or Google wins their lawsuit? And even if they do, MS / Google just pays some money and continues doing business…

And if the government ever DOES get involved, it will simply be a fine…
 
Back