Reddit, Yahoo, and other publishers roll out licensing protocol, pressuring AI companies to pay for scraped data

The days of free web scraping for AI training may be over

By Kishalaya Kundu September 10, 2025, 13:36

Reddit, Yahoo, and other publishers roll out licensing protocol, pressuring AI companies to pay for scraped data

Serving tech enthusiasts for over 25 years.
TechSpot means tech analysis and advice you can trust.

In a nutshell: Several major online platforms annd publishers including Reddit, Yahoo, Medium, Ziff Davis, and Quora have announced support for a new licensing standard that allows web publishers and creators to set terms for how AI systems use their content. Called Really Simple Licensing, or RSL, the framework could enable publishers to earn compensation when AI systems scrape data from their websites to train models.

Announced earlier today, Really Simple Licensing, or RSL, is an open, decentralized protocol developed by the non-profit RSL Collective. Built on the widely used RSS (Really Simple Syndication) standard, it can handle any digital content – web pages, books, videos, and datasets – across millions of websites. Unlike traditional licensing systems, RSL works at scale, allowing automated tools and web crawlers to read and interpret licensing terms without manual intervention.

Despite its technical readiness, no AI company has yet agreed to abide by the RSL licensing terms, but its co-founders, Eckart Walther and Doug Leeds, are hopeful of striking deals with big names in the near future. Leeds notes that the new licensing protocol meets AI companies' need for a streamlined licensing structure that allows web crawlers to scrape data for AI training without risking lawsuits.

The RSL protocol supports multiple licensing, usage, and royalty models, including free, attribution, subscription, pay-per-crawl, and pay-per-inference. It requires AI companies to either obtain a custom license or follow Creative Commons terms to use data from the open web.

Web publishers who adopt the new licensing standard will include the terms in their "robots.txt" files, allowing artificial intelligence companies to easily identify the conditions for using data from that source. Crawlers that honor RSL terms will negotiate royalties and other issues with the RSL Collective, which acts as the intermediary between web publishers and AI companies.

The artificial intelligence industry has recently faced several lawsuits alleging that companies trained their models on unlicensed data. Late last month, Anthropic, the company behind the Claude large language model, agreed to pay $1.5 billion to settle a class-action lawsuit brought by authors who said the company used pirated copies of their books to train its chatbot.

Several dozen copyright cases are currently pending in US courts, all seeking damages from tech companies for allegedly scraping and using unlicensed data to train their AI models. Just last week, Warner Bros. sued Midjourney for generating AI images of Superman, Bugs Bunny, and other copyrighted characters without a licensing agreement.

Google is also facing scrutiny over its "AI Overview" feature, which displays AI-generated summaries of key information at the top of search results for some queries. Many publishers and blog owners say they have lost significant click-through traffic from Google Search since the feature launched and are demanding that the company disclose traffic statistics from AI Overview.

// Related Stories

Featured on TechSpot