Tumblr and WordPress blogs will be exploited for AI model training

Automattic is striking an AI-focused deal, but may have already abused some private content

By Alfonso Maruccia February 28, 2024, 10:26

Tumblr and WordPress blogs will be exploited for AI model training

Serving tech enthusiasts for over 25 years.
TechSpot means tech analysis and advice you can trust.

Facepalm: Generative AI gobbles massive amounts of data, and companies always need fresh content to develop their LLMs and other machine learning models. WordPress owner Automattic is seemingly ready to provide that content for a fee. The company vows to respect users' privacy, but it may have already fed some private data to AI partners.

Automattic is working on a business deal with Midjourney and OpenAI and has already prepared an initial batch of content to feed their models. An unnamed internal source told 404 Media that the deals are imminent, and internal documentation provides proof of a "messy" data-sharing process at one of Automattic's main blogging products.

The company, founded by Matt Mullenweg, owns platforms Tumblr and WordPress.com, the for-profit blogging site developed on top of the open-source WordPress CMS software. User data is paramount for AI development, as large-language models are prone to sputtering nonsensical gibberish when left to themselves due to the so-called feedback loop effect.

The insider said that Automattic plans to provide full opt-out rights to users interested in protecting their public data, including posts and pictures. However, internal posts indicate that Tumblr has already provided Midjourney and OpenAI an "initial data dump" of all publicly posted content between 2014 and 2023. Furthermore, a "mistake" caused Automattic to share private data of Tumblr users with the two AI companies as well.

After 404 Media went public with its report, Automattic released a statement about "protecting user choice" in the rapidly evolving AI world. The data broker is "closely following" the recent advancements in AI tech and is diligently looking at "how to work" with AI companies while respecting users' privacy and data control.

Automattic currently blocks AI platform crawlers "by default," including spiders from the world's largest tech companies. WordPress.com and Tumblr now have settings to "discourage" data crawling by AI companies, which are on by default if a user had previously disabled search engine indexing.

Automatic admits that no laws currently exist to force AI crawlers to comply with those no-indexing preferences. However, this could soon change with new pending legislation in the European Union. The company also confirms that it's working directly with "select" AI companies – as long as their working plans align with Automattic's principles about user choice.

4 comments 120 likes and shares

// Related Stories

Featured on TechSpot