Tumblr and WordPress blogs will be exploited for AI model training

Alfonso Maruccia · Feb 28, 2024

Facepalm: Generative AI gobbles massive amounts of data, and companies always need fresh content to develop their LLMs and other machine learning models. WordPress owner Automattic is seemingly ready to provide that content for a fee. The company vows to respect users' privacy, but it may have already fed some private data to AI partners.

Automattic is working on a business deal with Midjourney and OpenAI and has already prepared an initial batch of content to feed their models. An unnamed internal source told 404 Media that the deals are imminent, and internal documentation provides proof of a "messy" data-sharing process at one of Automattic's main blogging products.

The company, founded by Matt Mullenweg, owns platforms Tumblr and WordPress.com, the for-profit blogging site developed on top of the open-source WordPress CMS software. User data is paramount for AI development, as large-language models are prone to sputtering nonsensical gibberish when left to themselves due to the so-called feedback loop effect.

The insider said that Automattic plans to provide full opt-out rights to users interested in protecting their public data, including posts and pictures. However, internal posts indicate that Tumblr has already provided Midjourney and OpenAI an "initial data dump" of all publicly posted content between 2014 and 2023. Furthermore, a "mistake" caused Automattic to share private data of Tumblr users with the two AI companies as well.

After 404 Media went public with its report, Automattic released a statement about "protecting user choice" in the rapidly evolving AI world. The data broker is "closely following" the recent advancements in AI tech and is diligently looking at "how to work" with AI companies while respecting users' privacy and data control.

Automattic currently blocks AI platform crawlers "by default," including spiders from the world's largest tech companies. WordPress.com and Tumblr now have settings to "discourage" data crawling by AI companies, which are on by default if a user had previously disabled search engine indexing.

Automatic admits that no laws currently exist to force AI crawlers to comply with those no-indexing preferences. However, this could soon change with new pending legislation in the European Union. The company also confirms that it's working directly with "select" AI companies – as long as their working plans align with Automattic's principles about user choice.

Permalink to story.

https://www.techspot.com/news/102057-tumblr-wordpress-data-exploited-ai-model-training.html

Theinsanegamer · Feb 28, 2024

I’m sure joining AI to the cesspit that is tumblr and the rambling that are Wordpress blogs will absolutely improve it’s gibberish issue!

yRaz · Feb 28, 2024

Ohman, and people thought Google Gemini was bad, I can't wait to see results from this. Are we going to see the first trans AI that identifies as smart toaster firmware or something?

toooooot · Feb 28, 2024

AI taught on tumbler must come with a huge warning sign.
But then, there is already google and its gemini already

Vanderlinde · Feb 28, 2024

If this is turned on by default upon next update or release of wordpress, I'm highly suggesting to stop or block from using wordpress in general.

The package itself is bloated - even with updates your getting pushed through these twentyX themes that nobody wants and is likely to be hacked.

Roughly 60% of your blog traffic is pure garbage, bots seeking for exploits on your website. As a consumer your "forced" to grandnanny your website to keep it updated.

Tumblr and WordPress blogs will be exploited for AI model training

Alfonso Maruccia

Posts: 2,566 +956

Theinsanegamer

Posts: 8,695 +17,724

yRaz

Posts: 9,249 +16,283

toooooot

Posts: 4,682 +2,989

Vanderlinde

Posts: 798 +530

Similar threads

Latest posts