Websites can now block OpenAI's web crawling bot

Alfonso Maruccia

Posts: 1,025   +302
Staff
Something to look forward to: The ChatGPT large language model was unveiled in November 2022, and in just a few months, the technology has garnered a multitude of criticisms and accusations from various corners of the internet. OpenAI, the company that developed the algorithm, is now beginning to exhibit initial, cautious responses to address this criticism.

ChatGPT's LLM has been developed by scraping vast amounts of freely available internet content, a fact that OpenAI readily acknowledges. The company is now providing instructions on how webmasters, server administrators, and internet companies can prevent its crawling technology from accessing their websites.

In an official post, OpenAI explains that GPTBot is the company's web crawler designed to gather free internet content for training ChatGPT. Web pages crawled with the "GPTBot" user agent might be used to enhance future LLM models, as mentioned by OpenAI. The crawler employs filters to exclude paywalled sources, sites known for collecting personally identifiable information, or text that violates the company's policies.

OpenAI states that allowing GPTBot to access a site can contribute to enhancing the accuracy of AI models, thus aiding ChatGPT in improving its overall capabilities and "safety." However, individuals and companies who are not interested in contributing to ChatGPT's improvement for free have the option to disallow the crawler. This can be achieved by adjusting the "robots.txt" rules to prevent GPTBot from accessing their website or domain.

The robots.txt text file implements the Robots Exclusion Protocol, commonly used by websites to either partially or fully allow/disallow web crawlers from scanning their content. This protocol relies on the voluntary compliance of web crawling entities, and not all web robots adhere to custom disallow rules. OpenAI appears to be dedicated to following the robots.txt rules, going so far as to provide the IP address block used by its crawler to simplify the blocking process.

Prior to the new blocking rule, Deviant Art introduced its own "NoAI" tag for artists who wanted to exclude their content from unpaid LLM training. However, employing robots.txt offers considerably more control to third-party companies and webmasters, assuming OpenAI adheres to its own proposed regulations.

Notably, the company recently endorsed a document suggested by the White House that commits to the voluntary pursuit of safe, secure, and trustworthy AI development.

Permalink to story.

 
Back