Sites scramble to block ChatGPT web crawler after instructions appear

Without advertising, OpenAI recently added details about its web crawler, GPTBot, to the online documentation site. GPTBot is the name of the user agent that the company uses to retrieve web pages to train the artificial intelligence models behind ChatGPT, such as GPT-4. Earlier this week, some sites They quickly announced their intention To prevent GPTBot from accessing its content.

In the new documentation, OpenAI says that web pages crawled with GPTBot “will likely be used to improve future models,” and that allowing GPTBot to access your site “can help AI models become more accurate and improve their overall capabilities and security.”

OpenAI claims to have implemented filters that ensure that GPTBot does not access sources that are behind paywalls, that collect personally identifiable information, or any content that violates OpenAI’s policies.

News of the ability to block OpenAI trainings (if you honor them) comes too late to affect existing training data for ChatGPT or GPT-4, which were scrapped without announcement years ago. OpenAI collected data ending September 2021, which is the current “cognitive” cutoff for OpenAI language models.

It is noteworthy that the new instructions Maybe not Prevent web browsing versions of ChatGPT or ChatGPT plug-ins from accessing existing websites to relay updated information to the user. This point is not explained in the documentation, and we have contacted OpenAI for clarification.

The answer lies in the robots.txt file

According to OpenAI’s documentationGPTBot will be recognized by the user agent token “GPTBot,” with its full string being “Mozilla/5.0 AppleWebKit/537.36 (KHTML, same as Gecko; compatible; GPTBot/1.0; + https://openai.com/gptbot)”.

The OpenAI docs also provide guidance on how to prevent GPTBot from crawling websites using industry standards robots.txt file file, which is a text file located in the root directory of a website that instructs web crawlers (such as those used by search engines) not to index the site.

It’s as easy as adding these two lines to your site’s robots.txt file:

User-agent: GPTBot
Disallow: /

OpenAI also says that admins can restrict GPTBot from certain parts of the site in a robots.txt file with different codes:

User-agent: GPTBot
Allow: /directory-1/
Disallow: /directory-2/

In addition, OpenAI introduced the Specific IP address blocks from which GPTBot will run, and which can be blocked by firewalls as well.

Despite this option, blocking GPTBot will not ensure that the location data does not end up training all future AI models. Aside from issues with scrapers ignoring robots.txt files, there are other large data sets of scraped websites (such as pile) is not affiliated with OpenAI. These datasets are commonly used to train open source (or open source) LLMs such as Meta’s Llama 2.

Some websites react quickly

While ChatGPT was a huge success from a technical standpoint, it was also controversial with how it scraped copyrighted data without permission and concentrated that value into a commercial product that circumvents the online publishing model. OpenAI has been accused (and sued for) plagiarism along those lines.

Accordingly, it’s not surprising to see some people reacting to the news of the ability to block their content from future GPT forms with a kind of pent-up I listen. For example, on Tuesday, VentureBeat male that the edgeSubstack Writer Casey NewtonAnd Neil Clark From Clarkesworld, they all said they would block GPTBot shortly after news of the bot broke.

But for operators of large websites, the option to block LLM crawlers is not as easy as it might seem. Making some LLMs ignore certain websites’ data will leave knowledge gaps that can serve some sites very well (such as sites that don’t want to lose visitors if ChatGPT provides their information to them), but could also hurt others. For example, blocking content from future AI models could reduce the cultural footprint of a site or brand if intelligent chatbots become a primary user interface in the future. As a thought experiment, imagine an online company announcing that it did not want its website to be indexed by Google in 2002—a self-destructive move when that was the most popular way to find information online.

It’s still very early in the generative AI game, and no matter which way the technology goes — or individual sites try to opt-out of AI model training — at least OpenAI offers the option.