AI and your content. How to opt to opt-out.
TLDR: This post will show you how to opt out of OpenAI’s large language model (LLM) used for ChatGPT by updating the robots.txt file. We’ll also discuss ways to opt out of other LLMs and web scrapers. The article starts by discussing how the value exchange is broken between content creators and big tech after the introduction of AI.
If you create content for a living or just as a hobby, and that content exists on the web, you have a choice to make as of August 8, 2023.
Do you want your creative output to be input for the big data model behind OpenAI’s ChatGPT?
I’ll assume your answer is no, and I want to show you how to opt-out by telling OpenAI’s web scraper that it is “disallowed” from scraping your site.
If you simply want the instructions on how to opt out, jump to that section of the post.
Update July 19, 2024: In the past month or two, there have been multiple reports of some AI services simply ignoring rules you place in the robots.txt file. To combat this, Cloudflare has a new service called Bot Fight Mode that tries to “block artificial intelligence (AI) bots, crawlers, and scrapers from scraping your website content and training large language models .” If you use Cloudflare for your DNS, this seems like a great option to use.
ChatGPT vs other AI bots
OpenAI has allowed us to prevent their scraping tool from adding our content to their data model. The company has given us this option after having scraped up our content while we were unaware during the recent past. The new settings OpenAI has introduced allow us to opt-out of future content scraping.
OpenAI is not the only company with AI tools in development. What about large language models built by other companies?
Google, for example, has Bard, but their blog post, A principled approach to evolving choice and control for web content, from July 6, 2023, says they are “kicking off a public discussion” on the topic of providing content creators some control over the use of their data. At the time I write this, there is no way to opt out for Bard.
I’ve found no information how to limit your data from being included in Meta’s LLaMa model. The LLaMa resources page simple says that it’s 2 trillion tokens of data was “pretrained using publicly available online data.” I don’t find any information on opting out.
In May 2023 the Electronic Privacy Information Center noted in a Generating Harms: Generative AI’s Impact & Paths Forward that “without meaningful data minimization or disclosure rules, companies have an incentive to collect and use increasingly more (and more sensitive) data to train AI models.”
We currently have only options from OpenAI on how our content can be controlled by the content creator. Other tech firms are unlikely to utilize the identiacal permission settings as OpenAI to limit the use of your data so however the tools evolve, expect that you’ll need to revisit these settings in the future. I’ll discuss a couple of options that might work for other LLM bots, but there are consquences you will want to consider before using them though.
Before we begin the how-to, I want to outline why I choose to opt out of including my site’s posts in OpenAI’s LLM in the future.
We had a deal
Web scrapers have crawled sites since the early days of the web. Data scraping and cataloging were essential to the web’s growth. Without an index of the content of web pages, search engines wouldn’t be able to find things.
An unwritten deal evolved.
Creators allowed and encouraged search engine bots to index their art, writing, and ideas. Big tech thrived on the data. In return, content creators got an audience.
That’s the very rough outline of “the deal.”
You could debate whether this deal was equitable, but it worked for quite a while.
The broken deal
Things started to change in 2023.
The introduction of image generators that could conjure images from simple text descriptions without the input of a photographer or illustrator bordered on supernatural. With a simple incantation, an image would appear. Soon after, ChatGPT debuted, and the magic show continued. Now a writer was seemingly no longer needed to create content.
Creativity, an essential human talent, was happening without a living person involved. Devoting years to refining your craft and perfecting your skills seemed like a poor use of your time when the machine could do creative jobs faster and better than humanly possible.
Was it magic or slight of hand?
What was the spark of the creativity from which the machines conjured their works? The creative source was our work. It was the writing, photos, and images we’d shared on the web. This was our content, silently ingested into big data models.
On April 19, 2023, the Washington Post published Inside the secret list of websites that make AI like ChatGPT sound smart by Kevin Schaul, Szu Yu Chen, and Nitasha Tiku.
Chatbots cannot think like humans: They do not understand what they say. They can mimic human speech because the artificial intelligence that powers them has ingested a gargantuan amount of text, mostly scraped from the internet.
The article’s authors analyzed Google’s C4 data set, “a snapshot of the contents” used to power Google’s T5 and Facebook’s LLaMA. The article includes a search tool to check if a particular URL has been included. Of course, I wanted to check if supergeekery.com had been included. The screenshot below shows “supergeekery.com” in the 3,026,152th spot out of 15 million.
So, yes, supergeekery.com was caught in the net. Being able to know this a positive side effect of this data model being open source.
So, what about the OpenAI dataset? According to the article, “OpenAI does not disclose what datasets it uses to train the models backing its popular chatbot, ChatGPT.” But I was curious. Was my blog in there too? Since I can’t see inside the OpenAI data model, I needed to ask someone.
Who do you ask in 2023? ChatGPT.
I asked ChatGPT to “Write two paragraphs about learning the basics of Javascript in the style of John Morton from supergeekery.com.” The screenshot shows what it returned.
I can see some of my writing style in the result, but perhaps I was reading too much into it. It could be spitting out some text about Javascript and ignoring the parts of my request that it didn’t know about.
I asked the same question about an entirely made-up person and site to find out.
I opened a new window and asked ChatGPT, “Write two paragraphs about learning the basics of Javascript in the style of Blurg Smartbugg from example.com.” (If your name is Blurg Smartbugg, please accept my apologies for using your name without your permission. Also, that’s a really unfortunate name. Damn.)
ChatGPT is “not familiar with the specific style of Blurg Smartbugg from example.com.”
Interesting. It does know “John Morton from supergeekery.com” but it doesn’t know “Blurg Smartbugg from example.com.”
So that means my data is part of the ChatGPT data model.
I didn’t opt-in, but I can now opt-out
Before I go further, I want to make one point clear. I find AI exciting and useful. I use ChatGPT and other generative AI tools. I don’t want to stop their use or their development. BUT… I prefer to be asked for my permission before my work is included in the data models.
On the positive side, OpenAI is the first AI company I know that has provided us with tools to let them know we don’t want to be included in their data model. I applaud that move and I expect it will pressure other tech companies consider similar steps.
OpenAI is giving use this option after scraping up a large chunk of the internet, but we can now take back some small measure of control of our content going forward.
If these opt-out tools prove popular enough, the companies may even consider compensating content creators for their work. Dream big! (Can someone use AI to make a vision board for that?)
How to opt out of OpenAI using your content
OpenAI has documention on how to opt out of permitting their web scraper, GPTbot, from cataloging your site. You do this with a small text file at the root of your domain, robots.txt
.
Learn all about this file type here on the robotstxt.org site. The robots.txt file is a public statement regarding who and what scrapes your site’s content.
To immediately pour some cold water on your hopes, though, I’ll also point you to Google’s page on the robots.txt file.
The instructions in robots.txt files cannot enforce crawler behavior to your site; it’s up to the crawler to obey them. While Googlebot and other respectable web crawlers obey the instructions in a robots.txt file, other crawlers might not.
That means the people that follow the rules will follow the rules. Sigh. 😐
Let’s shake that off and continue on our way. 🙃
The solution I use can be found in the robots.txt
for this site here. I’ve got comments above two rules that are relevant to this task.
# Don't add any content to the GPT model
User-agent: GPTBot
Disallow: /
# No ChatGPT user allowed anywhere on the site
User-agent: ChatGPT-User
Disallow: /
The first rule tells the GPTBot not to scrape pages from my site. The GPTBot is what’s slurping up data for the next big iteration of the model behind ChatGPT.
The second rule tells ChatGPT-User, another page scraping robot from OpenAI, that it is also not allowed access to my site. The scenario where this comes into play is when a user of ChatGPT specifically asks to sample something from your site.
Just add these rules to the robots.txt
file at the root level of your site. If that file doesn’t exist already, just make a plain text file name robots.txt
and include either or both of those rules.
Opt out of Google Bard
On September 28, 2023, Google has followed OpenAI’s path and given site owners the ability to opt-out of the AI models that Google is building. The documention about the full range of Google crawlers offers the level of granularity that I hoped to see. Google let’s you opt-in and out of many different types of crawling. This means your site can be indexed for search without being included in their AI research.
The rule that goes in your robots.txt file looks like this:
User-agent: Google-Extended
Disallow: /
In other words, we’re telling the Google-Extended
crawler, which is the crawler used for “Bard and Vertex AI generative APIs.”
Other tactics
These rules should be enough when it comes to controlling your data and the OpenAI model.
If you are determined to stop additional scraping you will need to be more aggressive.
Block all web crawlers with your robots.txt
Your robots.txt
file can tell all bots to beat it. The following code snippet tells all user agents that it disallows all URLs.
# All web crawlers are disallowed
User-agent: *
Disallow: /
I don’t do this because it will remove your site from the search indexes of Bing, and other search engines. (I’ve removed Google from the list of search engines since they’re updated their options on 28SEP2023.) I don’t want to do that. If this post could not be found in search results, you probably wouldn’t be here.
Also, as noted previously, rules in your robots.txt
file will only deter the bots that have agreed to play by the rules. In my opinion, this agressive soltuion limits the big tech companies that provide some value to you and does nothing to deter the bad ones we’d really like to stop.
Require registration
Another tactic to limit your data’s scraping would be to remove the majorirty of your content from the open web by hiding it behind a registration page.
For example, if I were to do this, I could allow anyone to read this site’s home page, including the title of each post and a truncated blurb. But, if a person clicks on a link to read the full post, I could require user registration before giving access. This would throw up a roadblock for all users, including the bots. The bots might skip over the site’s content and move on to their next target. Real users might move along too.
I don’t employ this technique either. Requiring registration puts a burden on the reader that I don’t want to impose.
Wrapping up
So you’ve got some tactics on how to limit your AI exposure.
If you’re from an LLM company, I’d love to hear some ways you can compensate creators, even small bloggers, who are willing to allow their content into your models. I know I’m not the only person interested in having more control over the inclusion of data in LLMs. The creative community wants control over our content.
I am happy with OpenAI’s first steps in giving us the option to opt out. Other LLMs now need to step up and do the same.
I expect we’ll hear more on this topic soon. For example, Google has launched the AI Web Publisher Controls Mailing List. I’ve signed up recently. Will Meta join the discssion too? Will Amazon? Time will tell.
Addendum: Other bots
Evan Warner, a member of the Craft CMS community in the Craft CMS Discord group, mentioned a web crawler that I think it worth pointing out due to its expressed use in AI work.
The TurnitinBot
bot scans the internet intending to identify plagiarism. It uses AI to make this identification. You can read more about it here. I don’t block this crawler on my site because its expressed goal, stopping plagiarism, is one I agree with. If your goal is to eliminate all AI-related work, you should be aware of it.
The reality is that there are many web crawlers. The crawler-user-agents project was recently updated to include the GPTBot
. There are approximately 500 web crawlers in the list. Whoa.
Since my initial posting, I’ve also received messages about the CCbot
, the Common Crawl bot. At least one post says its crawled data ends up in the Stable Diffusion model. Another post says the Common Crawl data also ended up in an early version of Midjourney.
To block the Common Crawl bot, you’ll add the following to your robots.txt
file.
User-agent: CCBot
Disallow: /
What I mentioned earlier in the post is still valid. The bots that will respect what’s in the robots.txt
file are what I’d consider legitimate bots. The bots that actively slurp up our content against our expressed wishes are the biggest problem.
Addendums: news items
As I write this short update, it’s September 4, 2023 and I see momentum. In today’s news I read The Guardian blocks ChatGPT owner OpenAI from trawling its content. As the headline states, The Guardian is now blocking OpenAI. The article also says, “news websites now blocking the GPTBot crawler, which takes data from webpages to feed into its AI models, include CNN, Reuters, the Washington Post, Bloomberg, the New York Times and its sports site the Athletic. Other sites that have blocked GPTBot include Lonely Planet, Amazon, the job listings site Indeed, the question-and-answer site Quora, and dictionary.com.” If the big site are blocking AI and small sites also block, I hope we’ll see actual change in how these models are built.
As of Sept 11, 2023, about 35% of news organizations are now blocking OpenAI.
References
- http://www.robotstxt.org/robotstxt.html
- https://platform.openai.com/docs/gptbot
- https://platform.openai.com/docs/plugins/bot
- https://developers.google.com/search/docs/crawling-indexing/robots/intro
- https://www.washingtonpost.com/technology/interactive/2023/ai-chatbot-learning/
- https://venturebeat.com/ai/openai-launches-web-crawling-gptbot-sparking-blocking-effort-by-website-owners-and-creators
- https://blog.google/technology/ai/ai-web-publisher-controls-sign-up/
- https://gizmodo.com/google-says-itll-scrape-everything-you-post-online-for-1850601486
- https://gizmodo.com/google-bard-ai-scrape-websites-data-australia-opt-out-1850720633
- https://ai.meta.com/llama/
- https://ai.meta.com/resources/models-and-libraries/llama/
- https://interaktiv.br.de/ki-trainingsdaten/en/
- https://katharinabrunner.de/2023/08/robots-txt-openais-gptbot-common-crawls-ccbot-how-to-block-ai-crawlers-from-gathering-text-and-images-from-your-website/
- https://waxy.org/2022/08/exploring-12-million-of-the-images-used-to-train-stable-diffusions-image-generator/