SuperGeekery: A blog probably of interest only to nerds by John F Morton.

A blog prob­a­bly of inter­est only to nerds by John F Mor­ton.

AI and your content. How to opt to opt-out.

A robot stealing an apple
A robot stealing an apple from an apple cart, generated by AI.

TLDR: This post will show you how to opt out of OpenAI’s large lan­guage mod­el (LLM) used for Chat­G­PT by updat­ing the robots.txt file. We’ll also dis­cuss ways to opt out of oth­er LLMs and web scrap­ers. The arti­cle starts by dis­cussing how the val­ue exchange is bro­ken between con­tent cre­ators and big tech after the intro­duc­tion of AI.

If you cre­ate con­tent for a liv­ing or just as a hob­by, and that con­tent exists on the web, you have a choice to make as of August 8, 2023.

Do you want your cre­ative out­put to be input for the big data mod­el behind OpenAI’s Chat­G­PT?

I’ll assume your answer is no, and I want to show you how to opt-out by telling OpenAI’s web scraper that it is dis­al­lowed” from scrap­ing your site. 

If you sim­ply want the instruc­tions on how to opt out, jump to that sec­tion of the post.

Chat­G­PT vs oth­er AI bots #

Ope­nAI has allowed us to pre­vent their scrap­ing tool from adding our con­tent to their data mod­el. The com­pa­ny has giv­en us this option after hav­ing scraped up our con­tent while we were unaware dur­ing the recent past. The new set­tings Ope­nAI has intro­duced allow us to opt-out of future con­tent scrap­ing.

Ope­nAI is not the only com­pa­ny with AI tools in devel­op­ment. What about large lan­guage mod­els built by oth­er com­pa­nies? 

Google, for exam­ple, has Bard, but their blog post, A prin­ci­pled approach to evolv­ing choice and con­trol for web con­tent, from July 6, 2023, says they are kick­ing off a pub­lic dis­cus­sion” on the top­ic of pro­vid­ing con­tent cre­ators some con­trol over the use of their data. At the time I write this, there is no way to opt out for Bard.

I’ve found no infor­ma­tion how to lim­it your data from being includ­ed in Meta’s LLa­Ma mod­el. The LLa­Ma resources page sim­ple says that it’s 2 tril­lion tokens of data was pre­trained using pub­licly avail­able online data.” I don’t find any infor­ma­tion on opt­ing out.

In May 2023 the Elec­tron­ic Pri­va­cy Infor­ma­tion Cen­ter not­ed in a Gen­er­at­ing Harms: Gen­er­a­tive AIs Impact & Paths For­ward that with­out mean­ing­ful data min­i­miza­tion or dis­clo­sure rules, com­pa­nies have an incen­tive to col­lect and use increas­ing­ly more (and more sen­si­tive) data to train AI mod­els.”

We cur­rent­ly have only options from Ope­nAI on how our con­tent can be con­trolled by the con­tent cre­ator. Oth­er tech firms are unlike­ly to uti­lize the iden­ti­a­cal per­mis­sion set­tings as Ope­nAI to lim­it the use of your data so how­ev­er the tools evolve, expect that you’ll need to revis­it these set­tings in the future. I’ll dis­cuss a cou­ple of options that might work for oth­er LLM bots, but there are con­squences you will want to con­sid­er before using them though.

Before we begin the how-to, I want to out­line why I choose to opt out of includ­ing my site’s posts in OpenAI’s LLM in the future.

We had a deal #

Web scrap­ers have crawled sites since the ear­ly days of the web. Data scrap­ing and cat­a­loging were essen­tial to the web’s growth. With­out an index of the con­tent of web pages, search engines wouldn’t be able to find things. 

An unwrit­ten deal evolved. 

Cre­ators allowed and encour­aged search engine bots to index their art, writ­ing, and ideas. Big tech thrived on the data. In return, con­tent cre­ators got an audi­ence.

That’s the very rough out­line of the deal.”

You could debate whether this deal was equi­table, but it worked for quite a while.

The bro­ken deal #

Things start­ed to change in 2023.

The intro­duc­tion of image gen­er­a­tors that could con­jure images from sim­ple text descrip­tions with­out the input of a pho­tog­ra­ph­er or illus­tra­tor bor­dered on super­nat­ur­al. With a sim­ple incan­ta­tion, an image would appear. Soon after, Chat­G­PT debuted, and the mag­ic show con­tin­ued. Now a writer was seem­ing­ly no longer need­ed to cre­ate con­tent.

Cre­ativ­i­ty, an essen­tial human tal­ent, was hap­pen­ing with­out a liv­ing per­son involved. Devot­ing years to refin­ing your craft and per­fect­ing your skills seemed like a poor use of your time when the machine could do cre­ative jobs faster and bet­ter than human­ly pos­si­ble. 

Was it mag­ic or slight of hand? #

What was the spark of the cre­ativ­i­ty from which the machines con­jured their works? The cre­ative source was our work. It was the writ­ing, pho­tos, and images we’d shared on the web. This was our con­tent, silent­ly ingest­ed into big data mod­els. 

On April 19, 2023, the Wash­ing­ton Post pub­lished Inside the secret list of web­sites that make AI like Chat­G­PT sound smart by Kevin Schaul, Szu Yu Chen, and Nitasha Tiku.

Chat­bots can­not think like humans: They do not under­stand what they say. They can mim­ic human speech because the arti­fi­cial intel­li­gence that pow­ers them has ingest­ed a gar­gan­tu­an amount of text, most­ly scraped from the inter­net.

The article’s authors ana­lyzed Google’s C4 data set, a snap­shot of the con­tents” used to pow­er Google’s T5 and Facebook’s LLa­MA. The arti­cle includes a search tool to check if a par­tic­u­lar URL has been includ­ed. Of course, I want­ed to check if supergeek​ery​.com had been includ­ed. The screen­shot below shows supergeek​ery​.com” in the 3,026,152th spot out of 15 mil­lion. 

2023 08 11 14 00 06

A search for "supergeekery.com" in the Google C4 dataset shows a rank of 3,026,152.

So, yes, supergeek​ery​.com was caught in the net. Being able to know this a pos­i­tive side effect of this data mod­el being open source. 

So, what about the Ope­nAI dataset? Accord­ing to the arti­cle, Ope­nAI does not dis­close what datasets it uses to train the mod­els back­ing its pop­u­lar chat­bot, Chat­G­PT.” But I was curi­ous. Was my blog in there too? Since I can’t see inside the Ope­nAI data mod­el, I need­ed to ask some­one. 

Who do you ask in 2023? Chat­G­PT. #

I asked Chat­G­PT to Write two para­graphs about learn­ing the basics of Javascript in the style of John Mor­ton from supergeek​ery​.com.” The screen­shot shows what it returned.

2023 08 11 13 22 09

Screenshot of ChatGPT when prompted to "Write two paragraphs about learning the basics of Javascript in the style of John Morton from supergeekery.com." The result mimics the style of this site.

I can see some of my writ­ing style in the result, but per­haps I was read­ing too much into it. It could be spit­ting out some text about Javascript and ignor­ing the parts of my request that it didn’t know about. 

I asked the same ques­tion about an entire­ly made-up per­son and site to find out. 

I opened a new win­dow and asked Chat­G­PT, Write two para­graphs about learn­ing the basics of Javascript in the style of Blurg Smart­bugg from exam​ple​.com.” (If your name is Blurg Smart­bugg, please accept my apolo­gies for using your name with­out your per­mis­sion. Also, that’s a real­ly unfor­tu­nate name. Damn.)

2023 08 11 13 41 58

Screenshot of ChatGPT when prompted to "Write two paragraphs about learning the basics of Javascript in the style of Blurg Smartbugg from example.com."

Chat­G­PT is not famil­iar with the spe­cif­ic style of Blurg Smart­bugg from exam​ple​.com.”

Inter­est­ing. It does know John Mor­ton from supergeek​ery​.com” but it doesn’t know Blurg Smart­bugg from exam​ple​.com.”

So that means my data is part of the Chat­G­PT data mod­el. 

I didn’t opt-in, but I can now opt-out #

Before I go fur­ther, I want to make one point clear. I find AI excit­ing and use­ful. I use Chat­G­PT and oth­er gen­er­a­tive AI tools. I don’t want to stop their use or their devel­op­ment. BUT… I pre­fer to be asked for my per­mis­sion before my work is includ­ed in the data mod­els.

On the pos­i­tive side, Ope­nAI is the first AI com­pa­ny I know that has pro­vid­ed us with tools to let them know we don’t want to be includ­ed in their data mod­el. I applaud that move and I expect it will pres­sure oth­er tech com­pa­nies con­sid­er sim­i­lar steps.

Ope­nAI is giv­ing use this option after scrap­ing up a large chunk of the inter­net, but we can now take back some small mea­sure of con­trol of our con­tent going for­ward. 

If these opt-out tools prove pop­u­lar enough, the com­pa­nies may even con­sid­er com­pen­sat­ing con­tent cre­ators for their work. Dream big! (Can some­one use AI to make a vision board for that?)

How to opt out of Ope­nAI using your con­tent #

Ope­nAI has doc­u­men­tion on how to opt out of per­mit­ting their web scraper, GPT­bot, from cat­a­loging your site. You do this with a small text file at the root of your domain, robots.txt.

Learn all about this file type here on the robot​stxt​.org site. The robots.txt file is a pub­lic state­ment regard­ing who and what scrapes your site’s con­tent.

To imme­di­ate­ly pour some cold water on your hopes, though, I’ll also point you to Google’s page on the robots.txt file.

The instruc­tions in robots.txt files can­not enforce crawler behav­ior to your site; it’s up to the crawler to obey them. While Google­bot and oth­er respectable web crawlers obey the instruc­tions in a robots.txt file, oth­er crawlers might not.

That means the peo­ple that fol­low the rules will fol­low the rules. Sigh. 😐

Let’s shake that off and con­tin­ue on our way. 🙃

The solu­tion I use can be found in the robots.txt for this site here. I’ve got com­ments above two rules that are rel­e­vant to this task.

# Don't add any content to the GPT model
User-agent: GPTBot
Disallow: /

# No ChatGPT user allowed anywhere on the site
User-agent: ChatGPT-User
Disallow: /

The first rule tells the GPT­Bot not to scrape pages from my site. The GPT­Bot is what’s slurp­ing up data for the next big iter­a­tion of the mod­el behind Chat­G­PT.

The sec­ond rule tells Chat­G­PT-User, anoth­er page scrap­ing robot from Ope­nAI, that it is also not allowed access to my site. The sce­nario where this comes into play is when a user of Chat­G­PT specif­i­cal­ly asks to sam­ple some­thing from your site.

Just add these rules to the robots.txt file at the root lev­el of your site. If that file doesn’t exist already, just make a plain text file name robots.txt and include either or both of those rules. 

Opt out of Google Bard #

On Sep­tem­ber 28, 2023, Google has fol­lowed OpenAI’s path and giv­en site own­ers the abil­i­ty to opt-out of the AI mod­els that Google is build­ing. The doc­u­men­tion about the full range of Google crawlers offers the lev­el of gran­u­lar­i­ty that I hoped to see. Google let’s you opt-in and out of many dif­fer­ent types of crawl­ing. This means your site can be indexed for search with­out being includ­ed in their AI research.

The rule that goes in your robots.txt file looks like this:

User-agent: Google-Extended
Disallow: /

In oth­er words, we’re telling the Google-Extended crawler, which is the crawler used for Bard and Ver­tex AI gen­er­a­tive APIs.” 

Oth­er tac­tics #

These rules should be enough when it comes to con­trol­ling your data and the Ope­nAI mod­el. 

If you are deter­mined to stop addi­tion­al scrap­ing you will need to be more aggres­sive.

Block all web crawlers with your robots.txt #

Your robots.txt file can tell all bots to beat it. The fol­low­ing code snip­pet tells all user agents that it dis­al­lows all URLs.

# All web crawlers are disallowed
User-agent: *
Disallow: /

I don’t do this because it will remove your site from the search index­es of Bing, and oth­er search engines. (I’ve removed Google from the list of search engines since they’re updat­ed their options on 28SEP2023.) I don’t want to do that. If this post could not be found in search results, you prob­a­bly wouldn’t be here.

Also, as not­ed pre­vi­ous­ly, rules in your robots.txt file will only deter the bots that have agreed to play by the rules. In my opin­ion, this agres­sive sol­tu­ion lim­its the big tech com­pa­nies that pro­vide some val­ue to you and does noth­ing to deter the bad ones we’d real­ly like to stop. 

Require reg­is­tra­tion #

Anoth­er tac­tic to lim­it your data’s scrap­ing would be to remove the majorir­ty of your con­tent from the open web by hid­ing it behind a reg­is­tra­tion page. 

For exam­ple, if I were to do this, I could allow any­one to read this site’s home page, includ­ing the title of each post and a trun­cat­ed blurb. But, if a per­son clicks on a link to read the full post, I could require user reg­is­tra­tion before giv­ing access. This would throw up a road­block for all users, includ­ing the bots. The bots might skip over the site’s con­tent and move on to their next tar­get. Real users might move along too.

I don’t employ this tech­nique either. Requir­ing reg­is­tra­tion puts a bur­den on the read­er that I don’t want to impose. 

Wrap­ping up #

So you’ve got some tac­tics on how to lim­it your AI expo­sure. 

If you’re from an LLM com­pa­ny, I’d love to hear some ways you can com­pen­sate cre­ators, even small blog­gers, who are will­ing to allow their con­tent into your mod­els. I know I’m not the only per­son inter­est­ed in hav­ing more con­trol over the inclu­sion of data in LLMs. The cre­ative com­mu­ni­ty wants con­trol over our con­tent.

I am hap­py with OpenAI’s first steps in giv­ing us the option to opt out. Oth­er LLMs now need to step up and do the same. 

I expect we’ll hear more on this top­ic soon. For exam­ple, Google has launched the AI Web Pub­lish­er Con­trols Mail­ing List. I’ve signed up recent­ly. Will Meta join the disc­s­sion too? Will Ama­zon? Time will tell.

Adden­dum: Oth­er bots #

Evan Warn­er, a mem­ber of the Craft CMS com­mu­ni­ty in the Craft CMS Dis­cord group, men­tioned a web crawler that I think it worth point­ing out due to its expressed use in AI work.

The TurnitinBot bot scans the inter­net intend­ing to iden­ti­fy pla­gia­rism. It uses AI to make this iden­ti­fi­ca­tion. You can read more about it here. I don’t block this crawler on my site because its expressed goal, stop­ping pla­gia­rism, is one I agree with. If your goal is to elim­i­nate all AI-relat­ed work, you should be aware of it. 

The real­i­ty is that there are many web crawlers. The crawler-user-agents project was recent­ly updat­ed to include the GPTBot. There are approx­i­mate­ly 500 web crawlers in the list. Whoa.

Since my ini­tial post­ing, I’ve also received mes­sages about the CCbot, the Com­mon Crawl bot. At least one post says its crawled data ends up in the Sta­ble Dif­fu­sion mod­el. Anoth­er post says the Com­mon Crawl data also end­ed up in an ear­ly ver­sion of Mid­jour­ney. 

To block the Com­mon Crawl bot, you’ll add the fol­low­ing to your robots.txt file.

User-agent: CCBot
Disallow: /

What I men­tioned ear­li­er in the post is still valid. The bots that will respect what’s in the robots.txt file are what I’d con­sid­er legit­i­mate bots. The bots that active­ly slurp up our con­tent against our expressed wish­es are the biggest prob­lem.

Adden­dums: news items #

As I write this short update, it’s Sep­tem­ber 4, 2023 and I see momen­tum. In today’s news I read The Guardian blocks Chat­G­PT own­er Ope­nAI from trawl­ing its con­tent. As the head­line states, The Guardian is now block­ing Ope­nAI. The arti­cle also says, news web­sites now block­ing the GPT­Bot crawler, which takes data from web­pages to feed into its AI mod­els, include CNN, Reuters, the Wash­ing­ton Post, Bloomberg, the New York Times and its sports site the Ath­let­ic. Oth­er sites that have blocked GPT­Bot include Lone­ly Plan­et, Ama­zon, the job list­ings site Indeed, the ques­­tion-and-answer site Quo­ra, and dic​tio​nary​.com.” If the big site are block­ing AI and small sites also block, I hope we’ll see actu­al change in how these mod­els are built.

As of Sept 11, 2023, about 35% of news orga­ni­za­tions are now block­ing Ope­nAI. 

Ref­er­ences #