Is Your robots.txt Blocking AI Crawlers?

A growing number of websites have added blanket AI crawler blocks to their robots.txt files over the past two years. Some did it deliberately to prevent their content from being used as training data. Others inherited the blocks from WordPress plugins, CDN defaults, or security hardening guides that treat all bots as hostile. The unintended consequence: these sites are also invisible to AI-powered search tools that could be sending them qualified traffic.

According to an eMarketer report on AI tool adoption, more than 40% of US adults used an AI assistant for information gathering in 2025, and that figure is projected to exceed 55% by the end of 2026. When someone asks ChatGPT or Perplexity a question and gets an answer that cites a source, that citation generates a real click. Blocking the crawlers that enable those citations means forfeiting that traffic entirely.

The AI Crawler Landscape

At least six major AI crawlers are actively indexing the web right now. Each serves a different operator and a different purpose, and the distinction between “training” and “retrieval” crawlers is critical for making informed blocking decisions.

AI Crawler Reference

Bot Name	Operator	Purpose	Sends Traffic?
GPTBot	OpenAI	Training + retrieval	Yes (ChatGPT)
ChatGPT-User	OpenAI	Real-time retrieval only	Yes (ChatGPT)
Google-Extended	Google	Gemini training	Indirectly (AI Overviews)
anthropic-ai	Anthropic	Training	Indirectly (Claude citations)
PerplexityBot	Perplexity	Real-time retrieval	Yes (inline citations)
Bytespider	ByteDance	Training (TikTok AI)	Minimal
CCBot	Common Crawl	Open dataset (used by many LLMs)	No

Green = direct traffic source. Yellow = indirect traffic influence. Red = training only, no traffic benefit.

The most important distinction in that table is between training data crawlers and retrieval crawlers. ChatGPT-User and PerplexityBot fetch your pages in real time when a user asks a question - blocking them is functionally the same as blocking Googlebot. GPTBot, Google-Extended, and CCBot crawl for model training purposes, which is a different value proposition.

How to Check What You're Blocking

Open your site's robots.txt file - it lives at yourdomain.com/robots.txt - and search for any of the bot names listed above. Common patterns that block AI crawlers look like this:

User-agent: GPTBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: CCBot
Disallow: /

Some WordPress security plugins and CDN providers add these blocks automatically. Cloudflare's “AI Scrapers and Crawlers” toggle, for instance, blocks a long list of AI bots when enabled. If you turned that on without checking the specifics, you may be blocking traffic-generating retrieval bots alongside the training-only crawlers.

Barry Schwartz@rustybrick

I keep seeing sites that blocked GPTBot months ago wondering why they don't show up in ChatGPT search results. If you block the bot, you block the citations. You can't have it both ways.

Jan 2026

The Training vs Retrieval Tradeoff

Many publishers object to AI companies using their content to train LLMs without compensation - a legitimate concern that has driven multiple lawsuits. Blocking training crawlers like CCBot, Google-Extended, and anthropic-ai is a reasonable choice if you want to withhold your content from future model training.

Blocking retrieval crawlers, though, carries a direct cost. When you block ChatGPT-User or PerplexityBot, your content cannot appear as a cited source when users ask questions on those platforms. Glenn Gabe, an SEO consultant who has tracked AI traffic patterns across client portfolios, describes the dynamic bluntly.

“Blocking retrieval bots is like deindexing yourself from a new search engine. You can absolutely do it, but understand that you are giving up a growing traffic channel to make a philosophical point.”
Glenn GabeSEO Consultant, G-Squared Interactive·LinkedIn

The practical solution for most sites: block training bots, allow retrieval bots. This approach lets AI search tools cite your content while preventing that content from being absorbed into training datasets.

A Balanced robots.txt Configuration

The following configuration blocks training-only crawlers while preserving access for bots that generate organic search traffic from AI platforms:

# Allow AI retrieval bots (these generate traffic)
User-agent: ChatGPT-User
Allow: /

User-agent: PerplexityBot
Allow: /

# Block training-only crawlers
User-agent: GPTBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: anthropic-ai
Disallow: /

Note the order: ChatGPT-User and PerplexityBot get explicit Allow directives. GPTBot is blocked separately, which prevents OpenAI from using your content for model training while still allowing ChatGPT's real-time search feature to access and cite your pages.

Glenn Gabe@gaborNOTgabe

Important nuance: GPTBot and ChatGPT-User are different bots. Blocking GPTBot stops training. Blocking ChatGPT-User stops real-time search citations. Most sites should block the first and allow the second.

Nov 2025

Monitoring the Impact

After updating your robots.txt, the next step is measuring whether AI platforms are actually sending traffic. Standard GA4 reporting does not break out AI-referred visits by default - ChatGPT traffic appears under referral or sometimes direct traffic, mixed in with everything else.

MeasureBoard's AI Traffic Intelligence feature automatically identifies visits from ChatGPT, Perplexity, Claude, Gemini, Copilot, and other AI platforms, separating them into a dedicated view. You can see which AI tools are sending the most sessions, which of your pages are being cited, and whether your AI traffic share is growing month over month.

Pair that with the AI Rank Tracker to see whether AI platforms are actually mentioning your brand when users ask questions in your space. If you unblock retrieval crawlers and your mention frequency increases in the weeks that follow, you have a direct signal that the robots.txt change is working.

What Happens Next

The robots.txt standard was designed in 1994 for a web with a handful of search engine crawlers. It was never intended to handle the nuance of distinguishing between training and retrieval, between scraping and citing. Several proposals for more granular machine-readable permissions are circulating - including TDMRep headers and AI-specific meta tags - but none have achieved broad adoption yet.

In the meantime, robots.txt remains the only widely-supported mechanism for controlling AI bot access. Reviewing yours takes five minutes. The traffic implications of getting it wrong will compound for years as AI overviews and AI-powered search become a larger share of how people find information online.

Your robots.txt Might Be Blocking AI Crawlers (and Costing You Traffic)