Your robots.txt Might Be Blocking AI Crawlers (and Costing You Traffic)
AI search tools can only cite your content if their bots can access it. Many websites are unknowingly blocking the crawlers that power ChatGPT, Perplexity, and other AI platforms.
A growing number of websites have added blanket AI crawler blocks to their robots.txt files over the past two years. Some did it deliberately to prevent their content from being used as training data. Others inherited the blocks from WordPress plugins, CDN defaults, or security hardening guides that treat all bots as hostile. The unintended consequence: these sites are also invisible to AI-powered search tools that could be sending them qualified traffic.
According to an eMarketer report on AI tool adoption, more than 40% of US adults used an AI assistant for information gathering in 2025, and that figure is projected to exceed 55% by the end of 2026. When someone asks ChatGPT or Perplexity a question and gets an answer that cites a source, that citation generates a real click. Blocking the crawlers that enable those citations means forfeiting that traffic entirely.
The AI Crawler Landscape
At least six major AI crawlers are actively indexing the web right now. Each serves a different operator and a different purpose, and the distinction between “training” and “retrieval” crawlers is critical for making informed blocking decisions.
AI Crawler Reference
| Bot Name | Operator | Purpose | Sends Traffic? |
|---|---|---|---|
| GPTBot | OpenAI | Training + retrieval | Yes (ChatGPT) |
| ChatGPT-User | OpenAI | Real-time retrieval only | Yes (ChatGPT) |
| Google-Extended | Gemini training | Indirectly (AI Overviews) | |
| anthropic-ai | Anthropic | Training | Indirectly (Claude citations) |
| PerplexityBot | Perplexity | Real-time retrieval | Yes (inline citations) |
| Bytespider | ByteDance | Training (TikTok AI) | Minimal |
| CCBot | Common Crawl | Open dataset (used by many LLMs) | No |
Green = direct traffic source. Yellow = indirect traffic influence. Red = training only, no traffic benefit.
The most important distinction in that table is between training data crawlers and retrieval crawlers. ChatGPT-User and PerplexityBot fetch your pages in real time when a user asks a question - blocking them is functionally the same as blocking Googlebot. GPTBot, Google-Extended, and CCBot crawl for model training purposes, which is a different value proposition.
How to Check What You're Blocking
Open your site's robots.txt file - it lives at yourdomain.com/robots.txt - and search for any of the bot names listed above. Common patterns that block AI crawlers look like this:
User-agent: GPTBot Disallow: / User-agent: Google-Extended Disallow: / User-agent: CCBot Disallow: /
Some WordPress security plugins and CDN providers add these blocks automatically. Cloudflare's “AI Scrapers and Crawlers” toggle, for instance, blocks a long list of AI bots when enabled. If you turned that on without checking the specifics, you may be blocking traffic-generating retrieval bots alongside the training-only crawlers.
I keep seeing sites that blocked GPTBot months ago wondering why they don't show up in ChatGPT search results. If you block the bot, you block the citations. You can't have it both ways.
The Training vs Retrieval Tradeoff
Many publishers object to AI companies using their content to train LLMs without compensation - a legitimate concern that has driven multiple lawsuits. Blocking training crawlers like CCBot, Google-Extended, and anthropic-ai is a reasonable choice if you want to withhold your content from future model training.
Blocking retrieval crawlers, though, carries a direct cost. When you block ChatGPT-User or PerplexityBot, your content cannot appear as a cited source when users ask questions on those platforms. Glenn Gabe, an SEO consultant who has tracked AI traffic patterns across client portfolios, describes the dynamic bluntly.
“Blocking retrieval bots is like deindexing yourself from a new search engine. You can absolutely do it, but understand that you are giving up a growing traffic channel to make a philosophical point.”
The practical solution for most sites: block training bots, allow retrieval bots. This approach lets AI search tools cite your content while preventing that content from being absorbed into training datasets.
A Balanced robots.txt Configuration
The following configuration blocks training-only crawlers while preserving access for bots that generate organic search traffic from AI platforms:
# Allow AI retrieval bots (these generate traffic) User-agent: ChatGPT-User Allow: / User-agent: PerplexityBot Allow: / # Block training-only crawlers User-agent: GPTBot Disallow: / User-agent: Google-Extended Disallow: / User-agent: CCBot Disallow: / User-agent: Bytespider Disallow: / User-agent: anthropic-ai Disallow: /
Note the order: ChatGPT-User and PerplexityBot get explicit Allow directives. GPTBot is blocked separately, which prevents OpenAI from using your content for model training while still allowing ChatGPT's real-time search feature to access and cite your pages.
Important nuance: GPTBot and ChatGPT-User are different bots. Blocking GPTBot stops training. Blocking ChatGPT-User stops real-time search citations. Most sites should block the first and allow the second.
Monitoring the Impact
After updating your robots.txt, the next step is measuring whether AI platforms are actually sending traffic. Standard GA4 reporting does not break out AI-referred visits by default - ChatGPT traffic appears under referral or sometimes direct traffic, mixed in with everything else.
MeasureBoard's AI Traffic Intelligence feature automatically identifies visits from ChatGPT, Perplexity, Claude, Gemini, Copilot, and other AI platforms, separating them into a dedicated view. You can see which AI tools are sending the most sessions, which of your pages are being cited, and whether your AI traffic share is growing month over month.
Pair that with the AI Rank Tracker to see whether AI platforms are actually mentioning your brand when users ask questions in your space. If you unblock retrieval crawlers and your mention frequency increases in the weeks that follow, you have a direct signal that the robots.txt change is working.
What Happens Next
The robots.txt standard was designed in 1994 for a web with a handful of search engine crawlers. It was never intended to handle the nuance of distinguishing between training and retrieval, between scraping and citing. Several proposals for more granular machine-readable permissions are circulating - including TDMRep headers and AI-specific meta tags - but none have achieved broad adoption yet.
In the meantime, robots.txt remains the only widely-supported mechanism for controlling AI bot access. Reviewing yours takes five minutes. The traffic implications of getting it wrong will compound for years as AI overviews and AI-powered search become a larger share of how people find information online.