Fix library

How to fix robots.txt blocking AI crawlers

The Brimm team · grounded in the docs

If your pages never get cited in AI answers, the first thing to check is your robots.txt. There are 2 kinds of AI bots, and most owners block both by accident: the retrieval bots that fetch pages to cite, and the training bots that collect text for models. Per OpenAI's bot documentation these are separate user-agents you can control independently. The rule is plain: “if you block the crawler, you block the citation.” Allow the search bots, decide on training separately, and verify.

The symptom: you are nowhere in the answers

You write good pages. A person can find them in Google. But when you ask ChatGPT, Perplexity, or Google's AI Overviews about your topic, your site is never quoted. A competitor with a thinner page gets named instead. You check your content, your schema, your speed, and none of it explains why you are invisible.

Before you rewrite anything, look at one file. Open yoursite.com/robots.txt in a browser and read it. In a large share of the sites we audit, the reason a page is never cited is not the page at all. It is a single line in robots.txt telling the AI search crawlers to stay out. The page is fine. The door is locked.

The cause: blocking "AI" blocks the wrong thing

Two patterns cause almost all of these failures, and both come from good intentions.

The first is the blanket disallow. A line like User-agent: * Disallow: / tells every well-behaved bot to skip the whole site. People leave it in place from a staging setup, or a plugin writes it, and it quietly excludes the citation crawlers along with everyone else.

The second is the deliberate AI block. An owner reads that AI companies scrape sites to train models, decides to "keep the AI out," and blocks every AI user-agent they can find. The problem is that the same companies run two different bots for two different jobs, and blocking all of them throws away the citations to stop the training.

Retrieval and search bots fetch your page so an AI answer can quote and link it. These are the ones that earn you a citation: OAI-SearchBot (OpenAI), PerplexityBot (Perplexity), and Claude-SearchBot (Anthropic). Googlebot belongs here too, because Google AI Overviews are built on the normal Google index.
Training bots collect text to train future models. They have nothing to do with whether you get cited: GPTBot (OpenAI), ClaudeBot (Anthropic), and Google-Extended (Google).

The key fact is that these are independent decisions. You can welcome the bots that cite you and refuse the bots that train on you, all in the same file. Blocking training is a real choice some owners want to make. Blocking retrieval is almost never what anyone means to do, and it is the line that removes you from the answers.

The fix: allow the search bots, choose on training

Open your robots.txt and make the split explicit. Allow the retrieval and search crawlers so they can read and quote you. Allow Googlebot so you stay eligible for AI Overviews. Then, only if you specifically want to stay out of model training, block the training bots. Here is a copy-paste starting point:

# Let the AI search crawlers fetch and cite you
User-agent: OAI-SearchBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Claude-SearchBot
Allow: /

# Keep normal Google access (AI Overviews ride the index)
User-agent: Googlebot
Allow: /

# Optional: stay out of model training only
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

# Make sure no blanket rule locks everyone out
User-agent: *
Allow: /

The non-negotiable part is allowing the search bots. If you want to be in the answers, those four Allow: / lines are what put you there. The three training blocks at the bottom are entirely optional. Delete them if you are happy to be in training data, keep them if you are not. Either way, your citation eligibility is unchanged, because retrieval and training are separate user-agents.

One caution while you edit. The last block above sets User-agent: * Allow: / so a leftover blanket disallow cannot quietly override your work. If a specific bot needs a rule, give it its own block rather than tightening the wildcard, because a single Disallow: / under User-agent: * is the exact mistake that started this.

Verify: read it the way a bot does

A change you cannot confirm is not a fix. Once you have saved the file, check it the same way a crawler would:

Fetch yoursite.com/robots.txt directly in a browser and confirm the file you see is the file you edited, not a cached or plugin-generated version.
Read each user-agent block in turn. For every search bot you care about, the matching rule must be Allow: / or no disallow at all. For every training bot, confirm the rule says exactly what you intended.
Confirm there is no stray User-agent: * Disallow: / sitting above your specific rules.
Re-run an audit so a tool reads the live file and reports each bot's verdict back to you in plain language.

One thing to keep in perspective: robots.txt is a public, voluntary standard, not access control. It tells well-behaved bots what you would prefer, and the major search and AI crawlers honor it, which is exactly why a wrong line here is so costly. It is not a security boundary, so never use it to hide private pages. Use it to invite the bots that cite you and to make a clear, deliberate choice about the bots that train on you.

Check your own file

You can read all of this by hand, or you can paste your link into Brimm and see in about 30 seconds which AI crawlers your robots.txt lets in and which it locks out, named bot by named bot. We read the live file the way the engines do and print the failures in fix order. Start from the fix library for related issues, learn the bigger picture in what AEO actually is, or run the focused GEO audit when you want the crawler-access view on its own. When you are ready for the full read, open the audit.