Technical guide

GPTBot and robots.txt: allow or block the AI crawlers

The Brimm team · grounded in the docs

The trick is that "the AI crawlers" are not one thing. OpenAI alone runs 3 separate bots, and per OpenAI's bot documentation they do different jobs: one trains models, one fetches pages to cite, one fetches links a user opens. So “you can allow OAI-SearchBot while disallowing GPTBot for training.” Training access and citation access are separate decisions in your robots.txt, and blocking everything to keep AI out also deletes you from AI answers.

Two kinds of bot, two kinds of decision

Every major AI company now splits its crawlers into two camps, and the split is the whole point of this page. Training crawlers collect your content to train future models. Retrieval crawlers, sometimes called search or user bots, fetch your page live so the assistant can quote it in an answer right now. They use different user-agent names, and your robots.txt can treat them differently.

That gives you two independent decisions. Do you want your content used to train models? And do you want your content cited in AI answers? You can say yes to one and no to the other. The mistake we catch constantly is owners who block all of it to "keep AI out," not realizing they have also blocked the bot that would have put them in the answer box.

The crawlers, by company

Here is the current map. The names matter, because robots.txt matches on the exact user-agent token.

OpenAI runs three. GPTBot collects content for model training. OAI-SearchBot fetches pages to cite in ChatGPT search results. ChatGPT-User fetches a page when a user explicitly asks ChatGPT to open a link. You can allow OAI-SearchBot and ChatGPT-User for citations while disallowing GPTBot to stay out of training.

Anthropic runs the same shape. ClaudeBot is the training crawler. Claude-SearchBot retrieves pages so Claude can answer with current information. Claude-User fetches a page a user pointed Claude at. All three are controlled independently in robots.txt.

Perplexity runs PerplexityBot for indexing and citations, and Perplexity-User for fetches triggered by a user's question. If you want to appear in Perplexity answers, PerplexityBot needs access.

Google is the one that trips people up. Googlebot powers both classic Search and AI Overviews. Block it and you disappear from both, so do not block it if you want to appear. Google's separate control token, Google-Extended, opts your content out of Gemini and Vertex model training and grounding. It does not affect Search ranking or indexing. So Google-Extended: Disallow keeps you in Search while keeping you out of Gemini training.

Bing runs bingbot, which also powers Microsoft Copilot. The same crawler that indexes you for Bing feeds Copilot's answers.

A robots.txt you can copy

This example does the common thing most owners actually want: let the retrieval and search bots in so you can be cited, and keep the training bots out. Edit it to your own preference, the comments mark the two decisions.

# --- Allow the bots that CITE you in AI answers ---
User-agent: OAI-SearchBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: Claude-SearchBot
Allow: /

User-agent: Claude-User
Allow: /

User-agent: PerplexityBot
Allow: /

# --- Optional: keep your content OUT of model training ---
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

# Note: do NOT disallow Googlebot or bingbot if you want
# to appear in Google AI Overviews or Microsoft Copilot.

If you want the opposite, to be in training and in citations, just delete the second block. If you want to be invisible to everything, disallow them all, but understand that this also removes you from every AI answer, not only from training sets.

The one line that decides whether you get cited

To be quoted in AI answers, the retrieval and search bots have to reach you. That is the line that matters. To stay out of model training, the training bots have to be blocked. Those are two separate switches, and the most expensive accident is flipping the wrong one. We see sites that blanket-blocked OpenAI to protect their writing from training, and in doing so blocked OAI-SearchBot too, which is why they never show up when someone asks ChatGPT about their topic.

So before you change anything, decide the two questions on purpose. Most businesses that publish to be found want the search and user bots allowed. The training decision is a values call, and either answer is fine, as long as it is the answer you meant.

One honest caveat about robots.txt

The robots.txt standard is public and voluntary. Well-behaved crawlers from OpenAI, Anthropic, Perplexity, Google, and Bing read it and honor it. But it is a request, not a lock. It is not access control, and it cannot stop a crawler that chooses to ignore it. If your goal is to truly prevent a page from being fetched, robots.txt is the wrong tool, use authentication or server-side blocking. For deciding which reputable AI bots may use your public content, robots.txt is exactly the right tool, and it is the one these companies document and respect.

Check what the bots actually see

You can read your own robots.txt by hand, or you can paste your link into Brimm and we will tell you which of these crawlers your file currently allows and which it blocks, mapped to what each one actually does. We read your site the way the engines do, so you can see in plain language whether you are open to citations, closed to training, or accidentally locked out of both. Then we print the fixes in order.