Fix library

How to allow AI search crawlers to read your site

The Brimm team · grounded in the docs

An AI search engine cites you only if its crawler can fetch your page. There are more than 1 way to block it: robots.txt is the obvious gate, but a CDN, a WAF, or a server returning 403 will stop the bot before it ever reads a rule. See Google's robots.txt documentation for how the first gate works. The principle is plain: “a page a crawler can't reach is a page an AI can't quote.” Find the blocker, allow the retrieval bots, and verify with a real fetch.

The symptom: you are invisible to AI, and nothing looks wrong

Your site loads fine in a browser. Your content is good. But you never appear in ChatGPT search, Perplexity, or Google AI Overviews, and a competitor with a thinner page does. The page looks healthy to you because you are a human with a browser. The crawler is not. Somewhere between the AI engine and your HTML, a request is being refused, and you cannot see it from the front door.

This is the failure we catch most often that owners cannot diagnose alone. The block is real, it is silent, and it almost never shows up in the place people look first. To fix it you have to stop thinking like a visitor and start thinking like a request that arrives wearing a bot name.

The cause: robots.txt is only one of several gates

People assume that if robots.txt allows the bot, the bot can read the site. That is half the picture. A request from a crawler passes through several layers, and any one of them can refuse it before your page is ever served:

robots.txt tells well-behaved bots what they may fetch. A wildcard Disallow, or a block aimed at AI, will keep the retrieval crawlers out.
A CDN or WAF sits in front of your origin. Cloudflare's bot-fight mode, and similar tools at other providers, challenge or drop traffic from non-browser user-agents. The AI crawler gets a challenge page it cannot solve, not your content.
The server itself can return a 403 Forbidden or 429 Too Many Requests to user-agents it does not recognize, often from a security plugin or a hand-written rule.
Rate limiting that is too aggressive throttles a crawler into errors, so it gives up before it has fetched enough to cite you.
IP allowlists, login walls, and geoblocking each refuse a request that does not come from an approved address, an authenticated session, or an allowed country. A crawler has none of those.

The retrieval bots that matter here are the ones that fetch a page to quote it in an answer: OAI-SearchBot for ChatGPT search, PerplexityBot for Perplexity, Claude-SearchBot for Claude, and Googlebot, which AI Overviews depends on. These are distinct from the bots that train models. Blocking one of these retrieval crawlers, at any layer, removes you from the answer it builds.

The fix: identify the blocker, then allow the bots at every layer

Work from the outside in. The order matters, because a fix at one layer is wasted if a layer above it is still refusing the request.

1. Find the blocker in your logs

Open your server access logs, and your CDN or WAF logs if you have them. Search for the bot user-agents by name: OAI-SearchBot, PerplexityBot, Claude-SearchBot, Googlebot. Look at the status code each one received. A 200 means the bot read the page. A 403 or 429 means something refused it, and the layer that logged the refusal is your culprit. If the bots do not appear in the logs at all, a layer above your origin is dropping them before they reach you, which points at the CDN or WAF.

2. Keep robots.txt permissive for the retrieval bots

Make the allow explicit so there is no ambiguity. If you want to keep training bots out, that is a separate decision and a separate user-agent. Do not let a broad anti-AI rule sweep up the crawlers that cite you. For the robots.txt layer specifically, see how to fix robots.txt blocking AI crawlers.

# Let the retrieval crawlers fetch everything
User-agent: OAI-SearchBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Claude-SearchBot
Allow: /

User-agent: Googlebot
Allow: /

3. Allow the bots at the CDN and WAF

If you run Cloudflare, turn off bot-fight mode for these user-agents, or add an explicit allow rule that matches them and skips the challenge. Other CDNs and WAFs have the same concept under different names. The goal is the same everywhere: a request that identifies itself as one of the retrieval crawlers should reach your origin without a challenge it cannot pass. Do not rely on the default bot rules to make an exception for AI search. They usually do not.

4. Fix the server and the rate limits

If a security plugin or a server rule is returning 403 or 429 to these user-agents, add them to its allowlist. Loosen rate limits enough that a crawler fetching a handful of pages in sequence is not throttled into errors. If you use an IP allowlist, a login wall, or geoblocking on pages you want cited, those pages are off-limits to every crawler by design, so move the content you want quoted to a path that is publicly reachable.

Verify: fetch your own page as the bot

Do not trust that the fix worked because the rule looks right. Test it the honest way. Request your page from the command line while sending one of the bot user-agents, and read what comes back:

# Pretend to be the ChatGPT search crawler
curl -A "OAI-SearchBot" -I https://yoursite.com/

# A pass looks like this
HTTP/2 200

You want a 200 and the full HTML of your page, the same content a browser sees. A 403, a 429, a redirect to a challenge, or a near-empty body means a layer is still refusing the bot. Repeat the check for each retrieval crawler you care about, because a rule that allows one can easily miss another. This is also where you confirm a separate but related failure: if the body comes back empty because the page only fills in after JavaScript runs, the crawler still cannot quote you. That is a different fix, covered in how to fix a website that doesn't render without JavaScript.

Why this is worth doing

Crawler access is the floor under everything else. You can write the most quotable answer on the internet, but if the bot that builds the answer cannot fetch your page, none of it counts. This is the first thing we check, because it gates all the rest. If you are new to this, our explainer on what answer engine optimization is puts crawler access in context with the work that follows it.

Check your own site

You can run every one of these checks by hand, or you can paste your link into Brimm and see in seconds whether the retrieval crawlers get a clean fetch or a refusal. We request your pages the way the engines do, report the status code each bot receives, and print the failures in fix order. Start with our free GEO audit, or open the full audit to see every layer at once. The whole fix library lives at /learn/fix when you want the next problem.