Back to blog
AI crawlerGPTBotrobots.txtCloudflare

AI Crawler Audit: Robots.txt Is Not Enough

Run a 7-layer AI crawler audit across 13 bots, robots.txt, Content-Signal, WAF rules, status codes, logs, IP verification, retrieval proof, and fixes.

By Senior Editor· · 13 min read
AI crawler audit workflow showing robots.txt policy, edge firewall checks, bot logs, and retrieval outcomes.
AI crawler audit workflow showing robots.txt policy, edge firewall checks, bot logs, and retrieval outcomes.
TL;DR: A real AI crawler audit checks policy, bot coverage, file discovery, edge behavior, status codes, logs, and retrieval outcomes. robots.txt is only the first layer. Audit all major AI bot user agents, verify WAF/CDN behavior, separate training from search and user-fetch traffic, and record evidence for every public page class.

Most AI crawler advice stops at robots.txt. That is too thin. A site can publish a perfect robots file and still fail AI retrieval because a WAF challenges the request, a CDN blocks unknown agents, a sitemap points to stale URLs, or the policy ignores half the bots that now matter.

Use this guide after the AI crawler access policy and the llms.txt vs robots.txt publishing guide. It also supports the GEO vs SEO strategy, because AI citations fail when public pages cannot be retrieved.

What is an AI crawler audit?

An AI crawler audit is a production evidence review, not a robots.txt lint check. OpenAI documents separate agents for search, training, and user-triggered actions, including OAI-SearchBot, GPTBot, and ChatGPT-User (OpenAI, Overview of OpenAI Crawlers, retrieved 2026-06-03).

An AI crawler audit should compare intended crawler policy with real retrieval evidence. The audit should show which bots are named, which page classes they can fetch, which edge controls intervene, and whether logs confirm the intended outcome.

The audit goal is to answer five questions. Which AI bots does the site recognize? Which ones are allowed, blocked, or missing? Do public pages return clean 2xx responses? Do security tools challenge or block legitimate traffic? Do logs prove that the policy works after deployment?

<!-- [USEFUL INSIGHT] --> The highest-value finding is usually a contradiction. SEO sees an allow rule, security sees a bot block, engineering sees a redirect, and logs show 403. A useful audit forces those systems into one evidence table.

Three-layer audit model showing crawler policy, transport, and outcome evidence.
Three-layer audit model showing crawler policy, transport, and outcome evidence.

Which AI bots should your audit include?

Start with the bots your scanner or policy actually checks. CanAgentUse currently audits explicit robots.txt user-agent rules for 13 AI and AI-adjacent crawlers: GPTBot, ChatGPT-User, ClaudeBot, Claude-Web, Anthropic-AI, Google-Extended, PerplexityBot, CCBot, Applebot-Extended, Meta-ExternalAgent, Bytespider, Amazonbot, and Cohere-AI.

Bot or user agentTypical policy questionWhy it matters
GPTBotAllow training use?OpenAI identifies GPTBot separately from search and user actions.
OAI-SearchBotAllow ChatGPT search discovery?OpenAI says this affects surfacing in ChatGPT search answers.
ChatGPT-UserAllow user-triggered fetches?OpenAI describes this as user-initiated, not automatic crawling.
ClaudeBotAllow Anthropic model-development crawling?Anthropic separates training, search, and user retrieval.
Claude-SearchBotAllow Claude search indexing?Blocking it may reduce Claude search visibility.
Claude-UserAllow user-directed retrieval?User retrieval should be treated differently from training.
Claude-WebLegacy or related Anthropic crawler policy?Some sites still name it explicitly in robots.txt.
Anthropic-AIAnthropic policy coverage?Explicit naming reduces ambiguity in older policies.
Google-ExtendedAllow Google model training use?Google documents it as a control for Gemini/Vertex training use.
PerplexityBotAllow Perplexity discovery?Perplexity documents bot access and WAF allowlisting guidance.
Perplexity-UserAllow user-requested Perplexity fetches?It belongs in a user-fetch policy class.
CCBotAllow Common Crawl?Common Crawl data can flow into many AI datasets.
Applebot-ExtendedAllow Apple AI use preferences?Apple documents Applebot-Extended for data-use control.
Meta-ExternalAgentAllow Meta external agent access?Needed for Meta AI-related crawler policies.
BytespiderAllow ByteDance crawler access?Commonly appears in AI crawler block lists.
AmazonbotAllow Amazon crawler access?Can appear in AI and assistant crawler policies.
Cohere-AIAllow Cohere AI crawler access?Relevant for enterprise AI/search ecosystems.
Bot coverage must be explicit. A robots file that names only GPTBot and ClaudeBot can still leave PerplexityBot, Google-Extended, CCBot, Applebot-Extended, Meta-ExternalAgent, Bytespider, Amazonbot, and Cohere-AI ambiguous.

The table includes more agents than a single scanner rule because crawler ecosystems do not stay neat. A strong audit uses vendor docs as the source of truth, then keeps a local watchlist for bots the business cares about. Missing names are not always failures, but they are policy gaps.

How should you separate bot purpose classes?

Separate bots by purpose before deciding allow or disallow. OpenAI says OAI-SearchBot is for ChatGPT search, GPTBot is for training-related crawling, and ChatGPT-User supports user actions. Anthropic similarly documents robots for model development, search, and user-directed retrieval (Anthropic Help Center, retrieved 2026-06-03).

Purpose classExamplesDefault audit posture
Search or answer discoveryOAI-SearchBot, Claude-SearchBot, PerplexityBot, GooglebotUsually allow public content if AI visibility matters.
User-directed fetchChatGPT-User, Claude-User, Perplexity-UserUsually allow public pages; protect private data with auth.
Training or model improvementGPTBot, ClaudeBot, Google-Extended, CCBot, Applebot-ExtendedDecide with legal, content, and business owners.
Broad AI-adjacent crawlersMeta-ExternalAgent, Bytespider, Amazonbot, Cohere-AIDecide by data policy and log behavior.
Unknown or spoofed agentsFake user agents, unverified IPsVerify identity or challenge.
Training, search, and user-fetch bots should not share one blanket rule. A site can block model-training crawlers while allowing AI search and user-directed retrieval, but the policy must name the agents and logs must prove the behavior.

This is where many teams overcorrect. Blocking every AI crawler may protect training preference, but it can also remove your public docs from AI search. Allowing every crawler may help visibility, but it can conflict with content licensing or publisher policy.

What should robots.txt prove?

robots.txt should prove that the site has a parseable crawler policy with User-agent directives, clear Allow or Disallow choices, and sitemap references. RFC 9309 defines robots.txt as the Robots Exclusion Protocol, and Google still documents it as a standard crawl-control file (Google Search Central, retrieved 2026-06-03).

Use a policy matrix before writing rules:

Page classSearch botsUser-fetch botsTraining botsNotes
Homepage and product pagesAllowAllowDecideNeeded for AI visibility.
Blog and docsAllowAllowDecideUsually the core citation surface.
Pricing and contactAllowAllowDecideUseful for buyer answers.
Reports and dashboardsBlock by authBlock by authBlock by authDo not rely on robots.txt.
Admin and account routesBlock by authBlock by authBlock by authServer authorization is mandatory.
robots.txt is policy evidence, not privacy enforcement. It should declare crawler intent for public URLs, while private dashboards, reports, invoices, and admin routes must be protected with authentication and server authorization.

CanAgentUse parses robots.txt for user agents, sitemap directives, and Content-Signal entries. It also checks whether major AI bots are named at all. That means a "valid" robots file can still score poorly if it never names AI crawler policy.

For the broader machine-readable content layer, pair this with the AI agent readiness playbook. Crawler access is only one part of the stack.

What else belongs in the crawler-policy layer?

The policy layer should include robots.txt, sitemap links, llms.txt, ai.txt where relevant, TDMRep where relevant, and Content-Signal directives. These files do not all do the same job, but together they reduce ambiguity for crawlers, agents, publishers, and auditors.

SignalScanner questionPractical value
robots.txtIs there a valid crawl policy?Names user agents, rules, and sitemaps.
AI bot rulesAre major AI crawlers explicit?Reduces ambiguity around training/search/retrieval.
Content-SignalAre ai-train, search, and ai-input declared?Gives machine-readable AI usage preference where supported.
llms.txtIs there a concise agent-readable site guide?Helps summarize important public context and links.
ai.txtIs there human-readable AI policy text?Useful for advisory usage, attribution, and contact guidance.
TDMRepIs there a text/data mining reservation signal?Relevant for rightsholders using W3C TDM reservation policy.
SitemapAre canonical URLs discoverable?Gives crawlers updated public URLs.
AI crawler policy is multi-file. robots.txt controls crawl intent, llms.txt organizes public context, Content-Signal declares usage preferences, ai.txt can explain policy, and TDMRep can support text/data mining reservation where applicable.

Do not add every emerging file because a checklist says so. Add the signals that match your business policy, then keep them consistent. The worst outcome is a site that blocks training in robots.txt, invites broad AI use in ai.txt, and lists blocked URLs in llms.txt.

Why can robots.txt pass while retrieval fails?

Robots.txt can pass while retrieval fails because fetches still traverse DNS, redirects, CDN rules, bot scoring, WAF policies, rate limits, geo restrictions, TLS settings, and application routing. Perplexity documents WAF allowlisting considerations for its crawlers (Perplexity Crawlers, retrieved 2026-06-03).

Common failure modes:

  1. robots.txt allows the bot, but Cloudflare or another CDN returns a challenge.
  2. The homepage returns 200, but docs routes return 403 to crawler-like traffic.
  3. A redirect chain drops the bot onto a localized or consent page.
  4. The server returns 200 with a JavaScript shell and no extractable content.
  5. A managed bot rule blocks published crawler IP ranges.
  6. Rate limits return 429 during repeated assistant fetches.
  7. noindex or X-Robots-Tag conflicts with organic and AI overview goals.
A crawler audit needs transport evidence. A page is not AI-retrievable just because policy allows it; the audit must confirm status code, final URL, response body, headers, and edge decision for representative public pages.

Personal Experience

When we inspect readiness reports, the sneakiest issue is a "good" 200 response that contains a consent wall, challenge page, or empty app shell. The status code passes. The content fails.

How do you test edge and WAF behavior?

Test edge behavior with representative URLs and crawler classes. You do not need to spoof every bot in production, but you do need evidence from logs, vendor verification, and controlled fetches that the intended crawlers are not receiving challenges.

Use this workflow:

  1. Fetch /robots.txt, /sitemap.xml, /llms.txt, and key public pages.
  2. Record status code, content type, final URL, canonical, and body excerpt.
  3. Compare normal browser fetches with crawler-like fetches where safe.
  4. Check WAF events for allowed, challenged, blocked, and rate-limited requests.
  5. Verify official IP ranges for vendors that publish them.
  6. Confirm the response body contains the expected page content.
  7. Rescan after security-rule changes, CDN migrations, or framework rewrites.
WAF validation should combine user-agent policy with identity evidence. User-agent strings can be spoofed, so high-confidence allow rules should use official vendor guidance, IP verification where available, and ongoing log review.

Do not blindly allow every request with an AI-looking user agent. That invites spoofing. Also do not blindly block every unknown bot if your public content strategy depends on AI search visibility. Treat crawler identity as a security problem and crawler purpose as a policy problem.

What should your log table show?

Your log table should show one row per bot, page class, and result. The minimum useful fields are user agent, purpose, URL path, policy decision, status code, final URL, edge action, IP verification, body class, last seen, owner, and next action.

User agentPurposePathPolicyStatusEdge actionBody classAction
OAI-SearchBotSearch/blog/Allow200AllowedArticle HTMLKeep
GPTBotTraining/blog/Disallow403BlockedBlock pageIntentional
ChatGPT-UserUser fetch/checksAllow200AllowedProduct HTMLKeep
Claude-SearchBotSearch/docsAllow403ChallengeChallenge pageFix WAF
PerplexityBotSearch/pricingAllow429Rate limitError pageTune limit
GooglebotSearch/blog/Allow200AllowedArticle HTMLKeep
A useful AI crawler log table captures both policy and outcome. Status code alone is insufficient; the row should also show edge action, final URL, body class, and whether the result matches the intended crawler purpose.
Log matrix showing AI crawler user agent, purpose, path, status, edge decision, and recommended action.
Log matrix showing AI crawler user agent, purpose, path, status, edge decision, and recommended action.

Which pages should you audit first?

Prioritize pages that teach, prove, or convert. For a SaaS site, audit the homepage, product pages, pricing, docs, API reference, blog hub, top blog posts, changelog, contact routes, and any public report pages. Then audit private routes to confirm they are protected by authentication.

PriorityPage classWhy it matters
P0Homepage, pricing, product pagesAI systems use them for brand and offer summaries.
P0Docs and API referenceAssistants use them to answer implementation questions.
P1Blog hub and strategic postsThese feed GEO, AIO, and topical authority.
P1openapi.json, API catalog, MCP cardThese support agent action discovery.
P2Changelog and status pagesThese support freshness and operational trust.
P0 privateReports, dashboards, adminThese must require authentication.
Page selection changes crawler-audit quality. Auditing only the homepage misses docs, blog posts, API contracts, and conversion paths; auditing private routes verifies that sensitive content is protected by auth, not by voluntary crawler rules.

This is where crawler audits connect to the OpenAPI to MCP action layer. A crawler may reach your docs, but an agent still cannot act if your API contract, OAuth metadata, and MCP server card are missing.

Once the retrieval layer is clean, the MCP server SEO guide shows how to audit action discovery and protocol evidence.

How should you decide allow, block, or monitor?

Use three decisions: allow, block, or monitor. Allow means the bot is useful and the content is public. Block means the bot conflicts with policy or content rights. Monitor means the bot is not yet important enough to allow broadly, but logs should track behavior.

For many B2B sites, a balanced default is:

  1. Allow search and user-fetch bots for public marketing, docs, and blog pages.
  2. Decide training access separately with legal and content owners.
  3. Block private routes with authentication, not robots.txt.
  4. Monitor AI-adjacent crawlers that are not tied to clear business value.
  5. Review WAF events monthly and after every security change.
The best crawler posture is not "allow AI" or "block AI." It is purpose-level access: public retrieval for visibility, explicit training preference, authenticated private content, and logs that reveal when the policy stops matching production behavior.

FAQ

Should I block GPTBot?

Maybe. OpenAI documents GPTBot separately from OAI-SearchBot and ChatGPT-User. Many publishers block training while allowing AI search and user-fetch. The key is to name the specific bot and verify the result in logs.

Is ChatGPT-User different from GPTBot?

Yes. OpenAI documents ChatGPT-User for certain user actions, while GPTBot relates to foundation-model training. A crawler audit should treat them as different policy classes and avoid one blanket rule.

Why does CanAgentUse check bots like CCBot and Applebot-Extended?

Because AI visibility and training exposure extend beyond the most famous assistant brands. Common Crawl, Apple, Meta, ByteDance, Amazon, and Cohere-related crawlers may matter for publisher policy, AI datasets, or assistant ecosystems.

Does Cloudflare block AI crawlers?

It can, depending on bot-management, WAF, challenge, and rate-limit settings. The audit should inspect edge events and verify official crawler identity guidance where available.

How often should crawler access be audited?

Audit monthly, plus after CDN, WAF, bot-management, robots.txt, CMS, framework, or routing changes. Crawler docs and security defaults change too quickly for one-time setup.

Research sources