1. Abstract
Declare deliberate robots.txt rules for major AI training, AI search, user-triggered, and dataset crawlers.
AI crawler product tokens have different meanings. Explicit robots.txt groups make training, search, and retrieval access policy auditable for compliant crawler operators.
2. Classification
- Check ID
- ai-bot-rules
- Check version
- 1.0.0
- Package path
- lib/checks/ai-bot-rules/versions/1.0.0
- Category
- AI Discoverability
- Subcategory
- Bot Access Control
- Check group
- Bot Policy
- Check group ID
- bot-policy
- Maturity
- Established
- Scope
- site
- Check weight
- 1
3. Input And Output Contracts
- Input
- [email protected]
- Output
- [email protected]
- Resources inspected
- /robots.txt
4. Scoring Semantics
| Step ID | Title | Weight | Description |
|---|---|---|---|
fetch-robots | Fetch robots.txt | 0.25 | Fetch robots.txt before inspecting AI crawler rules. |
classify-ai-bots | Classify AI crawler rules | 0.55 | Evaluate explicit AI crawler User-agent groups and effective root-path policy. |
policy-review | Review AI crawler policy risks | 0.2 | Warn on broad search crawler blocks or likely policy mistakes. |
5. Package Documentation
AI Bot Rules Check v1.0.0
Checks whether /robots.txt declares explicit policy for major AI training, AI search, user-triggered retrieval, and dataset crawlers.
This check is separate from the generic robots.txt check. The generic check validates discoverability and RFC-shaped parsing. This check interprets provider-specific AI crawler product tokens as Bot Access Control evidence.
Input Contract
Requires the scan origin. The check fetches ${origin}/robots.txt.
Output Contract
Emits stepped evidence for fetch, AI crawler classification, and policy review.
Pass Criteria
/robots.txtis available.- At least one explicit
User-agentgroup is present for a known AI crawler
token.
- No broad search crawler blocks are detected at
/.
Warning Criteria
- Explicit AI crawler rules exist, but broad search crawlers such as
Googlebot or Applebot are blocked at /. This may be intentional, but often indicates that the publisher meant to use narrower tokens such as Google-Extended or Applebot-Extended.
Failure Criteria
- No robots.txt content is available.
- No explicit
User-agentrules are found for known AI crawler tokens.
Crawler Purpose Model
| Purpose | Examples |
|---|---|
| AI training | GPTBot, ClaudeBot, Amazonbot, Bytespider, Meta-ExternalAgent |
| AI training opt-out token | Google-Extended, Applebot-Extended |
| AI search | OAI-SearchBot, Claude-SearchBot, PerplexityBot, Amzn-SearchBot, YouBot |
| User-triggered retrieval | ChatGPT-User, Claude-User, Perplexity-User, Amzn-User |
| Dataset crawlers | CCBot |
| General search | Googlebot, Applebot, bingbot |
For each known crawler, this version reports effective policy at / as allowed, blocked, or unspecified. Exact crawler groups take precedence over User-agent: * fallback groups.
Scoring Steps
| Step | Weight | Purpose |
|---|---|---|
fetch-robots | 0.25 | Fetch robots.txt. |
classify-ai-bots | 0.55 | Classify explicit AI crawler groups and effective root-path policy. |
policy-review | 0.2 | Warn on broad search crawler blocks or likely policy mistakes. |
Current v1.0.0 Coverage
This version checks:
- Explicit AI crawler
User-agentgroups in robots.txt. - Effective allow/block/unspecified policy at
/. - Purpose grouping for training, AI search, user-triggered retrieval, dataset,
and general search crawlers.
- Broad
GooglebotandApplebotblocking warnings.
This version does not validate:
- Whether crawlers comply with the declared policy.
- WAF/CDN blocks, IP verification, or server logs.
Content-Signal, TDMRep, ai.txt, Web Bot Auth, or RSL; those are sibling Bot
Access Control checks.
References
- www.rfc-editor.org/rfc/rfc9309
- platform.openai.com/docs/bots
- support.anthropic.com/en/articles/8896518-does-anthropic-crawl-data-from-the-web-and-how-can-site-owners-block-the-crawler
- developers.google.com/search/docs/crawling-indexing/google-common-crawlers
- support.apple.com/en-us/119829
- developer.amazon.com/amazonbot
- www.perplexity.ai/perplexitybot
- developers.facebook.com/docs/sharing/webmasters/web-crawlers
- commoncrawl.org/ccbot
Source: lib/checks/ai-bot-rules/versions/1.0.0/docs.md
6. Version Changelog
ai-bot-rules v1.0.0 Changelog
Initial versioned package for ai-bot-rules.
- Classifies known AI crawler tokens by purpose.
- Evaluates effective root-path allow/block/unspecified policy.
- Warns on broad
GooglebotandApplebotblocks that may indicate accidental search crawler blocking.
Source: lib/checks/ai-bot-rules/versions/1.0.0/changelog.md