1. Abstract
Publish /robots.txt with clear crawl rules.
robots.txt gives crawlers and agents a standard place to read crawl permissions, disallowed paths, and sitemap locations before requesting site content.
2. Classification
- Check ID
- robots.txt
- Check version
- 1.0.0
- Package path
- lib/checks/robots-txt/versions/1.0.0
- Category
- AI Discoverability
- Subcategory
- Discoverability
- Check group
- Crawl Discovery
- Check group ID
- crawl-discovery
- Maturity
- Established
- Scope
- site
- Check weight
- 1
3. Input And Output Contracts
- Input
- [email protected]
- Output
- [email protected]
- Resources inspected
- /robots.txt
4. Scoring Semantics
| Step ID | Title | Weight | Description |
|---|---|---|---|
fetch | Fetch robots.txt | 0.3 | Fetch the root /robots.txt file with a successful HTTP response. |
core-syntax | Validate RFC 9309 core syntax | 0.45 | Validate User-agent groups and core Allow/Disallow records. |
extensions | Classify extension records | 0.25 | Report known nonstandard extensions, unsupported legacy records, and unknown records without failing otherwise valid core syntax. |
5. Package Documentation
robots.txt Check v1.0.0
Validates that a site publishes a fetchable and parseable /robots.txt file.
This check separates the formal Robots Exclusion Protocol from popular nonstandard extensions. The RFC-defined protocol is intentionally small; search engines, AI crawlers, and site owners commonly add extra records that should be parsed and reported, but not treated as RFC-required directives.
Input Contract
Requires the scan origin. The check fetches ${origin}/robots.txt.
Output Contract
Emits the parsed robots body, sitemap directives, Content-Signal directives, and a report check result. The result is stepped: fetch, RFC core syntax, and extension classification.
Pass Criteria
/robots.txtresponds with a successful HTTP status.- The body parses as valid robots.txt content.
- At least one valid
User-agentgroup is present.
Failure Criteria
/robots.txtcannot be fetched./robots.txtresponds with a non-2xx status.- The body cannot be parsed as a valid robots.txt file.
Warning Criteria
- The RFC core syntax is valid, but unsupported legacy records such as
Noindex
are present.
HTTP Status Semantics
The check records the HTTP status because crawlers interpret robots.txt fetch outcomes differently from ordinary page fetches.
| Status class | Robots interpretation |
|---|---|
2xx | The robots.txt file was fetched and should be parsed. |
3xx | Redirects may be followed, but long redirect chains are treated as unavailable. |
4xx | A missing or inaccessible robots.txt file generally means there are no robots.txt crawl restrictions. |
5xx | Server errors are temporary failures; crawlers may pause or reduce crawling rather than assume access is allowed. |
| Network / DNS error | Treated like a temporary fetch failure by crawlers. |
For CanAgentUse scoring, a non-2xx response fails this check because the site did not publish a fetchable policy document at /robots.txt, even though crawler behavior for 4xx can be permissive.
RFC 9309 Core Protocol
RFC 9309 defines the core Robots Exclusion Protocol. The core records are:
| Record | Status | Purpose |
|---|---|---|
User-agent | Required for a group | Declares the crawler product token that the following rules apply to. |
Allow | Core rule | Allows access to matching URI paths. |
Disallow | Core rule | Disallows access to matching URI paths. |
Core structure:
- A file is made of one or more groups.
- A group begins with one or more
User-agentrecords. - A group may contain zero or more
AlloworDisallowrules. - Empty lines separate groups.
- Comments begin with
#. - Rules before the first
User-agentrecord are ignored by compliant parsers. - Matching is path-based.
- The longest matching rule wins.
- If an equivalent
AllowandDisallowboth match,Allowwins. *can match zero or more characters.$anchors a pattern to the end of the path.- Unknown records do not invalidate the file; parsers should ignore unsupported
records and continue processing valid lines.
Example:
User-agent: *
Allow: /
Disallow: /admin/
User-agent: ExampleBot
Disallow: /private/
Allow: /private/public/Popular Nonstandard Extensions
These records are not part of the RFC 9309 core rule set, but are common enough that the check should parse and report them separately from RFC validity.
| Record | Example | Notes |
|---|---|---|
Sitemap | Sitemap: https://example.com/sitemap.xml | Widely supported by major search engines as sitemap discovery metadata. |
Crawl-delay | Crawl-delay: 10 | Nonstandard. Google does not support it; some crawlers and AI bots document support or expected respect for it. |
Request-rate | Request-rate: 1/5 | Legacy extended robots proposal for request pacing. Rare today. |
Visit-time | Visit-time: 0600-0845 | Legacy extended robots proposal for crawl time windows. Rare today. |
Host | Host: example.com | Legacy/search-engine-specific preferred host directive, historically associated with Yandex-style usage. |
Clean-param | Clean-param: ref /products/ | Search-engine-specific URL parameter cleanup directive, historically associated with Yandex-style usage. |
Content-Signal | Content-Signal: ai-train=no, search=yes, ai-input=no | AI-era usage preference signal. It expresses content-use policy, not crawl permission. |
IndexNow-Key | IndexNow-Key: https://example.com/key.txt | Optional IndexNow key-location discovery convention. Not part of RFC robots rules. |
Noindex | Noindex: /private/ | Historical unsupported pattern. Detect and warn; use meta robots or X-Robots-Tag for indexing controls instead. |
Extension records should not make an otherwise valid file fail. They should be classified as known extensions, unsupported legacy records, or unknown records. This version emits that classification in the extensions step evidence.
Known User-Agent Tokens
User-agent tokens are not centrally registered and change over time. The check should treat unknown tokens as syntactically valid when they fit the RFC product token shape, while classifying known tokens for reporting and AI/search policy analysis.
Search and platform crawlers:
APIs-Google
AdsBot-Google
Applebot
Applebot-Extended
Baiduspider
BingPreview
DuckDuckBot
Google-CloudVertexBot
Google-Extended
Google-InspectionTool
GoogleOther
GoogleOther-Image
GoogleOther-Video
Googlebot
Googlebot-Image
Googlebot-News
Googlebot-Video
KagiBot
Mediapartners-Google
MojeekBot
PetalBot
Qwantify
SeznamBot
Slurp
Sogou
Storebot-Google
YandexBot
YandexImages
Yeti
bingbotAI and answer-engine crawlers:
Amazonbot
Bytespider
CCBot
ChatGPT-User
Claude-SearchBot
Claude-User
ClaudeBot
Cohere-AI
Diffbot
FacebookBot
GPTBot
ImagesiftBot
Meta-ExternalAgent
Meta-ExternalFetcher
OAI-AdsBot
OAI-SearchBot
Perplexity-User
PerplexityBot
YouBot
omgili
omgilibotSEO and commercial crawlers:
AhrefsBot
DataForSeoBot
DotBot
MJ12bot
Screaming Frog SEO Spider
SemrushBotSource Notes
- RFC core behavior comes from RFC 9309:
https://www.rfc-editor.org/rfc/rfc9309.html. - Google documents supported robots fields and unsupported records in its robots.txt documentation:
https://developers.google.com/search/docs/crawling-indexing/robots/robots_txt. - Google crawler and fetcher tokens are documented at
https://developers.google.com/search/docs/crawling-indexing/google-common-crawlers. - OpenAI crawler tokens are documented at
https://platform.openai.com/docs/bots. - Anthropic crawler behavior, including robots.txt and crawl-delay handling, is documented at
https://support.anthropic.com/en/articles/8896518-does-anthropic-crawl-data-from-the-web-and-how-can-site-owners-block-the-crawler. - Applebot and Applebot-Extended are documented at
https://support.apple.com/en-us/119829. - Common Crawl documents
CCBotathttps://commoncrawl.org/ccbot. - IndexNow key-location behavior is documented at
https://www.indexnow.org/documentation.
Scoring Steps
| Step | Weight | Purpose |
|---|---|---|
fetch | 0.3 | Fetch /robots.txt with a successful HTTP response. |
core-syntax | 0.45 | Validate RFC 9309 core User-agent grouping and core rule records. |
extensions | 0.25 | Report known extensions, unsupported legacy records, unknown records, and known/unknown crawler tokens. |
Current v1.0.0 Coverage
This version checks:
- Fetch success for
/robots.txt. - Presence of at least one valid
User-agentgroup. - Core
AllowandDisallowrule grouping. SitemapandContent-Signalevidence.- Known nonstandard extension records such as
Crawl-delay,Host,Clean-param, andIndexNow-Key. - Unsupported legacy
Noindexrecords as warnings. - Unknown records as evidence.
- Known and unknown crawler product tokens as evidence.
This version documents, but does not yet validate:
- Full RFC longest-match path evaluation for specific URLs.
- Whether specific AI/search crawler policies are desirable for the site.
- Whether sitemap URLs advertised in robots.txt are valid; the sitemap check owns sitemap validation.
References
- www.rfc-editor.org/rfc/rfc9309
- developers.google.com/search/docs/crawling-indexing/robots/intro
- example.com/sitemap.xml
- example.com/key.txt
- www.rfc-editor.org/rfc/rfc9309.html
- developers.google.com/search/docs/crawling-indexing/robots/robots_txt
- developers.google.com/search/docs/crawling-indexing/google-common-crawlers
- platform.openai.com/docs/bots
- support.anthropic.com/en/articles/8896518-does-anthropic-crawl-data-from-the-web-and-how-can-site-owners-block-the-crawler
- support.apple.com/en-us/119829
- commoncrawl.org/ccbot
- www.indexnow.org/documentation
Source: lib/checks/robots-txt/versions/1.0.0/docs.md
6. Version Changelog
robots.txt v1.0.0 Changelog
Initial versioned package for robots.txt.
- Adds stepped scoring for fetch, RFC core syntax, and extension classification.
- Emits known extension records, unsupported legacy records, unknown records, orphan records, invalid lines, and known/unknown crawler token evidence.
- Warns on unsupported legacy
Noindexrecords while preserving pass/fail behavior for the RFC core syntax.
Source: lib/checks/robots-txt/versions/1.0.0/changelog.md