1. Abstract

Publish /robots.txt with clear crawl rules.

robots.txt gives crawlers and agents a standard place to read crawl permissions, disallowed paths, and sitemap locations before requesting site content.

2. Classification

Check ID: robots.txt
Check version: 1.0.0
Package path: lib/checks/robots-txt/versions/1.0.0
Category: AI Discoverability
Subcategory: Discoverability
Check group: Crawl Discovery
Check group ID: crawl-discovery
Maturity: Established
Scope: site
Check weight: 1

3. Input And Output Contracts

Input: [email protected]
Output: [email protected]
Resources inspected: /robots.txt

4. Scoring Semantics

Step ID	Title	Weight	Description
`fetch`	Fetch robots.txt	`0.3`	Fetch the root /robots.txt file with a successful HTTP response.
`core-syntax`	Validate RFC 9309 core syntax	`0.45`	Validate User-agent groups and core Allow/Disallow records.
`extensions`	Classify extension records	`0.25`	Report known nonstandard extensions, unsupported legacy records, and unknown records without failing otherwise valid core syntax.

5. Package Documentation

robots.txt Check v1.0.0

Validates that a site publishes a fetchable and parseable /robots.txt file.

This check separates the formal Robots Exclusion Protocol from popular nonstandard extensions. The RFC-defined protocol is intentionally small; search engines, AI crawlers, and site owners commonly add extra records that should be parsed and reported, but not treated as RFC-required directives.

Input Contract

[email protected]

Requires the scan origin. The check fetches ${origin}/robots.txt.

Output Contract

[email protected]

Emits the parsed robots body, sitemap directives, Content-Signal directives, and a report check result. The result is stepped: fetch, RFC core syntax, and extension classification.

Pass Criteria

/robots.txt responds with a successful HTTP status.
The body parses as valid robots.txt content.
At least one valid User-agent group is present.

Failure Criteria

/robots.txt cannot be fetched.
/robots.txt responds with a non-2xx status.
The body cannot be parsed as a valid robots.txt file.

Warning Criteria

The RFC core syntax is valid, but unsupported legacy records such as Noindex

are present.

HTTP Status Semantics

The check records the HTTP status because crawlers interpret robots.txt fetch outcomes differently from ordinary page fetches.

Status class	Robots interpretation
`2xx`	The robots.txt file was fetched and should be parsed.
`3xx`	Redirects may be followed, but long redirect chains are treated as unavailable.
`4xx`	A missing or inaccessible robots.txt file generally means there are no robots.txt crawl restrictions.
`5xx`	Server errors are temporary failures; crawlers may pause or reduce crawling rather than assume access is allowed.
Network / DNS error	Treated like a temporary fetch failure by crawlers.

For CanAgentUse scoring, a non-2xx response fails this check because the site did not publish a fetchable policy document at /robots.txt, even though crawler behavior for 4xx can be permissive.

RFC 9309 Core Protocol

RFC 9309 defines the core Robots Exclusion Protocol. The core records are:

Record	Status	Purpose
`User-agent`	Required for a group	Declares the crawler product token that the following rules apply to.
`Allow`	Core rule	Allows access to matching URI paths.
`Disallow`	Core rule	Disallows access to matching URI paths.

Core structure:

A file is made of one or more groups.
A group begins with one or more User-agent records.
A group may contain zero or more Allow or Disallow rules.
Empty lines separate groups.
Comments begin with #.
Rules before the first User-agent record are ignored by compliant parsers.
Matching is path-based.
The longest matching rule wins.
If an equivalent Allow and Disallow both match, Allow wins.
* can match zero or more characters.
$ anchors a pattern to the end of the path.
Unknown records do not invalidate the file; parsers should ignore unsupported

records and continue processing valid lines.

Example:

User-agent: *
Allow: /
Disallow: /admin/

User-agent: ExampleBot
Disallow: /private/
Allow: /private/public/

Popular Nonstandard Extensions

These records are not part of the RFC 9309 core rule set, but are common enough that the check should parse and report them separately from RFC validity.

Record	Example	Notes
`Sitemap`	`Sitemap: https://example.com/sitemap.xml`	Widely supported by major search engines as sitemap discovery metadata.
`Crawl-delay`	`Crawl-delay: 10`	Nonstandard. Google does not support it; some crawlers and AI bots document support or expected respect for it.
`Request-rate`	`Request-rate: 1/5`	Legacy extended robots proposal for request pacing. Rare today.
`Visit-time`	`Visit-time: 0600-0845`	Legacy extended robots proposal for crawl time windows. Rare today.
`Host`	`Host: example.com`	Legacy/search-engine-specific preferred host directive, historically associated with Yandex-style usage.
`Clean-param`	`Clean-param: ref /products/`	Search-engine-specific URL parameter cleanup directive, historically associated with Yandex-style usage.
`Content-Signal`	`Content-Signal: ai-train=no, search=yes, ai-input=no`	AI-era usage preference signal. It expresses content-use policy, not crawl permission.
`IndexNow-Key`	`IndexNow-Key: https://example.com/key.txt`	Optional IndexNow key-location discovery convention. Not part of RFC robots rules.
`Noindex`	`Noindex: /private/`	Historical unsupported pattern. Detect and warn; use meta robots or `X-Robots-Tag` for indexing controls instead.

Extension records should not make an otherwise valid file fail. They should be classified as known extensions, unsupported legacy records, or unknown records. This version emits that classification in the extensions step evidence.

Known User-Agent Tokens

User-agent tokens are not centrally registered and change over time. The check should treat unknown tokens as syntactically valid when they fit the RFC product token shape, while classifying known tokens for reporting and AI/search policy analysis.

Search and platform crawlers:

APIs-Google
AdsBot-Google
Applebot
Applebot-Extended
Baiduspider
BingPreview
DuckDuckBot
Google-CloudVertexBot
Google-Extended
Google-InspectionTool
GoogleOther
GoogleOther-Image
GoogleOther-Video
Googlebot
Googlebot-Image
Googlebot-News
Googlebot-Video
KagiBot
Mediapartners-Google
MojeekBot
PetalBot
Qwantify
SeznamBot
Slurp
Sogou
Storebot-Google
YandexBot
YandexImages
Yeti
bingbot

AI and answer-engine crawlers:

Amazonbot
Bytespider
CCBot
ChatGPT-User
Claude-SearchBot
Claude-User
ClaudeBot
Cohere-AI
Diffbot
FacebookBot
GPTBot
ImagesiftBot
Meta-ExternalAgent
Meta-ExternalFetcher
OAI-AdsBot
OAI-SearchBot
Perplexity-User
PerplexityBot
YouBot
omgili
omgilibot

SEO and commercial crawlers:

AhrefsBot
DataForSeoBot
DotBot
MJ12bot
Screaming Frog SEO Spider
SemrushBot

Source Notes

RFC core behavior comes from RFC 9309: https://www.rfc-editor.org/rfc/rfc9309.html.
Google documents supported robots fields and unsupported records in its robots.txt documentation: https://developers.google.com/search/docs/crawling-indexing/robots/robots_txt.
Google crawler and fetcher tokens are documented at https://developers.google.com/search/docs/crawling-indexing/google-common-crawlers.
OpenAI crawler tokens are documented at https://platform.openai.com/docs/bots.
Anthropic crawler behavior, including robots.txt and crawl-delay handling, is documented at https://support.anthropic.com/en/articles/8896518-does-anthropic-crawl-data-from-the-web-and-how-can-site-owners-block-the-crawler.
Applebot and Applebot-Extended are documented at https://support.apple.com/en-us/119829.
Common Crawl documents CCBot at https://commoncrawl.org/ccbot.
IndexNow key-location behavior is documented at https://www.indexnow.org/documentation.

Scoring Steps

Step	Weight	Purpose
`fetch`	0.3	Fetch `/robots.txt` with a successful HTTP response.
`core-syntax`	0.45	Validate RFC 9309 core User-agent grouping and core rule records.
`extensions`	0.25	Report known extensions, unsupported legacy records, unknown records, and known/unknown crawler tokens.

Current v1.0.0 Coverage

This version checks:

Fetch success for /robots.txt.
Presence of at least one valid User-agent group.
Core Allow and Disallow rule grouping.
Sitemap and Content-Signal evidence.
Known nonstandard extension records such as Crawl-delay, Host, Clean-param, and IndexNow-Key.
Unsupported legacy Noindex records as warnings.
Unknown records as evidence.
Known and unknown crawler product tokens as evidence.

This version documents, but does not yet validate:

Full RFC longest-match path evaluation for specific URLs.
Whether specific AI/search crawler policies are desirable for the site.
Whether sitemap URLs advertised in robots.txt are valid; the sitemap check owns sitemap validation.

References

Source: lib/checks/robots-txt/versions/1.0.0/docs.md

6. Version Changelog

robots.txt v1.0.0 Changelog

Initial versioned package for robots.txt.

Adds stepped scoring for fetch, RFC core syntax, and extension classification.
Emits known extension records, unsupported legacy records, unknown records, orphan records, invalid lines, and known/unknown crawler token evidence.
Warns on unsupported legacy Noindex records while preserving pass/fail behavior for the RFC core syntax.

Source: lib/checks/robots-txt/versions/1.0.0/changelog.md