1. Abstract

Publish a standards-aligned sitemap or sitemap index at a discoverable URL.

Sitemaps help crawlers and agents discover canonical URLs, update timestamps, and deeper content that may not be obvious from homepage navigation alone.

2. Classification

Check ID: sitemap
Check version: 1.0.0
Package path: lib/checks/sitemap/versions/1.0.0
Category: AI Discoverability
Subcategory: Discoverability
Check group: Crawl Discovery
Check group ID: crawl-discovery
Maturity: Established
Scope: site
Check weight: 1

3. Input And Output Contracts

Input: [email protected]
Output: [email protected]
Resources inspected: /sitemap.xml, /sitemap.txt, /sitemap_index.xml, /sitemap-index.xml, /robots.txt

4. Scoring Semantics

Step ID	Title	Weight	Description
`discover`	Discover sitemap candidates	`0.2`	Build candidate sitemap URLs from standard locations and independently fetched robots.txt Sitemap directives.
`fetch`	Fetch a sitemap candidate	`0.25`	Fetch at least one candidate with a successful HTTP response.
`parse`	Parse sitemap	`0.25`	Confirm the response is a sitemap document with absolute HTTP(S) URL entries.
`field-quality`	Validate sitemap field quality	`0.1`	Validate optional fields, extension metadata, and protocol size limits when present.
`scope`	Validate URL scope	`0.2`	Confirm listed page URLs belong to the scanned origin host.

5. Package Documentation

Sitemap Check v1.0.0

Validates that a site exposes a fetchable sitemap discovery surface using the Sitemaps protocol.

There is no IETF RFC for XML Sitemaps. The normative protocol reference is the Sitemaps XML protocol published at sitemaps.org. That protocol references URI and IRI syntax standards, but the sitemap protocol itself is not an RFC.

The check is isolated. It does not consume the result of the robots.txt check. It independently discovers sitemap candidates from standard sitemap paths and, when available, Sitemap: directives in /robots.txt.

Input Contract

[email protected]

Requires the scan origin. The check fetches standard sitemap candidates and may fetch ${origin}/robots.txt only to discover additional Sitemap: URLs.

Output Contract

[email protected]

Emits a stepped report check result with discovery, fetch, parse, field-quality, and URL-scope evidence.

Pass Criteria

At least one sitemap candidate is discoverable.
At least one candidate responds with a successful HTTP status.
The response parses as a supported sitemap format: XML <urlset>,

XML <sitemapindex>, plain text, RSS, or Atom.

The document contains at least one absolute HTTP(S) URL.
For URL sitemaps, page URLs belong to the scanned origin host.
Optional fields, when present, use valid values: <lastmod> is a parseable W3C/ISO-style date, <changefreq> is one of the protocol values, and <priority> is between 0.0 and 1.0.
URL sitemap and sitemap index entry counts do not exceed the protocol's 50,000-entry limit.
Sitemap content stays within the 50 MB uncompressed size limit.
Supported extension records contain required fields and absolute HTTP(S) URL

values.

Warning Criteria

A valid URL sitemap is found, but one or more page URLs point to a different

host than the scanned origin.

A sitemap is valid, but optional fields such as <lastmod>, <changefreq>,

or <priority> contain invalid values.

A sitemap is valid, but image, video, news, or hreflang extension records are

incomplete or contain relative/non-HTTP(S) URL values.

Failure Criteria

No sitemap candidates can be discovered.
No candidate returns a successful HTTP response.
Fetched content is not a supported sitemap format.
URL values are missing or are not absolute HTTP(S) URLs.
The sitemap exceeds the protocol's 50,000 URL or 50,000 child sitemap limit.
The sitemap exceeds the 50 MB uncompressed size limit.

Core Protocol Documents

The Sitemaps protocol defines two XML document shapes.

Document	Root element	Child element	Required data
URL sitemap	`<urlset>`	`<url>`	One or more `<loc>` entries for page URLs.
Sitemap index	`<sitemapindex>`	`<sitemap>`	One or more `<loc>` entries for child sitemap files.

URL Sitemap Fields

Field	Status	Meaning
`<loc>`	Required	Absolute URL for a page. The URL must include the protocol and must be entity-escaped in XML.
`<lastmod>`	Optional	Last modification date for the page. The protocol allows W3C datetime format.
`<changefreq>`	Optional	Crawl-change hint. It is not a command and crawlers may ignore it.
`<priority>`	Optional	Relative priority from `0.0` to `1.0` within this site only. It does not compare one site against another.

Sitemap Index Fields

Field	Status	Meaning
`<loc>`	Required	Absolute URL for a child sitemap, feed, or text sitemap file.
`<lastmod>`	Optional	Last modification date for the child sitemap file.

Protocol Limits

Limit	Value	Applies to
URLs per sitemap	50,000	URL sitemap files.
Sitemaps per index	50,000	Sitemap index files.
Uncompressed file size	50 MB	Sitemap files and sitemap index files.
Encoding	UTF-8	Sitemap files.
Compression	gzip allowed	Compressed file must still respect the uncompressed size limit.

Large sites should split URLs across multiple sitemap files and publish a sitemap index.

URL Scope And Host Rules

Sitemap URLs are scoped. The protocol requires sitemap entries to belong to the same site or appropriate host scope as the sitemap location. In practice:

A sitemap at https://example.com/sitemap.xml should list URLs under

https://example.com/.

A sitemap should not mix unrelated hosts.
Separate hosts or subdomains should publish their own sitemaps, or a sitemap

index should be placed at a scope that is valid for those child sitemaps.

URL values should be canonical, absolute, HTTP(S) URLs.

This check warns when URL sitemap entries point outside the scanned origin host.

XML Requirements

Sitemap files must be valid XML.
XML entities must be escaped, including &, <, >, quotes, and apostrophes

where required.

URLs must use valid URI or IRI syntax.
The document should declare sitemap XML namespaces when using the XML

protocol or extensions.

Discovery Model

The check probes these conventional locations:

/sitemap.xml
/sitemap.txt
/sitemap_index.xml
/sitemap-index.xml

These paths are conventions, not the only valid sitemap locations.

The check also reads Sitemap: directives from /robots.txt when robots.txt is available. Robots discovery is additive; a missing or invalid robots.txt file does not make this check fail if a standard sitemap URL is valid.

Sitemap: directives:

Use a full sitemap URL.
Are not scoped to a specific User-agent group.
May appear multiple times.
Can point to sitemap indexes or individual sitemap files.

Submission And Notification

Mechanism	Status	Notes
`robots.txt` `Sitemap:` directive	Standard discovery	Search engines can discover sitemap URLs while fetching robots.txt.
Search engine console submission	Search-engine-specific	Google Search Console and similar tools accept direct sitemap submission.
Ping URL submission	Protocol-defined, engine-dependent	The protocol documents ping-style submission, but support varies by search engine.
IndexNow	Related separate protocol	URL-change notification protocol; not part of the Sitemap protocol.

Alternate Sitemap Formats

The protocol allows alternatives to XML sitemap files.

Format	Status	Notes
Plain text sitemap	Supported alternate	One URL per line. Same URL count and uncompressed size limits.
RSS feed	Supported alternate	Often best for recent URLs, not full canonical inventories.
Atom feed	Supported alternate	Often best for recent URLs, not full canonical inventories.

This v1.0.0 check validates plain text, RSS, and Atom sitemap formats when they are found at a conventional path or advertised through a Sitemap: directive.

Extension Namespaces

The Sitemaps protocol supports XML namespace extensions. Extension support is crawler-specific; unsupported extension fields should not invalidate a core sitemap document.

Extension	Status	Notes
Image sitemap extension	Google-supported extension	Adds image discovery metadata, commonly with `image:image` and `image:loc`.
Video sitemap extension	Google-supported extension	Adds video discovery metadata for video indexing.
News sitemap extension	Google-supported extension	Adds Google News publication metadata with stricter freshness and publisher requirements.
`xhtml:link` hreflang annotations	Google-supported extension	Declares localized alternate URLs inside a sitemap.
Custom namespaces	Protocol-supported mechanism	Valid XML extensions can exist, but crawler support depends on the namespace and search engine.

This version validates basic extension correctness for image, video, news, and hreflang records. It checks required fields and absolute HTTP(S) URL values, but does not attempt crawler-specific eligibility decisions such as whether a news publisher is approved for Google News.

Search Engine Interpretation Notes

Field or behavior	Protocol status	Search-engine notes
`<lastmod>`	Optional standard field	Google may use it when it is accurate and consistently verifiable.
`<changefreq>`	Optional standard hint	Google generally ignores it.
`<priority>`	Optional standard hint	Google generally ignores it.
Image extension deprecated tags	Extension-specific	Google no longer documents support for older image tags such as caption, title, geographic location, and license fields.
Multiple extensions in one sitemap	Extension mechanism	Search engines may support combining multiple namespaces in one XML sitemap.

Related But Not Sitemap Protocol

Feature	Status	Notes
HTML sitemap	Separate UX/navigation page	Useful for users, not the XML sitemap protocol.
IndexNow	Separate URL notification protocol	Complements sitemaps but does not replace the sitemap document.
`llms.txt`	Separate AI/content guidance file	Not a sitemap extension.
`agents.json` and agent manifests	Separate agent discovery surfaces	Not sitemap extensions.
API catalogs and OpenAPI	Separate machine interface discovery	Not sitemap extensions.

Scoring Steps

Step	Weight	Purpose
`discover`	0.2	Build candidate sitemap URLs from standard locations and robots.txt directives.
`fetch`	0.25	Fetch at least one candidate with a successful HTTP response.
`parse`	0.25	Confirm valid XML, plain text, RSS, or Atom sitemap structure.
`field-quality`	0.1	Validate optional field values, extension records, and size/entry-count limits.
`scope`	0.2	Confirm listed page URLs belong to the scanned origin host.

Current v1.0.0 Coverage

This version checks:

Conventional XML sitemap paths.
Conventional plain text sitemap path.
robots.txt Sitemap: directives.
Fetch success.
XML <urlset> and <sitemapindex> structure.
Plain text, RSS, and Atom sitemap formats.
Absolute HTTP(S) URL values.
Same-origin host scope for URL sitemap entries.
Same-origin host scope for child sitemap index entries.
Optional <lastmod>, <changefreq>, and <priority> value quality.
50,000 URL and 50,000 child sitemap entry-count limits.
50 MB uncompressed sitemap size limits, including decompressed response bodies

as exposed by fetch.

Basic image, video, news, and hreflang extension correctness.

External Signals Not Emitted By This Check

This package validates the public sitemap file. It does not emit search engine console submission state, ping submission state, indexing status, or crawler-specific eligibility decisions because those require external account data or live search-engine submission telemetry rather than the sitemap document itself.

References

Source: lib/checks/sitemap/versions/1.0.0/docs.md

6. Version Changelog

sitemap v1.0.0 Changelog

Initial versioned package for sitemap.

Declares discovery, fetch, parse, and scope scoring steps.
Documents XML sitemap and sitemap index semantics.
Documents robots.txt Sitemap: discovery as an isolated additive discovery mechanism.
Adds optional field-quality validation for <lastmod>, <changefreq>, and <priority>.
Adds entry-count limit checks for URL sitemaps and sitemap indexes.
Adds same-origin scope evidence for child sitemap URLs in sitemap indexes.
Adds /sitemap.txt discovery and validates plain text, RSS, and Atom sitemap formats.
Adds uncompressed 50 MB sitemap size validation.
Adds basic image, video, news, and hreflang extension-quality validation.

Source: lib/checks/sitemap/versions/1.0.0/changelog.md