Check specification

sitemap 1.0.0

Sitemap

Validates that a site exposes a fetchable sitemap or sitemap index.

Assessment Suite
2026.06.10
Maturity
Established
Category
AI Discoverability
Subcategory
Crawl Discovery

1. Abstract

Publish a standards-aligned sitemap or sitemap index at a discoverable URL.

Sitemaps help crawlers and agents discover canonical URLs, update timestamps, and deeper content that may not be obvious from homepage navigation alone.

2. Classification

Check ID
sitemap
Check version
1.0.0
Package path
lib/checks/sitemap/versions/1.0.0
Category
AI Discoverability
Subcategory
Discoverability
Check group
Crawl Discovery
Check group ID
crawl-discovery
Maturity
Established
Scope
site
Check weight
1

3. Input And Output Contracts

Resources inspected
/sitemap.xml, /sitemap.txt, /sitemap_index.xml, /sitemap-index.xml, /robots.txt

4. Scoring Semantics

Step IDTitleWeightDescription
discoverDiscover sitemap candidates0.2Build candidate sitemap URLs from standard locations and independently fetched robots.txt Sitemap directives.
fetchFetch a sitemap candidate0.25Fetch at least one candidate with a successful HTTP response.
parseParse sitemap0.25Confirm the response is a sitemap document with absolute HTTP(S) URL entries.
field-qualityValidate sitemap field quality0.1Validate optional fields, extension metadata, and protocol size limits when present.
scopeValidate URL scope0.2Confirm listed page URLs belong to the scanned origin host.

5. Package Documentation

Sitemap Check v1.0.0

Validates that a site exposes a fetchable sitemap discovery surface using the Sitemaps protocol.

There is no IETF RFC for XML Sitemaps. The normative protocol reference is the Sitemaps XML protocol published at sitemaps.org. That protocol references URI and IRI syntax standards, but the sitemap protocol itself is not an RFC.

The check is isolated. It does not consume the result of the robots.txt check. It independently discovers sitemap candidates from standard sitemap paths and, when available, Sitemap: directives in /robots.txt.

Input Contract

[email protected]

Requires the scan origin. The check fetches standard sitemap candidates and may fetch ${origin}/robots.txt only to discover additional Sitemap: URLs.

Output Contract

[email protected]

Emits a stepped report check result with discovery, fetch, parse, field-quality, and URL-scope evidence.

Pass Criteria

  • At least one sitemap candidate is discoverable.
  • At least one candidate responds with a successful HTTP status.
  • The response parses as a supported sitemap format: XML <urlset>,

XML <sitemapindex>, plain text, RSS, or Atom.

  • The document contains at least one absolute HTTP(S) URL.
  • For URL sitemaps, page URLs belong to the scanned origin host.
  • Optional fields, when present, use valid values: <lastmod> is a parseable W3C/ISO-style date, <changefreq> is one of the protocol values, and <priority> is between 0.0 and 1.0.
  • URL sitemap and sitemap index entry counts do not exceed the protocol's 50,000-entry limit.
  • Sitemap content stays within the 50 MB uncompressed size limit.
  • Supported extension records contain required fields and absolute HTTP(S) URL

values.

Warning Criteria

  • A valid URL sitemap is found, but one or more page URLs point to a different

host than the scanned origin.

  • A sitemap is valid, but optional fields such as <lastmod>, <changefreq>,

or <priority> contain invalid values.

  • A sitemap is valid, but image, video, news, or hreflang extension records are

incomplete or contain relative/non-HTTP(S) URL values.

Failure Criteria

  • No sitemap candidates can be discovered.
  • No candidate returns a successful HTTP response.
  • Fetched content is not a supported sitemap format.
  • URL values are missing or are not absolute HTTP(S) URLs.
  • The sitemap exceeds the protocol's 50,000 URL or 50,000 child sitemap limit.
  • The sitemap exceeds the 50 MB uncompressed size limit.

Core Protocol Documents

The Sitemaps protocol defines two XML document shapes.

DocumentRoot elementChild elementRequired data
URL sitemap<urlset><url>One or more <loc> entries for page URLs.
Sitemap index<sitemapindex><sitemap>One or more <loc> entries for child sitemap files.

URL Sitemap Fields

FieldStatusMeaning
<loc>RequiredAbsolute URL for a page. The URL must include the protocol and must be entity-escaped in XML.
<lastmod>OptionalLast modification date for the page. The protocol allows W3C datetime format.
<changefreq>OptionalCrawl-change hint. It is not a command and crawlers may ignore it.
<priority>OptionalRelative priority from 0.0 to 1.0 within this site only. It does not compare one site against another.

Sitemap Index Fields

FieldStatusMeaning
<loc>RequiredAbsolute URL for a child sitemap, feed, or text sitemap file.
<lastmod>OptionalLast modification date for the child sitemap file.

Protocol Limits

LimitValueApplies to
URLs per sitemap50,000URL sitemap files.
Sitemaps per index50,000Sitemap index files.
Uncompressed file size50 MBSitemap files and sitemap index files.
EncodingUTF-8Sitemap files.
Compressiongzip allowedCompressed file must still respect the uncompressed size limit.

Large sites should split URLs across multiple sitemap files and publish a sitemap index.

URL Scope And Host Rules

Sitemap URLs are scoped. The protocol requires sitemap entries to belong to the same site or appropriate host scope as the sitemap location. In practice:

  • A sitemap at https://example.com/sitemap.xml should list URLs under

https://example.com/.

  • A sitemap should not mix unrelated hosts.
  • Separate hosts or subdomains should publish their own sitemaps, or a sitemap

index should be placed at a scope that is valid for those child sitemaps.

  • URL values should be canonical, absolute, HTTP(S) URLs.

This check warns when URL sitemap entries point outside the scanned origin host.

XML Requirements

  • Sitemap files must be valid XML.
  • XML entities must be escaped, including &, <, >, quotes, and apostrophes

where required.

  • URLs must use valid URI or IRI syntax.
  • The document should declare sitemap XML namespaces when using the XML

protocol or extensions.

Discovery Model

The check probes these conventional locations:

/sitemap.xml
/sitemap.txt
/sitemap_index.xml
/sitemap-index.xml

These paths are conventions, not the only valid sitemap locations.

The check also reads Sitemap: directives from /robots.txt when robots.txt is available. Robots discovery is additive; a missing or invalid robots.txt file does not make this check fail if a standard sitemap URL is valid.

Sitemap: directives:

  • Use a full sitemap URL.
  • Are not scoped to a specific User-agent group.
  • May appear multiple times.
  • Can point to sitemap indexes or individual sitemap files.

Submission And Notification

MechanismStatusNotes
robots.txt Sitemap: directiveStandard discoverySearch engines can discover sitemap URLs while fetching robots.txt.
Search engine console submissionSearch-engine-specificGoogle Search Console and similar tools accept direct sitemap submission.
Ping URL submissionProtocol-defined, engine-dependentThe protocol documents ping-style submission, but support varies by search engine.
IndexNowRelated separate protocolURL-change notification protocol; not part of the Sitemap protocol.

Alternate Sitemap Formats

The protocol allows alternatives to XML sitemap files.

FormatStatusNotes
Plain text sitemapSupported alternateOne URL per line. Same URL count and uncompressed size limits.
RSS feedSupported alternateOften best for recent URLs, not full canonical inventories.
Atom feedSupported alternateOften best for recent URLs, not full canonical inventories.

This v1.0.0 check validates plain text, RSS, and Atom sitemap formats when they are found at a conventional path or advertised through a Sitemap: directive.

Extension Namespaces

The Sitemaps protocol supports XML namespace extensions. Extension support is crawler-specific; unsupported extension fields should not invalidate a core sitemap document.

ExtensionStatusNotes
Image sitemap extensionGoogle-supported extensionAdds image discovery metadata, commonly with image:image and image:loc.
Video sitemap extensionGoogle-supported extensionAdds video discovery metadata for video indexing.
News sitemap extensionGoogle-supported extensionAdds Google News publication metadata with stricter freshness and publisher requirements.
xhtml:link hreflang annotationsGoogle-supported extensionDeclares localized alternate URLs inside a sitemap.
Custom namespacesProtocol-supported mechanismValid XML extensions can exist, but crawler support depends on the namespace and search engine.

This version validates basic extension correctness for image, video, news, and hreflang records. It checks required fields and absolute HTTP(S) URL values, but does not attempt crawler-specific eligibility decisions such as whether a news publisher is approved for Google News.

Search Engine Interpretation Notes

Field or behaviorProtocol statusSearch-engine notes
<lastmod>Optional standard fieldGoogle may use it when it is accurate and consistently verifiable.
<changefreq>Optional standard hintGoogle generally ignores it.
<priority>Optional standard hintGoogle generally ignores it.
Image extension deprecated tagsExtension-specificGoogle no longer documents support for older image tags such as caption, title, geographic location, and license fields.
Multiple extensions in one sitemapExtension mechanismSearch engines may support combining multiple namespaces in one XML sitemap.

Related But Not Sitemap Protocol

FeatureStatusNotes
HTML sitemapSeparate UX/navigation pageUseful for users, not the XML sitemap protocol.
IndexNowSeparate URL notification protocolComplements sitemaps but does not replace the sitemap document.
llms.txtSeparate AI/content guidance fileNot a sitemap extension.
agents.json and agent manifestsSeparate agent discovery surfacesNot sitemap extensions.
API catalogs and OpenAPISeparate machine interface discoveryNot sitemap extensions.

Scoring Steps

StepWeightPurpose
discover0.2Build candidate sitemap URLs from standard locations and robots.txt directives.
fetch0.25Fetch at least one candidate with a successful HTTP response.
parse0.25Confirm valid XML, plain text, RSS, or Atom sitemap structure.
field-quality0.1Validate optional field values, extension records, and size/entry-count limits.
scope0.2Confirm listed page URLs belong to the scanned origin host.

Current v1.0.0 Coverage

This version checks:

  • Conventional XML sitemap paths.
  • Conventional plain text sitemap path.
  • robots.txt Sitemap: directives.
  • Fetch success.
  • XML <urlset> and <sitemapindex> structure.
  • Plain text, RSS, and Atom sitemap formats.
  • Absolute HTTP(S) URL values.
  • Same-origin host scope for URL sitemap entries.
  • Same-origin host scope for child sitemap index entries.
  • Optional <lastmod>, <changefreq>, and <priority> value quality.
  • 50,000 URL and 50,000 child sitemap entry-count limits.
  • 50 MB uncompressed sitemap size limits, including decompressed response bodies

as exposed by fetch.

  • Basic image, video, news, and hreflang extension correctness.

External Signals Not Emitted By This Check

This package validates the public sitemap file. It does not emit search engine console submission state, ping submission state, indexing status, or crawler-specific eligibility decisions because those require external account data or live search-engine submission telemetry rather than the sitemap document itself.

References

Source: lib/checks/sitemap/versions/1.0.0/docs.md

6. Version Changelog

sitemap v1.0.0 Changelog

Initial versioned package for sitemap.

  • Declares discovery, fetch, parse, and scope scoring steps.
  • Documents XML sitemap and sitemap index semantics.
  • Documents robots.txt Sitemap: discovery as an isolated additive discovery mechanism.
  • Adds optional field-quality validation for <lastmod>, <changefreq>, and <priority>.
  • Adds entry-count limit checks for URL sitemaps and sitemap indexes.
  • Adds same-origin scope evidence for child sitemap URLs in sitemap indexes.
  • Adds /sitemap.txt discovery and validates plain text, RSS, and Atom sitemap formats.
  • Adds uncompressed 50 MB sitemap size validation.
  • Adds basic image, video, news, and hreflang extension-quality validation.

Source: lib/checks/sitemap/versions/1.0.0/changelog.md