1. Abstract
Publish a standards-aligned sitemap or sitemap index at a discoverable URL.
Sitemaps help crawlers and agents discover canonical URLs, update timestamps, and deeper content that may not be obvious from homepage navigation alone.
2. Classification
- Check ID
- sitemap
- Check version
- 1.0.0
- Package path
- lib/checks/sitemap/versions/1.0.0
- Category
- AI Discoverability
- Subcategory
- Discoverability
- Check group
- Crawl Discovery
- Check group ID
- crawl-discovery
- Maturity
- Established
- Scope
- site
- Check weight
- 1
3. Input And Output Contracts
- Input
- [email protected]
- Output
- [email protected]
- Resources inspected
- /sitemap.xml, /sitemap.txt, /sitemap_index.xml, /sitemap-index.xml, /robots.txt
4. Scoring Semantics
| Step ID | Title | Weight | Description |
|---|---|---|---|
discover | Discover sitemap candidates | 0.2 | Build candidate sitemap URLs from standard locations and independently fetched robots.txt Sitemap directives. |
fetch | Fetch a sitemap candidate | 0.25 | Fetch at least one candidate with a successful HTTP response. |
parse | Parse sitemap | 0.25 | Confirm the response is a sitemap document with absolute HTTP(S) URL entries. |
field-quality | Validate sitemap field quality | 0.1 | Validate optional fields, extension metadata, and protocol size limits when present. |
scope | Validate URL scope | 0.2 | Confirm listed page URLs belong to the scanned origin host. |
5. Package Documentation
Sitemap Check v1.0.0
Validates that a site exposes a fetchable sitemap discovery surface using the Sitemaps protocol.
There is no IETF RFC for XML Sitemaps. The normative protocol reference is the Sitemaps XML protocol published at sitemaps.org. That protocol references URI and IRI syntax standards, but the sitemap protocol itself is not an RFC.
The check is isolated. It does not consume the result of the robots.txt check. It independently discovers sitemap candidates from standard sitemap paths and, when available, Sitemap: directives in /robots.txt.
Input Contract
Requires the scan origin. The check fetches standard sitemap candidates and may fetch ${origin}/robots.txt only to discover additional Sitemap: URLs.
Output Contract
Emits a stepped report check result with discovery, fetch, parse, field-quality, and URL-scope evidence.
Pass Criteria
- At least one sitemap candidate is discoverable.
- At least one candidate responds with a successful HTTP status.
- The response parses as a supported sitemap format: XML
<urlset>,
XML <sitemapindex>, plain text, RSS, or Atom.
- The document contains at least one absolute HTTP(S) URL.
- For URL sitemaps, page URLs belong to the scanned origin host.
- Optional fields, when present, use valid values:
<lastmod>is a parseable W3C/ISO-style date,<changefreq>is one of the protocol values, and<priority>is between0.0and1.0. - URL sitemap and sitemap index entry counts do not exceed the protocol's 50,000-entry limit.
- Sitemap content stays within the 50 MB uncompressed size limit.
- Supported extension records contain required fields and absolute HTTP(S) URL
values.
Warning Criteria
- A valid URL sitemap is found, but one or more page URLs point to a different
host than the scanned origin.
- A sitemap is valid, but optional fields such as
<lastmod>,<changefreq>,
or <priority> contain invalid values.
- A sitemap is valid, but image, video, news, or hreflang extension records are
incomplete or contain relative/non-HTTP(S) URL values.
Failure Criteria
- No sitemap candidates can be discovered.
- No candidate returns a successful HTTP response.
- Fetched content is not a supported sitemap format.
- URL values are missing or are not absolute HTTP(S) URLs.
- The sitemap exceeds the protocol's 50,000 URL or 50,000 child sitemap limit.
- The sitemap exceeds the 50 MB uncompressed size limit.
Core Protocol Documents
The Sitemaps protocol defines two XML document shapes.
| Document | Root element | Child element | Required data |
|---|---|---|---|
| URL sitemap | <urlset> | <url> | One or more <loc> entries for page URLs. |
| Sitemap index | <sitemapindex> | <sitemap> | One or more <loc> entries for child sitemap files. |
URL Sitemap Fields
| Field | Status | Meaning |
|---|---|---|
<loc> | Required | Absolute URL for a page. The URL must include the protocol and must be entity-escaped in XML. |
<lastmod> | Optional | Last modification date for the page. The protocol allows W3C datetime format. |
<changefreq> | Optional | Crawl-change hint. It is not a command and crawlers may ignore it. |
<priority> | Optional | Relative priority from 0.0 to 1.0 within this site only. It does not compare one site against another. |
Sitemap Index Fields
| Field | Status | Meaning |
|---|---|---|
<loc> | Required | Absolute URL for a child sitemap, feed, or text sitemap file. |
<lastmod> | Optional | Last modification date for the child sitemap file. |
Protocol Limits
| Limit | Value | Applies to |
|---|---|---|
| URLs per sitemap | 50,000 | URL sitemap files. |
| Sitemaps per index | 50,000 | Sitemap index files. |
| Uncompressed file size | 50 MB | Sitemap files and sitemap index files. |
| Encoding | UTF-8 | Sitemap files. |
| Compression | gzip allowed | Compressed file must still respect the uncompressed size limit. |
Large sites should split URLs across multiple sitemap files and publish a sitemap index.
URL Scope And Host Rules
Sitemap URLs are scoped. The protocol requires sitemap entries to belong to the same site or appropriate host scope as the sitemap location. In practice:
- A sitemap at
https://example.com/sitemap.xmlshould list URLs under
https://example.com/.
- A sitemap should not mix unrelated hosts.
- Separate hosts or subdomains should publish their own sitemaps, or a sitemap
index should be placed at a scope that is valid for those child sitemaps.
- URL values should be canonical, absolute, HTTP(S) URLs.
This check warns when URL sitemap entries point outside the scanned origin host.
XML Requirements
- Sitemap files must be valid XML.
- XML entities must be escaped, including
&,<,>, quotes, and apostrophes
where required.
- URLs must use valid URI or IRI syntax.
- The document should declare sitemap XML namespaces when using the XML
protocol or extensions.
Discovery Model
The check probes these conventional locations:
/sitemap.xml
/sitemap.txt
/sitemap_index.xml
/sitemap-index.xmlThese paths are conventions, not the only valid sitemap locations.
The check also reads Sitemap: directives from /robots.txt when robots.txt is available. Robots discovery is additive; a missing or invalid robots.txt file does not make this check fail if a standard sitemap URL is valid.
Sitemap: directives:
- Use a full sitemap URL.
- Are not scoped to a specific
User-agentgroup. - May appear multiple times.
- Can point to sitemap indexes or individual sitemap files.
Submission And Notification
| Mechanism | Status | Notes |
|---|---|---|
robots.txt Sitemap: directive | Standard discovery | Search engines can discover sitemap URLs while fetching robots.txt. |
| Search engine console submission | Search-engine-specific | Google Search Console and similar tools accept direct sitemap submission. |
| Ping URL submission | Protocol-defined, engine-dependent | The protocol documents ping-style submission, but support varies by search engine. |
| IndexNow | Related separate protocol | URL-change notification protocol; not part of the Sitemap protocol. |
Alternate Sitemap Formats
The protocol allows alternatives to XML sitemap files.
| Format | Status | Notes |
|---|---|---|
| Plain text sitemap | Supported alternate | One URL per line. Same URL count and uncompressed size limits. |
| RSS feed | Supported alternate | Often best for recent URLs, not full canonical inventories. |
| Atom feed | Supported alternate | Often best for recent URLs, not full canonical inventories. |
This v1.0.0 check validates plain text, RSS, and Atom sitemap formats when they are found at a conventional path or advertised through a Sitemap: directive.
Extension Namespaces
The Sitemaps protocol supports XML namespace extensions. Extension support is crawler-specific; unsupported extension fields should not invalidate a core sitemap document.
| Extension | Status | Notes |
|---|---|---|
| Image sitemap extension | Google-supported extension | Adds image discovery metadata, commonly with image:image and image:loc. |
| Video sitemap extension | Google-supported extension | Adds video discovery metadata for video indexing. |
| News sitemap extension | Google-supported extension | Adds Google News publication metadata with stricter freshness and publisher requirements. |
xhtml:link hreflang annotations | Google-supported extension | Declares localized alternate URLs inside a sitemap. |
| Custom namespaces | Protocol-supported mechanism | Valid XML extensions can exist, but crawler support depends on the namespace and search engine. |
This version validates basic extension correctness for image, video, news, and hreflang records. It checks required fields and absolute HTTP(S) URL values, but does not attempt crawler-specific eligibility decisions such as whether a news publisher is approved for Google News.
Search Engine Interpretation Notes
| Field or behavior | Protocol status | Search-engine notes |
|---|---|---|
<lastmod> | Optional standard field | Google may use it when it is accurate and consistently verifiable. |
<changefreq> | Optional standard hint | Google generally ignores it. |
<priority> | Optional standard hint | Google generally ignores it. |
| Image extension deprecated tags | Extension-specific | Google no longer documents support for older image tags such as caption, title, geographic location, and license fields. |
| Multiple extensions in one sitemap | Extension mechanism | Search engines may support combining multiple namespaces in one XML sitemap. |
Related But Not Sitemap Protocol
| Feature | Status | Notes |
|---|---|---|
| HTML sitemap | Separate UX/navigation page | Useful for users, not the XML sitemap protocol. |
| IndexNow | Separate URL notification protocol | Complements sitemaps but does not replace the sitemap document. |
llms.txt | Separate AI/content guidance file | Not a sitemap extension. |
agents.json and agent manifests | Separate agent discovery surfaces | Not sitemap extensions. |
| API catalogs and OpenAPI | Separate machine interface discovery | Not sitemap extensions. |
Scoring Steps
| Step | Weight | Purpose |
|---|---|---|
discover | 0.2 | Build candidate sitemap URLs from standard locations and robots.txt directives. |
fetch | 0.25 | Fetch at least one candidate with a successful HTTP response. |
parse | 0.25 | Confirm valid XML, plain text, RSS, or Atom sitemap structure. |
field-quality | 0.1 | Validate optional field values, extension records, and size/entry-count limits. |
scope | 0.2 | Confirm listed page URLs belong to the scanned origin host. |
Current v1.0.0 Coverage
This version checks:
- Conventional XML sitemap paths.
- Conventional plain text sitemap path.
robots.txtSitemap:directives.- Fetch success.
- XML
<urlset>and<sitemapindex>structure. - Plain text, RSS, and Atom sitemap formats.
- Absolute HTTP(S) URL values.
- Same-origin host scope for URL sitemap entries.
- Same-origin host scope for child sitemap index entries.
- Optional
<lastmod>,<changefreq>, and<priority>value quality. - 50,000 URL and 50,000 child sitemap entry-count limits.
- 50 MB uncompressed sitemap size limits, including decompressed response bodies
as exposed by fetch.
- Basic image, video, news, and hreflang extension correctness.
External Signals Not Emitted By This Check
This package validates the public sitemap file. It does not emit search engine console submission state, ping submission state, indexing status, or crawler-specific eligibility decisions because those require external account data or live search-engine submission telemetry rather than the sitemap document itself.
References
- www.sitemaps.org/protocol.html
- www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd
- developers.google.com/search/docs/crawling-indexing/sitemaps/overview
- developers.google.com/search/docs/crawling-indexing/sitemaps/build-sitemap
- developers.google.com/search/docs/crawling-indexing/sitemaps/image-sitemaps
- developers.google.com/search/docs/crawling-indexing/sitemaps/video-sitemaps
- developers.google.com/search/docs/crawling-indexing/sitemaps/news-sitemap
- developers.google.com/search/docs/specialty/international/localized-versions
- www.indexnow.org/documentation
- example.com/sitemap.xml
- example.com/
Source: lib/checks/sitemap/versions/1.0.0/docs.md
6. Version Changelog
sitemap v1.0.0 Changelog
Initial versioned package for sitemap.
- Declares discovery, fetch, parse, and scope scoring steps.
- Documents XML sitemap and sitemap index semantics.
- Documents robots.txt
Sitemap:discovery as an isolated additive discovery mechanism. - Adds optional field-quality validation for
<lastmod>,<changefreq>, and<priority>. - Adds entry-count limit checks for URL sitemaps and sitemap indexes.
- Adds same-origin scope evidence for child sitemap URLs in sitemap indexes.
- Adds
/sitemap.txtdiscovery and validates plain text, RSS, and Atom sitemap formats. - Adds uncompressed 50 MB sitemap size validation.
- Adds basic image, video, news, and hreflang extension-quality validation.
Source: lib/checks/sitemap/versions/1.0.0/changelog.md