1. Abstract

Make page content easy for AI answer engines to extract, cite, and attribute without promising inclusion in any proprietary answer surface.

Generative answer systems work best with visible, self-contained, evidence-backed content, clear entities, trustworthy attribution, structured page sections, and crawler-accessible HTML. These signals improve machine understanding even though they do not guarantee AI citations.

2. Classification

Check ID: geo-readiness
Check version: 1.0.0
Package path: lib/checks/geo-readiness/versions/1.0.0
Category: GEO, AIO and AEO
Subcategory: GEO, AIO & AEO
Check group: AI Answer Readiness
Check group ID: ai-answer-readiness
Maturity: Emerging recommendation
Scope: page
Check weight: 1

3. Input And Output Contracts

Input: [email protected]
Output: [email protected]
Resources inspected: Initial HTML, Citation-ready passages, Entity clarity, Structured extraction signals, Trust signals, AI crawler access

4. Scoring Semantics

Step ID	Title	Weight	Description
`extractable-html`	Extractable HTML	`0.18`	Verify meaningful visible text exists in the initial HTML without depending on client-side rendering.
`entity-clarity`	Entity clarity	`0.18`	Align title, h1, description, and primary visible copy around the page's main entity or topic.
`citable-passages`	Citable passages	`0.24`	Identify self-contained explanatory passages with context and evidence signals.
`structured-extraction`	Structured extraction	`0.18`	Check headings, lists, tables, definitions, summaries, and structured data that help answer extraction.
`source-and-trust`	Source and trust signals	`0.12`	Check authorship, publisher, freshness dates, source links, and accountability signals.
`ai-retrieval-access`	AI retrieval access	`0.1`	Evaluate robots.txt and snippet controls for relevant AI retrieval and user-triggered crawlers.

5. Package Documentation

GEO Readiness Check v1.0.0

Status

Version: 1.0.0
Check identifier: geo-readiness
Input contract: [email protected]
Output contract: [email protected]
Scope: page

Abstract

The GEO Readiness check evaluates whether a public HTML page is easy for AI answer engines and retrieval systems to extract, understand, cite, and attribute. It checks initial HTML extractability, entity clarity, self-contained citable passages, structured extraction patterns, source and trust signals, and access controls for relevant AI retrieval crawlers.

This check does not promise inclusion in ChatGPT, Perplexity, Google AI Overviews, AI Mode, or any other proprietary answer surface.

Motivation

Generative answer systems can only use what they can fetch, parse, understand, and attribute. Pages with visible explanatory content, clear entities, evidence-backed answer passages, structured sections, and accountable sources are easier to quote or summarize than pages that rely on vague copy, hidden JavaScript-rendered content, generic metadata, or blocked crawlers.

GEO readiness complements traditional SEO, AIO readiness, AEO readiness, structured data checks, robots.txt, llms.txt, and policy/licensing checks. It is a page-level extraction and citability check, not a query-level visibility monitor.

Normative Model

The check models GEO readiness as six independent validation steps:

Step	Weight	Purpose
`extractable-html`	0.18	Verify meaningful visible text exists in the initial HTML without depending on client-side rendering.
`entity-clarity`	0.18	Align title, h1, description, and primary visible copy around the page's main entity or topic.
`citable-passages`	0.24	Identify self-contained explanatory passages with context and evidence signals.
`structured-extraction`	0.18	Check headings, lists, tables, definitions, summaries, and structured data that help answer extraction.
`source-and-trust`	0.12	Check authorship, publisher, freshness dates, source links, and accountability signals.
`ai-retrieval-access`	0.10	Evaluate robots.txt and snippet controls for relevant AI retrieval and user-triggered crawlers.

Step weights are normalized decimals and sum to 1.0.

The check uses the page's initial HTML as primary evidence. It may use homepage response headers for X-Robots-Tag and fetch /robots.txt from the scanned origin for path-specific AI crawler access evaluation.

Applicability

The check applies to public HTML pages where the page is expected to be understood, summarized, compared, cited, or attributed by AI answer engines.

Applicable pages include:

Homepages and landing pages.
Product, service, and feature pages.
Blog posts, guides, docs, reports, and FAQs.
Public comparison, pricing, help, and knowledge-base pages.

The check is lower-signal for login-only pages, thin utility routes, binary assets, API JSON, redirects, non-HTML resources, and pages intentionally blocked from public crawling.

Pass Criteria

The check passes when the weighted score is at least 90/100.

A passing page normally has:

Meaningful visible text in the initial HTML.
Clear title, single h1, description, and primary-topic alignment.
At least one self-contained explanatory passage or answer block that can be quoted with context.
Visible evidence signals for factual claims, such as source links, dates, statistics, named research, or reports.
Structured headings and sections that expose definitions, comparisons, grouped facts, or steps.
Relevant typed JSON-LD or equivalent structured entity evidence.
Authorship, publisher, freshness, source, contact/about, policy, or sameAs/entity trust signals where appropriate.
No path-specific robots.txt rule blocking relevant AI retrieval or user-triggered crawlers from the scanned URL.
No snippet controls that remove important text from extraction.

Warning Criteria

The check warns when the weighted score is from 50 through 89.

Warnings include:

Initial HTML has visible content but is thin, low-density, or likely dependent on client-side rendering.
Paragraphs are too short, too long, vague, or missing visible evidence signals.
Title, h1, description, and opening copy are weakly aligned.
JSON-LD is present but does not expose useful typed entity evidence.
Heading hierarchy, summaries, lists, tables, definitions, or FAQ-style sections are incomplete.
Source, author, publisher, date, sameAs, about/contact, policy, or citation signals are missing.
robots.txt could not be fetched, making AI crawler access unconfirmed.
nosnippet, max-snippet:0, X-Robots-Tag, or data-nosnippet constrains extractable text.
AI policy/training crawlers are blocked. These are reported separately from retrieval crawler blockers.

Failure Criteria

The check fails when the weighted score is below 50.

Failures include:

The page has almost no visible, extractable explanatory text in the initial HTML.
The page entity cannot be determined from title, h1, description, or opening content.
No meaningful citation-length or evidence-backed explanatory passages are present.
Relevant AI retrieval or user-triggered crawlers are blocked for the scanned URL path.
Snippet controls remove text the page appears to need for extraction.
The page exposes materially misleading machine-readable content compared with visible content.

Evidence Model

The result emits:

Final URL, raw HTML byte count, and visible word count.
Step-level score and weight.
Title, h1 count, h1 snippets, description, topic overlap, and missing entity terms.
Candidate and citable passages with word counts, snippets, headings, and improvement reasons.
Heading outline, heading issues, summaries, table/list counts, FAQ pattern evidence, definition counts, JSON-LD count, and JSON-LD @type values.
Source and trust evidence including author, publisher, dates, sameAs values, source links, about/contact links, and policy links.
/robots.txt fetch status, excerpt, crawler-specific access decisions, matched user-agent groups, matched rules, and scanned URL path.
Meta robots, X-Robots-Tag, and data-nosnippet extraction controls.

Evidence must not include full HTML, credentials, cookies, authorization headers, or private data.

Validation And Scoring Steps

Parse the initial HTML and extract visible text while ignoring script, style, SVG, and noscript content.
Score initial HTML extractability using visible word count, text density, HTML size, and client-rendering root signals.
Score entity clarity using title, h1 count, description, topic overlap, and title/h1 terms missing from the description.
Inspect paragraph passages. A useful citation passage is a heuristic, not a standard: roughly 80-220 words, self-contained, and supported by visible evidence signals.
Score structured extraction using heading structure, summaries, tables with headers, lists, FAQ patterns, definition sentences, and typed JSON-LD.
Score source and trust signals using authorship, publisher/organization, freshness dates, source links, about/contact links, policy links, and sameAs/entity references.
Fetch /robots.txt from the origin and evaluate the scanned URL path against relevant AI retrieval, user-triggered, and policy crawlers.
Inspect snippet controls from meta robots, X-Robots-Tag, and data-nosnippet.
Compute the weighted 0-100 score using the published step weights. Pass is >=90, warning is 50-89, and fail is <50.

Standard Behavior

Robots evaluation follows RFC 9309-oriented behavior:

User-agent groups are parsed from /robots.txt.
Relevant groups are matched case-insensitively by crawler product token.
Matching groups with the most specific user-agent token are evaluated.
Path rules are evaluated against the scanned URL path and query.
Allow and Disallow path patterns support * and $.
The most specific matching rule wins; tied Allow rules win over Disallow.
Empty or absent matching rules allow crawling for the evaluated path.

The check evaluates retrieval and user-triggered crawlers separately from policy or training crawlers. Google-Extended is reported as a Google Gemini/Vertex AI product-policy signal, not as a Google Search or AI Overview crawler.

Non-Standard And Real-World Behavior

GEO is an emerging practice, not a formal web standard. The check treats llms.txt, sameAs links, FAQ patterns, and paragraph length targets as supporting evidence rather than mandatory requirements.

The check recognizes that platforms use different crawlers and purposes:

OAI-SearchBot for OpenAI search retrieval.
ChatGPT-User for user-triggered OpenAI fetches.
PerplexityBot and Perplexity-User for Perplexity retrieval and user-triggered fetches.
ClaudeBot, Claude-SearchBot, and Claude-User for Anthropic retrieval and user-triggered fetches.
GPTBot for OpenAI model improvement.
Google-Extended for Google Gemini/Vertex AI product policy.

Non-Goals And Limitations

This check does not:

Promise inclusion or ranking in any AI answer surface.
Perform query-level AI visibility monitoring.
Replace traditional SEO indexability checks.
Fully validate structured data; specialized structured-data checks own that behavior.
Fully validate /robots.txt; the robots-txt check owns full site-level robots syntax validation.
Require llms.txt, RSL, FAQ schema, Wikipedia presence, Reddit mentions, YouTube mentions, or brand-mention metrics as hard pass criteria.
Execute arbitrary client-side JavaScript to recover hidden content.

AI systems are proprietary and change frequently. A page can be technically ready and still not be selected or cited. A page can also be cited despite imperfect structure if it has unique, authoritative information.

References

Source: lib/checks/geo-readiness/versions/1.0.0/docs.md

6. Version Changelog

geo-readiness v1.0.0 Changelog

Added RFC 9309-oriented path-specific AI crawler access evaluation.
Added OAI-SearchBot, Claude-SearchBot, Claude-User, and Perplexity-User crawler evidence.
Separated AI retrieval and user-triggered crawler access from policy/training crawler signals such as GPTBot and Google-Extended.
Added snippet-control, initial HTML extractability, JSON-LD type, and source/trust evidence.

Source: lib/checks/geo-readiness/versions/1.0.0/changelog.md