Check specification

semantic-html 1.0.0

Semantic HTML

Validates whether page HTML exposes an extraction-friendly semantic structure for agents, browsers, crawlers, and assistive technology.

Assessment Suite
2026.06.10
Maturity
Established
Category
AI Discoverability
Subcategory
Page Structure

1. Abstract

Expose readable page structure through semantic HTML and accessible controls.

Semantic HTML gives browsers, assistive technology, search systems, and agents reliable landmarks, headings, controls, form semantics, and image context.

2. Classification

Check ID
semantic-html
Check version
1.0.0
Package path
lib/checks/semantic-html/versions/1.0.0
Category
AI Discoverability
Subcategory
Content Readiness
Check group
Page Structure
Check group ID
page-structure
Maturity
Established
Scope
page
Check weight
1

3. Input And Output Contracts

Resources inspected
HTML landmarks, Heading hierarchy, Accessible controls, Form field autocomplete, Image alt text

4. Scoring Semantics

Step IDTitleWeightDescription
landmarksPage landmarks0.18Validate one visible main landmark plus navigation/header/footer semantics where appropriate.
heading-structureHeading structure0.18Validate a meaningful h1 and ordered, non-empty headings.
linksLinks0.16Validate accessible names, crawlable href values, and descriptive anchor text.
controlsButtons and interactive controls0.12Validate accessible names for native and ARIA button controls.
formsForm labels and autocomplete0.16Validate form labels and valid autocomplete tokens for personal-data inputs.
imagesImage text alternatives0.12Validate image alt coverage and common alt quality issues.
native-semanticsNative semantics and ARIA0.08Detect ARIA roles that duplicate or conflict with native HTML semantics.

5. Package Documentation

Semantic HTML Check v1.0.0

Check identifier: semantic-html

Abstract

This check validates whether a page exposes an extraction-friendly HTML surface through native semantic landmarks, ordered headings, accessible links and controls, labeled forms, appropriate autocomplete hints, image text alternatives, and non-conflicting ARIA usage.

The goal is not visual accessibility auditing by screenshot. The goal is to determine whether agents, crawlers, browsers, and assistive technologies can infer page structure and interaction meaning from the HTML itself.

Scope

This check is page-local. It evaluates the scanned HTML document and does not crawl linked pages, execute full assistive-technology trees, prove WCAG conformance, or replace browser accessibility audits.

Standards Basis

HTML defines semantic sectioning and grouping elements such as main, nav, header, footer, headings, anchors, buttons, form controls, and images. These native elements expose meaning to user agents without requiring site-specific JavaScript or visual interpretation.

The main element represents the dominant contents of the document. A page should expose one visible main landmark for the primary content. nav represents a major navigation section, not every cluster of links. header and footer can apply to the whole page or to sectioning content depending on their ancestor context.

Headings define the hierarchy of page content. A page should expose a meaningful top-level heading and should not skip heading ranks for visual styling. Empty headings provide no useful structure.

WCAG and WAI guidance require controls and links to expose programmatically determinable names, roles, and values. Accessible names can come from visible text, associated labels, aria-label, aria-labelledby, host-language features such as alt, and other fallback sources described by the Accessible Name and Description Computation specification.

HTML autocomplete tokens apply to fields collecting known user information such as name, email, username, password, address, telephone, organization, birthday, URL, and payment data. This check applies autocomplete validation only to inputs that appear to collect those known personal, account, address, contact, or payment values.

Image alt text is required for meaningful image semantics. Empty alt is appropriate for decorative or redundant images, but informative and functional images need useful text alternatives. Image links rely on image alt for link purpose when no other link text is present.

ARIA can provide fallback semantics, but it should not duplicate or conflict with native HTML semantics. Native HTML is preferred when an element already has the right meaning.

Normative Requirements

  • A page MUST expose exactly one visible primary content landmark through native <main> or role="main".
  • A page SHOULD expose page-level navigation, header, and footer/contentinfo landmarks when those regions exist.
  • A page SHOULD expose one meaningful visible h1.
  • Headings SHOULD be non-empty and SHOULD not skip ranks.
  • Navigational links MUST be crawlable <a> elements with usable href values.
  • Links and button controls MUST expose accessible names.
  • User-fillable form controls MUST expose labels or accessible names.
  • Inputs that collect known personal, account, address, contact, or payment information SHOULD use valid HTML autocomplete tokens.
  • Images MUST include an alt attribute. Decorative images MAY use alt="".
  • Functional image links MUST not rely on an empty or missing image alt.
  • ARIA roles SHOULD NOT duplicate or override native semantics unless there is a specific compatibility reason.

Validation Steps

Page Landmarks

The check records native and ARIA fallback landmarks:

  • <main> and role="main".
  • <nav> and role="navigation".
  • page-level <header> and role="banner".
  • page-level <footer> and role="contentinfo".

The main requirement is strict. Navigation, header, and footer are treated as warnings when the dominant content landmark exists but surrounding page landmarks are incomplete.

Heading Structure

The check records visible h1 count, total heading count, skipped heading ranks, and empty heading elements. It fails when the page lacks one meaningful visible h1, contains empty headings, or skips heading levels.

Links

The check validates:

  • Accessible link names.
  • Crawlable anchors with real href values.
  • Generic anchor text such as click here, read more, and learn more.

Generic anchor text is a warning-level issue when names and crawlability are otherwise correct.

Buttons and Interactive Controls

The check validates accessible names for:

  • Native <button> elements.
  • Button-like <input> controls.
  • Elements with role="button".

Names may come from visible text, aria-label, aria-labelledby, title, value, alt, or equivalent host-language mechanisms.

Form Labels and Autocomplete

The check validates label coverage for user-fillable inputs, selects, and textareas. It accepts visible labels, explicit for/id labels, implicit wrapping labels, aria-label, and aria-labelledby.

Autocomplete validation is narrower: it applies to fields whose name, id, type, placeholder, or accessible name indicates known personal/account/contact/address/payment data. Search fields, filters, coupon codes, scanner URL inputs, and arbitrary query fields are not required to provide autocomplete tokens unless they clearly collect one of the known user-data purposes.

Image Text Alternatives

The check validates:

  • Missing alt attributes.
  • Empty alt on image-only links without another link name.
  • Generic alt text such as image, photo, icon, screenshot, or chart.
  • Alt text that merely repeats the image filename.

The check does not attempt full visual interpretation of complex charts or diagrams.

Native Semantics and ARIA

The check records cases where ARIA duplicates or conflicts with native semantics, including:

  • Native landmarks with redundant matching roles.
  • Native links or buttons overridden with conflicting roles.
  • Landmark or interactive elements hidden from the accessibility tree.

These are warning-level issues unless they also cause another step, such as accessible names, to fail.

Evidence Model

The check records:

  • Landmark counts for native and ARIA fallback regions.
  • Heading counts, visible h1 counts, skipped heading sequences, and empty headings.
  • Link counts, inaccessible links, non-crawlable anchors, and generic anchor text.
  • Button/control counts and inaccessible controls.
  • Form-control counts, unlabeled controls, autocomplete-applicable inputs, valid autocomplete coverage, and missing tokens.
  • Image counts, alt coverage, and alt-quality issues.
  • ARIA/native semantic conflicts.
  • Step-level pass, warning, and fail statuses.

Scoring Model

The check uses weighted step scoring:

  • Page landmarks: 18%
  • Heading structure: 18%
  • Links: 16%
  • Buttons and interactive controls: 12%
  • Form labels and autocomplete: 16%
  • Image text alternatives: 12%
  • Native semantics and ARIA: 8%

Warnings receive partial credit. Clear failures in main content, heading structure, accessible names, labels, or image alt coverage reduce the result more strongly.

Limitations

This check is static and heuristic. It does not compute the complete browser accessibility tree, inspect CSS-generated text, evaluate visual reading order, determine whether an image is actually decorative, or validate every ARIA role/attribute combination. It intentionally favors native HTML and practical extraction evidence over attempting full WCAG certification.

References

Source: lib/checks/semantic-html/versions/1.0.0/docs.md

6. Version Changelog

semantic-html v1.0.0 Changelog

Initial versioned package for semantic-html.

Source: lib/checks/semantic-html/versions/1.0.0/changelog.md