1. Abstract
Expose readable page structure through semantic HTML and accessible controls.
Semantic HTML gives browsers, assistive technology, search systems, and agents reliable landmarks, headings, controls, form semantics, and image context.
2. Classification
- Check ID
- semantic-html
- Check version
- 1.0.0
- Package path
- lib/checks/semantic-html/versions/1.0.0
- Category
- AI Discoverability
- Subcategory
- Content Readiness
- Check group
- Page Structure
- Check group ID
- page-structure
- Maturity
- Established
- Scope
- page
- Check weight
- 1
3. Input And Output Contracts
- Input
- [email protected]
- Output
- [email protected]
- Resources inspected
- HTML landmarks, Heading hierarchy, Accessible controls, Form field autocomplete, Image alt text
4. Scoring Semantics
| Step ID | Title | Weight | Description |
|---|---|---|---|
landmarks | Page landmarks | 0.18 | Validate one visible main landmark plus navigation/header/footer semantics where appropriate. |
heading-structure | Heading structure | 0.18 | Validate a meaningful h1 and ordered, non-empty headings. |
links | Links | 0.16 | Validate accessible names, crawlable href values, and descriptive anchor text. |
controls | Buttons and interactive controls | 0.12 | Validate accessible names for native and ARIA button controls. |
forms | Form labels and autocomplete | 0.16 | Validate form labels and valid autocomplete tokens for personal-data inputs. |
images | Image text alternatives | 0.12 | Validate image alt coverage and common alt quality issues. |
native-semantics | Native semantics and ARIA | 0.08 | Detect ARIA roles that duplicate or conflict with native HTML semantics. |
5. Package Documentation
Semantic HTML Check v1.0.0
Check identifier: semantic-html
Abstract
This check validates whether a page exposes an extraction-friendly HTML surface through native semantic landmarks, ordered headings, accessible links and controls, labeled forms, appropriate autocomplete hints, image text alternatives, and non-conflicting ARIA usage.
The goal is not visual accessibility auditing by screenshot. The goal is to determine whether agents, crawlers, browsers, and assistive technologies can infer page structure and interaction meaning from the HTML itself.
Scope
This check is page-local. It evaluates the scanned HTML document and does not crawl linked pages, execute full assistive-technology trees, prove WCAG conformance, or replace browser accessibility audits.
Standards Basis
HTML defines semantic sectioning and grouping elements such as main, nav, header, footer, headings, anchors, buttons, form controls, and images. These native elements expose meaning to user agents without requiring site-specific JavaScript or visual interpretation.
The main element represents the dominant contents of the document. A page should expose one visible main landmark for the primary content. nav represents a major navigation section, not every cluster of links. header and footer can apply to the whole page or to sectioning content depending on their ancestor context.
Headings define the hierarchy of page content. A page should expose a meaningful top-level heading and should not skip heading ranks for visual styling. Empty headings provide no useful structure.
WCAG and WAI guidance require controls and links to expose programmatically determinable names, roles, and values. Accessible names can come from visible text, associated labels, aria-label, aria-labelledby, host-language features such as alt, and other fallback sources described by the Accessible Name and Description Computation specification.
HTML autocomplete tokens apply to fields collecting known user information such as name, email, username, password, address, telephone, organization, birthday, URL, and payment data. This check applies autocomplete validation only to inputs that appear to collect those known personal, account, address, contact, or payment values.
Image alt text is required for meaningful image semantics. Empty alt is appropriate for decorative or redundant images, but informative and functional images need useful text alternatives. Image links rely on image alt for link purpose when no other link text is present.
ARIA can provide fallback semantics, but it should not duplicate or conflict with native HTML semantics. Native HTML is preferred when an element already has the right meaning.
Normative Requirements
- A page MUST expose exactly one visible primary content landmark through native
<main>orrole="main". - A page SHOULD expose page-level navigation, header, and footer/contentinfo landmarks when those regions exist.
- A page SHOULD expose one meaningful visible
h1. - Headings SHOULD be non-empty and SHOULD not skip ranks.
- Navigational links MUST be crawlable
<a>elements with usablehrefvalues. - Links and button controls MUST expose accessible names.
- User-fillable form controls MUST expose labels or accessible names.
- Inputs that collect known personal, account, address, contact, or payment information SHOULD use valid HTML autocomplete tokens.
- Images MUST include an
altattribute. Decorative images MAY usealt="". - Functional image links MUST not rely on an empty or missing image
alt. - ARIA roles SHOULD NOT duplicate or override native semantics unless there is a specific compatibility reason.
Validation Steps
Page Landmarks
The check records native and ARIA fallback landmarks:
<main>androle="main".<nav>androle="navigation".- page-level
<header>androle="banner". - page-level
<footer>androle="contentinfo".
The main requirement is strict. Navigation, header, and footer are treated as warnings when the dominant content landmark exists but surrounding page landmarks are incomplete.
Heading Structure
The check records visible h1 count, total heading count, skipped heading ranks, and empty heading elements. It fails when the page lacks one meaningful visible h1, contains empty headings, or skips heading levels.
Links
The check validates:
- Accessible link names.
- Crawlable anchors with real
hrefvalues. - Generic anchor text such as
click here,read more, andlearn more.
Generic anchor text is a warning-level issue when names and crawlability are otherwise correct.
Buttons and Interactive Controls
The check validates accessible names for:
- Native
<button>elements. - Button-like
<input>controls. - Elements with
role="button".
Names may come from visible text, aria-label, aria-labelledby, title, value, alt, or equivalent host-language mechanisms.
Form Labels and Autocomplete
The check validates label coverage for user-fillable inputs, selects, and textareas. It accepts visible labels, explicit for/id labels, implicit wrapping labels, aria-label, and aria-labelledby.
Autocomplete validation is narrower: it applies to fields whose name, id, type, placeholder, or accessible name indicates known personal/account/contact/address/payment data. Search fields, filters, coupon codes, scanner URL inputs, and arbitrary query fields are not required to provide autocomplete tokens unless they clearly collect one of the known user-data purposes.
Image Text Alternatives
The check validates:
- Missing
altattributes. - Empty
alton image-only links without another link name. - Generic alt text such as
image,photo,icon,screenshot, orchart. - Alt text that merely repeats the image filename.
The check does not attempt full visual interpretation of complex charts or diagrams.
Native Semantics and ARIA
The check records cases where ARIA duplicates or conflicts with native semantics, including:
- Native landmarks with redundant matching roles.
- Native links or buttons overridden with conflicting roles.
- Landmark or interactive elements hidden from the accessibility tree.
These are warning-level issues unless they also cause another step, such as accessible names, to fail.
Evidence Model
The check records:
- Landmark counts for native and ARIA fallback regions.
- Heading counts, visible
h1counts, skipped heading sequences, and empty headings. - Link counts, inaccessible links, non-crawlable anchors, and generic anchor text.
- Button/control counts and inaccessible controls.
- Form-control counts, unlabeled controls, autocomplete-applicable inputs, valid autocomplete coverage, and missing tokens.
- Image counts, alt coverage, and alt-quality issues.
- ARIA/native semantic conflicts.
- Step-level pass, warning, and fail statuses.
Scoring Model
The check uses weighted step scoring:
- Page landmarks: 18%
- Heading structure: 18%
- Links: 16%
- Buttons and interactive controls: 12%
- Form labels and autocomplete: 16%
- Image text alternatives: 12%
- Native semantics and ARIA: 8%
Warnings receive partial credit. Clear failures in main content, heading structure, accessible names, labels, or image alt coverage reduce the result more strongly.
Limitations
This check is static and heuristic. It does not compute the complete browser accessibility tree, inspect CSS-generated text, evaluate visual reading order, determine whether an image is actually decorative, or validate every ARIA role/attribute combination. It intentionally favors native HTML and practical extraction evidence over attempting full WCAG certification.
References
- html.spec.whatwg.org/multipage/sections.html
- html.spec.whatwg.org/multipage/grouping-content.html#the-main-element
- html.spec.whatwg.org/multipage/embedded-content.html#alt
- html.spec.whatwg.org/multipage/form-control-infrastructure.html#autofill
- www.w3.org/WAI/tutorials/page-structure
- www.w3.org/WAI/tutorials/forms/labels
- www.w3.org/WAI/tutorials/images
- www.w3.org/WAI/WCAG22/Understanding/name-role-value.html
- www.w3.org/WAI/WCAG22/Understanding/identify-input-purpose.html
- www.w3.org/TR/accname-1.2
- www.w3.org/TR/html-aria
- developers.google.com/search/docs/crawling-indexing/links-crawlable
Source: lib/checks/semantic-html/versions/1.0.0/docs.md
6. Version Changelog
semantic-html v1.0.0 Changelog
Initial versioned package for semantic-html.
Source: lib/checks/semantic-html/versions/1.0.0/changelog.md