Back to blog
AI botscrawlersGPTBotPerplexityBot

AI bot traffic taxonomy guide for websites

AI bot traffic taxonomy guide: 8 classes, route risk, signed agents, verification, logs, and policy for crawlers and user agents.

By · · 25 min read

Share

Copy article as Markdown
AI bot traffic taxonomy showing search crawlers, answer bots, training bots, user-directed agents, and signed agents.
AI bot traffic taxonomy showing search crawlers, answer bots, training bots, user-directed agents, and signed agents.

"AI bot traffic" is too broad to operate on.

A request from an AI system might be indexing a page for search, retrieving a source for an answer, collecting public content for model training, fetching a URL because a user asked, operating a browser as a user's assistant, calling an MCP tool, or spoofing a familiar user agent to scrape an expensive route.

Those are not the same thing. They should not share one allow/block switch.

OpenAI documents multiple crawler and user agent roles, including GPTBot, OAI-SearchBot, ChatGPT-User, and ChatGPT-Search (OpenAI crawler docs, 2026-06-29). Google separates Googlebot, special-case crawlers, and user-triggered fetchers in its crawler overview (Google Search Central crawler overview, 2026-06-29). Perplexity publishes separate user agents for its crawler behavior (Perplexity crawler docs, 2026-06-29). The ecosystem is already telling us that purpose matters.

TL;DR AI bot traffic should be classified by purpose, identity proof, route risk, and permission. The useful classes are search crawlers, answer/retrieval bots, training crawlers, user-directed fetchers, browser agents, protocol clients, signed agents, and spoofed or unknown automation. A crawl policy file expresses preference. Verification proves identity. Authorization decides what the request may do.

AI bot traffic taxonomy is a model for naming automated AI-related requests before applying policy. Declared identity is what the request says it is through headers or metadata. Verified identity is evidence that the request came from the claimed actor. Route risk is the sensitivity and cost of the URL being accessed. Authorization is the permission decision for the action.

The mistake we see most often is treating "AI" as one risk class. That creates two bad outcomes. Some teams block useful retrieval and make their public docs harder for assistants to cite. Others allow too much and expose search, checkout, account, or expensive dynamic pages to automation they never intended to support.

We tested this taxonomy against access-log exports, crawl policy files, and route inventories. From our analysis, the useful break was never "AI versus not AI." It was purpose, proof, route risk, and permission. In our experience, teams make better decisions when every request has a class and every policy decision has a reason.

This article is a technical taxonomy for website teams. It connects crawler policy, edge verification, WAF decisions, logs, route classification, user-directed agents, and signed-agent work such as Cloudflare's Web Bot Auth proposal.

Reviewed under the CanAgentUse editorial process; see our about and contact pages for context.

Why does the old bot model break?

The old bot model was mostly search crawler versus scraper. That was already imperfect, but teams had a working habit: verify Googlebot, allow public pages, block private routes, rate limit suspicious automation, and watch logs.

AI traffic breaks that model because the same provider may send different traffic for different purposes.

Old questionNew question
Is this Googlebot?Which Google crawler or fetcher is it, and why is it here?
Is this a search crawler?Is it indexing, answering, training, fetching for a user, or acting?
Should bots reach this page?Should this class reach this route for this purpose?
Is the user agent allowed?Is the identity verified, and is the action authorized?
Did the request return 200?Did the policy decision match the route risk and crawler purpose?

Google's crawler overview explicitly says its crawlers and fetchers fall into categories, and some are triggered by user actions rather than general search indexing (Google Search Central, 2026-06-29). OpenAI's docs also separate crawler roles. That distinction matters operationally because a public documentation page, a pricing page, a cart endpoint, and an account page have different policy boundaries.

AI bot policy fails when teams classify traffic only by brand name. OpenAI, Google, and Perplexity all document multiple crawler or fetcher roles, which means website teams need purpose-level classification. A verified provider identity is useful, but the route and action still decide whether the request should be allowed.

The operational model should have four layers:

  1. Purpose: what kind of automated request is this?
  2. Proof: how confident are we about who sent it?
  3. Route risk: what page or endpoint is being accessed?
  4. Permission: should this request be allowed, limited, challenged, authenticated, or denied?

Skipping any layer creates weird policy. A training crawler may be allowed on a blog but not a customer forum. A user-directed fetcher may be allowed to read a public guide but not submit a form. A signed browser agent may still need user authorization before checkout.

What are the eight useful AI traffic classes?

The taxonomy below is a working model, not a legal ontology. It is meant to drive edge rules, logs, dashboards, crawl policy, and product decisions.

ClassPurposeTypical identity signalMain policy question
Search crawlerDiscover and index public pages for searchPublished user agent plus verificationShould this URL appear in search?
Answer/retrieval botFetch sources for AI answers or citationsPublished user agent, IP verification, provider docsShould this page ground generated answers?
Training crawlerCollect public content for model trainingPublished user agent or crawler classDo we allow training use of this content?
User-directed fetcherRetrieve a URL because a user asked an assistantUser-facing agent string or product-specific fetcherShould a user's assistant be able to read this page?
Browser agentOperate the site through a browserBrowser-like traffic, session behavior, signed identity if availableWhich tasks are safe through the UI?
Protocol clientCall APIs, MCP tools, UCP, A2A, or agent endpointsOAuth, API keys, mTLS, signed requests, tool client metadataWhat scopes and schemas apply?
Signed agentPresent cryptographic proof of identity or delegationSignature, issuer, key, attestation, tokenWhat can this verified agent do for this user?
Spoofed or unknown automationClaim identity without proof or hide intentSuspicious user agent, failed verification, behavior signalsShould we challenge, rate limit, tarpit, or block?

The classes overlap. A provider can run a search crawler and a training crawler. A browser agent may call APIs. A protocol client may fetch public docs first. A signed agent may still be unknown to your application.

That is fine. The point is not to force every request into one forever bucket. The point is to give your policy engine a useful first label.

How should teams separate purpose from identity?

Purpose and identity are different questions.

Purpose asks: why is this request being made?

Identity asks: who made it?

Permission asks: what is it allowed to do here?

A request can have a declared identity and still lack verified identity. It can have verified identity and still lack permission. It can be a known crawler and still hit the wrong route.

ScenarioIdentityPermission result
Verified search crawler on public blogStrong enoughAllow if page is indexable
Verified search crawler on checkoutStrong enoughBlock or require auth because route is wrong
User-directed fetcher on public docsProvider may be verifiedUsually allow if docs are public
User-directed fetcher on account invoiceProvider may be verifiedRequire user auth and authorization
Training crawler on public blogProvider may be verifiedAllow only if training policy permits
Unknown bot on expensive search routeWeakRate limit or challenge
Signed agent calling checkout APIStrongerRequire user mandate and scoped permission

This layered thinking prevents a common mistake: "We verified it, so it is allowed." Verification is not authorization. It only improves confidence about the actor.

Cloudflare's verified bots documentation frames verified bots as bots that Cloudflare has validated as legitimate for purposes such as search engine crawlers and monitoring services (Cloudflare verified bots, 2026-06-29). That helps with identity. It does not decide whether a verified actor should access /checkout/complete.

AI bot identity and permission layers showing declared user agent, verification, signed request, route risk, auth context, policy decision, and audit log.
AI bot identity and permission layers showing declared user agent, verification, signed request, route risk, auth context, policy decision, and audit log.

What is the difference between search, answer, and training traffic?

Search, answer, and training traffic are often discussed together because they can all fetch public pages. The business question behind each one is different.

Search crawlers

Search crawlers support discovery in search results. Most sites want these crawlers to access canonical public pages, while excluding private, duplicate, or low-value routes.

Good default:

RouteSearch crawler policy
Homepage, core landing pagesAllow
Blog, docs, help centerAllow if current and canonical
Product and category pagesAllow if intended to rank
Internal search resultsUsually limit or block
Faceted duplicatesControl with canonicals, route rules, or parameters
Cart, checkout, accountBlock or require auth
Staging and previewsBlock and protect

Answer and retrieval bots

Answer or retrieval bots fetch pages to support generated answers, citations, search-grounded responses, or assistant summaries. A page can be good for search but bad for answers if it is stale, contradictory, or missing direct definitions.

Answer readiness depends on:

SignalWhy it matters
Canonical pageAvoids conflicting generated answers
Updated dateHelps freshness-sensitive systems
Direct definitionsLets assistants quote cleanly
Source linksBuilds trust and auditability
Structured sectionsMakes extraction easier
Stable policy pagesPrevents old terms from being cited

If your old refund page is still public and internally linked, an answer bot may retrieve it. That is not the bot's fault. It is a content governance failure.

Training crawlers

Training crawlers collect content for model development or training corpora. That is a different decision than answer retrieval. A company may want its current docs cited in answers but not used as training input. Another company may allow public blog training but exclude user-generated forums or license-restricted material.

Policy should therefore separate:

Policy dimensionExample decision
Search indexingAllow canonical docs and product pages
Answer retrievalAllow current docs, pricing, and support pages
Training useExclude docs, allow blog, or decide by license
User-directed readAllow public pages when requested by a user
Action executionRequire auth, scopes, and user approval

Search, answer, and training traffic should not share one rule. Search crawlers support indexing. Answer bots fetch source material for assistant responses. Training crawlers collect content for model development. A business may allow one and refuse another, so route rules, policy files, and logs should preserve the purpose distinction.

This is where legal, SEO, security, and support teams need the same vocabulary. Otherwise, "allow AI bots" means one thing to growth and another to counsel.

What makes user-directed fetchers different?

User-directed fetchers are automated requests caused by a user's intent. A user pastes a URL into an assistant and asks it to summarize, compare, translate, explain, or evaluate. The fetch is automated, but it is closer to assisted browsing than bulk crawling.

OpenAI documents user-facing agents separately from crawlers, including ChatGPT-User for user actions (OpenAI crawler docs, 2026-06-29). Google also describes fetchers that may be triggered by user actions, separate from broad crawler behavior (Google Search Central, 2026-06-29).

That category is awkward because old bot controls were built for site-owner intent, not user intent.

User taskSuggested treatment
Summarize a public blog postAllow if public content policy allows retrieval
Compare two public product pagesAllow, with rate limits if needed
Read public support docsAllow current canonical docs
Summarize a paywalled articleEnforce paywall and session access
Read an account invoiceRequire user authentication
Fill a support formRequire CSRF, session controls, and explicit submit boundary
Prepare checkoutAllow cart preparation where appropriate
Complete purchaseRequire user approval and payment authorization

The wrong move is blanket blocking every user-directed fetcher because it contains "AI" in the identity. That can make your public documentation harder for customers to use with their own tools. The other wrong move is treating user intent as universal permission. A user's assistant should not bypass auth, paywalls, consent, or fraud controls.

How do browser agents change bot policy?

Browser agents are not classic crawlers. They click, type, scroll, wait, retry, open modals, fill forms, and sometimes carry a user's authenticated session.

That means bot policy moves from "can this actor read the page?" to "can this actor perform this task?"

Browser-agent actionRiskControl
Read public articleLowAllow public access
Filter product catalogMedium if expensiveRate limit, cache, expose state
Add item to cartMediumSession controls, duplicate guards
Submit lead formMedium to highCSRF, spam controls, confirmation
Change account settingHighAuth, reauth for sensitive fields
Cancel subscriptionHighExplicit confirmation and audit log
Complete checkoutHighUser mandate, payment controls, fraud checks
Download bulk dataHighAuth, scopes, quotas, abuse detection

This is where UX and security meet. If a browser agent cannot tell whether "Continue" means "next step" or "charge the card," security risk increases. If the site cannot distinguish a product-filtering task from a checkout-completion task, edge policy becomes too coarse.

Browser agents also make bot detection harder. A legitimate agent might look like a normal browser, because it is using one. A hostile scraper might also use a browser. Behavior signals help, but they are not enough for high-trust actions. For those, identity and authorization need to move into signed delegation, user approval, OAuth-like scopes, or payment mandates.

Security research on autonomous LLM agents points to the same conclusion from another angle: once agents can plan, call tools, and execute multi-step tasks, identity alone is not enough. Systems need scoped authority, runtime checks, audit trails, and recovery paths for tool misuse or prompt-driven redirection (Mao et al., "SoK, Security of Autonomous LLM Agents in Agentic Commerce", 2026-06-29).

Browser agents turn bot policy from page access into task authorization. Reading a public article, filtering products, submitting a form, changing account settings, and completing checkout have different risk profiles. A browser-like request is not automatically human or hostile, so high-risk actions need auth, user approval, audit logs, and duplicate guards.

How are protocol clients different from browser agents?

Protocol clients call structured machine interfaces: REST APIs, GraphQL, MCP tools, A2A endpoints, UCP commerce profiles, or internal agent APIs. They can be safer than browser automation because the action and schema are explicit. They can also be more dangerous because they bypass the friction and visibility of the UI.

Client pathStrengthRisk
Browser agentWorks on today's web, uses existing authAmbiguous UI state, fragile interaction, hard identity
REST or GraphQL APIMature auth and rate limitsOverbroad endpoints, hidden business controls
MCP toolAgent-friendly action schemaTool may be too powerful or underspecified
A2A endpointAgent-to-agent task exchangeDelegation and lifecycle complexity
UCP commerce pathShared commerce objects and payment handlersRequires strict cart, mandate, and payment controls
Signed-agent channelStronger identity proofStill needs route policy and user authorization

The policy rule is simple: a protocol client should not get a weaker business control than the browser route.

If the UI requires reauthentication before changing payout details, the API should too. If the browser checkout requires final review, the UCP or API checkout path needs an equivalent mandate. If the support portal rate limits exports, the MCP tool should not allow unlimited export just because the schema is neat.

What does signed-agent identity add?

Signed-agent work tries to solve the gap between declared identity and trusted identity. Instead of relying only on an IP list or user-agent string, an agent can present a cryptographic proof that the request came from a known actor or delegated user.

Cloudflare's Web Bot Auth proposal describes a way for bots to sign requests using public key cryptography, allowing websites to verify bot identity more reliably than user-agent strings alone (Cloudflare Web Bot Auth, 2026-06-29). The direction is important: bots need something closer to authentication.

Signed identity helps answer:

QuestionSigned identity contribution
Did this request come from the claimed agent?Signature verification
Is the key controlled by a known issuer?Public key registry or trust anchor
Was this request modified in transit?Request binding
Is the signature fresh?Timestamp and nonce
Is this agent acting for a user?Delegation token or user consent record

Signed identity does not answer:

QuestionMissing layer
Is this route public?Route policy
Can this user access this account?Authentication and authorization
Can this agent perform this action?Scopes and consent
Is this payment approved?Payment mandate or wallet authorization
Is this request abusive at scale?Rate limits and abuse controls

Think of signed agents as a stronger identity layer, not a magic allowlist. The phrase "verified agent" should never mean "allowed everywhere."

A signed-agent request flow showing agent key, request signature, edge verification, route policy, user authorization, scoped action, and audit log.
A signed-agent request flow showing agent key, request signature, edge verification, route policy, user authorization, scoped action, and audit log.

How should route risk drive policy?

Route risk is the part many teams skip. They write bot rules by actor and forget that URLs differ wildly in sensitivity, cost, and action power.

Start with route classes:

Route classExamplesDefault AI traffic policy
Public contentBlog, docs, guides, homepageAllow search and answer retrieval; decide training separately
Canonical commercial pagesPricing, products, plansAllow search, answer, and user-directed read
Dynamic discoveryInternal search, filters, recommendationsAllow carefully with caching and rate limits
Expensive public toolsCalculators, validators, previewsRate limit, cache, require API key if abused
FormsLead, contact, support, newsletterRequire CSRF, spam controls, confirmation
CartCart read and editSession controls, duplicate guards, rate limits
Checkout and paymentShipping, taxes, payment, order completionRequire auth, user approval, fraud controls
AccountProfile, billing, invoices, settingsRequire auth and step-up for sensitive actions
Admin and internalCMS, dashboards, preview, debugDeny public automation
APIs and MCP toolsMachine endpointsRequire auth, scopes, schema validation, audit logs

Then combine route risk with bot class:

Bot classPublic docsProduct pagesSearch/filterCartCheckoutAccountAPI/tool
Search crawlerAllow if canonicalAllow if indexableLimitDenyDenyDenyDeny
Answer/retrievalAllow current docsAllow current pagesLimitDenyDenyDenyDeny
Training crawlerPolicy-specificPolicy-specificUsually denyDenyDenyDenyDeny
User-directed fetcherAllow publicAllow publicLimitAuth/sessionAuth plus approvalAuthScope
Browser agentAllow publicAllow publicLimitSession controlsUser mandateAuth plus step-upScope
Protocol clientN/A or allow docsN/AScopeScopeScope plus mandateScopeScope
Signed agentAllow by policyAllow by policyRate and scopeSession or scopeUser mandateAuth plus scopeScope
Unknown automationObserve or limitLimitRate limitChallengeDenyDenyDeny

This matrix turns the vague debate into engineering work. You can implement it at the CDN, WAF, app middleware, API gateway, and product layer.

A route-based AI bot policy matrix mapping public docs, product pages, search, cart, checkout, account, and API routes against bot classes and access decisions.
A route-based AI bot policy matrix mapping public docs, product pages, search, cart, checkout, account, and API routes against bot classes and access decisions.

What should verification actually check?

Verification methods vary by provider and infrastructure. Treat them as signals with different strengths.

Verification methodStrengthWeakness
User-agent matchEasy first signalTrivial to spoof
Published IP rangeStronger for known providersOperational maintenance, cloud churn
Reverse DNS plus forward DNSCommon for major crawlersProvider-specific and not universal
CDN verified bot flagEasy at edgeDepends on CDN vendor classification
Signed requestStrong identity proofEcosystem still emerging
OAuth or API keyStrong for protocol clientsNot suited to public crawlers
Session authenticationUser-specific accessDoes not identify the agent cleanly
Behavior modelCatches abuse patternsCan confuse legitimate agents and attackers

For major search crawlers, reverse DNS verification remains a common technique. For newer AI traffic, provider docs and CDN bot products may be more practical. For high-risk actions, move beyond crawler verification and require user auth, scopes, and explicit approval.

Verification should produce a normalized value:

{
  "declared_identity": "GPTBot",
  "normalized_actor": "openai",
  "traffic_class": "training_crawler",
  "verification_method": "published_provider_rule",
  "verification_status": "verified",
  "confidence": "high"
}

Then route policy decides what happens next.

What should logs capture?

If logs keep only raw user-agent strings, you cannot debug AI traffic. You need the classification and the decision reason.

Log at least:

FieldExample
Request IDreq_01h...
Timestamp2026-06-29T10:31:54Z
Method and pathGET /docs/agent-card-discovery
Declared user agentRaw header
Normalized actoropenai, google, perplexity, anthropic, unknown
Traffic classsearch, answer, training, user_directed, browser_agent, protocol_client, signed_agent, unknown
Verification statusverified, signed, reverse_dns_passed, unverified, failed
Route classpublic_docs, product_page, search, cart, checkout, account, api
Auth contextanonymous, user_session, service_token, delegated_user
Policy decisionallow, deny, challenge, rate_limit, require_auth
Decision reasonanswer_bot_allowed_public_docs
Status code200, 403, 429
Cache statushit, miss, bypass
Cost markercheap, dynamic, expensive, write_action
User impactnone, assisted_read, attempted_action

The decision_reason field is the difference between useful logs and archaeology.

Example log record:

{
  "request_id": "req_7az4",
  "method": "GET",
  "path": "/guides/returns-policy",
  "declared_user_agent": "ChatGPT-User",
  "normalized_actor": "openai",
  "traffic_class": "user_directed_fetcher",
  "verification_status": "verified",
  "route_class": "public_docs",
  "auth_context": "anonymous",
  "policy_decision": "allow",
  "decision_reason": "user_directed_public_docs_allowed",
  "status_code": 200,
  "cache_status": "hit"
}

AI bot logs should record declared identity, normalized actor, traffic class, verification status, route class, auth context, policy decision, decision reason, status code, and cache status. Cloudflare's Web Bot Auth points toward stronger identity proof, but logs still need route and permission context to explain why a request was allowed or denied.

Logs should not collect sensitive user data just because an agent is involved. Record classification and decisions, not private form values.

An AI bot traffic log dashboard showing actor, traffic class, verification status, route class, decision reason, status code, and cache status.
An AI bot traffic log dashboard showing actor, traffic class, verification status, route class, decision reason, status code, and cache status.

What are bad AI bot policies?

Bad policies usually sound simple.

Bad policy 1: block all AI

This blocks useful user-directed retrieval, answer citation, and public documentation access. It may be defensible for a narrow site with sensitive content, but many businesses will harm support, discovery, and customer workflows if they apply it globally.

Better: block or limit specific classes on specific route types.

Bad policy 2: allow all verified bots

Verification proves identity, not permission. A verified crawler should not access checkout, account, staging, debug, or write endpoints.

Better: combine verified identity with route policy.

Bad policy 3: treat public as training-approved

Public access does not automatically settle training use. This is a business, legal, and licensing decision.

Better: separate search, answer retrieval, and training in policy language.

Bad policy 4: user-agent string as the source of truth

User-agent strings are declarations. Unknown automation can spoof them.

Better: use user agent only as an input into verification and classification.

Bad policy 5: no reason in logs

Without a reason field, teams cannot tell whether a request was denied because of route risk, failed verification, missing auth, rate limit, or a broken rule.

Better: log normalized class, route class, decision, and reason.

How should crawl policy, terms, and technical enforcement line up?

Policy signals should agree across layers. If your public terms say training is not allowed, your crawl policy and edge rules should not imply the opposite. If your docs are meant to appear in AI answers, do not hide canonical pages while leaving stale PDFs open.

Use a three-layer model:

LayerRole
Published preferenceCrawl policy file, terms, AI access policy, license notices
Technical verificationProvider docs, IP/rDNS checks, CDN verified bot, signed request
EnforcementEdge rules, auth, rate limits, route policy, API scopes

The published preference helps cooperative actors. Technical verification improves identity confidence. Enforcement protects routes when cooperation or identity is not enough.

Do not make a crawl policy file carry more weight than it can. It is not authentication. It is not authorization. It is one input into a larger system.

How does caching affect AI bot traffic?

AI retrieval can make stale-content problems worse. If a docs page is cached for a long time after a policy change, an answer bot may fetch an old version and cite it. If an expensive route is not cached at all, harmless retrieval can become a cost problem.

Cache policy should vary by route:

RouteCache approach
Blog and docsCache strongly, purge on update
PricingCache carefully, purge on change
Product pagesCache public shell, update availability carefully
Search resultsCache popular queries, rate limit high-cardinality filters
Cart and checkoutDo not public-cache user state
AccountPrivate cache only, auth required
API toolsUse quotas and explicit freshness rules

For AI answer readiness, freshness is a policy problem as much as a content problem. A stale page with good structure is still a stale source.

What is the implementation plan?

Start with observation. Then add policy. Then enforce.

Phase 1: classify and log

  1. Inventory routes and assign route classes.
  2. Normalize known crawler user agents into traffic classes.
  3. Add verification status where possible.
  4. Add policy decision and reason fields to logs.
  5. Build a dashboard by actor, class, route, decision, and status code.

Phase 2: publish policy

  1. Decide search, answer, training, and user-directed retrieval preferences.
  2. Align crawl policy files, terms, and public AI access pages.
  3. Remove stale public pages that should not ground answers.
  4. Make canonical public docs clear and current.

Phase 3: enforce route rules

  1. Allow search and answer retrieval on public canonical pages.
  2. Separate training rules from answer rules.
  3. Rate limit dynamic discovery routes.
  4. Require auth for account, checkout, and protected APIs.
  5. Require scoped tokens or signed requests for agent endpoints.
  6. Add user approval and idempotency for high-risk actions.

Phase 4: test with real scenarios

Test these cases:

ScenarioExpected result
Search crawler fetches canonical guide200 allowed
Answer bot fetches stale policy URL301 to canonical or 410
Training crawler fetches excluded docsDenied or policy-specific response
User-directed fetcher reads public support page200 allowed
Browser agent filters product catalog quicklyAllowed with rate limits and cache
Browser agent submits lead form repeatedlySpam control or rate limit
Signed agent starts checkoutAllowed only with session and user mandate
Unknown bot hits account routeDeny or require auth

This is not a one-time checklist. AI traffic categories will change. Keep the taxonomy in code and logs, not just in a slide deck.

FAQ

What is AI bot traffic taxonomy?

AI bot traffic taxonomy is a classification model for automated AI-related requests. It separates search crawlers, answer bots, training crawlers, user-directed fetchers, browser agents, protocol clients, signed agents, and unknown automation so teams can apply route-specific policy instead of one broad allow/block rule.

Is a crawl policy file enough for AI bot control?

No. A crawl policy file expresses preferences to cooperative crawlers, but it does not prove identity or authorize actions. Website teams still need verification, route-level policy, authentication for protected areas, rate limits, and logs that explain each allow or deny decision.

Should answer bots and training crawlers get the same access?

Not necessarily. Many sites want current public docs, pricing, and support pages to appear in AI answers, while applying different rules to model training. Separate search, answer retrieval, and training in policy language, technical rules, and logs.

What is the difference between a browser agent and a crawler?

A crawler usually reads pages for indexing, retrieval, or training. A browser agent operates the site through the UI: clicking buttons, filling forms, opening modals, adding items to carts, or completing tasks. Browser agents need task authorization, not just page access policy.

What does signed-agent identity solve?

Signed-agent identity helps prove that a request came from the claimed agent or issuer. It is stronger than trusting a user-agent string. It does not decide whether the route is public, whether the user is authorized, whether the action is safe, or whether a payment is approved.

Conclusion

AI bot traffic needs taxonomy before policy.

Search crawlers, answer bots, training crawlers, user-directed fetchers, browser agents, protocol clients, signed agents, and unknown automation all touch the web differently. Some read. Some retrieve for a user. Some collect for training. Some act.

The practical model is purpose, proof, route risk, and permission. A user-agent string declares identity. Verification improves confidence. Signed requests can improve it further. None of those replace authorization.

Treat "AI bot" as the beginning of the investigation, not the answer.

Sources

Share

Copy article as Markdown