Evidence Mode

The problem

Method mode gives your AI the right methods. Your AI still produces unverifiable output.

Your AI can call analyze_page, receive structured data, and then claim "site A has better performance than site B" without citing what it measured, without noting that security data failed to load, and without adjusting its confidence to reflect that gap.

Three failure modes recur:

Untyped output. Findings are free-text prose. Nothing enforces required fields.
Invisible gaps. When a data source times out, nothing records that absence.
Uncalibrated confidence. Your AI claims "high confidence" regardless of whether supporting evidence loaded.

Evidence mode addresses all three.

What evidence mode adds

Evidence mode is what you get when method mode's composite tools return typed evidence records instead of ad-hoc objects.

Layer	What it adds	Variance
Code Mode	Search + execute	Medium
Method Mode	Composite tools, normalized output	Low
Evidence Mode	Typed records + enforcement	Minimal

Each layer constrains more. Evidence mode constrains the most.

Four core concepts

Concept	What it does
Evidence Record	Typed return value from extraction methods. Structured data with explicit null fields.
Gap	What data is missing, why it is missing, and whether it affects confidence.
Finding	Validated research claim with required fields. Enforced at the tool level.
Quality	Computed from gaps. Tells callers how much to trust the evidence.

Evidence on the Crawlio App side

Evidence records

Each analyze_page call creates an observation log entry and returns an evidenceId. The record includes:

url, timestamp, captureTriggered
enrichment (framework, network, console, DOM)
enrichmentStatus (ok or timeout)
crawlStatus (download state for the URL)

Evidence gaps

Gaps are tied to HTTP outcomes:

Gap	Triggered by
`captureRejected`	Capture endpoint returned non-202
`captureUnreachable`	Capture transport failure
`enrichmentTimeout`	Polling exhausted without data
`enrichmentError`	Server error during polling
`crawlStatusMissing`	No crawl data for the URL

Evidence quality

Quality is computed from gaps:

Quality	Condition
`complete`	No gaps
`partial`	Gaps present but capture succeeded
`degraded`	Capture-level failure
`unavailable`	Cannot produce usable evidence

Comparison readiness

compare_pages produces a readiness signal from both sides:

Readiness	Condition
`ready`	Both sides complete
`cautious`	One side partial
`unreliable`	Either side degraded

The comparison also includes symmetric (whether both sides have the same gap profile), degradationNotes, timingDelta, and paired evidenceId values for verification via get_observation.

Creating findings

Use create_finding to record a research claim with evidence:

create_finding(
  title: "React hydration errors on /products",
  url: "https://example.com/products",
  evidence: ["obs-1", "obs-2"],
  synthesis: "SSR/client HTML mismatch causing hydration failures.",
  confidence: "high",
  category: "framework"
)

Findings are persisted in the observation log. Retrieve them with get_findings.

Evidence on the browser side

The browser agent implements the same concepts with substrate-appropriate mechanisms.

Typed findings

smart.finding() validates every research claim synchronously. Required fields: claim, evidence (array of strings), sourceUrl, confidence, method, dimension. Malformed input throws immediately.

Coverage gaps

When extractPage() runs, it fires seven parallel operations. If supplementary calls fail, the field returns as null and a gap is recorded:

{
  dimension: "performance",
  reason: "CDP domain disabled",
  impact: "method-failed",
  reducesConfidence: true
}

Not all gaps reduce confidence:

Supplementary call	Dimension	Reduces confidence
Performance metrics	`performance`	Yes
Security state	`security`	Yes
Font detection	`fonts`	No
Accessibility tree	`accessibility`	No
Mobile readiness	`mobile-readiness`	No

Confidence propagation

When a finding's dimension matches an active gap with reducesConfidence: true, the runtime caps confidence one level down:

Input	Active gap?	Output
`high`	Yes	`medium` (capped, with `cappedBy` field)
`medium`	Yes	`low` (capped)
`low`	Yes	`low` (floor)
any	No	unchanged

This is automatic. Your AI does not choose whether to cap. The runtime enforces it based on what data loaded.

Session aggregation

smart.findings() returns all findings in the session. smart.clearFindings() resets both findings and gaps. Gaps are append-only until cleared.

Comparison scaffolds

comparePages() returns a comparison scaffold with 11 fixed dimensions:

Dimension	Data source
`framework`	Detected framework
`performance`	Performance metrics + gaps
`security`	Security state
`seo`	Title, canonical URL, structured data
`accessibility`	Accessibility tree summary
`error-surface`	Console errors
`third-party-load`	Network requests
`architecture`	Framework analysis
`content-delivery`	Protocol and TLS info
`mobile-readiness`	Viewport meta, media queries
`data-structure`	Structured data presence

A dimension is comparable: true only when both sides are present.

Key differences between substrates

Aspect	Crawlio App	Browser Agent
Gap model	Per-HTTP-outcome	Per-CDP-call
Quality signal	Explicit enum	Implicit (`gaps.length === 0`)
Confidence propagation	Not yet automatic	Automatic via `reducesConfidence`
Finding enforcement	`create_finding` HTTP POST	`smart.finding()` synchronous validation
Persistence	Observation log (durable)	Session memory (ephemeral)
Evidence chain	`evidenceId` via `get_observation`	`findings()` returns copy

Known limits

Sequential comparison. Both substrates compare sites sequentially. Site B timing is affected by site A processing.
Append-only gaps. Session gaps persist until explicitly cleared. Old gaps affect later findings.
One-level confidence cap. A gap drops high to medium, not high to low. Two gaps on different dimensions cap independently but cannot compound on the same finding.
Crawlio App confidence propagation pending. The app accepts confidence on create_finding but does not auto-adjust based on evidence quality yet.
Accessibility depth limited. The accessibility tree is capped at depth 3.
Mobile readiness is read-only. Viewport analysis only. No viewport resizing or touch target testing.

Next steps

Method Mode: the composite tools that evidence mode builds on
Code Mode: the search-and-execute pattern underneath
Tool Reference: all 49 full-mode tools with parameters
JIT Context: how the aggregator handles context loading

PreviousMethod Mode NextJIT Context Runtime