Evidence Mode
The problem
Method mode gives your AI the right methods. Your AI still produces unverifiable output.
Your AI can call analyze_page, receive structured data, and then claim "site A has better performance than site B" without citing what it measured, without noting that security data failed to load, and without adjusting its confidence to reflect that gap.
Three failure modes recur:
- Untyped output. Findings are free-text prose. Nothing enforces required fields.
- Invisible gaps. When a data source times out, nothing records that absence.
- Uncalibrated confidence. Your AI claims "high confidence" regardless of whether supporting evidence loaded.
Evidence mode addresses all three.
What evidence mode adds
Evidence mode is what you get when method mode's composite tools return typed evidence records instead of ad-hoc objects.
| Layer | What it adds | Variance |
|---|---|---|
| Code Mode | Search + execute | Medium |
| Method Mode | Composite tools, normalized output | Low |
| Evidence Mode | Typed records + enforcement | Minimal |
Each layer constrains more. Evidence mode constrains the most.
Four core concepts
| Concept | What it does |
|---|---|
| Evidence Record | Typed return value from extraction methods. Structured data with explicit null fields. |
| Gap | What data is missing, why it is missing, and whether it affects confidence. |
| Finding | Validated research claim with required fields. Enforced at the tool level. |
| Quality | Computed from gaps. Tells callers how much to trust the evidence. |
Evidence on the Crawlio App side
Evidence records
Each analyze_page call creates an observation log entry and returns an evidenceId. The record includes:
url,timestamp,captureTriggeredenrichment(framework, network, console, DOM)enrichmentStatus(okortimeout)crawlStatus(download state for the URL)
Evidence gaps
Gaps are tied to HTTP outcomes:
| Gap | Triggered by |
|---|---|
captureRejected |
Capture endpoint returned non-202 |
captureUnreachable |
Capture transport failure |
enrichmentTimeout |
Polling exhausted without data |
enrichmentError |
Server error during polling |
crawlStatusMissing |
No crawl data for the URL |
Evidence quality
Quality is computed from gaps:
| Quality | Condition |
|---|---|
complete |
No gaps |
partial |
Gaps present but capture succeeded |
degraded |
Capture-level failure |
unavailable |
Cannot produce usable evidence |
Comparison readiness
compare_pages produces a readiness signal from both sides:
| Readiness | Condition |
|---|---|
ready |
Both sides complete |
cautious |
One side partial |
unreliable |
Either side degraded |
The comparison also includes symmetric (whether both sides have the same gap profile), degradationNotes, timingDelta, and paired evidenceId values for verification via get_observation.
Creating findings
Use create_finding to record a research claim with evidence:
create_finding(
title: "React hydration errors on /products",
url: "https://example.com/products",
evidence: ["obs-1", "obs-2"],
synthesis: "SSR/client HTML mismatch causing hydration failures.",
confidence: "high",
category: "framework"
)Findings are persisted in the observation log. Retrieve them with get_findings.
Evidence on the browser side
The browser agent implements the same concepts with substrate-appropriate mechanisms.
Typed findings
smart.finding() validates every research claim synchronously. Required fields: claim, evidence (array of strings), sourceUrl, confidence, method, dimension. Malformed input throws immediately.
Coverage gaps
When extractPage() runs, it fires seven parallel operations. If supplementary calls fail, the field returns as null and a gap is recorded:
{
dimension: "performance",
reason: "CDP domain disabled",
impact: "method-failed",
reducesConfidence: true
}Not all gaps reduce confidence:
| Supplementary call | Dimension | Reduces confidence |
|---|---|---|
| Performance metrics | performance |
Yes |
| Security state | security |
Yes |
| Font detection | fonts |
No |
| Accessibility tree | accessibility |
No |
| Mobile readiness | mobile-readiness |
No |
Confidence propagation
When a finding's dimension matches an active gap with reducesConfidence: true, the runtime caps confidence one level down:
| Input | Active gap? | Output |
|---|---|---|
high |
Yes | medium (capped, with cappedBy field) |
medium |
Yes | low (capped) |
low |
Yes | low (floor) |
| any | No | unchanged |
This is automatic. Your AI does not choose whether to cap. The runtime enforces it based on what data loaded.
Session aggregation
smart.findings() returns all findings in the session. smart.clearFindings() resets both findings and gaps. Gaps are append-only until cleared.
Comparison scaffolds
comparePages() returns a comparison scaffold with 11 fixed dimensions:
| Dimension | Data source |
|---|---|
framework |
Detected framework |
performance |
Performance metrics + gaps |
security |
Security state |
seo |
Title, canonical URL, structured data |
accessibility |
Accessibility tree summary |
error-surface |
Console errors |
third-party-load |
Network requests |
architecture |
Framework analysis |
content-delivery |
Protocol and TLS info |
mobile-readiness |
Viewport meta, media queries |
data-structure |
Structured data presence |
A dimension is comparable: true only when both sides are present.
Key differences between substrates
| Aspect | Crawlio App | Browser Agent |
|---|---|---|
| Gap model | Per-HTTP-outcome | Per-CDP-call |
| Quality signal | Explicit enum | Implicit (gaps.length === 0) |
| Confidence propagation | Not yet automatic | Automatic via reducesConfidence |
| Finding enforcement | create_finding HTTP POST |
smart.finding() synchronous validation |
| Persistence | Observation log (durable) | Session memory (ephemeral) |
| Evidence chain | evidenceId via get_observation |
findings() returns copy |
Known limits
- Sequential comparison. Both substrates compare sites sequentially. Site B timing is affected by site A processing.
- Append-only gaps. Session gaps persist until explicitly cleared. Old gaps affect later findings.
- One-level confidence cap. A gap drops
hightomedium, nothightolow. Two gaps on different dimensions cap independently but cannot compound on the same finding. - Crawlio App confidence propagation pending. The app accepts
confidenceoncreate_findingbut does not auto-adjust based on evidence quality yet. - Accessibility depth limited. The accessibility tree is capped at depth 3.
- Mobile readiness is read-only. Viewport analysis only. No viewport resizing or touch target testing.
Next steps
- Method Mode: the composite tools that evidence mode builds on
- Code Mode: the search-and-execute pattern underneath
- Tool Reference: all 49 full-mode tools with parameters
- JIT Context: how the aggregator handles context loading