AI Enrichment

Overview

Crawlio's enrichment pipeline captures signals that static HTML parsing cannot detect. Three components work together:

Browser runtime capture. An off-screen WebKit view (or Chrome extension) captures framework detection, network requests, console logs, DOM snapshots, and screenshots.
Vision OCR. Apple Vision.framework extracts text from downloaded raster images.
Enrichment store. All enrichment data is stored per-URL, persisted to disk, and scoped per project.

Enrichment data flows into all export formats and is queryable via MCP tools.

Browser runtime capture

What gets captured

Signal	What it captures
Framework detection	Client-side framework markers (React, Vue, Angular, Next.js, etc.) via window globals and DOM queries
Network requests	Intercepted `fetch()` and `XMLHttpRequest` calls plus `PerformanceObserver` entries
Console logs	All `console.log/warn/error` output
DOM snapshot	Full `document.documentElement.outerHTML` after JS execution
Screenshot	Page rendering captured as PNG image

How it works

A shared WebKit view loads the target URL
User scripts injected at document start intercept network calls and console output
Framework detection JS probes window globals (__NEXT_DATA__, __NUXT__, __vue_app__) and DOM markers (#__next, [data-reactroot], astro-island)
Results are collected and stored in the enrichment store

Triggering capture

Capture can be triggered three ways:

Automatic: During crawl when enrichment is enabled
MCP tool: trigger_capture(url: "https://example.com")
HTTP API: POST /capture endpoint on the control server

Chrome extension capture

The Crawlio Agent Chrome extension provides richer capture via Chrome DevTools Protocol:

Data type	Description
Framework detection	Same probes as WebKit, plus additional DOM markers
Network requests	Full request/response correlation with body sizes
Console logs	JS console output plus browser-level errors
DOM snapshot	Recursive DOM walker with Shadow DOM support
Screenshots	Full-page capture via device metrics override

The extension sends captured data to Crawlio via HTTP POST to the control server enrichment endpoints.

See Framework Detection for the full list of 59 detected technologies.

Vision OCR pipeline

Crawlio extracts text from downloaded images using Apple's Vision.framework. The pipeline is opt-in and runs after link localization.

Configuration

Setting	Default	Description
`ocr.isEnabled`	false	Enable OCR (zero overhead when off)
`ocr.maxImageSize`	10 MB	Maximum image size for OCR
`ocr.languages`	`[]`	Recognition languages (empty = auto-detect)
`ocr.recognitionLevel`	`accurate`	Vision recognition level: `accurate` or `fast`
`ocr.maxConcurrentJobs`	2	Maximum parallel OCR jobs

Eligible images

OCR runs on raster images only:

Eligible	Rejected
PNG, JPEG, TIFF, BMP, WebP	SVG (vector, no pixel data), GIF (animation frames)

Images larger than maxImageSize and data: URIs are skipped.

How it works

After link localization, the engine scans each downloaded HTML page for <img> tags
Each image is checked for MIME type eligibility and file size
Eligible images are submitted as jobs to the OCR pipeline
Jobs run in parallel (capped at maxConcurrentJobs)
Each job uses VNRecognizeTextRequest for text recognition
Results are sorted by bounding box position (top-to-bottom) and deduplicated
Text is grouped by source page URL into a single ocrText string per page

Where OCR text appears

Per-item storage: each downloaded item gets an ocrText field
deploy.json: pages[].ocrText
crawl-manifest.json: items[].ocrText
WARC export: metadata record with WARC-Refers-To linking to the response record

Enrichment store

The enrichment store is the central persistence layer for all enrichment data.

Data model

Each URL maps to an enrichment record:

Field	Type	Description
`url`	String	Page URL
`capturedAt`	Date	Capture timestamp
`framework`	object	JS framework detection (name, subtype, confidence, signals, version, SSR mode)
`networkRequests`	array	Captured HTTP requests (URL, method, status, MIME, size, duration, type)
`consoleLogs`	array	Console output (level, text, timestamp, source URL, line number)
`domSnapshotJSON`	String	Serialized DOM tree

Storage behavior

URL normalization: Strips #fragment and trailing / for consistent keying
Merge semantics: Updates only overwrite non-nil fields. Framework detection and network entries can arrive independently and merge under the same URL key.
Persistence: Debounced atomic writes to disk. Immediate flush on app termination.
Per-project isolation: Each project gets its own enrichment store, persisted to {destination}/.crawlio/enrichment.json

URLs discovered by browser capture (API endpoints, CDN assets, dynamic routes) are fed back into the crawl frontier. These URLs pass through the same filtering as crawled URLs: scope checks, robots.txt, URL normalization.

Querying enrichment data

Via MCP

get_enrichment()                                    # All enrichment data
get_enrichment(url: "https://example.com")          # Per-URL

Or in code mode:

execute_api("GET", "/enrichment")
execute_api("GET", "/enrichment?url=https://example.com")

Via HTTP API

curl --unix-socket ~/Library/Logs/Crawlio/control.sock \
  http://localhost/enrichment
 
curl --unix-socket ~/Library/Logs/Crawlio/control.sock \
  "http://localhost/enrichment?url=https://example.com"

Via the app

The Enrichment Inspector shows enrichment data inline for the selected download item:

Framework badge (name, confidence, signals)
Network request summary (count, total size)
Console log breakdown by level
DOM snapshot availability indicator

Example: AI-driven analysis workflow

User: "What framework does crawlio.app use?"
 
Claude:
  1. trigger_capture(url: "https://crawlio.app")
     --> WebKit captures framework, network, console, DOM
 
  2. get_enrichment(url: "https://crawlio.app")
     --> Framework: Next.js (App Router), confidence: high
     --> 47 network requests, 3 console warnings
 
  3. "crawlio.app uses Next.js with App Router. The site makes
      47 network requests including API calls to /api/* and
      CDN assets on vercel.live. There are 3 console warnings
      related to deprecated React APIs."

Next steps

Framework Detection: Full list of 59 detected technologies
Export Formats: How enrichment flows into exports
MCP Tools Reference: Enrichment tools (submit and query)
Settings Reference: OCR configuration options

PreviousFramework Detection NextCloudflare Integration