AI Enrichment
Overview
Crawlio's enrichment pipeline captures signals that static HTML parsing cannot detect. Three components work together:
- Browser runtime capture. An off-screen WebKit view (or Chrome extension) captures framework detection, network requests, console logs, DOM snapshots, and screenshots.
- Vision OCR. Apple Vision.framework extracts text from downloaded raster images.
- Enrichment store. All enrichment data is stored per-URL, persisted to disk, and scoped per project.
Enrichment data flows into all export formats and is queryable via MCP tools.
Browser runtime capture
What gets captured
| Signal | What it captures |
|---|---|
| Framework detection | Client-side framework markers (React, Vue, Angular, Next.js, etc.) via window globals and DOM queries |
| Network requests | Intercepted fetch() and XMLHttpRequest calls plus PerformanceObserver entries |
| Console logs | All console.log/warn/error output |
| DOM snapshot | Full document.documentElement.outerHTML after JS execution |
| Screenshot | Page rendering captured as PNG image |
How it works
- A shared WebKit view loads the target URL
- User scripts injected at document start intercept network calls and console output
- Framework detection JS probes window globals (
__NEXT_DATA__,__NUXT__,__vue_app__) and DOM markers (#__next,[data-reactroot],astro-island) - Results are collected and stored in the enrichment store
Triggering capture
Capture can be triggered three ways:
- Automatic: During crawl when enrichment is enabled
- MCP tool:
trigger_capture(url: "https://example.com") - HTTP API:
POST /captureendpoint on the control server
Chrome extension capture
The Crawlio Agent Chrome extension provides richer capture via Chrome DevTools Protocol:
| Data type | Description |
|---|---|
| Framework detection | Same probes as WebKit, plus additional DOM markers |
| Network requests | Full request/response correlation with body sizes |
| Console logs | JS console output plus browser-level errors |
| DOM snapshot | Recursive DOM walker with Shadow DOM support |
| Screenshots | Full-page capture via device metrics override |
The extension sends captured data to Crawlio via HTTP POST to the control server enrichment endpoints.
See Framework Detection for the full list of 59 detected technologies.
Vision OCR pipeline
Crawlio extracts text from downloaded images using Apple's Vision.framework. The pipeline is opt-in and runs after link localization.
Configuration
| Setting | Default | Description |
|---|---|---|
ocr.isEnabled |
false | Enable OCR (zero overhead when off) |
ocr.maxImageSize |
10 MB | Maximum image size for OCR |
ocr.languages |
[] |
Recognition languages (empty = auto-detect) |
ocr.recognitionLevel |
accurate |
Vision recognition level: accurate or fast |
ocr.maxConcurrentJobs |
2 | Maximum parallel OCR jobs |
Eligible images
OCR runs on raster images only:
| Eligible | Rejected |
|---|---|
| PNG, JPEG, TIFF, BMP, WebP | SVG (vector, no pixel data), GIF (animation frames) |
Images larger than maxImageSize and data: URIs are skipped.
How it works
- After link localization, the engine scans each downloaded HTML page for
<img>tags - Each image is checked for MIME type eligibility and file size
- Eligible images are submitted as jobs to the OCR pipeline
- Jobs run in parallel (capped at
maxConcurrentJobs) - Each job uses
VNRecognizeTextRequestfor text recognition - Results are sorted by bounding box position (top-to-bottom) and deduplicated
- Text is grouped by source page URL into a single
ocrTextstring per page
Where OCR text appears
- Per-item storage: each downloaded item gets an
ocrTextfield - deploy.json:
pages[].ocrText - crawl-manifest.json:
items[].ocrText - WARC export:
metadatarecord withWARC-Refers-Tolinking to the response record
Enrichment store
The enrichment store is the central persistence layer for all enrichment data.
Data model
Each URL maps to an enrichment record:
| Field | Type | Description |
|---|---|---|
url |
String | Page URL |
capturedAt |
Date | Capture timestamp |
framework |
object | JS framework detection (name, subtype, confidence, signals, version, SSR mode) |
networkRequests |
array | Captured HTTP requests (URL, method, status, MIME, size, duration, type) |
consoleLogs |
array | Console output (level, text, timestamp, source URL, line number) |
domSnapshotJSON |
String | Serialized DOM tree |
Storage behavior
- URL normalization: Strips
#fragmentand trailing/for consistent keying - Merge semantics: Updates only overwrite non-nil fields. Framework detection and network entries can arrive independently and merge under the same URL key.
- Persistence: Debounced atomic writes to disk. Immediate flush on app termination.
- Per-project isolation: Each project gets its own enrichment store, persisted to
{destination}/.crawlio/enrichment.json
Engine feedback
URLs discovered by browser capture (API endpoints, CDN assets, dynamic routes) are fed back into the crawl frontier. These URLs pass through the same filtering as crawled URLs: scope checks, robots.txt, URL normalization.
Querying enrichment data
Via MCP
get_enrichment() # All enrichment data
get_enrichment(url: "https://example.com") # Per-URLOr in code mode:
execute_api("GET", "/enrichment")
execute_api("GET", "/enrichment?url=https://example.com")Via HTTP API
curl --unix-socket ~/Library/Logs/Crawlio/control.sock \
http://localhost/enrichment
curl --unix-socket ~/Library/Logs/Crawlio/control.sock \
"http://localhost/enrichment?url=https://example.com"Via the app
The Enrichment Inspector shows enrichment data inline for the selected download item:
- Framework badge (name, confidence, signals)
- Network request summary (count, total size)
- Console log breakdown by level
- DOM snapshot availability indicator
Example: AI-driven analysis workflow
User: "What framework does crawlio.app use?"
Claude:
1. trigger_capture(url: "https://crawlio.app")
--> WebKit captures framework, network, console, DOM
2. get_enrichment(url: "https://crawlio.app")
--> Framework: Next.js (App Router), confidence: high
--> 47 network requests, 3 console warnings
3. "crawlio.app uses Next.js with App Router. The site makes
47 network requests including API calls to /api/* and
CDN assets on vercel.live. There are 3 console warnings
related to deprecated React APIs."Next steps
- Framework Detection: Full list of 59 detected technologies
- Export Formats: How enrichment flows into exports
- MCP Tools Reference: Enrichment tools (submit and query)
- Settings Reference: OCR configuration options