CrawlioCrawlio Docs

AI Enrichment

Overview

Crawlio's enrichment pipeline captures signals that static HTML parsing cannot detect. Three components work together:

  1. Browser runtime capture. An off-screen WebKit view (or Chrome extension) captures framework detection, network requests, console logs, DOM snapshots, and screenshots.
  2. Vision OCR. Apple Vision.framework extracts text from downloaded raster images.
  3. Enrichment store. All enrichment data is stored per-URL, persisted to disk, and scoped per project.

Enrichment data flows into all export formats and is queryable via MCP tools.


Browser runtime capture

What gets captured

Signal What it captures
Framework detection Client-side framework markers (React, Vue, Angular, Next.js, etc.) via window globals and DOM queries
Network requests Intercepted fetch() and XMLHttpRequest calls plus PerformanceObserver entries
Console logs All console.log/warn/error output
DOM snapshot Full document.documentElement.outerHTML after JS execution
Screenshot Page rendering captured as PNG image

How it works

  1. A shared WebKit view loads the target URL
  2. User scripts injected at document start intercept network calls and console output
  3. Framework detection JS probes window globals (__NEXT_DATA__, __NUXT__, __vue_app__) and DOM markers (#__next, [data-reactroot], astro-island)
  4. Results are collected and stored in the enrichment store

Triggering capture

Capture can be triggered three ways:

  • Automatic: During crawl when enrichment is enabled
  • MCP tool: trigger_capture(url: "https://example.com")
  • HTTP API: POST /capture endpoint on the control server

Chrome extension capture

The Crawlio Agent Chrome extension provides richer capture via Chrome DevTools Protocol:

Data type Description
Framework detection Same probes as WebKit, plus additional DOM markers
Network requests Full request/response correlation with body sizes
Console logs JS console output plus browser-level errors
DOM snapshot Recursive DOM walker with Shadow DOM support
Screenshots Full-page capture via device metrics override

The extension sends captured data to Crawlio via HTTP POST to the control server enrichment endpoints.

See Framework Detection for the full list of 59 detected technologies.


Vision OCR pipeline

Crawlio extracts text from downloaded images using Apple's Vision.framework. The pipeline is opt-in and runs after link localization.

Configuration

Setting Default Description
ocr.isEnabled false Enable OCR (zero overhead when off)
ocr.maxImageSize 10 MB Maximum image size for OCR
ocr.languages [] Recognition languages (empty = auto-detect)
ocr.recognitionLevel accurate Vision recognition level: accurate or fast
ocr.maxConcurrentJobs 2 Maximum parallel OCR jobs

Eligible images

OCR runs on raster images only:

Eligible Rejected
PNG, JPEG, TIFF, BMP, WebP SVG (vector, no pixel data), GIF (animation frames)

Images larger than maxImageSize and data: URIs are skipped.

How it works

  1. After link localization, the engine scans each downloaded HTML page for <img> tags
  2. Each image is checked for MIME type eligibility and file size
  3. Eligible images are submitted as jobs to the OCR pipeline
  4. Jobs run in parallel (capped at maxConcurrentJobs)
  5. Each job uses VNRecognizeTextRequest for text recognition
  6. Results are sorted by bounding box position (top-to-bottom) and deduplicated
  7. Text is grouped by source page URL into a single ocrText string per page

Where OCR text appears

  • Per-item storage: each downloaded item gets an ocrText field
  • deploy.json: pages[].ocrText
  • crawl-manifest.json: items[].ocrText
  • WARC export: metadata record with WARC-Refers-To linking to the response record

Enrichment store

The enrichment store is the central persistence layer for all enrichment data.

Data model

Each URL maps to an enrichment record:

Field Type Description
url String Page URL
capturedAt Date Capture timestamp
framework object JS framework detection (name, subtype, confidence, signals, version, SSR mode)
networkRequests array Captured HTTP requests (URL, method, status, MIME, size, duration, type)
consoleLogs array Console output (level, text, timestamp, source URL, line number)
domSnapshotJSON String Serialized DOM tree

Storage behavior

  • URL normalization: Strips #fragment and trailing / for consistent keying
  • Merge semantics: Updates only overwrite non-nil fields. Framework detection and network entries can arrive independently and merge under the same URL key.
  • Persistence: Debounced atomic writes to disk. Immediate flush on app termination.
  • Per-project isolation: Each project gets its own enrichment store, persisted to {destination}/.crawlio/enrichment.json

Engine feedback

URLs discovered by browser capture (API endpoints, CDN assets, dynamic routes) are fed back into the crawl frontier. These URLs pass through the same filtering as crawled URLs: scope checks, robots.txt, URL normalization.


Querying enrichment data

Via MCP

get_enrichment()                                    # All enrichment data
get_enrichment(url: "https://example.com")          # Per-URL

Or in code mode:

execute_api("GET", "/enrichment")
execute_api("GET", "/enrichment?url=https://example.com")

Via HTTP API

curl --unix-socket ~/Library/Logs/Crawlio/control.sock \
  http://localhost/enrichment
 
curl --unix-socket ~/Library/Logs/Crawlio/control.sock \
  "http://localhost/enrichment?url=https://example.com"

Via the app

The Enrichment Inspector shows enrichment data inline for the selected download item:

  • Framework badge (name, confidence, signals)
  • Network request summary (count, total size)
  • Console log breakdown by level
  • DOM snapshot availability indicator

Example: AI-driven analysis workflow

User: "What framework does crawlio.app use?"
 
Claude:
  1. trigger_capture(url: "https://crawlio.app")
     --> WebKit captures framework, network, console, DOM
 
  2. get_enrichment(url: "https://crawlio.app")
     --> Framework: Next.js (App Router), confidence: high
     --> 47 network requests, 3 console warnings
 
  3. "crawlio.app uses Next.js with App Router. The site makes
      47 network requests including API calls to /api/* and
      CDN assets on vercel.live. There are 3 console warnings
      related to deprecated React APIs."

Next steps

© 2026 Crawlio. All rights reserved.