CrawlioCrawlio Docs

Architecture

The crawl pipeline

When you start a crawl, Crawlio runs a five-stage pipeline:

ACQUIRE
Download every reachable artifact over HTTP
PARSE
Tokenize every byte into its semantic parts
EXTRACT
Classify every artifact by type and role
LOCALIZE
Rewrite every URL for offline browsing
EXPORT
Package into a format a human or AI can consume

Each stage runs concurrently. Downloading, parsing, and localizing happen in parallel.

Stage 1: Acquire

Download every reachable resource: HTML pages, stylesheets, scripts, images, fonts, PDFs, sitemaps, and robots.txt. Respect scope, depth limits, and politeness rules. Per-host connection limits prevent overwhelming any single server.

Stage 2: Parse

Run each downloaded file through a type-specific parser. Extract every URL, link, and reference. Build a complete graph of what references what.

Stage 3: Extract

Classify each file by role. Separate the page content from scaffolding (tracking scripts, hydration code, build tooling). Normalize HTML. Generate clean Markdown.

Stage 4: Localize

Rewrite every URL in every file so the entire site works offline. Convert absolute CDN URLs to relative local paths. Preserve the link structure exactly.

Stage 5: Export

Package everything into one of seven formats: browsable folder, ZIP archive, single self-contained HTML file, WARC web archive, PDF snapshot, deploy-ready static site, or extracted structured content.


What gets parsed

Crawlio selects a parser based on the response content type. 12 parsers cover all common web formats:

Content type What gets extracted
HTML Links, images, scripts, stylesheets, meta tags, srcset, inline styles, data attributes
CSS @import URLs, url() references, @font-face sources, image-set()
SVG <use>, <image>, <a>, <feImage>, <textPath> URLs, embedded CSS
JavaScript (URLs) fetch(), import(), ES module paths, absolute URL literals
JavaScript (chunks) Webpack/Vite chunk paths from build manifests
PDF Link annotations, text content URL patterns
XML sitemap <loc> URLs, recursive sitemap index discovery, gzip decompression
robots.txt Allow/Disallow rules, Crawl-delay, Sitemap URLs (RFC 9309)
Web App Manifest Icons, start_url, scope

The HTML parser is a custom character-by-character scanner. Real-world HTML ships with unclosed tags, mismatched quotes, and encoding errors. A DOM parser would reject half the web.

Framework-aware content processors

After parsing, framework-specific processors discover additional URLs. Next.js sites trigger RSC payload extraction and _next/static/ chunk scanning. Astro sites trigger island component URL discovery. WordPress sites get wp-content and wp-includes path scanning.

Supported frameworks: Next.js, Astro, Svelte/SvelteKit, Vue/Nuxt, React, Angular, Gatsby, WordPress, Hugo, Jekyll.


What gets analyzed

19 site analyzers run after the crawl completes. Each analyzer produces findings with severity levels (critical, warning, notice, info) and specific recommendations.

Analyzer What it checks
SEO Title, meta description, headings, canonical URLs, word count, readability
Social meta Open Graph tags, Twitter cards, relative image URLs, URL mismatches
Accessibility Alt text, ARIA attributes, heading hierarchy, color contrast hints
Security HTTPS, HSTS, CSP, X-Frame-Options, cookie flags, server header leaks
Best practices Viewport meta, favicon, broken links, large images, deprecated HTML
Content quality Thin content, duplicate titles, missing descriptions
Duplicate content Exact duplicates (SHA-256) and near-duplicates with similarity scores
Link intelligence Internal/external link counts, orphan pages, link graph analysis
Orphan pages Pages with no internal links pointing to them, sitemap cross-reference
Redirect chains Loops, long chains, mixed permanent/temporary, HTTP-to-HTTPS
URL hygiene Trailing slashes, mixed case, query parameter bloat, fragment misuse
Tracking Third-party trackers, analytics scripts, ad networks, cookie surface
Images Missing alt text, oversized images, missing dimensions, format suggestions
Keywords Top keywords by frequency, co-occurring groups, density analysis
Design system Colors, typography, spacing, breakpoints, component detection
Hreflang Language/region tag validation, self-referencing, return link verification
404 detection Broken internal links, missing assets, soft 404 pages
Parity HTML/JS rendering differences (title, h1, meta, canonical, robots)
Technology Technology fingerprinting across 59+ technologies

Results feed into a scoring system that grades the site across 15 categories. Each category gets a 0-100 score with color-coded thresholds.


Intelligence layer

Beyond basic analysis, Crawlio provides deeper intelligence capabilities.

Technology fingerprinting. Identifies 59+ technologies including frameworks, CMS platforms, analytics tools, CDNs, and hosting providers. Detection uses a 4-layer approach (see Framework Detection below).

API schema discovery. Analyzes network traffic to discover API endpoints, extract request/response patterns, and generate OpenAPI 3.0.3 specifications.

Traffic analysis. Classifies network requests into categories (document, script, style, image, font, fetch, tracker, analytics, ad) and maps the service dependency graph.

Vision OCR. Extracts text from downloaded images using Apple Vision framework. Supports PNG, JPEG, TIFF, BMP, and WebP. Results are included in exports (deploy.json, crawl-manifest.json, WARC metadata records). Opt-in via settings.


Concurrency model

Crawlio coordinates 25+ concurrent workers for downloading, parsing, and analysis. You control concurrency through settings:

  • Parallel connections (1-40). How many downloads run at once.
  • Per-host connection limit (default: 6). Prevents overwhelming a single server.
  • Crawl delay. Minimum time between requests to the same host.
  • Bandwidth throttle. Token bucket rate limiter for download speed.
  • Circuit breaker. Automatically backs off from hosts returning errors. Resumes with a probe request after cooldown.

The download loop uses fair queuing across hosts. Without fairness, a CDN with 10,000 assets would starve the main domain's HTML pages. Round-robin dequeuing ensures every domain gets download slots proportionally.

URL deduplication uses a normalized string set. Normalization includes lowercasing the host, stripping fragments, sorting query parameters, and IDNA punycode conversion.


Framework detection

Crawlio identifies the JavaScript framework powering a site through 4 detection layers:

Layer How it works
Static Inspects HTML markup, script attributes, and path patterns in downloaded source
Dynamic Runs detection JavaScript in a WebKit browser to check global variables and DOM state
Merged Combines static and dynamic results. Dynamic confidence wins on conflict, signals are unioned
Technography Full technology fingerprinting across 59+ technologies with version and confidence

Detection signals per framework:

Framework Signals
React data-reactroot, _reactRootContainer
Next.js __NEXT_DATA__ script tag, _next/ paths
Vue data-v- attributes, __vue__
Angular ng-version, ng-app
Svelte Svelte component class patterns
Gatsby ___gatsby div, page-data paths
Nuxt __NUXT__ global, _nuxt/ paths
Astro astro-island elements
WordPress wp-content, wp-includes paths

Detection helps the engine make better decisions. Next.js sites trigger RSC payload extraction. WordPress sites get specialized URL normalization. SPAs get a recommendation to enable WebKit mode.

See the Framework Detection guide for details.


Export

Seven export formats, plus three metadata exports:

Format What you get
Folder Browsable offline copy with localized links
ZIP Compressed archive of the folder export
Single HTML One self-contained HTML file with inlined assets
WARC ISO 28500 web archive with SHA-1 digests, CDX index, deduplication, and optional gzip compression
PDF PDF snapshot via WebKit rendering
Deploy Static site ready for hosting (flattened paths, redirect stubs, sitemap, junk removal)
Extracted Structured content (normalized HTML, Markdown, metadata)

Metadata exports (included alongside the main format):

  • deploy.json: Per-page metadata, enrichment data, OCR text
  • crawl-manifest.json: Complete crawl manifest with all URLs and metadata
  • CSV: Tabular data with configurable columns, presets, and filters

See the Export Formats guide for details.


After downloading, Crawlio rewrites URLs so the site works offline.

Before:

<a href="https://example.com/about">About</a>
<link rel="stylesheet" href="https://cdn.example.com/style.css">

After:

<a href="../about/index.html">About</a>
<link rel="stylesheet" href="../style.css">

Localization covers HTML (15 tag types), CSS (url(), @import, image-set()), JavaScript (absolute URLs, JSON-escaped paths), SVG (href attributes), inline styles, srcset, data-* attributes, and protocol-relative URLs. SRI attributes are stripped (integrity hashes break with rewritten URLs). Base tags are neutralized.

Progressive localization

Links are localized twice:

  1. During crawl. Each file is localized immediately after download using a partial link map. Even interrupted crawls produce partially-browsable offline sites.
  2. After crawl. A final pass re-localizes all files using the complete link map.

Scope and depth rules

Depth limit

Depth counts link-clicks from the seed URL:

  • example.com = depth 0
    • /about = depth 1
      • /about/team = depth 2
        • /about/team/alice = depth 3

Setting max depth to 2 downloads everything except /about/team/alice.

Scope modes

Scope What gets downloaded
Same domain Only example.com
Include subdomains example.com, blog.example.com, docs.example.com
Same domain + external links Any domain (use with caution)

Other controls

  • Path filters. Include or exclude URL path patterns.
  • File type filter. Restrict to specific MIME types.
  • File size limits. Skip tiny tracker pixels or huge video files.
  • Max pages per crawl. Cap the total pages downloaded.
  • Max crawl time. Set a time limit.
  • Robots.txt. Respected by default (RFC 9309). Disable in settings if you own the site.

3-pillar MCP

Crawlio exposes ~362 tools across 3 pillars through its MCP server:

  1. Chrome Extension (~114 tools). Live browser automation via CDP.
  2. Headless Agent (~199 tools). Background automation without a visible browser.
  3. Crawlio App (49 tools). Crawl control, export, intelligence, and vault.

Five aggregator meta-tools unify all pillars behind a single interface. See the MCP Overview for details.


Next steps

© 2026 Crawlio. All rights reserved.