Architecture

The crawl pipeline

When you start a crawl, Crawlio runs a five-stage pipeline:

ACQUIRE

Download every reachable artifact over HTTP

PARSE

Tokenize every byte into its semantic parts

EXTRACT

Classify every artifact by type and role

LOCALIZE

Rewrite every URL for offline browsing

EXPORT

Package into a format a human or AI can consume

Each stage runs concurrently. Downloading, parsing, and localizing happen in parallel.

Download every reachable resource: HTML pages, stylesheets, scripts, images, fonts, PDFs, sitemaps, and robots.txt. Respect scope, depth limits, and politeness rules. Per-host connection limits prevent overwhelming any single server.

Stage 2: Parse

Run each downloaded file through a type-specific parser. Extract every URL, link, and reference. Build a complete graph of what references what.

Stage 3: Extract

Classify each file by role. Separate the page content from scaffolding (tracking scripts, hydration code, build tooling). Normalize HTML. Generate clean Markdown.

Stage 4: Localize

Rewrite every URL in every file so the entire site works offline. Convert absolute CDN URLs to relative local paths. Preserve the link structure exactly.

Stage 5: Export

Package everything into one of seven formats: browsable folder, ZIP archive, single self-contained HTML file, WARC web archive, PDF snapshot, deploy-ready static site, or extracted structured content.

What gets parsed

Crawlio selects a parser based on the response content type. 12 parsers cover all common web formats:

Content type	What gets extracted
HTML	Links, images, scripts, stylesheets, meta tags, srcset, inline styles, data attributes
CSS	`@import` URLs, `url()` references, `@font-face` sources, `image-set()`
SVG	`<use>`, `<image>`, `<a>`, `<feImage>`, `<textPath>` URLs, embedded CSS
JavaScript (URLs)	`fetch()`, `import()`, ES module paths, absolute URL literals
JavaScript (chunks)	Webpack/Vite chunk paths from build manifests
PDF	Link annotations, text content URL patterns
XML sitemap	`<loc>` URLs, recursive sitemap index discovery, gzip decompression
robots.txt	Allow/Disallow rules, Crawl-delay, Sitemap URLs (RFC 9309)
Web App Manifest	Icons, start_url, scope

The HTML parser is a custom character-by-character scanner. Real-world HTML ships with unclosed tags, mismatched quotes, and encoding errors. A DOM parser would reject half the web.

Framework-aware content processors

After parsing, framework-specific processors discover additional URLs. Next.js sites trigger RSC payload extraction and _next/static/ chunk scanning. Astro sites trigger island component URL discovery. WordPress sites get wp-content and wp-includes path scanning.

Supported frameworks: Next.js, Astro, Svelte/SvelteKit, Vue/Nuxt, React, Angular, Gatsby, WordPress, Hugo, Jekyll.

What gets analyzed

19 site analyzers run after the crawl completes. Each analyzer produces findings with severity levels (critical, warning, notice, info) and specific recommendations.

Analyzer	What it checks
SEO	Title, meta description, headings, canonical URLs, word count, readability
Social meta	Open Graph tags, Twitter cards, relative image URLs, URL mismatches
Accessibility	Alt text, ARIA attributes, heading hierarchy, color contrast hints
Security	HTTPS, HSTS, CSP, X-Frame-Options, cookie flags, server header leaks
Best practices	Viewport meta, favicon, broken links, large images, deprecated HTML
Content quality	Thin content, duplicate titles, missing descriptions
Duplicate content	Exact duplicates (SHA-256) and near-duplicates with similarity scores
Link intelligence	Internal/external link counts, orphan pages, link graph analysis
Orphan pages	Pages with no internal links pointing to them, sitemap cross-reference
Redirect chains	Loops, long chains, mixed permanent/temporary, HTTP-to-HTTPS
URL hygiene	Trailing slashes, mixed case, query parameter bloat, fragment misuse
Tracking	Third-party trackers, analytics scripts, ad networks, cookie surface
Images	Missing alt text, oversized images, missing dimensions, format suggestions
Keywords	Top keywords by frequency, co-occurring groups, density analysis
Design system	Colors, typography, spacing, breakpoints, component detection
Hreflang	Language/region tag validation, self-referencing, return link verification
404 detection	Broken internal links, missing assets, soft 404 pages
Parity	HTML/JS rendering differences (title, h1, meta, canonical, robots)
Technology	Technology fingerprinting across 59+ technologies

Results feed into a scoring system that grades the site across 15 categories. Each category gets a 0-100 score with color-coded thresholds.

Intelligence layer

Beyond basic analysis, Crawlio provides deeper intelligence capabilities.

Technology fingerprinting. Identifies 59+ technologies including frameworks, CMS platforms, analytics tools, CDNs, and hosting providers. Detection uses a 4-layer approach (see Framework Detection below).

API schema discovery. Analyzes network traffic to discover API endpoints, extract request/response patterns, and generate OpenAPI 3.0.3 specifications.

Traffic analysis. Classifies network requests into categories (document, script, style, image, font, fetch, tracker, analytics, ad) and maps the service dependency graph.

Vision OCR. Extracts text from downloaded images using Apple Vision framework. Supports PNG, JPEG, TIFF, BMP, and WebP. Results are included in exports (deploy.json, crawl-manifest.json, WARC metadata records). Opt-in via settings.

Concurrency model

Crawlio coordinates 25+ concurrent workers for downloading, parsing, and analysis. You control concurrency through settings:

Parallel connections (1-40). How many downloads run at once.
Per-host connection limit (default: 6). Prevents overwhelming a single server.
Crawl delay. Minimum time between requests to the same host.
Bandwidth throttle. Token bucket rate limiter for download speed.
Circuit breaker. Automatically backs off from hosts returning errors. Resumes with a probe request after cooldown.

The download loop uses fair queuing across hosts. Without fairness, a CDN with 10,000 assets would starve the main domain's HTML pages. Round-robin dequeuing ensures every domain gets download slots proportionally.

URL deduplication uses a normalized string set. Normalization includes lowercasing the host, stripping fragments, sorting query parameters, and IDNA punycode conversion.

Framework detection

Crawlio identifies the JavaScript framework powering a site through 4 detection layers:

Layer	How it works
Static	Inspects HTML markup, script attributes, and path patterns in downloaded source
Dynamic	Runs detection JavaScript in a WebKit browser to check global variables and DOM state
Merged	Combines static and dynamic results. Dynamic confidence wins on conflict, signals are unioned
Technography	Full technology fingerprinting across 59+ technologies with version and confidence

Detection signals per framework:

Framework	Signals
React	`data-reactroot`, `_reactRootContainer`
Next.js	`__NEXT_DATA__` script tag, `_next/` paths
Vue	`data-v-` attributes, `__vue__`
Angular	`ng-version`, `ng-app`
Svelte	Svelte component class patterns
Gatsby	`___gatsby` div, `page-data` paths
Nuxt	`__NUXT__` global, `_nuxt/` paths
Astro	`astro-island` elements
WordPress	`wp-content`, `wp-includes` paths

Detection helps the engine make better decisions. Next.js sites trigger RSC payload extraction. WordPress sites get specialized URL normalization. SPAs get a recommendation to enable WebKit mode.

See the Framework Detection guide for details.

Export

Seven export formats, plus three metadata exports:

Format	What you get
Folder	Browsable offline copy with localized links
ZIP	Compressed archive of the folder export
Single HTML	One self-contained HTML file with inlined assets
WARC	ISO 28500 web archive with SHA-1 digests, CDX index, deduplication, and optional gzip compression
PDF	PDF snapshot via WebKit rendering
Deploy	Static site ready for hosting (flattened paths, redirect stubs, sitemap, junk removal)
Extracted	Structured content (normalized HTML, Markdown, metadata)

Metadata exports (included alongside the main format):

deploy.json: Per-page metadata, enrichment data, OCR text
crawl-manifest.json: Complete crawl manifest with all URLs and metadata
CSV: Tabular data with configurable columns, presets, and filters

See the Export Formats guide for details.

Link localization

After downloading, Crawlio rewrites URLs so the site works offline.

Before:

<a href="https://example.com/about">About</a>
<link rel="stylesheet" href="https://cdn.example.com/style.css">

After:

<a href="../about/index.html">About</a>
<link rel="stylesheet" href="../style.css">

Localization covers HTML (15 tag types), CSS (url(), @import, image-set()), JavaScript (absolute URLs, JSON-escaped paths), SVG (href attributes), inline styles, srcset, data-* attributes, and protocol-relative URLs. SRI attributes are stripped (integrity hashes break with rewritten URLs). Base tags are neutralized.

Progressive localization

Links are localized twice:

During crawl. Each file is localized immediately after download using a partial link map. Even interrupted crawls produce partially-browsable offline sites.
After crawl. A final pass re-localizes all files using the complete link map.

Scope and depth rules

Depth limit

Depth counts link-clicks from the seed URL:

example.com = depth 0
- /about = depth 1
  - /about/team = depth 2
    - /about/team/alice = depth 3

Setting max depth to 2 downloads everything except /about/team/alice.

Scope modes

Scope	What gets downloaded
Same domain	Only `example.com`
Include subdomains	`example.com`, `blog.example.com`, `docs.example.com`
Same domain + external links	Any domain (use with caution)

Other controls

Path filters. Include or exclude URL path patterns.
File type filter. Restrict to specific MIME types.
File size limits. Skip tiny tracker pixels or huge video files.
Max pages per crawl. Cap the total pages downloaded.
Max crawl time. Set a time limit.
Robots.txt. Respected by default (RFC 9309). Disable in settings if you own the site.

3-pillar MCP

Crawlio exposes ~362 tools across 3 pillars through its MCP server:

Chrome Extension (~114 tools). Live browser automation via CDP.
Headless Agent (~199 tools). Background automation without a visible browser.
Crawlio App (49 tools). Crawl control, export, intelligence, and vault.

Five aggregator meta-tools unify all pillars behind a single interface. See the MCP Overview for details.

Next steps

Configure crawl behavior in Settings Reference
See the HTTP API for programmatic control
Check File Locations for state files and logs

PreviousHTTP API NextFile Locations