Architecture
The crawl pipeline
When you start a crawl, Crawlio runs a five-stage pipeline:
Each stage runs concurrently. Downloading, parsing, and localizing happen in parallel.
Stage 1: Acquire
Download every reachable resource: HTML pages, stylesheets, scripts, images, fonts, PDFs, sitemaps, and robots.txt. Respect scope, depth limits, and politeness rules. Per-host connection limits prevent overwhelming any single server.
Stage 2: Parse
Run each downloaded file through a type-specific parser. Extract every URL, link, and reference. Build a complete graph of what references what.
Stage 3: Extract
Classify each file by role. Separate the page content from scaffolding (tracking scripts, hydration code, build tooling). Normalize HTML. Generate clean Markdown.
Stage 4: Localize
Rewrite every URL in every file so the entire site works offline. Convert absolute CDN URLs to relative local paths. Preserve the link structure exactly.
Stage 5: Export
Package everything into one of seven formats: browsable folder, ZIP archive, single self-contained HTML file, WARC web archive, PDF snapshot, deploy-ready static site, or extracted structured content.
What gets parsed
Crawlio selects a parser based on the response content type. 12 parsers cover all common web formats:
| Content type | What gets extracted |
|---|---|
| HTML | Links, images, scripts, stylesheets, meta tags, srcset, inline styles, data attributes |
| CSS | @import URLs, url() references, @font-face sources, image-set() |
| SVG | <use>, <image>, <a>, <feImage>, <textPath> URLs, embedded CSS |
| JavaScript (URLs) | fetch(), import(), ES module paths, absolute URL literals |
| JavaScript (chunks) | Webpack/Vite chunk paths from build manifests |
| Link annotations, text content URL patterns | |
| XML sitemap | <loc> URLs, recursive sitemap index discovery, gzip decompression |
| robots.txt | Allow/Disallow rules, Crawl-delay, Sitemap URLs (RFC 9309) |
| Web App Manifest | Icons, start_url, scope |
The HTML parser is a custom character-by-character scanner. Real-world HTML ships with unclosed tags, mismatched quotes, and encoding errors. A DOM parser would reject half the web.
Framework-aware content processors
After parsing, framework-specific processors discover additional URLs. Next.js sites trigger RSC payload extraction and _next/static/ chunk scanning. Astro sites trigger island component URL discovery. WordPress sites get wp-content and wp-includes path scanning.
Supported frameworks: Next.js, Astro, Svelte/SvelteKit, Vue/Nuxt, React, Angular, Gatsby, WordPress, Hugo, Jekyll.
What gets analyzed
19 site analyzers run after the crawl completes. Each analyzer produces findings with severity levels (critical, warning, notice, info) and specific recommendations.
| Analyzer | What it checks |
|---|---|
| SEO | Title, meta description, headings, canonical URLs, word count, readability |
| Social meta | Open Graph tags, Twitter cards, relative image URLs, URL mismatches |
| Accessibility | Alt text, ARIA attributes, heading hierarchy, color contrast hints |
| Security | HTTPS, HSTS, CSP, X-Frame-Options, cookie flags, server header leaks |
| Best practices | Viewport meta, favicon, broken links, large images, deprecated HTML |
| Content quality | Thin content, duplicate titles, missing descriptions |
| Duplicate content | Exact duplicates (SHA-256) and near-duplicates with similarity scores |
| Link intelligence | Internal/external link counts, orphan pages, link graph analysis |
| Orphan pages | Pages with no internal links pointing to them, sitemap cross-reference |
| Redirect chains | Loops, long chains, mixed permanent/temporary, HTTP-to-HTTPS |
| URL hygiene | Trailing slashes, mixed case, query parameter bloat, fragment misuse |
| Tracking | Third-party trackers, analytics scripts, ad networks, cookie surface |
| Images | Missing alt text, oversized images, missing dimensions, format suggestions |
| Keywords | Top keywords by frequency, co-occurring groups, density analysis |
| Design system | Colors, typography, spacing, breakpoints, component detection |
| Hreflang | Language/region tag validation, self-referencing, return link verification |
| 404 detection | Broken internal links, missing assets, soft 404 pages |
| Parity | HTML/JS rendering differences (title, h1, meta, canonical, robots) |
| Technology | Technology fingerprinting across 59+ technologies |
Results feed into a scoring system that grades the site across 15 categories. Each category gets a 0-100 score with color-coded thresholds.
Intelligence layer
Beyond basic analysis, Crawlio provides deeper intelligence capabilities.
Technology fingerprinting. Identifies 59+ technologies including frameworks, CMS platforms, analytics tools, CDNs, and hosting providers. Detection uses a 4-layer approach (see Framework Detection below).
API schema discovery. Analyzes network traffic to discover API endpoints, extract request/response patterns, and generate OpenAPI 3.0.3 specifications.
Traffic analysis. Classifies network requests into categories (document, script, style, image, font, fetch, tracker, analytics, ad) and maps the service dependency graph.
Vision OCR. Extracts text from downloaded images using Apple Vision framework. Supports PNG, JPEG, TIFF, BMP, and WebP. Results are included in exports (deploy.json, crawl-manifest.json, WARC metadata records). Opt-in via settings.
Concurrency model
Crawlio coordinates 25+ concurrent workers for downloading, parsing, and analysis. You control concurrency through settings:
- Parallel connections (1-40). How many downloads run at once.
- Per-host connection limit (default: 6). Prevents overwhelming a single server.
- Crawl delay. Minimum time between requests to the same host.
- Bandwidth throttle. Token bucket rate limiter for download speed.
- Circuit breaker. Automatically backs off from hosts returning errors. Resumes with a probe request after cooldown.
The download loop uses fair queuing across hosts. Without fairness, a CDN with 10,000 assets would starve the main domain's HTML pages. Round-robin dequeuing ensures every domain gets download slots proportionally.
URL deduplication uses a normalized string set. Normalization includes lowercasing the host, stripping fragments, sorting query parameters, and IDNA punycode conversion.
Framework detection
Crawlio identifies the JavaScript framework powering a site through 4 detection layers:
| Layer | How it works |
|---|---|
| Static | Inspects HTML markup, script attributes, and path patterns in downloaded source |
| Dynamic | Runs detection JavaScript in a WebKit browser to check global variables and DOM state |
| Merged | Combines static and dynamic results. Dynamic confidence wins on conflict, signals are unioned |
| Technography | Full technology fingerprinting across 59+ technologies with version and confidence |
Detection signals per framework:
| Framework | Signals |
|---|---|
| React | data-reactroot, _reactRootContainer |
| Next.js | __NEXT_DATA__ script tag, _next/ paths |
| Vue | data-v- attributes, __vue__ |
| Angular | ng-version, ng-app |
| Svelte | Svelte component class patterns |
| Gatsby | ___gatsby div, page-data paths |
| Nuxt | __NUXT__ global, _nuxt/ paths |
| Astro | astro-island elements |
| WordPress | wp-content, wp-includes paths |
Detection helps the engine make better decisions. Next.js sites trigger RSC payload extraction. WordPress sites get specialized URL normalization. SPAs get a recommendation to enable WebKit mode.
See the Framework Detection guide for details.
Export
Seven export formats, plus three metadata exports:
| Format | What you get |
|---|---|
| Folder | Browsable offline copy with localized links |
| ZIP | Compressed archive of the folder export |
| Single HTML | One self-contained HTML file with inlined assets |
| WARC | ISO 28500 web archive with SHA-1 digests, CDX index, deduplication, and optional gzip compression |
| PDF snapshot via WebKit rendering | |
| Deploy | Static site ready for hosting (flattened paths, redirect stubs, sitemap, junk removal) |
| Extracted | Structured content (normalized HTML, Markdown, metadata) |
Metadata exports (included alongside the main format):
- deploy.json: Per-page metadata, enrichment data, OCR text
- crawl-manifest.json: Complete crawl manifest with all URLs and metadata
- CSV: Tabular data with configurable columns, presets, and filters
See the Export Formats guide for details.
Link localization
After downloading, Crawlio rewrites URLs so the site works offline.
Before:
<a href="https://example.com/about">About</a>
<link rel="stylesheet" href="https://cdn.example.com/style.css">After:
<a href="../about/index.html">About</a>
<link rel="stylesheet" href="../style.css">Localization covers HTML (15 tag types), CSS (url(), @import, image-set()), JavaScript (absolute URLs, JSON-escaped paths), SVG (href attributes), inline styles, srcset, data-* attributes, and protocol-relative URLs. SRI attributes are stripped (integrity hashes break with rewritten URLs). Base tags are neutralized.
Progressive localization
Links are localized twice:
- During crawl. Each file is localized immediately after download using a partial link map. Even interrupted crawls produce partially-browsable offline sites.
- After crawl. A final pass re-localizes all files using the complete link map.
Scope and depth rules
Depth limit
Depth counts link-clicks from the seed URL:
example.com= depth 0/about= depth 1/about/team= depth 2/about/team/alice= depth 3
Setting max depth to 2 downloads everything except /about/team/alice.
Scope modes
| Scope | What gets downloaded |
|---|---|
| Same domain | Only example.com |
| Include subdomains | example.com, blog.example.com, docs.example.com |
| Same domain + external links | Any domain (use with caution) |
Other controls
- Path filters. Include or exclude URL path patterns.
- File type filter. Restrict to specific MIME types.
- File size limits. Skip tiny tracker pixels or huge video files.
- Max pages per crawl. Cap the total pages downloaded.
- Max crawl time. Set a time limit.
- Robots.txt. Respected by default (RFC 9309). Disable in settings if you own the site.
3-pillar MCP
Crawlio exposes ~362 tools across 3 pillars through its MCP server:
- Chrome Extension (~114 tools). Live browser automation via CDP.
- Headless Agent (~199 tools). Background automation without a visible browser.
- Crawlio App (49 tools). Crawl control, export, intelligence, and vault.
Five aggregator meta-tools unify all pillars behind a single interface. See the MCP Overview for details.
Next steps
- Configure crawl behavior in Settings Reference
- See the HTTP API for programmatic control
- Check File Locations for state files and logs