Export Formats
Overview
After a crawl completes, export your downloaded site in any of 7 formats. Three metadata exports are also available.
| Format | Output | Best for |
|---|---|---|
| Folder | directory | Default. Offline browsing with rewritten links |
| ZIP | .zip |
Sharing compressed archives |
| Single HTML | .html |
One-file page snapshots |
| WARC | .warc / .warc.gz |
ISO 28500 web archives |
.pdf |
Full-page PDF rendering | |
| Extracted | directory | Clean text + Markdown for AI/search |
| Deploy | directory | Deploy-ready static site |
The default export. Preserves the original directory structure with all links rewritten for offline browsing.
example.com/
index.html
about.html
css/
style.css
js/
app.js
images/
logo.png
hero.jpgOpen index.html in any browser. The site works completely offline.
Export via CLI:
crawlio export run folder --dest ~/sites/example.comExport via MCP:
export_site(format: "folder", destinationPath: "~/sites/example.com")Same structure as folder, compressed into a single .zip archive. Configurable compression level (1-9).
crawlio export run zip --dest ~/archives/example.zipInlines all CSS, JavaScript, and images (as base64 data URIs) into a single self-contained HTML file.
crawlio export run singleHTML --dest ~/snapshots/example.htmlSupported image formats for inlining: PNG, JPG, JPEG, GIF, SVG, WebP, ICO.
Output is capped at 50 MB. Requires index.html in the downloaded content. External resources not in the download set remain as external references.
Web ARChive (ISO 28500), the standard format used by the Internet Archive's Wayback Machine. Each resource is stored with full HTTP headers, response bodies, and metadata.
crawlio export run warc --dest ~/archives/example.warcWhat's included:
warcinfoheader record with Crawlio version and crawl metadata- Both
requestandresponserecords with real HTTP headers - SHA-1 digests (Base32-encoded) for both block and payload per RFC 4648
- Content-Type auto-detected from file extension
- OCR text as
metadatarecords (when OCR is enabled) withWARC-Refers-Tolinking
Four settings control WARC output:
| Setting | Type | Default | Description |
|---|---|---|---|
compressionEnabled |
bool | true | Per-record gzip compression. Output file extension is .warc.gz when on, .warc when off |
maxFileSize |
int | 1073741824 | Maximum file size in bytes before splitting (default 1 GB). Set to 0 to disable splitting |
cdxEnabled |
bool | true | Generate a CDX index file alongside the WARC. CDX entries reference the correct split file |
dedupEnabled |
bool | true | Skip duplicate responses. Detected via SHA-1 payload digest. Duplicates are stored as revisit records |
When splitting is active, files are numbered sequentially (example-00000.warc.gz, example-00001.warc.gz). Each split file gets its own warcinfo record. CDX entries track which split file contains each record.
WARC files can be replayed with ReplayWeb.page or ingested into web archive infrastructure like pywb and warctools.
Full-page PDF rendering via WebKit. Each page is rendered as it would appear in a browser, with full CSS applied.
crawlio export run pdf --dest ~/pdfs/example.pdfClean, structured content extracted from each page: normalized HTML, Markdown, metadata, and asset inventories.
output/
pages/
index/
page.html # Normalized HTML
content.md # Markdown conversion
metadata.json # Title, description, OG data
assets.json # Categorized asset inventory
about/
page.html
content.md
metadata.json
assets.json
manifest.json # Crawl metadata, site treeThe metadata.json includes:
{
"url": "https://example.com/",
"title": "Example Domain",
"contentType": "text/html",
"statusCode": 200,
"headers": { "h1": ["Example Domain"], "h2": ["About", "Contact"] },
"links": ["https://example.com/about", "https://example.com/contact"],
"framework": "nextjs",
"extractedAt": "2026-02-14T10:30:00Z"
}This format is ideal for AI consumption. Feed content.md files into RAG pipelines, search indexes, or training datasets.
A deploy-ready static site with clean URL routes and machine-readable manifests.
output/
index.html
blog/
index.html # from blog.html
series-a/
index.html # from blog/series-a.html
pricing/
index.html # from pricing.html
_assets/
cdn.example.com/ # cross-domain assets
fonts/inter.woff2
_next/static/ # framework assets preserved
deploy.json # manifest
sitemap.xml # generatedFeatures:
- Multi-host flattening: main domain at root, cross-domain assets under
_assets/{host}/ - HTML pages converted to clean URL routes (
blog.htmlbecomesblog/index.html) - Reference rewriting in HTML/CSS files for the flattened structure
- Skips junk files: API routes,
.css_duplicates, query-variant hash duplicates - Generates
deploy.jsonmanifest andsitemap.xml
Deploy to Vercel, Netlify, S3, or any static hosting.
Metadata exports
Three additional metadata files are generated alongside exports:
crawl-manifest.json
Full crawl metadata with every downloaded item:
{
"version": 1,
"seedURL": "https://example.com",
"exportedAt": "2026-03-06T10:30:00Z",
"items": [
{
"url": "https://example.com/",
"localPath": "index.html",
"contentType": "text/html",
"size": 15234,
"statusCode": 200,
"ocrText": "..."
}
]
}deploy.json
Pages and assets with enrichment data for the deploy format:
{
"pages": [
{
"url": "https://example.com/",
"route": "/",
"localPath": "index.html",
"ocrText": "..."
}
],
"assets": [...],
"enrichment": {
"framework": "nextjs",
"pagesEnriched": 45
}
}Enrichment dump
All browser capture data (framework detection, network requests, console logs, DOM snapshots) exported as JSON per URL.
Choosing a format
| I want to... | Use |
|---|---|
| Browse the site offline | Folder |
| Share an archive with someone | ZIP |
| Save a quick single-page snapshot | Single HTML |
| Create a standards-compliant web archive | WARC |
| Generate a printable document | |
| Feed content into an AI or search pipeline | Extracted |
| Deploy a mirrored static site | Deploy |
| Build a retrieval index | Deploy (use deploy.json manifest) |
Next steps
- Settings Reference: Configure what gets downloaded
- CLI Commands: Automate exports from the terminal
- MCP Overview: AI-driven export workflows