Export Formats

Overview

After a crawl completes, export your downloaded site in any of 7 formats. Three metadata exports are also available.

Format	Output	Best for
Folder	directory	Default. Offline browsing with rewritten links
ZIP	`.zip`	Sharing compressed archives
Single HTML	`.html`	One-file page snapshots
WARC	`.warc` / `.warc.gz`	ISO 28500 web archives
PDF	`.pdf`	Full-page PDF rendering
Extracted	directory	Clean text + Markdown for AI/search
Deploy	directory	Deploy-ready static site

The default export. Preserves the original directory structure with all links rewritten for offline browsing.

example.com/
  index.html
  about.html
  css/
    style.css
  js/
    app.js
  images/
    logo.png
    hero.jpg

Open index.html in any browser. The site works completely offline.

Export via CLI:

crawlio export run folder --dest ~/sites/example.com

Export via MCP:

export_site(format: "folder", destinationPath: "~/sites/example.com")

Same structure as folder, compressed into a single .zip archive. Configurable compression level (1-9).

crawlio export run zip --dest ~/archives/example.zip

Inlines all CSS, JavaScript, and images (as base64 data URIs) into a single self-contained HTML file.

crawlio export run singleHTML --dest ~/snapshots/example.html

Supported image formats for inlining: PNG, JPG, JPEG, GIF, SVG, WebP, ICO.

Output is capped at 50 MB. Requires index.html in the downloaded content. External resources not in the download set remain as external references.

Web ARChive (ISO 28500), the standard format used by the Internet Archive's Wayback Machine. Each resource is stored with full HTTP headers, response bodies, and metadata.

crawlio export run warc --dest ~/archives/example.warc

What's included:

warcinfo header record with Crawlio version and crawl metadata
Both request and response records with real HTTP headers
SHA-1 digests (Base32-encoded) for both block and payload per RFC 4648
Content-Type auto-detected from file extension
OCR text as metadata records (when OCR is enabled) with WARC-Refers-To linking

Four settings control WARC output:

Setting	Type	Default	Description
`compressionEnabled`	bool	true	Per-record gzip compression. Output file extension is `.warc.gz` when on, `.warc` when off
`maxFileSize`	int	1073741824	Maximum file size in bytes before splitting (default 1 GB). Set to 0 to disable splitting
`cdxEnabled`	bool	true	Generate a CDX index file alongside the WARC. CDX entries reference the correct split file
`dedupEnabled`	bool	true	Skip duplicate responses. Detected via SHA-1 payload digest. Duplicates are stored as `revisit` records

When splitting is active, files are numbered sequentially (example-00000.warc.gz, example-00001.warc.gz). Each split file gets its own warcinfo record. CDX entries track which split file contains each record.

WARC files can be replayed with ReplayWeb.page or ingested into web archive infrastructure like pywb and warctools.

Full-page PDF rendering via WebKit. Each page is rendered as it would appear in a browser, with full CSS applied.

crawlio export run pdf --dest ~/pdfs/example.pdf

Clean, structured content extracted from each page: normalized HTML, Markdown, metadata, and asset inventories.

output/
  pages/
    index/
      page.html          # Normalized HTML
      content.md         # Markdown conversion
      metadata.json      # Title, description, OG data
      assets.json        # Categorized asset inventory
    about/
      page.html
      content.md
      metadata.json
      assets.json
  manifest.json          # Crawl metadata, site tree

The metadata.json includes:

{
  "url": "https://example.com/",
  "title": "Example Domain",
  "contentType": "text/html",
  "statusCode": 200,
  "headers": { "h1": ["Example Domain"], "h2": ["About", "Contact"] },
  "links": ["https://example.com/about", "https://example.com/contact"],
  "framework": "nextjs",
  "extractedAt": "2026-02-14T10:30:00Z"
}

This format is ideal for AI consumption. Feed content.md files into RAG pipelines, search indexes, or training datasets.

A deploy-ready static site with clean URL routes and machine-readable manifests.

output/
  index.html
  blog/
    index.html              # from blog.html
    series-a/
      index.html            # from blog/series-a.html
  pricing/
    index.html              # from pricing.html
  _assets/
    cdn.example.com/        # cross-domain assets
      fonts/inter.woff2
  _next/static/             # framework assets preserved
  deploy.json               # manifest
  sitemap.xml               # generated

Features:

Multi-host flattening: main domain at root, cross-domain assets under _assets/{host}/
HTML pages converted to clean URL routes (blog.html becomes blog/index.html)
Reference rewriting in HTML/CSS files for the flattened structure
Skips junk files: API routes, .css_ duplicates, query-variant hash duplicates
Generates deploy.json manifest and sitemap.xml

Deploy to Vercel, Netlify, S3, or any static hosting.

Metadata exports

Three additional metadata files are generated alongside exports:

crawl-manifest.json

Full crawl metadata with every downloaded item:

{
  "version": 1,
  "seedURL": "https://example.com",
  "exportedAt": "2026-03-06T10:30:00Z",
  "items": [
    {
      "url": "https://example.com/",
      "localPath": "index.html",
      "contentType": "text/html",
      "size": 15234,
      "statusCode": 200,
      "ocrText": "..."
    }
  ]
}

deploy.json

Pages and assets with enrichment data for the deploy format:

{
  "pages": [
    {
      "url": "https://example.com/",
      "route": "/",
      "localPath": "index.html",
      "ocrText": "..."
    }
  ],
  "assets": [...],
  "enrichment": {
    "framework": "nextjs",
    "pagesEnriched": 45
  }
}

Enrichment dump

All browser capture data (framework detection, network requests, console logs, DOM snapshots) exported as JSON per URL.

Choosing a format

I want to...	Use
Browse the site offline	Folder
Share an archive with someone	ZIP
Save a quick single-page snapshot	Single HTML
Create a standards-compliant web archive	WARC
Generate a printable document	PDF
Feed content into an AI or search pipeline	Extracted
Deploy a mirrored static site	Deploy
Build a retrieval index	Deploy (use `deploy.json` manifest)

Next steps

Settings Reference: Configure what gets downloaded
CLI Commands: Automate exports from the terminal
MCP Overview: AI-driven export workflows

PreviousChangelog NextCommon Workflows