CrawlioCrawlio Docs

Export Formats

Overview

After a crawl completes, export your downloaded site in any of 7 formats. Three metadata exports are also available.

Format Output Best for
Folder directory Default. Offline browsing with rewritten links
ZIP .zip Sharing compressed archives
Single HTML .html One-file page snapshots
WARC .warc / .warc.gz ISO 28500 web archives
PDF .pdf Full-page PDF rendering
Extracted directory Clean text + Markdown for AI/search
Deploy directory Deploy-ready static site

The default export. Preserves the original directory structure with all links rewritten for offline browsing.

example.com/
  index.html
  about.html
  css/
    style.css
  js/
    app.js
  images/
    logo.png
    hero.jpg

Open index.html in any browser. The site works completely offline.

Export via CLI:

crawlio export run folder --dest ~/sites/example.com

Export via MCP:

export_site(format: "folder", destinationPath: "~/sites/example.com")

Same structure as folder, compressed into a single .zip archive. Configurable compression level (1-9).

crawlio export run zip --dest ~/archives/example.zip

Inlines all CSS, JavaScript, and images (as base64 data URIs) into a single self-contained HTML file.

crawlio export run singleHTML --dest ~/snapshots/example.html

Supported image formats for inlining: PNG, JPG, JPEG, GIF, SVG, WebP, ICO.

Output is capped at 50 MB. Requires index.html in the downloaded content. External resources not in the download set remain as external references.

Web ARChive (ISO 28500), the standard format used by the Internet Archive's Wayback Machine. Each resource is stored with full HTTP headers, response bodies, and metadata.

crawlio export run warc --dest ~/archives/example.warc

What's included:

  • warcinfo header record with Crawlio version and crawl metadata
  • Both request and response records with real HTTP headers
  • SHA-1 digests (Base32-encoded) for both block and payload per RFC 4648
  • Content-Type auto-detected from file extension
  • OCR text as metadata records (when OCR is enabled) with WARC-Refers-To linking

Four settings control WARC output:

Setting Type Default Description
compressionEnabled bool true Per-record gzip compression. Output file extension is .warc.gz when on, .warc when off
maxFileSize int 1073741824 Maximum file size in bytes before splitting (default 1 GB). Set to 0 to disable splitting
cdxEnabled bool true Generate a CDX index file alongside the WARC. CDX entries reference the correct split file
dedupEnabled bool true Skip duplicate responses. Detected via SHA-1 payload digest. Duplicates are stored as revisit records

When splitting is active, files are numbered sequentially (example-00000.warc.gz, example-00001.warc.gz). Each split file gets its own warcinfo record. CDX entries track which split file contains each record.

WARC files can be replayed with ReplayWeb.page or ingested into web archive infrastructure like pywb and warctools.

Full-page PDF rendering via WebKit. Each page is rendered as it would appear in a browser, with full CSS applied.

crawlio export run pdf --dest ~/pdfs/example.pdf

Clean, structured content extracted from each page: normalized HTML, Markdown, metadata, and asset inventories.

output/
  pages/
    index/
      page.html          # Normalized HTML
      content.md         # Markdown conversion
      metadata.json      # Title, description, OG data
      assets.json        # Categorized asset inventory
    about/
      page.html
      content.md
      metadata.json
      assets.json
  manifest.json          # Crawl metadata, site tree

The metadata.json includes:

{
  "url": "https://example.com/",
  "title": "Example Domain",
  "contentType": "text/html",
  "statusCode": 200,
  "headers": { "h1": ["Example Domain"], "h2": ["About", "Contact"] },
  "links": ["https://example.com/about", "https://example.com/contact"],
  "framework": "nextjs",
  "extractedAt": "2026-02-14T10:30:00Z"
}

This format is ideal for AI consumption. Feed content.md files into RAG pipelines, search indexes, or training datasets.

A deploy-ready static site with clean URL routes and machine-readable manifests.

output/
  index.html
  blog/
    index.html              # from blog.html
    series-a/
      index.html            # from blog/series-a.html
  pricing/
    index.html              # from pricing.html
  _assets/
    cdn.example.com/        # cross-domain assets
      fonts/inter.woff2
  _next/static/             # framework assets preserved
  deploy.json               # manifest
  sitemap.xml               # generated

Features:

  • Multi-host flattening: main domain at root, cross-domain assets under _assets/{host}/
  • HTML pages converted to clean URL routes (blog.html becomes blog/index.html)
  • Reference rewriting in HTML/CSS files for the flattened structure
  • Skips junk files: API routes, .css_ duplicates, query-variant hash duplicates
  • Generates deploy.json manifest and sitemap.xml

Deploy to Vercel, Netlify, S3, or any static hosting.

Metadata exports

Three additional metadata files are generated alongside exports:

crawl-manifest.json

Full crawl metadata with every downloaded item:

{
  "version": 1,
  "seedURL": "https://example.com",
  "exportedAt": "2026-03-06T10:30:00Z",
  "items": [
    {
      "url": "https://example.com/",
      "localPath": "index.html",
      "contentType": "text/html",
      "size": 15234,
      "statusCode": 200,
      "ocrText": "..."
    }
  ]
}

deploy.json

Pages and assets with enrichment data for the deploy format:

{
  "pages": [
    {
      "url": "https://example.com/",
      "route": "/",
      "localPath": "index.html",
      "ocrText": "..."
    }
  ],
  "assets": [...],
  "enrichment": {
    "framework": "nextjs",
    "pagesEnriched": 45
  }
}

Enrichment dump

All browser capture data (framework detection, network requests, console logs, DOM snapshots) exported as JSON per URL.


Choosing a format

I want to... Use
Browse the site offline Folder
Share an archive with someone ZIP
Save a quick single-page snapshot Single HTML
Create a standards-compliant web archive WARC
Generate a printable document PDF
Feed content into an AI or search pipeline Extracted
Deploy a mirrored static site Deploy
Build a retrieval index Deploy (use deploy.json manifest)

Next steps

© 2026 Crawlio. All rights reserved.