Common Workflows

Download a site for offline reading

Save a documentation site for offline reading.

crawlio crawl start https://docs.example.com \
  --depth 5 \
  --scope same-domain
 
# Wait for completion, then export
crawlio export run folder --dest ~/offline-docs/example

Open ~/offline-docs/example/index.html in any browser. All links are rewritten to work offline.

💡

Most documentation sites stay within one domain, so same-domain scope works well. If docs are split across subdomains, use include-subdomains.

Archive a site as WARC

Create an ISO 28500 web archive with full HTTP headers and metadata.

crawlio crawl start https://blog.example.com \
  --depth 0 \
  --scope same-domain
 
# Export as WARC with compression and CDX index
crawlio export run warc --dest ~/archives/blog-backup.warc

The WARC file preserves every page exactly as it appeared. Replay it with ReplayWeb.page or upload to the Internet Archive.

For large sites, WARC output splits automatically at 1 GB. CDX indexing and deduplication are on by default.

Extract clean text for AI

Extract structured text content for RAG pipelines, fine-tuning datasets, or context windows.

crawlio crawl start https://docs.example.com \
  --depth 5 \
  --scope same-domain
 
# Export as extracted text
crawlio export run extracted --dest ~/datasets/example

This produces a content.md and metadata.json for each page:

extracted/
  docs.example.com/
    getting-started/
      content.md         # Clean markdown
      metadata.json      # Title, URL, headers, links
    api-reference/
      content.md
      metadata.json

Feed the content.md files into your embedding pipeline or concatenate them for a context window.

💡

The deploy format also produces a deploy.json manifest that maps URLs to local paths. Useful for building retrieval indexes.

Analyze SEO issues

Run a crawl and review the built-in SEO analysis.

crawlio crawl start https://example.com \
  --depth 0 \
  --scope same-domain

After the crawl completes, open the Audit Dashboard in the app. Crawlio runs 19 analyzers automatically:

SEO: missing titles, duplicate meta descriptions, broken canonical URLs, missing Open Graph tags
Accessibility: missing alt text, ARIA issues, heading hierarchy
Security: missing HSTS, permissive CORS, exposed server headers
Best practices: missing favicon, render-blocking resources, deprecated HTML
Redirects: redirect chains, loops, mixed permanent/temporary, cross-domain redirects
Orphan pages: pages with no inbound links, pages not in sitemap
Duplicates: exact duplicates (SHA-256) and near-duplicates (TF-IDF cosine similarity)

Export the results as CSV:

crawlio export run csv --preset seo-overview --dest ~/reports/seo.csv

Or via MCP:

get_analysis_summary()

Compare two crawl snapshots

Detect changes between two versions of a site.

Crawl the first version and export:

crawlio crawl start https://example.com --depth 0 --scope same-domain
crawlio export run extracted --dest ~/snapshots/v1

Crawl again later (after a deploy, migration, or content update):

crawlio crawl start https://example.com --depth 0 --scope same-domain
crawlio export run extracted --dest ~/snapshots/v2

Diff the extracted content:

diff -rq ~/snapshots/v1/pages ~/snapshots/v2/pages

For structured diffs, compare the metadata.json files to find changed titles, missing pages, or new URLs.

Export design tokens

Extract CSS custom properties, colors, typography, and spacing values from a site.

crawlio crawl start https://example.com \
  --depth 1 \
  --scope same-domain
 
crawlio export run extracted --dest ~/tokens/example

The extracted output includes design token data found in stylesheets:

CSS custom properties (--color-primary, --font-size-lg)
Color values (hex, rgb, hsl)
Font stacks and sizes
Spacing and layout values

Design tokens appear in the per-page assets.json and in the extraction pipeline output.

Generate a deploy-ready static site

Mirror a site and deploy it to Vercel, Netlify, S3, or any static host.

crawlio crawl start https://example.com \
  --depth 0 \
  --scope same-domain
 
crawlio export run deploy --dest ~/deploy/example

The deploy export:

Converts pages to clean URL routes (blog.html becomes blog/index.html)
Flattens cross-domain assets under _assets/{host}/
Rewrites all references in HTML and CSS
Generates deploy.json manifest and sitemap.xml

Deploy directly:

# Vercel
cd ~/deploy/example && vercel
 
# Netlify
netlify deploy --dir ~/deploy/example --prod
 
# S3
aws s3 sync ~/deploy/example s3://my-bucket --delete

Next steps

CLI Commands: Full CLI reference
Export Formats: All 7 formats in detail
Settings Reference: All configuration options
Troubleshooting: Common issues and fixes

PreviousExport Formats NextSettings & Policy