CrawlioCrawlio Docs

Common Workflows

Download a site for offline reading

Save a documentation site for offline reading.

crawlio crawl start https://docs.example.com \
  --depth 5 \
  --scope same-domain
 
# Wait for completion, then export
crawlio export run folder --dest ~/offline-docs/example

Open ~/offline-docs/example/index.html in any browser. All links are rewritten to work offline.

💡

Most documentation sites stay within one domain, so same-domain scope works well. If docs are split across subdomains, use include-subdomains.


Archive a site as WARC

Create an ISO 28500 web archive with full HTTP headers and metadata.

crawlio crawl start https://blog.example.com \
  --depth 0 \
  --scope same-domain
 
# Export as WARC with compression and CDX index
crawlio export run warc --dest ~/archives/blog-backup.warc

The WARC file preserves every page exactly as it appeared. Replay it with ReplayWeb.page or upload to the Internet Archive.

For large sites, WARC output splits automatically at 1 GB. CDX indexing and deduplication are on by default.


Extract clean text for AI

Extract structured text content for RAG pipelines, fine-tuning datasets, or context windows.

crawlio crawl start https://docs.example.com \
  --depth 5 \
  --scope same-domain
 
# Export as extracted text
crawlio export run extracted --dest ~/datasets/example

This produces a content.md and metadata.json for each page:

extracted/
  docs.example.com/
    getting-started/
      content.md         # Clean markdown
      metadata.json      # Title, URL, headers, links
    api-reference/
      content.md
      metadata.json

Feed the content.md files into your embedding pipeline or concatenate them for a context window.

💡

The deploy format also produces a deploy.json manifest that maps URLs to local paths. Useful for building retrieval indexes.


Analyze SEO issues

Run a crawl and review the built-in SEO analysis.

crawlio crawl start https://example.com \
  --depth 0 \
  --scope same-domain

After the crawl completes, open the Audit Dashboard in the app. Crawlio runs 19 analyzers automatically:

  • SEO: missing titles, duplicate meta descriptions, broken canonical URLs, missing Open Graph tags
  • Accessibility: missing alt text, ARIA issues, heading hierarchy
  • Security: missing HSTS, permissive CORS, exposed server headers
  • Best practices: missing favicon, render-blocking resources, deprecated HTML
  • Redirects: redirect chains, loops, mixed permanent/temporary, cross-domain redirects
  • Orphan pages: pages with no inbound links, pages not in sitemap
  • Duplicates: exact duplicates (SHA-256) and near-duplicates (TF-IDF cosine similarity)

Export the results as CSV:

crawlio export run csv --preset seo-overview --dest ~/reports/seo.csv

Or via MCP:

get_analysis_summary()

Compare two crawl snapshots

Detect changes between two versions of a site.

  1. Crawl the first version and export:
crawlio crawl start https://example.com --depth 0 --scope same-domain
crawlio export run extracted --dest ~/snapshots/v1
  1. Crawl again later (after a deploy, migration, or content update):
crawlio crawl start https://example.com --depth 0 --scope same-domain
crawlio export run extracted --dest ~/snapshots/v2
  1. Diff the extracted content:
diff -rq ~/snapshots/v1/pages ~/snapshots/v2/pages

For structured diffs, compare the metadata.json files to find changed titles, missing pages, or new URLs.


Export design tokens

Extract CSS custom properties, colors, typography, and spacing values from a site.

crawlio crawl start https://example.com \
  --depth 1 \
  --scope same-domain
 
crawlio export run extracted --dest ~/tokens/example

The extracted output includes design token data found in stylesheets:

  • CSS custom properties (--color-primary, --font-size-lg)
  • Color values (hex, rgb, hsl)
  • Font stacks and sizes
  • Spacing and layout values

Design tokens appear in the per-page assets.json and in the extraction pipeline output.


Generate a deploy-ready static site

Mirror a site and deploy it to Vercel, Netlify, S3, or any static host.

crawlio crawl start https://example.com \
  --depth 0 \
  --scope same-domain
 
crawlio export run deploy --dest ~/deploy/example

The deploy export:

  • Converts pages to clean URL routes (blog.html becomes blog/index.html)
  • Flattens cross-domain assets under _assets/{host}/
  • Rewrites all references in HTML and CSS
  • Generates deploy.json manifest and sitemap.xml

Deploy directly:

# Vercel
cd ~/deploy/example && vercel
 
# Netlify
netlify deploy --dir ~/deploy/example --prod
 
# S3
aws s3 sync ~/deploy/example s3://my-bucket --delete

Next steps

© 2026 Crawlio. All rights reserved.