Common Workflows
Download a site for offline reading
Save a documentation site for offline reading.
crawlio crawl start https://docs.example.com \
--depth 5 \
--scope same-domain
# Wait for completion, then export
crawlio export run folder --dest ~/offline-docs/exampleOpen ~/offline-docs/example/index.html in any browser. All links are rewritten to work offline.
Most documentation sites stay within one domain, so same-domain scope works well. If docs are split across subdomains, use include-subdomains.
Archive a site as WARC
Create an ISO 28500 web archive with full HTTP headers and metadata.
crawlio crawl start https://blog.example.com \
--depth 0 \
--scope same-domain
# Export as WARC with compression and CDX index
crawlio export run warc --dest ~/archives/blog-backup.warcThe WARC file preserves every page exactly as it appeared. Replay it with ReplayWeb.page or upload to the Internet Archive.
For large sites, WARC output splits automatically at 1 GB. CDX indexing and deduplication are on by default.
Extract clean text for AI
Extract structured text content for RAG pipelines, fine-tuning datasets, or context windows.
crawlio crawl start https://docs.example.com \
--depth 5 \
--scope same-domain
# Export as extracted text
crawlio export run extracted --dest ~/datasets/exampleThis produces a content.md and metadata.json for each page:
extracted/
docs.example.com/
getting-started/
content.md # Clean markdown
metadata.json # Title, URL, headers, links
api-reference/
content.md
metadata.jsonFeed the content.md files into your embedding pipeline or concatenate them for a context window.
The deploy format also produces a deploy.json manifest that maps URLs to local paths. Useful for building retrieval indexes.
Analyze SEO issues
Run a crawl and review the built-in SEO analysis.
crawlio crawl start https://example.com \
--depth 0 \
--scope same-domainAfter the crawl completes, open the Audit Dashboard in the app. Crawlio runs 19 analyzers automatically:
- SEO: missing titles, duplicate meta descriptions, broken canonical URLs, missing Open Graph tags
- Accessibility: missing alt text, ARIA issues, heading hierarchy
- Security: missing HSTS, permissive CORS, exposed server headers
- Best practices: missing favicon, render-blocking resources, deprecated HTML
- Redirects: redirect chains, loops, mixed permanent/temporary, cross-domain redirects
- Orphan pages: pages with no inbound links, pages not in sitemap
- Duplicates: exact duplicates (SHA-256) and near-duplicates (TF-IDF cosine similarity)
Export the results as CSV:
crawlio export run csv --preset seo-overview --dest ~/reports/seo.csvOr via MCP:
get_analysis_summary()Compare two crawl snapshots
Detect changes between two versions of a site.
- Crawl the first version and export:
crawlio crawl start https://example.com --depth 0 --scope same-domain
crawlio export run extracted --dest ~/snapshots/v1- Crawl again later (after a deploy, migration, or content update):
crawlio crawl start https://example.com --depth 0 --scope same-domain
crawlio export run extracted --dest ~/snapshots/v2- Diff the extracted content:
diff -rq ~/snapshots/v1/pages ~/snapshots/v2/pagesFor structured diffs, compare the metadata.json files to find changed titles, missing pages, or new URLs.
Export design tokens
Extract CSS custom properties, colors, typography, and spacing values from a site.
crawlio crawl start https://example.com \
--depth 1 \
--scope same-domain
crawlio export run extracted --dest ~/tokens/exampleThe extracted output includes design token data found in stylesheets:
- CSS custom properties (
--color-primary,--font-size-lg) - Color values (hex, rgb, hsl)
- Font stacks and sizes
- Spacing and layout values
Design tokens appear in the per-page assets.json and in the extraction pipeline output.
Generate a deploy-ready static site
Mirror a site and deploy it to Vercel, Netlify, S3, or any static host.
crawlio crawl start https://example.com \
--depth 0 \
--scope same-domain
crawlio export run deploy --dest ~/deploy/exampleThe deploy export:
- Converts pages to clean URL routes (
blog.htmlbecomesblog/index.html) - Flattens cross-domain assets under
_assets/{host}/ - Rewrites all references in HTML and CSS
- Generates
deploy.jsonmanifest andsitemap.xml
Deploy directly:
# Vercel
cd ~/deploy/example && vercel
# Netlify
netlify deploy --dir ~/deploy/example --prod
# S3
aws s3 sync ~/deploy/example s3://my-bucket --deleteNext steps
- CLI Commands: Full CLI reference
- Export Formats: All 7 formats in detail
- Settings Reference: All configuration options
- Troubleshooting: Common issues and fixes