CrawlioCrawlio Docs

Autonomous Loop

The autonomous loop runs a multi-iteration cycle: crawl a site, analyze coverage gaps, adjust settings, and repeat. No human intervention required. This is the engine behind both crawlio <url> and crawlio loop.

Tier: Core.

Quick start

crawlio loop https://docs.stripe.com --target-coverage 0.95 --export warc

Crawls the site, re-crawls any gaps, and exports as WARC when 95% coverage is reached.

Flags

Flag Type Default Description
<url> string required URL to crawl
--max-iterations int 5 Maximum loop iterations
--target-coverage double 0.95 Target download/discovered ratio (0.0-1.0)
--dest string auto Destination directory
--export string -- Auto-export format on completion: folder, zip, singleHTML, warc
--log-file string -- Write JSON report to this path
--agent bool false Enable browser agent intelligence
--auth-url string -- URL for interactive browser auth before crawling
--auth-file string -- Path to cookie file (Netscape or JSON format)
--block-trackers bool false Block tracker/analytics requests via agent

The crawlio loop command requires --agent to enable browser intelligence (opt-in). The default crawlio <url> command auto-detects the agent instead.

How it works

Each iteration follows a four-step cycle:

1. Crawl

Download pages using current settings. The engine respects scope, depth, robots.txt, and rate limits.

2. Analyze

Compute coverage (downloaded / discovered) and identify gaps:

  • Which URLs failed and why (timeout, 4xx, 5xx, connection error)
  • Whether the site is an SPA with empty HTML shells
  • Whether rate limiting is occurring (429 responses)
  • Whether authentication is required (401/403 patterns)

3. Adjust

Tune settings based on what was learned:

Detection Adjustment
SPA detected (empty body + framework markers) Enable JS rendering
Rate limiting (429 responses) Increase crawl delay
Timeouts Increase timeout, reduce concurrency
Auth required (401/403 patterns) Flag for manual auth
Many failed URLs Retry with relaxed settings

4. Repeat

If coverage is below target, start the next iteration with adjusted settings. Failed URLs from the previous iteration are re-injected into the frontier.

Coverage calculation

Coverage = successfully downloaded URLs / total discovered URLs

The loop stops when any of these conditions is met:

Condition Exit code
Coverage reaches the target 0
Coverage stalls for 2 consecutive iterations (circuit breaker) 1
Maximum iterations reached 1
Unrecoverable error 2

The circuit breaker prevents infinite loops on sites where some URLs will never succeed (gated content, dead links, etc.).

Example output

crawlio loop https://docs.stripe.com \
  --max-iterations 5 \
  --target-coverage 0.95 \
  --export warc \
  --log-file crawl-report.json
Autonomous Crawl Loop v1.0.0
 
URL:              https://docs.stripe.com
Max iterations:   5
Target coverage:  95%
Auto-export:      warc
 
  -- Iteration 1 --------------------------
  Crawling... 1,247 pages discovered
  Coverage: 0.82 (1,023 / 1,247)
  Adjusting: increasing maxConcurrent to 20
 
  -- Iteration 2 --------------------------
  Re-crawling 224 remaining pages...
  Coverage: 0.94 (1,173 / 1,247)
  Adjusting: retrying failed pages with longer timeout
 
  -- Iteration 3 --------------------------
  Re-crawling 74 remaining pages...
  Coverage: 0.97 (1,210 / 1,247)
  Target coverage reached!
 
  Exporting as warc...
  Export completed: ~/docs.stripe.com.warc
  Report written to crawl-report.json

JSON report

When --log-file is specified, a JSON report is written on exit:

  • Per-iteration results (discovered, downloaded, failed, localized counts)
  • Settings adjustments made at each iteration
  • All failed URLs with error details
  • Exit reason and final coverage
  • Timing information

Useful for scripting and CI/CD pipelines.

Agent integration

The loop can connect to crawlio-browser (Chrome extension) for browser-powered intelligence.

What the agent provides

  • Framework detection. Identifies React, Next.js, Vue, Angular, and more. Adjusts crawl strategy.
  • Tracker blocking. Blocks known analytics/tracker requests to speed up crawls.
  • URL discovery. Captures dynamically loaded URLs from JavaScript-rendered pages.
  • Network monitoring. Observes actual browser requests to find hidden resources.

Usage

  1. Start crawlio-browser: cd crawlio-browser && npm start
  2. Add --agent to your loop command:
crawlio loop https://spa-example.com --agent --block-trackers

The CLI connects via WebSocket at 127.0.0.1:9333-9342 (first free port).

The agent is optional. Without --agent, the loop runs without browser intelligence. With --agent, it becomes a hard requirement. The loop exits with code 2 if the agent cannot be reached.

Authentication

For crawling login-gated sites, the loop supports two methods.

Interactive browser auth

crawlio loop https://example.com/dashboard --agent --auth-url https://example.com/login

Opens the login page in the browser agent, waits for you to log in, then transfers session cookies to the crawler.

crawlio loop https://example.com/dashboard --auth-file ~/cookies.json

Supports both Netscape cookie format and JSON cookie arrays.

Next steps

© 2026 Crawlio. All rights reserved.