Autonomous Loop

The autonomous loop runs a multi-iteration cycle: crawl a site, analyze coverage gaps, adjust settings, and repeat. No human intervention required. This is the engine behind both crawlio <url> and crawlio loop.

Tier: Core.

Quick start

crawlio loop https://docs.stripe.com --target-coverage 0.95 --export warc

Crawls the site, re-crawls any gaps, and exports as WARC when 95% coverage is reached.

Flags

Flag	Type	Default	Description
`<url>`	string	required	URL to crawl
`--max-iterations`	int	5	Maximum loop iterations
`--target-coverage`	double	0.95	Target download/discovered ratio (0.0-1.0)
`--dest`	string	auto	Destination directory
`--export`	string	--	Auto-export format on completion: `folder`, `zip`, `singleHTML`, `warc`
`--log-file`	string	--	Write JSON report to this path
`--agent`	bool	false	Enable browser agent intelligence
`--auth-url`	string	--	URL for interactive browser auth before crawling
`--auth-file`	string	--	Path to cookie file (Netscape or JSON format)
`--block-trackers`	bool	false	Block tracker/analytics requests via agent

The crawlio loop command requires --agent to enable browser intelligence (opt-in). The default crawlio <url> command auto-detects the agent instead.

How it works

Each iteration follows a four-step cycle:

1. Crawl

Download pages using current settings. The engine respects scope, depth, robots.txt, and rate limits.

2. Analyze

Compute coverage (downloaded / discovered) and identify gaps:

Which URLs failed and why (timeout, 4xx, 5xx, connection error)
Whether the site is an SPA with empty HTML shells
Whether rate limiting is occurring (429 responses)
Whether authentication is required (401/403 patterns)

3. Adjust

Tune settings based on what was learned:

Detection	Adjustment
SPA detected (empty body + framework markers)	Enable JS rendering
Rate limiting (429 responses)	Increase crawl delay
Timeouts	Increase timeout, reduce concurrency
Auth required (401/403 patterns)	Flag for manual auth
Many failed URLs	Retry with relaxed settings

4. Repeat

If coverage is below target, start the next iteration with adjusted settings. Failed URLs from the previous iteration are re-injected into the frontier.

Coverage calculation

Coverage = successfully downloaded URLs / total discovered URLs

The loop stops when any of these conditions is met:

Condition	Exit code
Coverage reaches the target	0
Coverage stalls for 2 consecutive iterations (circuit breaker)	1
Maximum iterations reached	1
Unrecoverable error	2

The circuit breaker prevents infinite loops on sites where some URLs will never succeed (gated content, dead links, etc.).

Example output

crawlio loop https://docs.stripe.com \
  --max-iterations 5 \
  --target-coverage 0.95 \
  --export warc \
  --log-file crawl-report.json

Autonomous Crawl Loop v1.0.0
 
URL:              https://docs.stripe.com
Max iterations:   5
Target coverage:  95%
Auto-export:      warc
 
  -- Iteration 1 --------------------------
  Crawling... 1,247 pages discovered
  Coverage: 0.82 (1,023 / 1,247)
  Adjusting: increasing maxConcurrent to 20
 
  -- Iteration 2 --------------------------
  Re-crawling 224 remaining pages...
  Coverage: 0.94 (1,173 / 1,247)
  Adjusting: retrying failed pages with longer timeout
 
  -- Iteration 3 --------------------------
  Re-crawling 74 remaining pages...
  Coverage: 0.97 (1,210 / 1,247)
  Target coverage reached!
 
  Exporting as warc...
  Export completed: ~/docs.stripe.com.warc
  Report written to crawl-report.json

JSON report

When --log-file is specified, a JSON report is written on exit:

Per-iteration results (discovered, downloaded, failed, localized counts)
Settings adjustments made at each iteration
All failed URLs with error details
Exit reason and final coverage
Timing information

Useful for scripting and CI/CD pipelines.

Agent integration

The loop can connect to crawlio-browser (Chrome extension) for browser-powered intelligence.

What the agent provides

Framework detection. Identifies React, Next.js, Vue, Angular, and more. Adjusts crawl strategy.
Tracker blocking. Blocks known analytics/tracker requests to speed up crawls.
URL discovery. Captures dynamically loaded URLs from JavaScript-rendered pages.
Network monitoring. Observes actual browser requests to find hidden resources.

Usage

Start crawlio-browser: cd crawlio-browser && npm start
Add --agent to your loop command:

crawlio loop https://spa-example.com --agent --block-trackers

The CLI connects via WebSocket at 127.0.0.1:9333-9342 (first free port).

The agent is optional. Without --agent, the loop runs without browser intelligence. With --agent, it becomes a hard requirement. The loop exits with code 2 if the agent cannot be reached.

Authentication

For crawling login-gated sites, the loop supports two methods.

Interactive browser auth

crawlio loop https://example.com/dashboard --agent --auth-url https://example.com/login

Opens the login page in the browser agent, waits for you to log in, then transfers session cookies to the crawler.

crawlio loop https://example.com/dashboard --auth-file ~/cookies.json

Supports both Netscape cookie format and JSON cookie arrays.

Next steps

See all CLI commands in the Commands Reference
Try the Interactive Shell for hands-on exploration
Configure crawl behavior with Settings & Policy

PreviousInteractive Shell NextOverview