Autonomous Loop
The autonomous loop runs a multi-iteration cycle: crawl a site, analyze coverage gaps, adjust settings, and repeat. No human intervention required. This is the engine behind both crawlio <url> and crawlio loop.
Tier: Core.
Quick start
crawlio loop https://docs.stripe.com --target-coverage 0.95 --export warcCrawls the site, re-crawls any gaps, and exports as WARC when 95% coverage is reached.
Flags
| Flag | Type | Default | Description |
|---|---|---|---|
<url> |
string | required | URL to crawl |
--max-iterations |
int | 5 | Maximum loop iterations |
--target-coverage |
double | 0.95 | Target download/discovered ratio (0.0-1.0) |
--dest |
string | auto | Destination directory |
--export |
string | -- | Auto-export format on completion: folder, zip, singleHTML, warc |
--log-file |
string | -- | Write JSON report to this path |
--agent |
bool | false | Enable browser agent intelligence |
--auth-url |
string | -- | URL for interactive browser auth before crawling |
--auth-file |
string | -- | Path to cookie file (Netscape or JSON format) |
--block-trackers |
bool | false | Block tracker/analytics requests via agent |
The
crawlio loopcommand requires--agentto enable browser intelligence (opt-in). The defaultcrawlio <url>command auto-detects the agent instead.
How it works
Each iteration follows a four-step cycle:
1. Crawl
Download pages using current settings. The engine respects scope, depth, robots.txt, and rate limits.
2. Analyze
Compute coverage (downloaded / discovered) and identify gaps:
- Which URLs failed and why (timeout, 4xx, 5xx, connection error)
- Whether the site is an SPA with empty HTML shells
- Whether rate limiting is occurring (429 responses)
- Whether authentication is required (401/403 patterns)
3. Adjust
Tune settings based on what was learned:
| Detection | Adjustment |
|---|---|
| SPA detected (empty body + framework markers) | Enable JS rendering |
| Rate limiting (429 responses) | Increase crawl delay |
| Timeouts | Increase timeout, reduce concurrency |
| Auth required (401/403 patterns) | Flag for manual auth |
| Many failed URLs | Retry with relaxed settings |
4. Repeat
If coverage is below target, start the next iteration with adjusted settings. Failed URLs from the previous iteration are re-injected into the frontier.
Coverage calculation
Coverage = successfully downloaded URLs / total discovered URLsThe loop stops when any of these conditions is met:
| Condition | Exit code |
|---|---|
| Coverage reaches the target | 0 |
| Coverage stalls for 2 consecutive iterations (circuit breaker) | 1 |
| Maximum iterations reached | 1 |
| Unrecoverable error | 2 |
The circuit breaker prevents infinite loops on sites where some URLs will never succeed (gated content, dead links, etc.).
Example output
crawlio loop https://docs.stripe.com \
--max-iterations 5 \
--target-coverage 0.95 \
--export warc \
--log-file crawl-report.jsonAutonomous Crawl Loop v1.0.0
URL: https://docs.stripe.com
Max iterations: 5
Target coverage: 95%
Auto-export: warc
-- Iteration 1 --------------------------
Crawling... 1,247 pages discovered
Coverage: 0.82 (1,023 / 1,247)
Adjusting: increasing maxConcurrent to 20
-- Iteration 2 --------------------------
Re-crawling 224 remaining pages...
Coverage: 0.94 (1,173 / 1,247)
Adjusting: retrying failed pages with longer timeout
-- Iteration 3 --------------------------
Re-crawling 74 remaining pages...
Coverage: 0.97 (1,210 / 1,247)
Target coverage reached!
Exporting as warc...
Export completed: ~/docs.stripe.com.warc
Report written to crawl-report.jsonJSON report
When --log-file is specified, a JSON report is written on exit:
- Per-iteration results (discovered, downloaded, failed, localized counts)
- Settings adjustments made at each iteration
- All failed URLs with error details
- Exit reason and final coverage
- Timing information
Useful for scripting and CI/CD pipelines.
Agent integration
The loop can connect to crawlio-browser (Chrome extension) for browser-powered intelligence.
What the agent provides
- Framework detection. Identifies React, Next.js, Vue, Angular, and more. Adjusts crawl strategy.
- Tracker blocking. Blocks known analytics/tracker requests to speed up crawls.
- URL discovery. Captures dynamically loaded URLs from JavaScript-rendered pages.
- Network monitoring. Observes actual browser requests to find hidden resources.
Usage
- Start crawlio-browser:
cd crawlio-browser && npm start - Add
--agentto your loop command:
crawlio loop https://spa-example.com --agent --block-trackersThe CLI connects via WebSocket at 127.0.0.1:9333-9342 (first free port).
The agent is optional. Without
--agent, the loop runs without browser intelligence. With--agent, it becomes a hard requirement. The loop exits with code 2 if the agent cannot be reached.
Authentication
For crawling login-gated sites, the loop supports two methods.
Interactive browser auth
crawlio loop https://example.com/dashboard --agent --auth-url https://example.com/loginOpens the login page in the browser agent, waits for you to log in, then transfers session cookies to the crawler.
Cookie file auth
crawlio loop https://example.com/dashboard --auth-file ~/cookies.jsonSupports both Netscape cookie format and JSON cookie arrays.
Next steps
- See all CLI commands in the Commands Reference
- Try the Interactive Shell for hands-on exploration
- Configure crawl behavior with Settings & Policy