Settings Reference
Download
Control how Crawlio downloads individual files.
| Setting | Type | Default | Description |
|---|---|---|---|
maxConcurrent |
int | 4 | Maximum parallel download connections (1-40) |
maxConnectionsPerHost |
int | 6 | Per-host connection limit (1-20) |
crawlDelay |
double | 0.5 | Seconds between requests to the same host |
timeout |
int | 60 | Request timeout in seconds (5-300) |
maxBytesPerSecond |
int | 0 | Bandwidth throttle via token bucket (0 = unlimited) |
maxRetries |
int | 3 | Retry count for transient errors (5xx, timeouts) |
userAgent |
string | System default | Presets: Safari, Chrome, Firefox, Googlebot, or custom string |
stripTrackingParams |
bool | true | Remove utm_*, fbclid, and other tracking query parameters |
downloadErrorPages |
bool | false | Save pages that return 4xx/5xx status codes |
alwaysDownloadHTML |
bool | false | Force re-download of HTML even if already cached |
downloadUsingWebViews |
bool | false | Use WebKit for JS-rendered page download |
limitMinImageSize |
bool | false | Enable minimum image size filtering |
minimumImageSize |
int | 0 | Minimum image dimension in pixels (rejects images smaller than this) |
customCookies |
array | [] |
Cookie entries: { name, value, domain, path } |
customHeaders |
array | [] |
HTTP headers: { name, value } |
customDataAttributes |
array | [] |
Additional data-* attributes to scan for URLs |
webpagePatterns |
array | [] |
URL patterns treated as HTML pages |
promptForCredentials |
bool | true | Show credential prompts for 401 responses |
storeCredentialsInKeychain |
bool | true | Save credentials in macOS Keychain |
preferHTTP2 |
bool | true | Negotiate HTTP/2 via ALPN when server supports it |
captureTimeout |
int | 30 | WebKit capture timeout in seconds |
Scope
Control which pages and resources Crawlio follows and downloads.
| Setting | Type | Default | Description |
|---|---|---|---|
scopeMode |
string | "sameDomain" |
URL scope: sameDomain, includeSubdomains, or customList |
maxDepth |
int | 5 | Maximum link hops from seed URL (0-100). 1 = seed page only |
externalLinkDepth |
int | 0 | Levels to follow on external domains (0 = don't follow) |
maxPagesPerCrawl |
int | 0 | Stop after this many pages (0 = unlimited) |
maxDiscoveredURLs |
int | 100000 | Frontier URL cap to prevent unbounded memory growth |
includePatterns |
array | [] |
URL patterns to include (substring or regex) |
excludePatterns |
array | [] |
URL patterns to exclude (substring or regex) |
useRegexPatterns |
bool | false | Treat include/exclude patterns as regular expressions |
includeSupportingFiles |
bool | true | Download supporting assets (CSS, JS, fonts) even outside scope |
downloadCrossDomainAssets |
bool | true | Download assets from external domains referenced by in-scope pages |
autoUpgradeHTTP |
bool | true | Auto-upgrade http:// to https://, fallback on cert errors |
scanSitemaps |
bool | true | Discover URLs from sitemap.xml and robots.txt Sitemap directives |
Scope modes
| Mode | Behavior |
|---|---|
sameDomain |
Only URLs on the exact same domain as the seed. www-insensitive: www.example.com = example.com |
includeSubdomains |
Same domain plus all subdomains (e.g., blog.example.com, cdn.example.com) |
customList |
Only URLs matching the user-defined include patterns |
Policy
Control crawl limits, file types, and content handling.
| Setting | Type | Default | Description |
|---|---|---|---|
maxCrawlTime |
double | null | Maximum crawl duration in seconds (null = unlimited) |
maxFileSize |
int | 52428800 | Maximum individual file size in bytes (default 50 MB) |
minFileSize |
int | null | Minimum file size in bytes (null = no minimum) |
maxTotalSize |
int | 524288000 | Maximum total download size in bytes (default 500 MB) |
maxRedirectChainDepth |
int | 20 | Maximum redirects per URL before rejection |
respectRobotsTxt |
bool | true | Honor robots.txt crawl rules |
enableJSRendering |
bool | false | Re-render SPA shells via WebKit when empty body + framework markers detected |
noProgressTimeout |
double | 120 | Seconds without progress before auto-completing |
downloadEmbeddedVideos |
bool | false | Download video files from YouTube/Vimeo embeds via yt-dlp |
hostBlacklistThreshold |
int | 10 | Consecutive failures before blacklisting a host |
allowedFileTypes |
array | [] |
Allowed file extensions (empty = allow all) |
blockedFileTypes |
array | [] |
Blocked file extensions |
Content type toggles
| Setting | Type | Default | Description |
|---|---|---|---|
downloadImages |
bool | true | Download image files (JPEG, PNG, GIF, SVG, WebP) |
downloadVideo |
bool | true | Download video files (MP4, WebM) |
downloadAudio |
bool | true | Download audio files (MP3, WAV, OGG) |
downloadFonts |
bool | true | Download font files (WOFF, WOFF2, TTF, OTF) |
downloadScripts |
bool | true | Download JavaScript files |
downloadStyles |
bool | true | Download CSS stylesheets |
downloadPDFs |
bool | true | Download PDF documents |
Proxy
Route crawl traffic through an HTTP, HTTPS, or SOCKS5 proxy.
| Setting | Type | Default | Description |
|---|---|---|---|
proxyConfiguration.type |
string | "http" |
Proxy type: http, https, or socks5 |
proxyConfiguration.host |
string | (required) | Proxy server hostname or IP |
proxyConfiguration.port |
int | 8080 | Proxy port. Defaults: HTTP 8080, HTTPS 8443, SOCKS5 1080 |
proxyConfiguration.noProxyHosts |
array | [] |
Hosts that bypass the proxy. Suffix-matched with dot-boundary semantics |
Proxy credentials (username and password) are accepted at runtime but not persisted to disk.
Crawlio also reads environment variables (http_proxy, https_proxy, no_proxy) when no explicit proxy is configured.
no_proxy matching rules:
- Exact match:
example.commatchesexample.com - Suffix with dot boundary:
example.commatchessub.example.combut notnotexample.com - Leading dot:
.example.commatches the domain and all subdomains - Wildcard:
*bypasses all hosts
Example:
crawlio settings set settings.proxyConfiguration '{"type":"http","host":"proxy.corp.com","port":8080,"noProxyHosts":["localhost",".internal.com"]}'Or via MCP:
update_settings(settings: {
proxyConfiguration: {
type: "http",
host: "proxy.corp.com",
port: 8080,
noProxyHosts: ["localhost", ".internal.com"]
}
})Security
Certificate pinning, HSTS enforcement, and trust evaluation.
Certificate pinning
Pin specific public keys per host to prevent MITM attacks. Keys are SHA-256 hashes of the DER-encoded SubjectPublicKeyInfo, Base64-encoded.
| Setting | Type | Default | Description |
|---|---|---|---|
pinnedPublicKeys |
object | {} |
Map of hostname to array of Base64-encoded SHA-256 public key pins |
Example:
update_settings(policy: {
pinnedPublicKeys: {
"api.example.com": ["sha256//AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA="],
"cdn.example.com": ["sha256//BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB="]
}
})Crawlio validates the server certificate's public key against the pinned values during the TLS handshake. If no pin matches, the connection fails. Host matching is case-insensitive.
HSTS enforcement
Crawlio maintains an HSTS store that tracks Strict-Transport-Security headers observed during crawling. HSTS entries include max-age and includeSubDomains directives per RFC 6797.
When HSTS is active for a host:
- All
http://requests to that host are upgraded tohttps://before sending - Subdomain upgrades apply when
includeSubDomainswas set - Entries expire after
max-ageseconds - IP addresses are excluded from HSTS (per the RFC)
HSTS state is persisted per project and loaded on crawl resume.
OCR
Optional Vision OCR pipeline for extracting text from downloaded images. Zero overhead when disabled.
| Setting | Type | Default | Description |
|---|---|---|---|
ocr.isEnabled |
bool | false | Enable OCR pipeline |
ocr.maxImageSize |
int | 10485760 | Maximum image size for OCR in bytes (default 10 MB) |
ocr.languages |
array | [] |
Recognition languages (empty = auto-detect) |
ocr.recognitionLevel |
string | "accurate" |
Vision recognition level: accurate or fast |
ocr.maxConcurrentJobs |
int | 2 | Maximum parallel OCR jobs |
OCR runs on raster images only (PNG, JPEG, TIFF, BMP, WebP). SVG and GIF are skipped. Results appear in deploy.json, crawl-manifest.json, and WARC metadata records.
WARC
Control WARC web archive output.
| Setting | Type | Default | Description |
|---|---|---|---|
compressionEnabled |
bool | true | Per-record gzip compression. File extension: .warc.gz when on, .warc when off |
maxFileSize |
int | 1073741824 | Maximum file size before splitting (default 1 GB, 0 = no splitting) |
cdxEnabled |
bool | true | Generate CDX index file alongside the WARC |
dedupEnabled |
bool | true | Deduplicate responses via SHA-1 payload digest. Duplicates stored as revisit records |
See Export Formats for details on WARC output structure.
Updating settings
Settings can only be changed when the engine is idle (not actively crawling).
Open Settings (Cmd+,). Six tabs: General, Crawl, Filters, Advanced, Auth, AI Agents.
update_settings(settings: { maxConcurrent: 20, crawlDelay: 1.0 })
update_settings(policy: { maxDepth: 3, scopeMode: "includeSubdomains" })crawlio settings set settings.maxConcurrent 20
crawlio settings set policy.maxDepth 3curl --unix-socket ~/Library/Logs/Crawlio/control.sock \
-X PATCH http://localhost/settings \
-H "Content-Type: application/json" \
-d '{"settings": {"maxConcurrent": 20}, "policy": {"maxDepth": 3}}'PATCH /settings returns HTTP 409 if the engine is active. Stop the crawl first.
Example: Large SPA crawl
Configure Crawlio for a large single-page application:
# Increase concurrency for fast crawling
crawlio settings set settings.maxConcurrent 20
# Enable JS rendering for SPA content
crawlio settings set policy.enableJSRendering true
# Allow subdomains (CDN assets)
crawlio settings set policy.scopeMode includeSubdomains
# Download cross-domain assets
crawlio settings set policy.downloadCrossDomainAssets true
# Extend timeout for slow JS-rendered pages
crawlio settings set settings.timeout 120
# Set a depth limit to avoid infinite routes
crawlio settings set policy.maxDepth 10
# Start the crawl
crawlio crawl start https://my-spa.com --watchOr as a single MCP call:
update_settings(
settings: { maxConcurrent: 20, timeout: 120 },
policy: {
enableJSRendering: true,
scopeMode: "includeSubdomains",
downloadCrossDomainAssets: true,
maxDepth: 10
}
)Next steps
- MCP Tools Reference: How tools use these settings
- Export Formats: Saving results
- CLI Commands: Configure settings from the terminal