CrawlioCrawlio Docs

Settings Reference

Download

Control how Crawlio downloads individual files.

Setting Type Default Description
maxConcurrent int 4 Maximum parallel download connections (1-40)
maxConnectionsPerHost int 6 Per-host connection limit (1-20)
crawlDelay double 0.5 Seconds between requests to the same host
timeout int 60 Request timeout in seconds (5-300)
maxBytesPerSecond int 0 Bandwidth throttle via token bucket (0 = unlimited)
maxRetries int 3 Retry count for transient errors (5xx, timeouts)
userAgent string System default Presets: Safari, Chrome, Firefox, Googlebot, or custom string
stripTrackingParams bool true Remove utm_*, fbclid, and other tracking query parameters
downloadErrorPages bool false Save pages that return 4xx/5xx status codes
alwaysDownloadHTML bool false Force re-download of HTML even if already cached
downloadUsingWebViews bool false Use WebKit for JS-rendered page download
limitMinImageSize bool false Enable minimum image size filtering
minimumImageSize int 0 Minimum image dimension in pixels (rejects images smaller than this)
customCookies array [] Cookie entries: { name, value, domain, path }
customHeaders array [] HTTP headers: { name, value }
customDataAttributes array [] Additional data-* attributes to scan for URLs
webpagePatterns array [] URL patterns treated as HTML pages
promptForCredentials bool true Show credential prompts for 401 responses
storeCredentialsInKeychain bool true Save credentials in macOS Keychain
preferHTTP2 bool true Negotiate HTTP/2 via ALPN when server supports it
captureTimeout int 30 WebKit capture timeout in seconds

Scope

Control which pages and resources Crawlio follows and downloads.

Setting Type Default Description
scopeMode string "sameDomain" URL scope: sameDomain, includeSubdomains, or customList
maxDepth int 5 Maximum link hops from seed URL (0-100). 1 = seed page only
externalLinkDepth int 0 Levels to follow on external domains (0 = don't follow)
maxPagesPerCrawl int 0 Stop after this many pages (0 = unlimited)
maxDiscoveredURLs int 100000 Frontier URL cap to prevent unbounded memory growth
includePatterns array [] URL patterns to include (substring or regex)
excludePatterns array [] URL patterns to exclude (substring or regex)
useRegexPatterns bool false Treat include/exclude patterns as regular expressions
includeSupportingFiles bool true Download supporting assets (CSS, JS, fonts) even outside scope
downloadCrossDomainAssets bool true Download assets from external domains referenced by in-scope pages
autoUpgradeHTTP bool true Auto-upgrade http:// to https://, fallback on cert errors
scanSitemaps bool true Discover URLs from sitemap.xml and robots.txt Sitemap directives

Scope modes

Mode Behavior
sameDomain Only URLs on the exact same domain as the seed. www-insensitive: www.example.com = example.com
includeSubdomains Same domain plus all subdomains (e.g., blog.example.com, cdn.example.com)
customList Only URLs matching the user-defined include patterns

Policy

Control crawl limits, file types, and content handling.

Setting Type Default Description
maxCrawlTime double null Maximum crawl duration in seconds (null = unlimited)
maxFileSize int 52428800 Maximum individual file size in bytes (default 50 MB)
minFileSize int null Minimum file size in bytes (null = no minimum)
maxTotalSize int 524288000 Maximum total download size in bytes (default 500 MB)
maxRedirectChainDepth int 20 Maximum redirects per URL before rejection
respectRobotsTxt bool true Honor robots.txt crawl rules
enableJSRendering bool false Re-render SPA shells via WebKit when empty body + framework markers detected
noProgressTimeout double 120 Seconds without progress before auto-completing
downloadEmbeddedVideos bool false Download video files from YouTube/Vimeo embeds via yt-dlp
hostBlacklistThreshold int 10 Consecutive failures before blacklisting a host
allowedFileTypes array [] Allowed file extensions (empty = allow all)
blockedFileTypes array [] Blocked file extensions

Content type toggles

Setting Type Default Description
downloadImages bool true Download image files (JPEG, PNG, GIF, SVG, WebP)
downloadVideo bool true Download video files (MP4, WebM)
downloadAudio bool true Download audio files (MP3, WAV, OGG)
downloadFonts bool true Download font files (WOFF, WOFF2, TTF, OTF)
downloadScripts bool true Download JavaScript files
downloadStyles bool true Download CSS stylesheets
downloadPDFs bool true Download PDF documents

Proxy

Route crawl traffic through an HTTP, HTTPS, or SOCKS5 proxy.

Setting Type Default Description
proxyConfiguration.type string "http" Proxy type: http, https, or socks5
proxyConfiguration.host string (required) Proxy server hostname or IP
proxyConfiguration.port int 8080 Proxy port. Defaults: HTTP 8080, HTTPS 8443, SOCKS5 1080
proxyConfiguration.noProxyHosts array [] Hosts that bypass the proxy. Suffix-matched with dot-boundary semantics

Proxy credentials (username and password) are accepted at runtime but not persisted to disk.

Crawlio also reads environment variables (http_proxy, https_proxy, no_proxy) when no explicit proxy is configured.

no_proxy matching rules:

  • Exact match: example.com matches example.com
  • Suffix with dot boundary: example.com matches sub.example.com but not notexample.com
  • Leading dot: .example.com matches the domain and all subdomains
  • Wildcard: * bypasses all hosts

Example:

crawlio settings set settings.proxyConfiguration '{"type":"http","host":"proxy.corp.com","port":8080,"noProxyHosts":["localhost",".internal.com"]}'

Or via MCP:

update_settings(settings: {
  proxyConfiguration: {
    type: "http",
    host: "proxy.corp.com",
    port: 8080,
    noProxyHosts: ["localhost", ".internal.com"]
  }
})

Security

Certificate pinning, HSTS enforcement, and trust evaluation.

Certificate pinning

Pin specific public keys per host to prevent MITM attacks. Keys are SHA-256 hashes of the DER-encoded SubjectPublicKeyInfo, Base64-encoded.

Setting Type Default Description
pinnedPublicKeys object {} Map of hostname to array of Base64-encoded SHA-256 public key pins

Example:

update_settings(policy: {
  pinnedPublicKeys: {
    "api.example.com": ["sha256//AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA="],
    "cdn.example.com": ["sha256//BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB="]
  }
})

Crawlio validates the server certificate's public key against the pinned values during the TLS handshake. If no pin matches, the connection fails. Host matching is case-insensitive.

HSTS enforcement

Crawlio maintains an HSTS store that tracks Strict-Transport-Security headers observed during crawling. HSTS entries include max-age and includeSubDomains directives per RFC 6797.

When HSTS is active for a host:

  • All http:// requests to that host are upgraded to https:// before sending
  • Subdomain upgrades apply when includeSubDomains was set
  • Entries expire after max-age seconds
  • IP addresses are excluded from HSTS (per the RFC)

HSTS state is persisted per project and loaded on crawl resume.


OCR

Optional Vision OCR pipeline for extracting text from downloaded images. Zero overhead when disabled.

Setting Type Default Description
ocr.isEnabled bool false Enable OCR pipeline
ocr.maxImageSize int 10485760 Maximum image size for OCR in bytes (default 10 MB)
ocr.languages array [] Recognition languages (empty = auto-detect)
ocr.recognitionLevel string "accurate" Vision recognition level: accurate or fast
ocr.maxConcurrentJobs int 2 Maximum parallel OCR jobs

OCR runs on raster images only (PNG, JPEG, TIFF, BMP, WebP). SVG and GIF are skipped. Results appear in deploy.json, crawl-manifest.json, and WARC metadata records.


WARC

Control WARC web archive output.

Setting Type Default Description
compressionEnabled bool true Per-record gzip compression. File extension: .warc.gz when on, .warc when off
maxFileSize int 1073741824 Maximum file size before splitting (default 1 GB, 0 = no splitting)
cdxEnabled bool true Generate CDX index file alongside the WARC
dedupEnabled bool true Deduplicate responses via SHA-1 payload digest. Duplicates stored as revisit records

See Export Formats for details on WARC output structure.


Updating settings

Settings can only be changed when the engine is idle (not actively crawling).

Open Settings (Cmd+,). Six tabs: General, Crawl, Filters, Advanced, Auth, AI Agents.

update_settings(settings: { maxConcurrent: 20, crawlDelay: 1.0 })
update_settings(policy: { maxDepth: 3, scopeMode: "includeSubdomains" })
crawlio settings set settings.maxConcurrent 20
crawlio settings set policy.maxDepth 3
curl --unix-socket ~/Library/Logs/Crawlio/control.sock \
  -X PATCH http://localhost/settings \
  -H "Content-Type: application/json" \
  -d '{"settings": {"maxConcurrent": 20}, "policy": {"maxDepth": 3}}'
ℹ️

PATCH /settings returns HTTP 409 if the engine is active. Stop the crawl first.


Example: Large SPA crawl

Configure Crawlio for a large single-page application:

# Increase concurrency for fast crawling
crawlio settings set settings.maxConcurrent 20
 
# Enable JS rendering for SPA content
crawlio settings set policy.enableJSRendering true
 
# Allow subdomains (CDN assets)
crawlio settings set policy.scopeMode includeSubdomains
 
# Download cross-domain assets
crawlio settings set policy.downloadCrossDomainAssets true
 
# Extend timeout for slow JS-rendered pages
crawlio settings set settings.timeout 120
 
# Set a depth limit to avoid infinite routes
crawlio settings set policy.maxDepth 10
 
# Start the crawl
crawlio crawl start https://my-spa.com --watch

Or as a single MCP call:

update_settings(
  settings: { maxConcurrent: 20, timeout: 120 },
  policy: {
    enableJSRendering: true,
    scopeMode: "includeSubdomains",
    downloadCrossDomainAssets: true,
    maxDepth: 10
  }
)

Next steps

© 2026 Crawlio. All rights reserved.