Settings Reference

Download

Control how Crawlio downloads individual files.

Setting	Type	Default	Description
`maxConcurrent`	int	4	Maximum parallel download connections (1-40)
`maxConnectionsPerHost`	int	6	Per-host connection limit (1-20)
`crawlDelay`	double	0.5	Seconds between requests to the same host
`timeout`	int	60	Request timeout in seconds (5-300)
`maxBytesPerSecond`	int	0	Bandwidth throttle via token bucket (0 = unlimited)
`maxRetries`	int	3	Retry count for transient errors (5xx, timeouts)
`userAgent`	string	System default	Presets: Safari, Chrome, Firefox, Googlebot, or custom string
`stripTrackingParams`	bool	true	Remove `utm_*`, `fbclid`, and other tracking query parameters
`downloadErrorPages`	bool	false	Save pages that return 4xx/5xx status codes
`alwaysDownloadHTML`	bool	false	Force re-download of HTML even if already cached
`downloadUsingWebViews`	bool	false	Use WebKit for JS-rendered page download
`limitMinImageSize`	bool	false	Enable minimum image size filtering
`minimumImageSize`	int	0	Minimum image dimension in pixels (rejects images smaller than this)
`customCookies`	array	`[]`	Cookie entries: `{ name, value, domain, path }`
`customHeaders`	array	`[]`	HTTP headers: `{ name, value }`
`customDataAttributes`	array	`[]`	Additional `data-*` attributes to scan for URLs
`webpagePatterns`	array	`[]`	URL patterns treated as HTML pages
`promptForCredentials`	bool	true	Show credential prompts for 401 responses
`storeCredentialsInKeychain`	bool	true	Save credentials in macOS Keychain
`preferHTTP2`	bool	true	Negotiate HTTP/2 via ALPN when server supports it
`captureTimeout`	int	30	WebKit capture timeout in seconds

Scope

Control which pages and resources Crawlio follows and downloads.

Setting	Type	Default	Description
`scopeMode`	string	`"sameDomain"`	URL scope: `sameDomain`, `includeSubdomains`, or `customList`
`maxDepth`	int	5	Maximum link hops from seed URL (0-100). 1 = seed page only
`externalLinkDepth`	int	0	Levels to follow on external domains (0 = don't follow)
`maxPagesPerCrawl`	int	0	Stop after this many pages (0 = unlimited)
`maxDiscoveredURLs`	int	100000	Frontier URL cap to prevent unbounded memory growth
`includePatterns`	array	`[]`	URL patterns to include (substring or regex)
`excludePatterns`	array	`[]`	URL patterns to exclude (substring or regex)
`useRegexPatterns`	bool	false	Treat include/exclude patterns as regular expressions
`includeSupportingFiles`	bool	true	Download supporting assets (CSS, JS, fonts) even outside scope
`downloadCrossDomainAssets`	bool	true	Download assets from external domains referenced by in-scope pages
`autoUpgradeHTTP`	bool	true	Auto-upgrade `http://` to `https://`, fallback on cert errors
`scanSitemaps`	bool	true	Discover URLs from `sitemap.xml` and robots.txt Sitemap directives

Scope modes

Mode	Behavior
`sameDomain`	Only URLs on the exact same domain as the seed. www-insensitive: `www.example.com` = `example.com`
`includeSubdomains`	Same domain plus all subdomains (e.g., `blog.example.com`, `cdn.example.com`)
`customList`	Only URLs matching the user-defined include patterns

Policy

Control crawl limits, file types, and content handling.

Setting	Type	Default	Description
`maxCrawlTime`	double	null	Maximum crawl duration in seconds (null = unlimited)
`maxFileSize`	int	52428800	Maximum individual file size in bytes (default 50 MB)
`minFileSize`	int	null	Minimum file size in bytes (null = no minimum)
`maxTotalSize`	int	524288000	Maximum total download size in bytes (default 500 MB)
`maxRedirectChainDepth`	int	20	Maximum redirects per URL before rejection
`respectRobotsTxt`	bool	true	Honor robots.txt crawl rules
`enableJSRendering`	bool	false	Re-render SPA shells via WebKit when empty body + framework markers detected
`noProgressTimeout`	double	120	Seconds without progress before auto-completing
`downloadEmbeddedVideos`	bool	false	Download video files from YouTube/Vimeo embeds via yt-dlp
`hostBlacklistThreshold`	int	10	Consecutive failures before blacklisting a host
`allowedFileTypes`	array	`[]`	Allowed file extensions (empty = allow all)
`blockedFileTypes`	array	`[]`	Blocked file extensions

Content type toggles

Setting	Type	Default	Description
`downloadImages`	bool	true	Download image files (JPEG, PNG, GIF, SVG, WebP)
`downloadVideo`	bool	true	Download video files (MP4, WebM)
`downloadAudio`	bool	true	Download audio files (MP3, WAV, OGG)
`downloadFonts`	bool	true	Download font files (WOFF, WOFF2, TTF, OTF)
`downloadScripts`	bool	true	Download JavaScript files
`downloadStyles`	bool	true	Download CSS stylesheets
`downloadPDFs`	bool	true	Download PDF documents

Proxy

Route crawl traffic through an HTTP, HTTPS, or SOCKS5 proxy.

Setting	Type	Default	Description
`proxyConfiguration.type`	string	`"http"`	Proxy type: `http`, `https`, or `socks5`
`proxyConfiguration.host`	string	(required)	Proxy server hostname or IP
`proxyConfiguration.port`	int	8080	Proxy port. Defaults: HTTP 8080, HTTPS 8443, SOCKS5 1080
`proxyConfiguration.noProxyHosts`	array	`[]`	Hosts that bypass the proxy. Suffix-matched with dot-boundary semantics

Proxy credentials (username and password) are accepted at runtime but not persisted to disk.

Crawlio also reads environment variables (http_proxy, https_proxy, no_proxy) when no explicit proxy is configured.

no_proxy matching rules:

Exact match: example.com matches example.com
Suffix with dot boundary: example.com matches sub.example.com but not notexample.com
Leading dot: .example.com matches the domain and all subdomains
Wildcard: * bypasses all hosts

Example:

crawlio settings set settings.proxyConfiguration '{"type":"http","host":"proxy.corp.com","port":8080,"noProxyHosts":["localhost",".internal.com"]}'

Or via MCP:

update_settings(settings: {
  proxyConfiguration: {
    type: "http",
    host: "proxy.corp.com",
    port: 8080,
    noProxyHosts: ["localhost", ".internal.com"]
  }
})

Security

Certificate pinning, HSTS enforcement, and trust evaluation.

Certificate pinning

Pin specific public keys per host to prevent MITM attacks. Keys are SHA-256 hashes of the DER-encoded SubjectPublicKeyInfo, Base64-encoded.

Setting	Type	Default	Description
`pinnedPublicKeys`	object	`{}`	Map of hostname to array of Base64-encoded SHA-256 public key pins

Example:

update_settings(policy: {
  pinnedPublicKeys: {
    "api.example.com": ["sha256//AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA="],
    "cdn.example.com": ["sha256//BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB="]
  }
})

Crawlio validates the server certificate's public key against the pinned values during the TLS handshake. If no pin matches, the connection fails. Host matching is case-insensitive.

HSTS enforcement

Crawlio maintains an HSTS store that tracks Strict-Transport-Security headers observed during crawling. HSTS entries include max-age and includeSubDomains directives per RFC 6797.

When HSTS is active for a host:

All http:// requests to that host are upgraded to https:// before sending
Subdomain upgrades apply when includeSubDomains was set
Entries expire after max-age seconds
IP addresses are excluded from HSTS (per the RFC)

HSTS state is persisted per project and loaded on crawl resume.

OCR

Optional Vision OCR pipeline for extracting text from downloaded images. Zero overhead when disabled.

Setting	Type	Default	Description
`ocr.isEnabled`	bool	false	Enable OCR pipeline
`ocr.maxImageSize`	int	10485760	Maximum image size for OCR in bytes (default 10 MB)
`ocr.languages`	array	`[]`	Recognition languages (empty = auto-detect)
`ocr.recognitionLevel`	string	`"accurate"`	Vision recognition level: `accurate` or `fast`
`ocr.maxConcurrentJobs`	int	2	Maximum parallel OCR jobs

OCR runs on raster images only (PNG, JPEG, TIFF, BMP, WebP). SVG and GIF are skipped. Results appear in deploy.json, crawl-manifest.json, and WARC metadata records.

WARC

Control WARC web archive output.

Setting	Type	Default	Description
`compressionEnabled`	bool	true	Per-record gzip compression. File extension: `.warc.gz` when on, `.warc` when off
`maxFileSize`	int	1073741824	Maximum file size before splitting (default 1 GB, 0 = no splitting)
`cdxEnabled`	bool	true	Generate CDX index file alongside the WARC
`dedupEnabled`	bool	true	Deduplicate responses via SHA-1 payload digest. Duplicates stored as `revisit` records

See Export Formats for details on WARC output structure.

Updating settings

Settings can only be changed when the engine is idle (not actively crawling).

Open Settings (Cmd+,). Six tabs: General, Crawl, Filters, Advanced, Auth, AI Agents.

update_settings(settings: { maxConcurrent: 20, crawlDelay: 1.0 })
update_settings(policy: { maxDepth: 3, scopeMode: "includeSubdomains" })

crawlio settings set settings.maxConcurrent 20
crawlio settings set policy.maxDepth 3

curl --unix-socket ~/Library/Logs/Crawlio/control.sock \
  -X PATCH http://localhost/settings \
  -H "Content-Type: application/json" \
  -d '{"settings": {"maxConcurrent": 20}, "policy": {"maxDepth": 3}}'

ℹ️

PATCH /settings returns HTTP 409 if the engine is active. Stop the crawl first.

Example: Large SPA crawl

Configure Crawlio for a large single-page application:

# Increase concurrency for fast crawling
crawlio settings set settings.maxConcurrent 20
 
# Enable JS rendering for SPA content
crawlio settings set policy.enableJSRendering true
 
# Allow subdomains (CDN assets)
crawlio settings set policy.scopeMode includeSubdomains
 
# Download cross-domain assets
crawlio settings set policy.downloadCrossDomainAssets true
 
# Extend timeout for slow JS-rendered pages
crawlio settings set settings.timeout 120
 
# Set a depth limit to avoid infinite routes
crawlio settings set policy.maxDepth 10
 
# Start the crawl
crawlio crawl start https://my-spa.com --watch

Or as a single MCP call:

update_settings(
  settings: { maxConcurrent: 20, timeout: 120 },
  policy: {
    enableJSRendering: true,
    scopeMode: "includeSubdomains",
    downloadCrossDomainAssets: true,
    maxDepth: 10
  }
)

Next steps

MCP Tools Reference: How tools use these settings
Export Formats: Saving results
CLI Commands: Configure settings from the terminal

PreviousCommon Workflows NextFramework Detection