MCP Tools Reference

Overview

Crawlio exposes ~362 tools across 3 pillars. This page covers:

Aggregator meta-tools (5 tools) that unify all pillars
Crawlio App tools (49 tools in Pillar 3) for crawl control, export, intelligence, and vault
References to the other pillar tool sets

In code mode (the default), the 49 Pillar 3 tools are replaced by 6 code-mode tools: search_api, execute_api, trigger_capture, analyze_page, compare_pages, and extract_text_from_image. Every tool below maps to an endpoint accessible through execute_api.

Section 1: Aggregator meta-tools

These 5 tools are what your AI sees when using the aggregator. They route across all 3 pillars.

crawlio_discover

List available tools across all pillars. Returns only schemas matching the current task.

Parameter	Type	Required	Description
`query`	string	Yes	Describe what you need (e.g. "crawl and export", "browser automation")

Returns: Array of matching tool schemas with names, descriptions, and parameters.

crawlio_call

Route a tool call to the correct pillar.

Parameter	Type	Required	Description
`tool`	string	Yes	Tool name
`args`	object	No	Tool arguments

Returns: The tool's response, routed to the appropriate pillar.

crawlio_do

Execute a high-level task with automatic pillar selection.

Parameter	Type	Required	Description
`task`	string	Yes	Natural language task description

Returns: Task result. The aggregator picks the best pillar based on session state.

crawlio_cortex

Query intelligence data across pillar boundaries.

Parameter	Type	Required	Description
`query`	string	Yes	Intelligence query

Returns: Combined data from enrichment, browser detection, and crawl analysis.

crawlio_consult

Multi-pillar consultation for complex tasks.

Parameter	Type	Required	Description
`question`	string	Yes	What you need help with

Returns: Coordinated response from multiple pillars.

Section 2: Crawlio App tools (Pillar 3)

49 tools across 10 categories.

Crawl monitoring (7 tools)

get_crawl_status

Returns current engine state and progress counters.

Parameter	Type	Required	Description
`since`	integer	No	Returns "no changes" when sequence has not advanced past this value

Returns: { engineState, seedURL, seq, progress: { totalDiscovered, downloaded, failed, queued, localized }, enrichment: { pagesEnriched, frameworksDetected, ... } }

get_crawl_logs

Returns recent log entries with optional filtering.

Parameter	Type	Required	Description
`category`	string	No	`engine`, `download`, `parser`, `localizer`, `network`, `ui`
`level`	string	No	`debug`, `info`, `default`, `error`, `fault`
`limit`	integer	No	Max entries (default: 50)

get_errors

Returns error-level and fault-level log entries only.

Parameter	Type	Required	Description
`limit`	integer	No	Max entries (default: 50)

get_downloads

Returns all download items with status, size, and content type.

Parameter	Type	Required	Description
`status`	string	No	`pending`, `downloading`, `completed`, `failed`

get_failed_urls

Returns only failed download items with error details.

Parameters: None.

get_site_tree

Returns an ASCII directory tree of downloaded files.

Parameter	Type	Required	Description
`max_depth`	integer	No	Maximum tree depth (default: 5)

get_crawled_urls

Returns downloaded URLs with filtering and pagination.

Parameter	Type	Required	Description
`status`	string	No	`completed`, `downloading`, `failed`, `queued`
`type`	string	No	Content type substring (e.g. `html`)
`limit`	integer	No	Max results (default: 1000)
`offset`	integer	No	Skip first N results

Crawl control (5 tools)

start_crawl

Start downloading a website.

Parameter	Type	Required	Description
`url`	string	One of `url`/`urls`	Single URL to crawl
`urls`	array	One of `url`/`urls`	Multiple seed URLs
`destinationPath`	string	No	Local path to save files

stop_crawl

Stop the current download. All in-flight requests are cancelled.

Parameters: None.

pause_crawl

Pause the current download. In-flight requests complete, no new requests start.

Parameters: None.

resume_crawl

Resume a paused download.

Parameters: None.

recrawl_urls

Re-inject URLs into the crawl frontier.

Parameter	Type	Required	Description
`urls`	array	Yes	URLs to re-crawl

Settings (2 tools)

get_settings

Returns current download settings and crawl policy.

Parameters: None.

Returns: { settings: {...}, policy: {...} }

update_settings

Update download settings and/or crawl policy via merge patch. Only works when the engine is idle.

Parameter	Type	Required	Description
`settings`	object	No	Download settings fields to merge
`policy`	object	No	Crawl policy fields to merge

Projects (5 tools)

list_projects

List all saved projects.

Parameters: None.

Returns: Array of projects with id, name, seedURL, createdAt.

get_project

Get full details for a saved project.

Parameter	Type	Required	Description
`id`	string	Yes	Project UUID

save_project

Save the current project state.

Parameter	Type	Required	Description
`name`	string	No	Project name (auto-generated if omitted)

load_project

Load a saved project, restoring settings and state.

Parameter	Type	Required	Description
`id`	string	Yes	Project UUID

delete_project

Delete a saved project.

Parameter	Type	Required	Description
`id`	string	Yes	Project UUID

Export and extraction (4 tools)

export_site

Start an asynchronous export of the downloaded site.

Parameter	Type	Required	Description
`format`	string	Yes	`folder`, `zip`, `singleHTML`, `warc`, `extracted`, `deploy`
`destinationPath`	string	Yes	Where to write the export
`warcConfiguration`	object	No	WARC options: `{ compressionEnabled, maxFileSize, cdxEnabled, dedupEnabled }`

Poll get_export_status to track progress.

get_export_status

Returns the current export state and progress.

Parameters: None.

Returns: { state, format, progress, path, error }

extract_site

Run the content extraction pipeline on a completed crawl.

Parameter	Type	Required	Description
`destinationPath`	string	No	Output directory

Poll get_extraction_status to track progress.

get_extraction_status

Returns the current extraction state.

Parameters: None.

Returns: { state, phase, progress, totalPages, totalAssets }

Enrichment (8 tools)

trigger_capture

Trigger a WebKit runtime capture for a URL. Runs framework detection JS, intercepts network requests, captures console logs, and takes a DOM snapshot.

Parameter	Type	Required	Description
`url`	string	Yes	URL to capture

get_enrichment

Returns browser enrichment data for a URL or all URLs.

Parameter	Type	Required	Description
`url`	string	No	Specific URL (omit for all)

Returns: Enrichment objects with framework, networkRequests, consoleLogs, domSnapshotJSON.

get_structured_data

Returns JSON-LD, HTML tables, microdata, and RDFa extracted from crawled pages.

Parameter	Type	Required	Description
`url`	string	No	Specific URL (omit for site-wide aggregate)

submit_enrichment_bundle

Submit a complete enrichment bundle with all data types.

Parameter	Type	Required	Description
`url`	string	Yes	The page URL
`framework`	object	No	`{ name, version?, confidence? }`
`networkRequests`	array	No	Captured network requests
`consoleLogs`	array	No	Console output entries
`domSnapshotJSON`	string	No	DOM snapshot as JSON

submit_enrichment_framework

Submit framework detection data only.

Parameter	Type	Required	Description
`url`	string	Yes	The page URL
`framework`	object	Yes	`{ name, version?, confidence? }`

submit_enrichment_network

Submit network request data. Discovered URLs are offered to the crawl engine.

Parameter	Type	Required	Description
`url`	string	Yes	The page URL
`networkRequests`	array	Yes	Array of `{ url, method, status, type }`

submit_enrichment_console

Submit console log data.

Parameter	Type	Required	Description
`url`	string	Yes	The page URL
`consoleLogs`	array	Yes	Array of `{ level, message, timestamp }`

submit_enrichment_dom

Submit a DOM snapshot.

Parameter	Type	Required	Description
`url`	string	Yes	The page URL
`domSnapshotJSON`	string	Yes	DOM snapshot as JSON

Observations and findings (4 tools)

get_observations

Query the append-only observation log.

Parameter	Type	Required	Description
`host`	string	No	Filter by hostname
`op`	string	No	Filter by operation type
`source`	string	No	`extension`, `webkit`, `agent`
`since`	number	No	Unix timestamp
`limit`	integer	No	Max entries

get_observation

Look up a single observation or finding by ID.

Parameter	Type	Required	Description
`id`	string	Yes	Observation ID

create_finding

Create a curated finding with evidence.

Parameter	Type	Required	Description
`title`	string	Yes	Finding title
`url`	string	No	Related URL
`evidence`	array	No	Observation IDs
`synthesis`	string	No	Summary analysis
`confidence`	string	No	Confidence level
`category`	string	No	Finding category

get_findings

Returns curated findings.

Parameter	Type	Required	Description
`host`	string	No	Filter by hostname
`limit`	integer	No	Max entries

Composite analysis (3 tools)

analyze_page

Composite: trigger capture + poll enrichment with backoff + return unified evidence record.

Parameter	Type	Required	Description
`url`	string	Yes	URL to analyze

Returns: { url, timestamp, captureTriggered, enrichment, enrichmentStatus, crawlStatus }

Timeout: 60s

compare_pages

Composite: run analyze_page on two URLs, return structured comparison.

Parameter	Type	Required	Description
`urlA`	string	Yes	First URL
`urlB`	string	Yes	Second URL

Returns: { siteA, siteB, comparisonSummary }

Timeout: 120s

synthesize_openapi

Composite: chain traffic analysis + schema extraction + OpenAPI 3.0.3 YAML export.

Parameter	Type	Required	Description
`exhaustDir`	string	No	Path to flows.jsonl directory
`title`	string	No	API title
`serverURL`	string	No	Base server URL

Intelligence (5 tools)

get_tech_stack

Returns detected technologies with name, categories, confidence, version, and detection signals.

Parameters: None.

get_seo_findings

Returns SEO analysis: title, meta description, headings, canonical, word count, readability.

Parameter	Type	Required	Description
`severity`	string	No	Filter by severity
`category`	string	No	Filter by category

get_design_intel

Returns design system data: colors, typography, spacing, breakpoints, components.

Parameters: None.

get_keyword_intel

Returns keyword analysis: top keywords by frequency, co-occurring groups, density.

Parameters: None.

get_duplicate_content

Returns duplicate content detection: exact duplicates and near-duplicates with similarity scores.

Parameters: None.

Vault (5 tools)

vault_list_domains

List all domains with stored auth sessions.

Parameters: None.

vault_get_session

Retrieve a stored auth session for a domain. Audit-logged.

Parameter	Type	Required	Description
`domain`	string	Yes	Domain (e.g. `example.com`)

Returns: { cookies, userAgent, isExpired }

vault_mark_expired

Mark a stored session as expired without deleting it.

Parameter	Type	Required	Description
`domain`	string	Yes	Domain

vault_delete

Delete a stored auth session permanently.

Parameter	Type	Required	Description
`domain`	string	Yes	Domain

vault_request_login

Open the auth browser in Crawlio so you can log in. The session is captured and stored in the vault.

Parameter	Type	Required	Description
`domain`	string	Yes	Domain to authenticate against
`loginURL`	string	Yes	Login page URL

OCR (1 tool)

extract_text_from_image

Run Vision OCR on a local image file. Does not require Crawlio.app to be running.

Parameter	Type	Required	Description
`path`	string	Yes	Absolute path to the image
`languages`	array	No	Language codes (e.g. `["en-US"]`)
`recognitionLevel`	string	No	`accurate` (default) or `fast`

Supported formats: PNG, JPEG, TIFF, BMP, WebP. SVG and GIF are not supported.

Timeout reference

Category	Timeout
Read-only (get_, list_)	5s
Control (start, stop, settings)	15s
Enrichment (submit_*)	10s
Capture (trigger_capture)	60s
Export (export_site, extract_site)	120s
Composite (analyze_page)	60s
Composite (compare_pages, synthesize_openapi)	120s
OCR (extract_text_from_image)	15s

Tool annotations

All tools carry MCP annotations:

Annotation	Meaning	Applies to
`readOnlyHint: true`	Does not modify state	All `get_` and `list_` tools
`destructiveHint: true`	Irreversible action	`stop_crawl`, `delete_project`
`idempotentHint: true`	Safe to repeat	`pause_crawl`, `resume_crawl`, `update_settings`
`openWorldHint: true`	Interacts with external systems	`start_crawl`, `recrawl_urls`, `trigger_capture`, `vault_request_login`

Section 3: Other pillars

Chrome Extension (Pillar 1)

~114 tools for live browser automation via CDP. Requires the Crawlio for Chrome extension.

Covers: tab management, navigation, screenshots, DOM interaction, network interception, console capture, accessibility tree, performance metrics, security state, and framework detection.

See Browser Agent Tools for the full reference.

Headless Agent (Pillar 2)

~199 tools across 5 tiers:

Tier	What it covers
Browser	Headless Chromium automation
Converter	Format conversion (PDF, screenshots)
mgrep/RE	Pattern matching
Interceptor	Network interception
Core	File I/O, orchestration

The headless agent runs without a visible browser. It handles background tasks and is the fallback when no Chrome tab is connected.

Next steps

MCP Overview: the 3-pillar architecture
Code Mode: use search_api + execute_api instead of 49 individual tools
Resources: MCP resources, prompts, and skills
JIT Context: how the aggregator loads context on demand

PreviousJIT Context Runtime NextResources & Prompts