CrawlioCrawlio Docs

Skill Reference

All skills live in the crawlio-plugin repo under skills/. Each skill is a single SKILL.md file that any MCP-compatible AI agent can read and follow. There are 7 skills in the current plugin release plus one site-auditor agent.

Skill Purpose
crawlio-mcp Complete MCP tool reference — 37 tools, 6 code-mode, 4 resources, 4 prompts
crawl-site Crawl a website with site-type-aware configuration
extract-and-export End-to-end crawl → extract → export in 7 formats
observe Query the append-only observation log
finding Create and query evidence-backed findings
audit-site Full multi-phase site audit with findings report
web-research Acquire > Normalize > Analyze research protocol

crawlio-mcp

Complete reference for the Crawlio MCP server itself — every tool, mode, resource, and prompt. Load this skill first when an agent is orchestrating anything non-trivial through Crawlio.

Trigger phrases: MCP agents load this as their baseline reference; no user invocation needed.

Modes

Crawlio MCP runs in one of two modes depending on how you launch the server.

Mode Tools When to use
Code mode (default) 6 Lower tool count, better for context-constrained clients. Uses search_api + execute_api to drive the full HTTP surface.
Full mode (--full) 37 Typed parameters and annotations for every operation. Better for clients that can handle many tools.

Full-mode tools (37)

Status & Monitoring (6)get_crawl_status, get_crawl_logs, get_errors, get_downloads, get_failed_urls, get_site_tree

Control (4)start_crawl, stop_crawl, pause_crawl, resume_crawl

Settings (3)get_settings, update_settings, recrawl_urls

Projects (5)list_projects, save_project, load_project, delete_project, get_project

Export & Extraction (5)export_site, get_export_status, extract_site, get_extraction_status, trigger_capture

OCR (1)extract_text_from_image (runs Vision OCR locally; no Crawlio.app required)

Enrichment (6)get_enrichment, submit_enrichment_bundle, submit_enrichment_framework, submit_enrichment_network, submit_enrichment_console, submit_enrichment_dom

Observations & Findings (5)get_observations, get_observation, create_finding, get_findings, get_crawled_urls

Code-mode tools (6)

Tool Purpose
search_api Discover endpoints by keyword
execute_api Execute any HTTP request against ControlServer
trigger_capture WebKit runtime capture (framework + network + console + DOM)
extract_text_from_image Vision OCR on a local image path
analyze_page Composite single-page analysis → evidenceId, evidenceQuality, gaps
compare_pages Side-by-side analysis → comparisonReadiness, symmetric, degradationNotes, timingDelta

HTTP-only endpoints (3)

Accessible via execute_api but not as MCP tools:

  • GET /health — server health, version, uptime, PID
  • GET /debug/metrics — engine metrics (connections, queue depth, memory)
  • POST /debug/dump-state — full engine state dump

Resources (4) + template (1)

URI Description
crawlio://status Engine state and progress
crawlio://settings Current crawl settings
crawlio://site-tree Downloaded file tree
crawlio://enrichment All browser enrichment data
crawlio://enrichment/{url} Per-URL enrichment (template)

Prompts (4)

Prompt Arguments
crawl-and-analyze url (req), maxDepth (opt)
export-site url (req), format (req), destination (opt)
compare-sites url1 (req), url2 (req)
fix-failed-urls none

Advanced options exposed through update_settings + export_site

These are real options on update_settings({settings: ...}) / update_settings({policy: ...}) / export_site:

  • WARC configurationexport_site({format: "warc", warcConfiguration: {compressionEnabled, maxFileSize, cdxEnabled, dedupEnabled}}). maxFileSize: 0 disables splitting.
  • Proxy configurationupdate_settings({settings: {proxyConfiguration: {type: "http"|"https"|"socks5", host, port, username?, password?, noProxyHosts?}}}).
  • TLS pinningupdate_settings({policy: {pinnedPublicKeys: {"example.com": ["sha256hex..."]}}}).
  • HTTP/2 preferenceupdate_settings({settings: {preferHTTP2: true}}).
  • Auto-upgrade HTTPupdate_settings({policy: {autoUpgradeHTTP: true}}).

crawl-site

Crawl a website with intelligent configuration. Detects site type, optimizes settings, monitors progress, retries failures, and reports results.

Trigger phrases: "crawl a site", "download a website", "mirror a site", "scrape a site"

Invoke: /crawlio:crawl-site https://example.com

Workflow

  1. Determine site type (static, SPA, CMS, docs, or single-page snapshot)
  2. Configure settings via update_settings (concurrency, delay, depth, scope, exclusions)
  3. Start crawl via start_crawl
  4. Monitor progress via get_crawl_status, polling with the since sequence number
  5. Check issues via get_failed_urls, get_errors, get_site_tree
  6. Retry failures via recrawl_urls
  7. Report results (page count, errors, duration)

Site-type presets

Site Type Depth Concurrency Notes
Static 5 8 Standard HTML/CSS
SPA (React, Vue) 3 4 includeSupportingFiles: true, consider crawlio-agent for framework detection
CMS (WordPress) 5 4 Exclude /wp-admin/*, /wp-json/*
Documentation 10 6 Exclude old version paths
Single page 0 1 includeSupportingFiles: true

MCP tools used

update_settings, start_crawl, get_crawl_status, get_crawl_logs, get_errors, get_failed_urls, recrawl_urls, get_site_tree, stop_crawl


extract-and-export

End-to-end pipeline: crawl a site, extract structured content (clean HTML, markdown, metadata, asset manifests), and export in any of 7 formats.

Trigger phrases: "download and export a site", "crawl and extract content", "archive a website", "export as WARC/ZIP/PDF"

Invoke: /crawlio:extract-and-export https://docs.stripe.com 5 warc

Arguments

Argument Required Default Description
URL Yes n/a The URL to crawl
maxDepth No 3 Maximum crawl depth
format No folder Export format

Export formats

Format Description
folder Mirror on disk with original directory structure
zip Compressed archive, ready to share
singleHTML All assets inlined into a single HTML file
warc ISO 28500 web archive standard (supports CDX, dedup, compression, splitting)
pdf Rendered pages as portable document
extracted Structured JSON only — clean text, metadata, asset manifests
deploy Production-ready bundle with crawl-manifest.json

Workflow

  1. Configure. update_settings — depth, scope, policy
  2. Crawl. start_crawl + poll get_crawl_status (use since sequence number)
  3. Check. get_failed_urls, get_errors; retry with recrawl_urls
  4. Review. get_site_tree, get_downloads
  5. Extract. extract_site, then poll get_extraction_status
  6. Export. export_site({format}), then poll get_export_status
  7. Report. Pages downloaded, export format, file size, any failures

MCP tools used

update_settings, start_crawl, get_crawl_status, get_failed_urls, get_errors, recrawl_urls, get_site_tree, get_downloads, extract_site, get_extraction_status, export_site, get_export_status, save_project


observe

Query the append-only observation log — the timeline of everything Crawlio observed during a crawl session. Also supports single-observation lookup for evidence-chain verification.

Trigger phrases: "check observations", "what did Crawlio see", "show crawl timeline", "query the observation log"

Invoke: /crawlio:observe example.com

get_observations filters

Parameter Description
host Filter by hostname
source Filter by source (see table below)
op Filter by operation type (see table below)
since Unix epoch seconds — only observations after this time
limit Maximum number of results (default: 20)

Observation sources

Source What it captures
extension Chrome extension enrichment — framework detection, network requests, console logs, DOM snapshots
engine Crawl lifecycle events (crawl_start, crawl_done)
webkit WebKit runtime capture (triggered via trigger_capture or analyze_page)
agent AI-created findings

Operations

Op Meaning
observe Raw data capture
finding Agent-created insight
crawl_start Crawl began
crawl_done Crawl completed (includes progress payload: totalDiscovered, downloaded, failed)
page Single-page observation

Single-observation lookup

Use get_observation({ id }) to verify evidence referenced by a finding, or to inspect the full payload of an evidenceId returned by analyze_page / compare_pages. Works with both obs_xxx and fnd_xxx IDs.

get_observation({ id: "obs_a1b2c3d4" })

Examples

# Recent observations
get_observations({ limit: 20 })
 
# Filter by host
get_observations({ host: "example.com", limit: 50 })
 
# Extension captures only
get_observations({ host: "example.com", source: "extension" })
 
# Since a specific epoch
get_observations({ since: 1708444200, limit: 100 })
 
# Combined
get_observations({ host: "example.com", source: "extension", op: "observe", limit: 50 })

Observation payload shape

Each entry contains id, op, ts (ISO 8601), url, source, and a composite data payload (framework detection, network requests, console logs, progress, etc.).

MCP tools used

get_observations, get_observation


finding

Create and query evidence-backed findings. Findings are the agent's judgment layer on top of raw observations — curated insights that persist across sessions, each backed by a chain of observation IDs.

Trigger phrases: "create a finding", "record an insight", "what findings exist", "show findings"

Invoke: /crawlio:finding

create_finding parameters

Parameter Type Required Description
title string Yes Short, specific title
url string No URL this finding relates to (leave empty for site-wide)
evidence string[] No Array of observation IDs (obs_xxx)
synthesis string No Detailed explanation of pattern and impact
confidence string No high, medium, low, or none
category string No Dimension: performance, security, seo, framework, errors, structure, accessibility, etc.

Note: the API uses synthesis and confidence, not description and severity.

Finding categories

Category Example title
Performance "Render-blocking scripts delay FCP by 2.3s"
Security "Mixed content: HTTP resources on HTTPS page"
SEO "Missing meta descriptions on 12 pages"
Framework "Next.js App Router with ISR detected"
Errors "3 JavaScript errors on product pages"
Structure "Orphaned pages not linked from navigation"
Accessibility "Missing alt attributes on hero images"

Evidence chain

The canonical flow:

  1. analyze_page({ url }) → returns evidenceId
  2. create_finding({ evidence: [evidenceId], ... }) → stores the finding with the chain
  3. get_observation({ id: evidenceId }) → verifies the evidence record exists and supports the claim

Example

create_finding({
  title: "Mixed content: HTTP images on HTTPS page",
  url: "https://example.com",
  evidence: ["obs_a3f7b2c1", "obs_b4e8c3d2"],
  synthesis: "Homepage loads 3 images over HTTP despite serving over HTTPS. Network observations show requests to http://cdn.example.com/img/. Triggers mixed-content warnings in Chrome and may be blocked in strict mode.",
  confidence: "high",
  category: "security"
})

Finding quality checklist

  • Title — specific ("3 images use HTTP on HTTPS pages"), not vague ("mixed content found")
  • Evidence — observation IDs that actually support the claim
  • Synthesis — explains why this matters and what the impact is, not just what was observed
  • Confidence — signal how strongly the evidence supports the claim

Querying findings

get_findings({})                          # all
get_findings({ host: "example.com" })     # per-host
get_findings({ limit: 10 })               # most recent

MCP tools used

get_observations, get_observation, create_finding, get_findings, analyze_page


audit-site

Full site audit: crawl, capture enrichment, analyze observations, and produce a findings report with prioritized recommendations. This is the highest-level skill — it orchestrates crawl-site, observe, and finding into a structured multi-phase analysis.

Trigger phrases: "audit a site", "analyze a website", "review a site", "site health check"

Invoke: /crawlio:audit-site https://example.com

Audit phases

Phase 1: Configure for the target — call update_settings with depth, scope, concurrency, policy. Size-based presets:

Site Size Pages maxDepth maxConcurrent
Small < 100 10 8
Medium 100–1,000 5 4
Large > 1,000 3 2

Phase 2: Crawlstart_crawl and poll get_crawl_status with the since sequence number. Handle rate-limiting (429s trigger automatic backoff).

Phase 3: Capture enrichment — if the Chrome extension is running, enrichment data is captured automatically during the crawl and appended to the observation log.

Phase 4: Analyze observations — query for patterns:

get_observations({ host: "example.com", limit: 200 })
get_observations({ host: "example.com", source: "extension", limit: 50 })
get_observations({ op: "crawl_done" })
get_errors()
get_failed_urls()

Phase 5: Create findings — for each issue or insight, call create_finding with title, url, evidence, synthesis, confidence, category.

Phase 6: Generate report — compile via get_findings({ host }). Report format:

  • Technology stack — framework, rendering mode, CDN, third-party services
  • Findings — grouped by category, sorted by severity (via confidence)
  • Site structure — tree overview, orphaned pages, broken links
  • Recommendations — prioritized action items

Audit checklist

  • Technology — framework, rendering mode, CDN, third-party services
  • Performance — page count, large files, external dependencies
  • Security — HTTPS enforcement, mixed content, security headers
  • Content — failed URLs, redirect chains, missing resources
  • Structure — site tree, depth distribution, cross-domain assets

MCP tools used

update_settings, start_crawl, get_crawl_status, get_crawl_logs, get_errors, get_failed_urls, recrawl_urls, get_site_tree, get_enrichment, get_observations, get_observation, create_finding, get_findings, save_project, export_site


web-research

Structured web-research protocol built on analyze_page and compare_pages. Teaches agents to follow the Acquire > Normalize > Analyze pattern so evidence records stay canonical and comparisons are reliable.

Trigger phrases: "research a site", "compare sites", "analyze technology", "structured web research"

Invoke: Not a user-facing slash command; agents load this as a protocol reference.

Core protocol

1. Acquire. Use composite tools — never the low-level trigger_capture + sleep + get_enrichment pattern.

Goal Tool Notes
Single-page evidence analyze_page One call = capture + enrichment + crawl status. Returns evidenceId, evidenceQuality, gaps
Two-site comparison compare_pages Sequential analysis with typed comparison evidence
Single evidence lookup get_observation Verify a specific evidence record by ID
Bulk crawl data get_crawled_urls After a completed crawl
Historical timeline get_observations Append-only audit trail

2. Normalize. Extract fields from the canonical record before drawing conclusions:

  • Framework — name, version, rendering mode (SSR/SSG/CSR/ISR)
  • Network — request count, external domains, resource types
  • Console — error count, warning patterns
  • Crawl — status, content type, byte count

Check enrichmentStatus before using enrichment data — "ok" means present and usable; "timeout" means capture completed but enrichment didn't arrive in time (note the gap).

Check evidenceQuality for overall health — "complete" (no gaps), "partial" (has gaps but capture succeeded), "degraded" (capture-level failure or enrichment server error).

3. Analyze. Compare normalized evidence against a rubric. Record insights via create_finding.

Anti-pattern

Never do this:

trigger_capture({ url })
// sleep 5s
get_enrichment({ url })

Do this instead:

analyze_page({ url })

Don't improvise evidence shapes — the record from analyze_page is the canonical format. Don't restructure it ad hoc.

Comparison protocol

For side-by-side analysis:

compare_pages({ urlA: "https://site-a.com", urlB: "https://site-b.com" })

The response includes a comparisonSummary with typed fields:

  • comparisonReadinessready (both complete), cautious (one partial), unreliable (either degraded)
  • symmetric — whether both sides have identical gap profiles
  • degradationNotes — human-readable gaps per side
  • timingDelta — absolute timing differences (capture, enrichment polling)
  • enrichmentAgeDeltaMs — timestamp difference between the two analyses
  • evidenceIdA / evidenceIdB — observation IDs for round-trip verification via get_observation

Comparison dimensions (10)

When comparing two sites, evaluate across these dimensions and note gaps explicitly:

  1. Framework — name, version, rendering strategy
  2. Performance — resource count, total bytes, external dependencies
  3. Security — HTTPS enforcement, mixed content, CSP presence
  4. SEO — meta tags, structured data, canonical URLs
  5. Accessibility — semantic HTML, ARIA usage, alt text
  6. Error surface — console errors, failed resources, 4xx/5xx responses
  7. Third-party load — analytics, tracking, CDN usage
  8. Architecture — SPA vs MPA, API patterns, hydration strategy
  9. Content delivery — CDN, caching headers, compression
  10. Mobile readiness — viewport meta, responsive signals

Example workflow

// 1. Acquire
result = analyze_page({ url: "https://example.com" })
 
// 2. Normalize
framework = result.enrichment.framework       // { name: "Next.js", version: "14.2.0" }
networkCount = result.enrichment.networkRequests.length
consoleErrors = result.enrichment.consoleLogs.filter(e => e.level === "error")
 
// 3. Analyze & record
create_finding({
  title: "Next.js 14 with high external dependency count",
  url: "https://example.com",
  evidence: [result.evidenceId],
  synthesis: "Detected Next.js 14.2.0 with 47 network requests including 12 third-party domains...",
  confidence: "high",
  category: "performance"
})

MCP tools used

analyze_page, compare_pages, get_observation, get_observations, get_crawled_urls, create_finding


Agent: site-auditor

The plugin includes a custom agent definition at agents/site-auditor.md. It wraps the audit-site skill into an autonomous agent:

  • Model — Sonnet
  • Tools — Bash, Read, Glob, Grep, WebFetch, WebSearch + all Crawlio MCP tools
  • Protocol — Reconnaissance → Crawl → Multi-pass Analysis (Structure / Errors / Enrichment / Synthesis) → Report

The agent follows the same phases as audit-site but can also use file-system tools and web search to cross-reference findings with external documentation and best practices.

Finding standards enforced by the agent

  • Specific title — "3 images use HTTP on HTTPS pages", not "mixed content found"
  • Evidence — at least one observation ID referenced
  • Impact — synthesis explains why this matters
  • Actionable — recommendations included in the report

When to use the agent vs the skill

Use case Choose
Interactive audit with human guidance audit-site skill
Fully autonomous audit site-auditor agent
Quick crawl without analysis crawl-site skill
Pipeline integration (crawl + export) extract-and-export skill
Single-page or two-site research web-research skill
MCP tool lookup crawlio-mcp skill
© 2026 Crawlio. All rights reserved.