Skill Reference
All skills live in the crawlio-plugin repo under skills/. Each skill is a single SKILL.md file that any MCP-compatible AI agent can read and follow. There are 7 skills in the current plugin release plus one site-auditor agent.
| Skill | Purpose |
|---|---|
crawlio-mcp |
Complete MCP tool reference — 37 tools, 6 code-mode, 4 resources, 4 prompts |
crawl-site |
Crawl a website with site-type-aware configuration |
extract-and-export |
End-to-end crawl → extract → export in 7 formats |
observe |
Query the append-only observation log |
finding |
Create and query evidence-backed findings |
audit-site |
Full multi-phase site audit with findings report |
web-research |
Acquire > Normalize > Analyze research protocol |
crawlio-mcp
Complete reference for the Crawlio MCP server itself — every tool, mode, resource, and prompt. Load this skill first when an agent is orchestrating anything non-trivial through Crawlio.
Trigger phrases: MCP agents load this as their baseline reference; no user invocation needed.
Modes
Crawlio MCP runs in one of two modes depending on how you launch the server.
| Mode | Tools | When to use |
|---|---|---|
| Code mode (default) | 6 | Lower tool count, better for context-constrained clients. Uses search_api + execute_api to drive the full HTTP surface. |
Full mode (--full) |
37 | Typed parameters and annotations for every operation. Better for clients that can handle many tools. |
Full-mode tools (37)
Status & Monitoring (6) — get_crawl_status, get_crawl_logs, get_errors, get_downloads, get_failed_urls, get_site_tree
Control (4) — start_crawl, stop_crawl, pause_crawl, resume_crawl
Settings (3) — get_settings, update_settings, recrawl_urls
Projects (5) — list_projects, save_project, load_project, delete_project, get_project
Export & Extraction (5) — export_site, get_export_status, extract_site, get_extraction_status, trigger_capture
OCR (1) — extract_text_from_image (runs Vision OCR locally; no Crawlio.app required)
Enrichment (6) — get_enrichment, submit_enrichment_bundle, submit_enrichment_framework, submit_enrichment_network, submit_enrichment_console, submit_enrichment_dom
Observations & Findings (5) — get_observations, get_observation, create_finding, get_findings, get_crawled_urls
Code-mode tools (6)
| Tool | Purpose |
|---|---|
search_api |
Discover endpoints by keyword |
execute_api |
Execute any HTTP request against ControlServer |
trigger_capture |
WebKit runtime capture (framework + network + console + DOM) |
extract_text_from_image |
Vision OCR on a local image path |
analyze_page |
Composite single-page analysis → evidenceId, evidenceQuality, gaps |
compare_pages |
Side-by-side analysis → comparisonReadiness, symmetric, degradationNotes, timingDelta |
HTTP-only endpoints (3)
Accessible via execute_api but not as MCP tools:
GET /health— server health, version, uptime, PIDGET /debug/metrics— engine metrics (connections, queue depth, memory)POST /debug/dump-state— full engine state dump
Resources (4) + template (1)
| URI | Description |
|---|---|
crawlio://status |
Engine state and progress |
crawlio://settings |
Current crawl settings |
crawlio://site-tree |
Downloaded file tree |
crawlio://enrichment |
All browser enrichment data |
crawlio://enrichment/{url} |
Per-URL enrichment (template) |
Prompts (4)
| Prompt | Arguments |
|---|---|
crawl-and-analyze |
url (req), maxDepth (opt) |
export-site |
url (req), format (req), destination (opt) |
compare-sites |
url1 (req), url2 (req) |
fix-failed-urls |
none |
Advanced options exposed through update_settings + export_site
These are real options on update_settings({settings: ...}) / update_settings({policy: ...}) / export_site:
- WARC configuration —
export_site({format: "warc", warcConfiguration: {compressionEnabled, maxFileSize, cdxEnabled, dedupEnabled}}).maxFileSize: 0disables splitting. - Proxy configuration —
update_settings({settings: {proxyConfiguration: {type: "http"|"https"|"socks5", host, port, username?, password?, noProxyHosts?}}}). - TLS pinning —
update_settings({policy: {pinnedPublicKeys: {"example.com": ["sha256hex..."]}}}). - HTTP/2 preference —
update_settings({settings: {preferHTTP2: true}}). - Auto-upgrade HTTP —
update_settings({policy: {autoUpgradeHTTP: true}}).
crawl-site
Crawl a website with intelligent configuration. Detects site type, optimizes settings, monitors progress, retries failures, and reports results.
Trigger phrases: "crawl a site", "download a website", "mirror a site", "scrape a site"
Invoke: /crawlio:crawl-site https://example.com
Workflow
- Determine site type (static, SPA, CMS, docs, or single-page snapshot)
- Configure settings via
update_settings(concurrency, delay, depth, scope, exclusions) - Start crawl via
start_crawl - Monitor progress via
get_crawl_status, polling with thesincesequence number - Check issues via
get_failed_urls,get_errors,get_site_tree - Retry failures via
recrawl_urls - Report results (page count, errors, duration)
Site-type presets
| Site Type | Depth | Concurrency | Notes |
|---|---|---|---|
| Static | 5 | 8 | Standard HTML/CSS |
| SPA (React, Vue) | 3 | 4 | includeSupportingFiles: true, consider crawlio-agent for framework detection |
| CMS (WordPress) | 5 | 4 | Exclude /wp-admin/*, /wp-json/* |
| Documentation | 10 | 6 | Exclude old version paths |
| Single page | 0 | 1 | includeSupportingFiles: true |
MCP tools used
update_settings, start_crawl, get_crawl_status, get_crawl_logs, get_errors, get_failed_urls, recrawl_urls, get_site_tree, stop_crawl
extract-and-export
End-to-end pipeline: crawl a site, extract structured content (clean HTML, markdown, metadata, asset manifests), and export in any of 7 formats.
Trigger phrases: "download and export a site", "crawl and extract content", "archive a website", "export as WARC/ZIP/PDF"
Invoke: /crawlio:extract-and-export https://docs.stripe.com 5 warc
Arguments
| Argument | Required | Default | Description |
|---|---|---|---|
| URL | Yes | n/a | The URL to crawl |
maxDepth |
No | 3 | Maximum crawl depth |
format |
No | folder |
Export format |
Export formats
| Format | Description |
|---|---|
folder |
Mirror on disk with original directory structure |
zip |
Compressed archive, ready to share |
singleHTML |
All assets inlined into a single HTML file |
warc |
ISO 28500 web archive standard (supports CDX, dedup, compression, splitting) |
pdf |
Rendered pages as portable document |
extracted |
Structured JSON only — clean text, metadata, asset manifests |
deploy |
Production-ready bundle with crawl-manifest.json |
Workflow
- Configure.
update_settings— depth, scope, policy - Crawl.
start_crawl+ pollget_crawl_status(usesincesequence number) - Check.
get_failed_urls,get_errors; retry withrecrawl_urls - Review.
get_site_tree,get_downloads - Extract.
extract_site, then pollget_extraction_status - Export.
export_site({format}), then pollget_export_status - Report. Pages downloaded, export format, file size, any failures
MCP tools used
update_settings, start_crawl, get_crawl_status, get_failed_urls, get_errors, recrawl_urls, get_site_tree, get_downloads, extract_site, get_extraction_status, export_site, get_export_status, save_project
observe
Query the append-only observation log — the timeline of everything Crawlio observed during a crawl session. Also supports single-observation lookup for evidence-chain verification.
Trigger phrases: "check observations", "what did Crawlio see", "show crawl timeline", "query the observation log"
Invoke: /crawlio:observe example.com
get_observations filters
| Parameter | Description |
|---|---|
host |
Filter by hostname |
source |
Filter by source (see table below) |
op |
Filter by operation type (see table below) |
since |
Unix epoch seconds — only observations after this time |
limit |
Maximum number of results (default: 20) |
Observation sources
| Source | What it captures |
|---|---|
extension |
Chrome extension enrichment — framework detection, network requests, console logs, DOM snapshots |
engine |
Crawl lifecycle events (crawl_start, crawl_done) |
webkit |
WebKit runtime capture (triggered via trigger_capture or analyze_page) |
agent |
AI-created findings |
Operations
| Op | Meaning |
|---|---|
observe |
Raw data capture |
finding |
Agent-created insight |
crawl_start |
Crawl began |
crawl_done |
Crawl completed (includes progress payload: totalDiscovered, downloaded, failed) |
page |
Single-page observation |
Single-observation lookup
Use get_observation({ id }) to verify evidence referenced by a finding, or to inspect the full payload of an evidenceId returned by analyze_page / compare_pages. Works with both obs_xxx and fnd_xxx IDs.
get_observation({ id: "obs_a1b2c3d4" })Examples
# Recent observations
get_observations({ limit: 20 })
# Filter by host
get_observations({ host: "example.com", limit: 50 })
# Extension captures only
get_observations({ host: "example.com", source: "extension" })
# Since a specific epoch
get_observations({ since: 1708444200, limit: 100 })
# Combined
get_observations({ host: "example.com", source: "extension", op: "observe", limit: 50 })Observation payload shape
Each entry contains id, op, ts (ISO 8601), url, source, and a composite data payload (framework detection, network requests, console logs, progress, etc.).
MCP tools used
get_observations, get_observation
finding
Create and query evidence-backed findings. Findings are the agent's judgment layer on top of raw observations — curated insights that persist across sessions, each backed by a chain of observation IDs.
Trigger phrases: "create a finding", "record an insight", "what findings exist", "show findings"
Invoke: /crawlio:finding
create_finding parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
title |
string | Yes | Short, specific title |
url |
string | No | URL this finding relates to (leave empty for site-wide) |
evidence |
string[] | No | Array of observation IDs (obs_xxx) |
synthesis |
string | No | Detailed explanation of pattern and impact |
confidence |
string | No | high, medium, low, or none |
category |
string | No | Dimension: performance, security, seo, framework, errors, structure, accessibility, etc. |
Note: the API uses synthesis and confidence, not description and severity.
Finding categories
| Category | Example title |
|---|---|
| Performance | "Render-blocking scripts delay FCP by 2.3s" |
| Security | "Mixed content: HTTP resources on HTTPS page" |
| SEO | "Missing meta descriptions on 12 pages" |
| Framework | "Next.js App Router with ISR detected" |
| Errors | "3 JavaScript errors on product pages" |
| Structure | "Orphaned pages not linked from navigation" |
| Accessibility | "Missing alt attributes on hero images" |
Evidence chain
The canonical flow:
analyze_page({ url })→ returnsevidenceIdcreate_finding({ evidence: [evidenceId], ... })→ stores the finding with the chainget_observation({ id: evidenceId })→ verifies the evidence record exists and supports the claim
Example
create_finding({
title: "Mixed content: HTTP images on HTTPS page",
url: "https://example.com",
evidence: ["obs_a3f7b2c1", "obs_b4e8c3d2"],
synthesis: "Homepage loads 3 images over HTTP despite serving over HTTPS. Network observations show requests to http://cdn.example.com/img/. Triggers mixed-content warnings in Chrome and may be blocked in strict mode.",
confidence: "high",
category: "security"
})Finding quality checklist
- Title — specific ("3 images use HTTP on HTTPS pages"), not vague ("mixed content found")
- Evidence — observation IDs that actually support the claim
- Synthesis — explains why this matters and what the impact is, not just what was observed
- Confidence — signal how strongly the evidence supports the claim
Querying findings
get_findings({}) # all
get_findings({ host: "example.com" }) # per-host
get_findings({ limit: 10 }) # most recentMCP tools used
get_observations, get_observation, create_finding, get_findings, analyze_page
audit-site
Full site audit: crawl, capture enrichment, analyze observations, and produce a findings report with prioritized recommendations. This is the highest-level skill — it orchestrates crawl-site, observe, and finding into a structured multi-phase analysis.
Trigger phrases: "audit a site", "analyze a website", "review a site", "site health check"
Invoke: /crawlio:audit-site https://example.com
Audit phases
Phase 1: Configure for the target — call update_settings with depth, scope, concurrency, policy. Size-based presets:
| Site Size | Pages | maxDepth |
maxConcurrent |
|---|---|---|---|
| Small | < 100 | 10 | 8 |
| Medium | 100–1,000 | 5 | 4 |
| Large | > 1,000 | 3 | 2 |
Phase 2: Crawl — start_crawl and poll get_crawl_status with the since sequence number. Handle rate-limiting (429s trigger automatic backoff).
Phase 3: Capture enrichment — if the Chrome extension is running, enrichment data is captured automatically during the crawl and appended to the observation log.
Phase 4: Analyze observations — query for patterns:
get_observations({ host: "example.com", limit: 200 })
get_observations({ host: "example.com", source: "extension", limit: 50 })
get_observations({ op: "crawl_done" })
get_errors()
get_failed_urls()Phase 5: Create findings — for each issue or insight, call create_finding with title, url, evidence, synthesis, confidence, category.
Phase 6: Generate report — compile via get_findings({ host }). Report format:
- Technology stack — framework, rendering mode, CDN, third-party services
- Findings — grouped by category, sorted by severity (via
confidence) - Site structure — tree overview, orphaned pages, broken links
- Recommendations — prioritized action items
Audit checklist
- Technology — framework, rendering mode, CDN, third-party services
- Performance — page count, large files, external dependencies
- Security — HTTPS enforcement, mixed content, security headers
- Content — failed URLs, redirect chains, missing resources
- Structure — site tree, depth distribution, cross-domain assets
MCP tools used
update_settings, start_crawl, get_crawl_status, get_crawl_logs, get_errors, get_failed_urls, recrawl_urls, get_site_tree, get_enrichment, get_observations, get_observation, create_finding, get_findings, save_project, export_site
web-research
Structured web-research protocol built on analyze_page and compare_pages. Teaches agents to follow the Acquire > Normalize > Analyze pattern so evidence records stay canonical and comparisons are reliable.
Trigger phrases: "research a site", "compare sites", "analyze technology", "structured web research"
Invoke: Not a user-facing slash command; agents load this as a protocol reference.
Core protocol
1. Acquire. Use composite tools — never the low-level trigger_capture + sleep + get_enrichment pattern.
| Goal | Tool | Notes |
|---|---|---|
| Single-page evidence | analyze_page |
One call = capture + enrichment + crawl status. Returns evidenceId, evidenceQuality, gaps |
| Two-site comparison | compare_pages |
Sequential analysis with typed comparison evidence |
| Single evidence lookup | get_observation |
Verify a specific evidence record by ID |
| Bulk crawl data | get_crawled_urls |
After a completed crawl |
| Historical timeline | get_observations |
Append-only audit trail |
2. Normalize. Extract fields from the canonical record before drawing conclusions:
- Framework — name, version, rendering mode (SSR/SSG/CSR/ISR)
- Network — request count, external domains, resource types
- Console — error count, warning patterns
- Crawl — status, content type, byte count
Check enrichmentStatus before using enrichment data — "ok" means present and usable; "timeout" means capture completed but enrichment didn't arrive in time (note the gap).
Check evidenceQuality for overall health — "complete" (no gaps), "partial" (has gaps but capture succeeded), "degraded" (capture-level failure or enrichment server error).
3. Analyze. Compare normalized evidence against a rubric. Record insights via create_finding.
Anti-pattern
Never do this:
trigger_capture({ url })
// sleep 5s
get_enrichment({ url })Do this instead:
analyze_page({ url })Don't improvise evidence shapes — the record from analyze_page is the canonical format. Don't restructure it ad hoc.
Comparison protocol
For side-by-side analysis:
compare_pages({ urlA: "https://site-a.com", urlB: "https://site-b.com" })The response includes a comparisonSummary with typed fields:
comparisonReadiness—ready(both complete),cautious(one partial),unreliable(either degraded)symmetric— whether both sides have identical gap profilesdegradationNotes— human-readable gaps per sidetimingDelta— absolute timing differences (capture, enrichment polling)enrichmentAgeDeltaMs— timestamp difference between the two analysesevidenceIdA/evidenceIdB— observation IDs for round-trip verification viaget_observation
Comparison dimensions (10)
When comparing two sites, evaluate across these dimensions and note gaps explicitly:
- Framework — name, version, rendering strategy
- Performance — resource count, total bytes, external dependencies
- Security — HTTPS enforcement, mixed content, CSP presence
- SEO — meta tags, structured data, canonical URLs
- Accessibility — semantic HTML, ARIA usage, alt text
- Error surface — console errors, failed resources, 4xx/5xx responses
- Third-party load — analytics, tracking, CDN usage
- Architecture — SPA vs MPA, API patterns, hydration strategy
- Content delivery — CDN, caching headers, compression
- Mobile readiness — viewport meta, responsive signals
Example workflow
// 1. Acquire
result = analyze_page({ url: "https://example.com" })
// 2. Normalize
framework = result.enrichment.framework // { name: "Next.js", version: "14.2.0" }
networkCount = result.enrichment.networkRequests.length
consoleErrors = result.enrichment.consoleLogs.filter(e => e.level === "error")
// 3. Analyze & record
create_finding({
title: "Next.js 14 with high external dependency count",
url: "https://example.com",
evidence: [result.evidenceId],
synthesis: "Detected Next.js 14.2.0 with 47 network requests including 12 third-party domains...",
confidence: "high",
category: "performance"
})MCP tools used
analyze_page, compare_pages, get_observation, get_observations, get_crawled_urls, create_finding
Agent: site-auditor
The plugin includes a custom agent definition at agents/site-auditor.md. It wraps the audit-site skill into an autonomous agent:
- Model — Sonnet
- Tools — Bash, Read, Glob, Grep, WebFetch, WebSearch + all Crawlio MCP tools
- Protocol — Reconnaissance → Crawl → Multi-pass Analysis (Structure / Errors / Enrichment / Synthesis) → Report
The agent follows the same phases as audit-site but can also use file-system tools and web search to cross-reference findings with external documentation and best practices.
Finding standards enforced by the agent
- Specific title — "3 images use HTTP on HTTPS pages", not "mixed content found"
- Evidence — at least one observation ID referenced
- Impact — synthesis explains why this matters
- Actionable — recommendations included in the report
When to use the agent vs the skill
| Use case | Choose |
|---|---|
| Interactive audit with human guidance | audit-site skill |
| Fully autonomous audit | site-auditor agent |
| Quick crawl without analysis | crawl-site skill |
| Pipeline integration (crawl + export) | extract-and-export skill |
| Single-page or two-site research | web-research skill |
| MCP tool lookup | crawlio-mcp skill |