Skill Reference

All skills live in the crawlio-plugin repo under skills/. Each skill is a single SKILL.md file that any MCP-compatible AI agent can read and follow. There are 7 skills in the current plugin release plus one site-auditor agent.

Skill	Purpose
`crawlio-mcp`	Complete MCP tool reference — 37 tools, 6 code-mode, 4 resources, 4 prompts
`crawl-site`	Crawl a website with site-type-aware configuration
`extract-and-export`	End-to-end crawl → extract → export in 7 formats
`observe`	Query the append-only observation log
`finding`	Create and query evidence-backed findings
`audit-site`	Full multi-phase site audit with findings report
`web-research`	Acquire > Normalize > Analyze research protocol

crawlio-mcp

Complete reference for the Crawlio MCP server itself — every tool, mode, resource, and prompt. Load this skill first when an agent is orchestrating anything non-trivial through Crawlio.

Trigger phrases: MCP agents load this as their baseline reference; no user invocation needed.

Modes

Crawlio MCP runs in one of two modes depending on how you launch the server.

Mode	Tools	When to use
Code mode (default)	6	Lower tool count, better for context-constrained clients. Uses `search_api` + `execute_api` to drive the full HTTP surface.
Full mode (`--full`)	37	Typed parameters and annotations for every operation. Better for clients that can handle many tools.

Full-mode tools (37)

Status & Monitoring (6) — get_crawl_status, get_crawl_logs, get_errors, get_downloads, get_failed_urls, get_site_tree

Control (4) — start_crawl, stop_crawl, pause_crawl, resume_crawl

Settings (3) — get_settings, update_settings, recrawl_urls

Projects (5) — list_projects, save_project, load_project, delete_project, get_project

Export & Extraction (5) — export_site, get_export_status, extract_site, get_extraction_status, trigger_capture

OCR (1) — extract_text_from_image (runs Vision OCR locally; no Crawlio.app required)

Enrichment (6) — get_enrichment, submit_enrichment_bundle, submit_enrichment_framework, submit_enrichment_network, submit_enrichment_console, submit_enrichment_dom

Observations & Findings (5) — get_observations, get_observation, create_finding, get_findings, get_crawled_urls

Code-mode tools (6)

Tool	Purpose
`search_api`	Discover endpoints by keyword
`execute_api`	Execute any HTTP request against ControlServer
`trigger_capture`	WebKit runtime capture (framework + network + console + DOM)
`extract_text_from_image`	Vision OCR on a local image path
`analyze_page`	Composite single-page analysis → `evidenceId`, `evidenceQuality`, `gaps`
`compare_pages`	Side-by-side analysis → `comparisonReadiness`, `symmetric`, `degradationNotes`, `timingDelta`

HTTP-only endpoints (3)

Accessible via execute_api but not as MCP tools:

GET /health — server health, version, uptime, PID
GET /debug/metrics — engine metrics (connections, queue depth, memory)
POST /debug/dump-state — full engine state dump

Resources (4) + template (1)

URI	Description
`crawlio://status`	Engine state and progress
`crawlio://settings`	Current crawl settings
`crawlio://site-tree`	Downloaded file tree
`crawlio://enrichment`	All browser enrichment data
`crawlio://enrichment/{url}`	Per-URL enrichment (template)

Prompts (4)

Prompt	Arguments
`crawl-and-analyze`	`url` (req), `maxDepth` (opt)
`export-site`	`url` (req), `format` (req), `destination` (opt)
`compare-sites`	`url1` (req), `url2` (req)
`fix-failed-urls`	none

Advanced options exposed through `update_settings` + `export_site`

These are real options on update_settings({settings: ...}) / update_settings({policy: ...}) / export_site:

WARC configuration — export_site({format: "warc", warcConfiguration: {compressionEnabled, maxFileSize, cdxEnabled, dedupEnabled}}). maxFileSize: 0 disables splitting.
Proxy configuration — update_settings({settings: {proxyConfiguration: {type: "http"|"https"|"socks5", host, port, username?, password?, noProxyHosts?}}}).
TLS pinning — update_settings({policy: {pinnedPublicKeys: {"example.com": ["sha256hex..."]}}}).
HTTP/2 preference — update_settings({settings: {preferHTTP2: true}}).
Auto-upgrade HTTP — update_settings({policy: {autoUpgradeHTTP: true}}).

crawl-site

Crawl a website with intelligent configuration. Detects site type, optimizes settings, monitors progress, retries failures, and reports results.

Trigger phrases: "crawl a site", "download a website", "mirror a site", "scrape a site"

Invoke: /crawlio:crawl-site https://example.com

Workflow

Determine site type (static, SPA, CMS, docs, or single-page snapshot)
Configure settings via update_settings (concurrency, delay, depth, scope, exclusions)
Start crawl via start_crawl
Monitor progress via get_crawl_status, polling with the since sequence number
Check issues via get_failed_urls, get_errors, get_site_tree
Retry failures via recrawl_urls
Report results (page count, errors, duration)

Site-type presets

Site Type	Depth	Concurrency	Notes
Static	5	8	Standard HTML/CSS
SPA (React, Vue)	3	4	`includeSupportingFiles: true`, consider crawlio-agent for framework detection
CMS (WordPress)	5	4	Exclude `/wp-admin/`, `/wp-json/`
Documentation	10	6	Exclude old version paths
Single page	0	1	`includeSupportingFiles: true`

MCP tools used

update_settings, start_crawl, get_crawl_status, get_crawl_logs, get_errors, get_failed_urls, recrawl_urls, get_site_tree, stop_crawl

extract-and-export

End-to-end pipeline: crawl a site, extract structured content (clean HTML, markdown, metadata, asset manifests), and export in any of 7 formats.

Trigger phrases: "download and export a site", "crawl and extract content", "archive a website", "export as WARC/ZIP/PDF"

Invoke: /crawlio:extract-and-export https://docs.stripe.com 5 warc

Arguments

Argument	Required	Default	Description
URL	Yes	n/a	The URL to crawl
`maxDepth`	No	3	Maximum crawl depth
`format`	No	`folder`	Export format

Export formats

Format	Description
`folder`	Mirror on disk with original directory structure
`zip`	Compressed archive, ready to share
`singleHTML`	All assets inlined into a single HTML file
`warc`	ISO 28500 web archive standard (supports CDX, dedup, compression, splitting)
`pdf`	Rendered pages as portable document
`extracted`	Structured JSON only — clean text, metadata, asset manifests
`deploy`	Production-ready bundle with `crawl-manifest.json`

Workflow

Configure. update_settings — depth, scope, policy
Crawl. start_crawl + poll get_crawl_status (use since sequence number)
Check. get_failed_urls, get_errors; retry with recrawl_urls
Review. get_site_tree, get_downloads
Extract. extract_site, then poll get_extraction_status
Export. export_site({format}), then poll get_export_status
Report. Pages downloaded, export format, file size, any failures

MCP tools used

update_settings, start_crawl, get_crawl_status, get_failed_urls, get_errors, recrawl_urls, get_site_tree, get_downloads, extract_site, get_extraction_status, export_site, get_export_status, save_project

observe

Query the append-only observation log — the timeline of everything Crawlio observed during a crawl session. Also supports single-observation lookup for evidence-chain verification.

Trigger phrases: "check observations", "what did Crawlio see", "show crawl timeline", "query the observation log"

Invoke: /crawlio:observe example.com

`get_observations` filters

Parameter	Description
`host`	Filter by hostname
`source`	Filter by source (see table below)
`op`	Filter by operation type (see table below)
`since`	Unix epoch seconds — only observations after this time
`limit`	Maximum number of results (default: 20)

Observation sources

Source	What it captures
`extension`	Chrome extension enrichment — framework detection, network requests, console logs, DOM snapshots
`engine`	Crawl lifecycle events (`crawl_start`, `crawl_done`)
`webkit`	WebKit runtime capture (triggered via `trigger_capture` or `analyze_page`)
`agent`	AI-created findings

Operations

Op	Meaning
`observe`	Raw data capture
`finding`	Agent-created insight
`crawl_start`	Crawl began
`crawl_done`	Crawl completed (includes progress payload: `totalDiscovered`, `downloaded`, `failed`)
`page`	Single-page observation

Single-observation lookup

Use get_observation({ id }) to verify evidence referenced by a finding, or to inspect the full payload of an evidenceId returned by analyze_page / compare_pages. Works with both obs_xxx and fnd_xxx IDs.

get_observation({ id: "obs_a1b2c3d4" })

Examples

# Recent observations
get_observations({ limit: 20 })
 
# Filter by host
get_observations({ host: "example.com", limit: 50 })
 
# Extension captures only
get_observations({ host: "example.com", source: "extension" })
 
# Since a specific epoch
get_observations({ since: 1708444200, limit: 100 })
 
# Combined
get_observations({ host: "example.com", source: "extension", op: "observe", limit: 50 })

Observation payload shape

Each entry contains id, op, ts (ISO 8601), url, source, and a composite data payload (framework detection, network requests, console logs, progress, etc.).

MCP tools used

get_observations, get_observation

finding

Create and query evidence-backed findings. Findings are the agent's judgment layer on top of raw observations — curated insights that persist across sessions, each backed by a chain of observation IDs.

Trigger phrases: "create a finding", "record an insight", "what findings exist", "show findings"

Invoke: /crawlio:finding

`create_finding` parameters

Parameter	Type	Required	Description
`title`	string	Yes	Short, specific title
`url`	string	No	URL this finding relates to (leave empty for site-wide)
`evidence`	string[]	No	Array of observation IDs (`obs_xxx`)
`synthesis`	string	No	Detailed explanation of pattern and impact
`confidence`	string	No	`high`, `medium`, `low`, or `none`
`category`	string	No	Dimension: `performance`, `security`, `seo`, `framework`, `errors`, `structure`, `accessibility`, etc.

Note: the API uses synthesis and confidence, not description and severity.

Finding categories

Category	Example title
Performance	"Render-blocking scripts delay FCP by 2.3s"
Security	"Mixed content: HTTP resources on HTTPS page"
SEO	"Missing meta descriptions on 12 pages"
Framework	"Next.js App Router with ISR detected"
Errors	"3 JavaScript errors on product pages"
Structure	"Orphaned pages not linked from navigation"
Accessibility	"Missing alt attributes on hero images"

Evidence chain

The canonical flow:

analyze_page({ url }) → returns evidenceId
create_finding({ evidence: [evidenceId], ... }) → stores the finding with the chain
get_observation({ id: evidenceId }) → verifies the evidence record exists and supports the claim

Example

create_finding({
  title: "Mixed content: HTTP images on HTTPS page",
  url: "https://example.com",
  evidence: ["obs_a3f7b2c1", "obs_b4e8c3d2"],
  synthesis: "Homepage loads 3 images over HTTP despite serving over HTTPS. Network observations show requests to http://cdn.example.com/img/. Triggers mixed-content warnings in Chrome and may be blocked in strict mode.",
  confidence: "high",
  category: "security"
})

Finding quality checklist

Title — specific ("3 images use HTTP on HTTPS pages"), not vague ("mixed content found")
Evidence — observation IDs that actually support the claim
Synthesis — explains why this matters and what the impact is, not just what was observed
Confidence — signal how strongly the evidence supports the claim

Querying findings

get_findings({})                          # all
get_findings({ host: "example.com" })     # per-host
get_findings({ limit: 10 })               # most recent

MCP tools used

get_observations, get_observation, create_finding, get_findings, analyze_page

audit-site

Full site audit: crawl, capture enrichment, analyze observations, and produce a findings report with prioritized recommendations. This is the highest-level skill — it orchestrates crawl-site, observe, and finding into a structured multi-phase analysis.

Trigger phrases: "audit a site", "analyze a website", "review a site", "site health check"

Invoke: /crawlio:audit-site https://example.com

Audit phases

Phase 1: Configure for the target — call update_settings with depth, scope, concurrency, policy. Size-based presets:

Site Size	Pages	`maxDepth`	`maxConcurrent`
Small	< 100	10	8
Medium	100–1,000	5	4
Large	> 1,000	3	2

Phase 2: Crawl — start_crawl and poll get_crawl_status with the since sequence number. Handle rate-limiting (429s trigger automatic backoff).

Phase 3: Capture enrichment — if the Chrome extension is running, enrichment data is captured automatically during the crawl and appended to the observation log.

Phase 4: Analyze observations — query for patterns:

get_observations({ host: "example.com", limit: 200 })
get_observations({ host: "example.com", source: "extension", limit: 50 })
get_observations({ op: "crawl_done" })
get_errors()
get_failed_urls()

Phase 5: Create findings — for each issue or insight, call create_finding with title, url, evidence, synthesis, confidence, category.

Phase 6: Generate report — compile via get_findings({ host }). Report format:

Technology stack — framework, rendering mode, CDN, third-party services
Findings — grouped by category, sorted by severity (via confidence)
Site structure — tree overview, orphaned pages, broken links
Recommendations — prioritized action items

Audit checklist

Technology — framework, rendering mode, CDN, third-party services
Performance — page count, large files, external dependencies
Security — HTTPS enforcement, mixed content, security headers
Content — failed URLs, redirect chains, missing resources
Structure — site tree, depth distribution, cross-domain assets

MCP tools used

update_settings, start_crawl, get_crawl_status, get_crawl_logs, get_errors, get_failed_urls, recrawl_urls, get_site_tree, get_enrichment, get_observations, get_observation, create_finding, get_findings, save_project, export_site

web-research

Structured web-research protocol built on analyze_page and compare_pages. Teaches agents to follow the Acquire > Normalize > Analyze pattern so evidence records stay canonical and comparisons are reliable.

Trigger phrases: "research a site", "compare sites", "analyze technology", "structured web research"

Invoke: Not a user-facing slash command; agents load this as a protocol reference.

Core protocol

1. Acquire. Use composite tools — never the low-level trigger_capture + sleep + get_enrichment pattern.

Goal	Tool	Notes
Single-page evidence	`analyze_page`	One call = capture + enrichment + crawl status. Returns `evidenceId`, `evidenceQuality`, `gaps`
Two-site comparison	`compare_pages`	Sequential analysis with typed comparison evidence
Single evidence lookup	`get_observation`	Verify a specific evidence record by ID
Bulk crawl data	`get_crawled_urls`	After a completed crawl
Historical timeline	`get_observations`	Append-only audit trail

2. Normalize. Extract fields from the canonical record before drawing conclusions:

Framework — name, version, rendering mode (SSR/SSG/CSR/ISR)
Network — request count, external domains, resource types
Console — error count, warning patterns
Crawl — status, content type, byte count

Check enrichmentStatus before using enrichment data — "ok" means present and usable; "timeout" means capture completed but enrichment didn't arrive in time (note the gap).

Check evidenceQuality for overall health — "complete" (no gaps), "partial" (has gaps but capture succeeded), "degraded" (capture-level failure or enrichment server error).

3. Analyze. Compare normalized evidence against a rubric. Record insights via create_finding.

Anti-pattern

Never do this:

trigger_capture({ url })
// sleep 5s
get_enrichment({ url })

Do this instead:

analyze_page({ url })

Don't improvise evidence shapes — the record from analyze_page is the canonical format. Don't restructure it ad hoc.

Comparison protocol

For side-by-side analysis:

compare_pages({ urlA: "https://site-a.com", urlB: "https://site-b.com" })

The response includes a comparisonSummary with typed fields:

comparisonReadiness — ready (both complete), cautious (one partial), unreliable (either degraded)
symmetric — whether both sides have identical gap profiles
degradationNotes — human-readable gaps per side
timingDelta — absolute timing differences (capture, enrichment polling)
enrichmentAgeDeltaMs — timestamp difference between the two analyses
evidenceIdA / evidenceIdB — observation IDs for round-trip verification via get_observation

Comparison dimensions (10)

When comparing two sites, evaluate across these dimensions and note gaps explicitly:

Framework — name, version, rendering strategy
Performance — resource count, total bytes, external dependencies
Security — HTTPS enforcement, mixed content, CSP presence
SEO — meta tags, structured data, canonical URLs
Accessibility — semantic HTML, ARIA usage, alt text
Error surface — console errors, failed resources, 4xx/5xx responses
Third-party load — analytics, tracking, CDN usage
Architecture — SPA vs MPA, API patterns, hydration strategy
Content delivery — CDN, caching headers, compression
Mobile readiness — viewport meta, responsive signals

Example workflow

// 1. Acquire
result = analyze_page({ url: "https://example.com" })
 
// 2. Normalize
framework = result.enrichment.framework       // { name: "Next.js", version: "14.2.0" }
networkCount = result.enrichment.networkRequests.length
consoleErrors = result.enrichment.consoleLogs.filter(e => e.level === "error")
 
// 3. Analyze & record
create_finding({
  title: "Next.js 14 with high external dependency count",
  url: "https://example.com",
  evidence: [result.evidenceId],
  synthesis: "Detected Next.js 14.2.0 with 47 network requests including 12 third-party domains...",
  confidence: "high",
  category: "performance"
})

MCP tools used

analyze_page, compare_pages, get_observation, get_observations, get_crawled_urls, create_finding

Agent: site-auditor

The plugin includes a custom agent definition at agents/site-auditor.md. It wraps the audit-site skill into an autonomous agent:

Model — Sonnet
Tools — Bash, Read, Glob, Grep, WebFetch, WebSearch + all Crawlio MCP tools
Protocol — Reconnaissance → Crawl → Multi-pass Analysis (Structure / Errors / Enrichment / Synthesis) → Report

The agent follows the same phases as audit-site but can also use file-system tools and web search to cross-reference findings with external documentation and best practices.

Finding standards enforced by the agent

Specific title — "3 images use HTTP on HTTPS pages", not "mixed content found"
Evidence — at least one observation ID referenced
Impact — synthesis explains why this matters
Actionable — recommendations included in the report

When to use the agent vs the skill

Use case	Choose
Interactive audit with human guidance	`audit-site` skill
Fully autonomous audit	`site-auditor` agent
Quick crawl without analysis	`crawl-site` skill
Pipeline integration (crawl + export)	`extract-and-export` skill
Single-page or two-site research	`web-research` skill
MCP tool lookup	`crawlio-mcp` skill

PreviousOverview NextOverview