CrawlioCrawlio Docs

HTTP API

Overview

Crawlio.app runs a local HTTP server while active. All 45 endpoints are localhost-only and require no authentication. Request and response bodies are JSON.

Transport

Primary transport is a Unix Domain Socket. A TCP fallback port is also available.

Property Value
Protocol AF_UNIX (Unix Domain Socket)
Path ~/Library/Logs/Crawlio/control.sock
Permissions 0600 (owner only)
Max connections 64 concurrent

The TCP port number is written to ~/Library/Logs/Crawlio/control.port on app launch and deleted on quit.

Connect via UDS

curl --unix-socket ~/Library/Logs/Crawlio/control.sock http://localhost/status

Connect via TCP (fallback)

PORT=$(cat ~/Library/Logs/Crawlio/control.port)
curl http://localhost:$PORT/status

Engine control

Method Path Description
GET /status Engine state, progress counters, enrichment summary, output path. Pass ?since=N for change detection (returns 304 if unchanged)
POST /start Start a crawl. Body: { "url": "..." } or { "urls": [...] } with optional destinationPath
POST /stop Stop the current crawl. Cancels all in-flight requests
POST /pause Pause the current crawl. In-flight downloads complete, no new requests start
POST /resume Resume a paused crawl
POST /recrawl Re-inject URLs into the frontier. Body: { "urls": [...] }. Returns 200 if crawl is active, 202 if starting new

Engine states

idle, crawling, stopping, localizing, extracting, paused, completed, failed

Start crawl examples

Single URL:

{ "url": "https://example.com" }

Multiple URLs with custom destination:

{
  "urls": ["https://example.com", "https://example.org"],
  "destinationPath": "~/Downloads/Crawlio/multi"
}

Status response

{
  "engineState": "crawling",
  "seedURL": "https://example.com",
  "seq": 42,
  "progress": {
    "totalDiscovered": 150,
    "downloaded": 85,
    "failed": 2,
    "queued": 63,
    "localized": 80
  },
  "enrichment": {
    "pagesEnriched": 12,
    "frameworksDetected": 3,
    "totalNetworkRequests": 245,
    "consoleErrors": 1,
    "consoleWarnings": 5
  },
  "outputPath": "/Users/you/Downloads/Crawlio/example.com"
}

Settings

Method Path Description
GET, PATCH /settings Read or update download settings and crawl policy. PATCH uses merge patch (only include fields to change). Returns 409 if engine is active

PATCH example

{
  "settings": { "maxConcurrent": 20 },
  "policy": { "maxDepth": 5 }
}

GET response

{
  "settings": {
    "maxConcurrent": 10,
    "crawlDelay": 0.0,
    "timeout": 60,
    "downloadImages": true,
    "downloadVideo": true,
    "stripTrackingParams": true,
    "maxRetries": 3
  },
  "policy": {
    "scopeMode": "sameDomain",
    "maxDepth": 0,
    "maxPagesPerCrawl": 0,
    "respectRobotsTxt": true
  }
}

Projects

Method Path Description
GET /projects List all saved projects
POST /projects Save the current project. Optional body: { "name": "..." }
GET /projects/{id} Get full project details (settings, policy, seed URL, destination)
DELETE /projects/{id} Delete a saved project
POST /projects/{id}/load Load a saved project, restoring settings and state

List response

{
  "projects": [
    {
      "id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
      "name": "Stripe Docs",
      "seedURL": "https://docs.stripe.com",
      "createdAt": "2026-02-14T10:30:00Z",
      "destinationPath": "/Users/you/Downloads/Crawlio/docs.stripe.com"
    }
  ]
}

Downloads and logs

Method Path Description
GET /downloads Downloaded files with status, size, and timing
GET /failed-urls Failed downloads with error messages
GET /site-tree ASCII directory tree of downloaded files
GET /crawled-urls All URLs with filtering and pagination. Query: ?status=completed&type=html&limit=1000&offset=0
GET /logs Log entries. Query: ?category=engine&level=info&limit=50

Crawled URLs response

{
  "total": 150,
  "offset": 0,
  "limit": 1000,
  "urls": [
    {
      "url": "https://example.com/index.html",
      "status": "completed",
      "bytesReceived": 15234,
      "contentType": "text/html",
      "statusCode": 200
    }
  ]
}

Status filter values: completed, downloading, failed, queued

Log categories: engine, download, parser, localizer, network

Log levels: debug, info, notice, warning, error, fault


Export and extraction

Method Path Description
POST /export Start an async export. Body: { "format": "warc", "destinationPath": "~/archives/site.warc" }
GET /export/status Export progress. States: idle, exporting, completed, failed
POST /extract Start the content extraction pipeline. Body: { "destinationPath": "~/extracted" }
GET /extract/status Extraction progress with phase and completion stats

Export formats

folder, zip, singleHTML, warc, extracted, deploy

Export status response

In progress:

{ "state": "exporting", "format": "warc", "progress": 0.45 }

Completed:

{ "state": "completed", "path": "~/archives/site.warc" }

Extraction status response

{
  "state": "completed",
  "totalPages": 247,
  "totalAssets": 1203,
  "missingSources": 3,
  "frameworkMode": "nextjs"
}

Enrichment

Method Path Description
GET /enrichment Get stored enrichment data. Pass ?url=... to filter by URL
POST /enrichment/bundle Submit a complete enrichment bundle (all data types)
POST /enrichment/framework Submit framework detection data
POST /enrichment/network Submit network request data. Discovered URLs are offered to the crawl engine
POST /enrichment/console Submit console log data
POST /enrichment/dom Submit a DOM snapshot

Bundle request

{
  "url": "https://example.com",
  "framework": { "name": "nextjs", "version": "14.1", "confidence": 0.95 },
  "networkRequests": [
    { "url": "https://cdn.example.com/bundle.js", "method": "GET", "status": 200, "type": "script" }
  ],
  "consoleLogs": [
    { "level": "error", "message": "Failed to load resource", "timestamp": 1708000000 }
  ],
  "domSnapshotJSON": "{...}"
}

Bundle response

{
  "stored": true,
  "url": "https://example.com",
  "fields": ["framework", "networkRequests", "consoleLogs", "domSnapshotJSON"],
  "urlsOffered": 1
}

Capture

Method Path Description
POST /capture Trigger a WebKit runtime capture for a URL. Runs framework detection, intercepts network requests, captures console logs, and takes a DOM snapshot

Request

{ "url": "https://example.com" }

Response (202)

{ "status": "capturing", "url": "https://example.com" }

Analysis

Method Path Description
GET /structured-data JSON-LD, tables, microdata, and RDFa. Query: ?url=...&type=...
GET /observations Query the observation log. Query: ?host=...&op=...&source=...&since=T&limit=N
GET /observation/{id} Look up a single observation by ID (prefix match)
POST /finding Create a curated finding with evidence
GET /findings Query curated findings. Query: ?host=...&limit=N

Observation sources

extension, webkit, agent

Create finding request

{
  "title": "React hydration errors on /products page",
  "url": "https://example.com/products",
  "evidence": ["obs-id-1", "obs-id-2"],
  "synthesis": "Multiple hydration mismatches suggest SSR/client HTML divergence."
}

Create finding response (201)

{ "status": "created", "id": "finding-uuid-..." }

Intelligence (Pro tier)

Method Path Description
GET /tech-stack Detected technologies with name, categories, confidence, version, and signals
GET /seo-findings SEO analysis findings. Query: ?severity=...&category=...
GET /design-intel Design system data (colors, typography, spacing, breakpoints, components)
GET /keyword-intel Keyword analysis (top keywords, co-occurring groups, density)
GET /duplicate-content Exact duplicates and near-duplicates with similarity scores

These endpoints require a Pro license. They return a tier-gating error on Free and Core tiers.


Vault

Method Path Description
POST /vault/request-login Open the auth browser so you can log in. The session is captured and stored. Body: { "domain": "example.com", "loginURL": "https://example.com/login" }

Debug

Method Path Description
GET /debug/metrics Runtime metrics
GET, POST /debug/log-level Read or set the current log level
POST /debug/dump-state Dump the full engine state for debugging

System

Method Path Description
GET /health Server health check. Returns { "status": "ok", "version": "..." }
GET /license License status and tier
POST /open-folder Open the download output directory in Finder

Error handling

All endpoints return JSON. On error:

{ "error": "Description of the error" }

HTTP status codes

Code Meaning
200 Success
201 Created (new finding, new project)
202 Accepted (async operation started)
304 Not Modified (no changes since ?since=N)
400 Bad request (missing fields, invalid parameters)
404 Not found (project, enrichment data)
409 Conflict (settings change while crawling)
429 Rate limited (free tier weekly crawl limit)
500 Internal server error

Free tier rate limit (429)

{
  "error": "free_tier_limit",
  "message": "Free tier limit reached (5/5 crawls this week).",
  "resetsAt": "2026-02-15T00:00:00Z"
}

Next steps

© 2026 Crawlio. All rights reserved.