Discovery Commands
Discovery commands auto-locate well-known resources on a target site. Each command tries a ranked list of candidate paths (robots.txt directives, <link> tags, filesystem conventions) and stops at the first successful hit. The dialogue narrates which candidate matched.
All four discovery commands are Core tier.
/sitemap <url>
Discover the site's sitemap and return every URL it exposes. Results are capped to 500 URLs to avoid pathological crawls on million-URL sitemap indexes. In the desktop app, discovered URLs are seeded into the current project's crawl.
Discovery chain: /robots.txt (parse Sitemap: directives) → /sitemap.xml → /sitemap_index.xml → /wp-sitemap.xml → /sitemap/sitemap.xml → /sitemap.xml.gz → /sitemap_index.xml.gz.
Robots.txt is checked first (RFC-sanctioned). Only on a miss does the command fall through to filesystem candidate paths. Gzipped sitemaps are transparently decompressed via streaming compression_stream.
Output
| Field | Type | Description |
|---|---|---|
urls |
[URL] |
Up to 500 discovered URLs |
discoveredPath |
String |
The candidate path that matched (e.g. /robots.txt → https://example.com/sitemap.xml) |
urlCount |
Int |
Total count of URLs found |
CLI
crawlio cmd sitemap https://wordpress.orgMCP
{ "name": "slash_sitemap", "arguments": { "url": "https://wordpress.org" } }Returns { urls: string[], discoveredPath: string, urlCount: number }.
/rss <url>
Discover and parse the site's RSS or Atom feed. Prefers the landing page's <link rel="alternate" type="application/rss+xml"> (or application/atom+xml) over filesystem conventions.
Discovery chain: fetch the landing page HTML → extract first <link rel="alternate"> with an RSS/Atom MIME type → if no match, try /feed → /rss → /feed.xml → /atom.xml → /rss.xml → /index.xml.
Output
| Field | Type | Description |
|---|---|---|
feed |
RSSFeed |
Parsed feed object with title, description, entries (each entry has title, link, pubDate, description) |
discoveredPath |
String |
The candidate path or rel=alternate URL that matched |
CLI
crawlio cmd rss https://blog.example.comMCP
{ "name": "slash_rss", "arguments": { "url": "https://blog.example.com" } }Returns { feed: { title, description, entries[] }, discoveredPath: string }.
/openapi <url>
Discover and fetch an OpenAPI or legacy Swagger specification. JSON candidates are tried first (more common); YAML variants come last as a fallback. The response body is returned verbatim so YAML stays YAML.
Discovery chain: /openapi.json → /swagger.json → /v1/openapi.json → /api/openapi.json → /docs/openapi.json → /.well-known/openapi.json → /openapi.yaml → /swagger.yaml → /openapi.yml.
Each candidate is validated by checking for an openapi: or swagger: top-level key (JSON) or first non-comment line (YAML). Invalid documents are skipped.
Output
| Field | Type | Description |
|---|---|---|
data |
Data |
Raw spec bytes (JSON or YAML) |
format |
"json" or "yaml" |
Detected format |
discoveredPath |
String |
Candidate path that matched |
version |
String |
OpenAPI/Swagger version string (e.g. "3.1.0") |
pathsCount |
Int |
Number of API paths defined in the spec |
CLI
crawlio cmd openapi https://api.example.comMCP
{ "name": "slash_openapi", "arguments": { "url": "https://api.example.com" } }Returns { format: string, discoveredPath: string, version: string, pathsCount: number }.
/robots <url>
Fetch and parse the site's /robots.txt. Returns both the raw file body and a parsed RobotsPolicy (user-agent rules, allowed/disallowed paths, sitemap directives, crawl-delay).
Discovery: single candidate path /robots.txt. A 4xx/5xx response is reported as "not found" rather than a transient error.
Output
| Field | Type | Description |
|---|---|---|
body |
String |
Raw robots.txt content |
policy |
RobotsPolicy |
Parsed policy: user-agent rules, allow/disallow paths, sitemap URLs, crawl-delay |
CLI
crawlio cmd robots https://example.comMCP
{ "name": "slash_robots", "arguments": { "url": "https://example.com" } }Returns { body: string, policy: { rules[], sitemapURLs[], crawlDelay? } }.