Discovery Commands

Discovery commands auto-locate well-known resources on a target site. Each command tries a ranked list of candidate paths (robots.txt directives, <link> tags, filesystem conventions) and stops at the first successful hit. The dialogue narrates which candidate matched.

All four discovery commands are Core tier.

`/sitemap <url>`

Discover the site's sitemap and return every URL it exposes. Results are capped to 500 URLs to avoid pathological crawls on million-URL sitemap indexes. In the desktop app, discovered URLs are seeded into the current project's crawl.

Discovery chain: /robots.txt (parse Sitemap: directives) → /sitemap.xml → /sitemap_index.xml → /wp-sitemap.xml → /sitemap/sitemap.xml → /sitemap.xml.gz → /sitemap_index.xml.gz.

Robots.txt is checked first (RFC-sanctioned). Only on a miss does the command fall through to filesystem candidate paths. Gzipped sitemaps are transparently decompressed via streaming compression_stream.

Output

Field	Type	Description
`urls`	`[URL]`	Up to 500 discovered URLs
`discoveredPath`	`String`	The candidate path that matched (e.g. `/robots.txt → https://example.com/sitemap.xml`)
`urlCount`	`Int`	Total count of URLs found

CLI

crawlio cmd sitemap https://wordpress.org

MCP

{ "name": "slash_sitemap", "arguments": { "url": "https://wordpress.org" } }

Returns { urls: string[], discoveredPath: string, urlCount: number }.

`/rss <url>`

Discover and parse the site's RSS or Atom feed. Prefers the landing page's <link rel="alternate" type="application/rss+xml"> (or application/atom+xml) over filesystem conventions.

Discovery chain: fetch the landing page HTML → extract first <link rel="alternate"> with an RSS/Atom MIME type → if no match, try /feed → /rss → /feed.xml → /atom.xml → /rss.xml → /index.xml.

Output

Field	Type	Description
`feed`	`RSSFeed`	Parsed feed object with title, description, entries (each entry has title, link, pubDate, description)
`discoveredPath`	`String`	The candidate path or `rel=alternate` URL that matched

CLI

crawlio cmd rss https://blog.example.com

MCP

{ "name": "slash_rss", "arguments": { "url": "https://blog.example.com" } }

Returns { feed: { title, description, entries[] }, discoveredPath: string }.

`/openapi <url>`

Discover and fetch an OpenAPI or legacy Swagger specification. JSON candidates are tried first (more common); YAML variants come last as a fallback. The response body is returned verbatim so YAML stays YAML.

Discovery chain: /openapi.json → /swagger.json → /v1/openapi.json → /api/openapi.json → /docs/openapi.json → /.well-known/openapi.json → /openapi.yaml → /swagger.yaml → /openapi.yml.

Each candidate is validated by checking for an openapi: or swagger: top-level key (JSON) or first non-comment line (YAML). Invalid documents are skipped.

Output

Field	Type	Description
`data`	`Data`	Raw spec bytes (JSON or YAML)
`format`	`"json"` or `"yaml"`	Detected format
`discoveredPath`	`String`	Candidate path that matched
`version`	`String`	OpenAPI/Swagger version string (e.g. `"3.1.0"`)
`pathsCount`	`Int`	Number of API paths defined in the spec

CLI

crawlio cmd openapi https://api.example.com

MCP

{ "name": "slash_openapi", "arguments": { "url": "https://api.example.com" } }

Returns { format: string, discoveredPath: string, version: string, pathsCount: number }.

`/robots <url>`

Fetch and parse the site's /robots.txt. Returns both the raw file body and a parsed RobotsPolicy (user-agent rules, allowed/disallowed paths, sitemap directives, crawl-delay).

Discovery: single candidate path /robots.txt. A 4xx/5xx response is reported as "not found" rather than a transient error.

Output

Field	Type	Description
`body`	`String`	Raw robots.txt content
`policy`	`RobotsPolicy`	Parsed policy: user-agent rules, allow/disallow paths, sitemap URLs, crawl-delay

CLI

crawlio cmd robots https://example.com

MCP

{ "name": "slash_robots", "arguments": { "url": "https://example.com" } }

Returns { body: string, policy: { rules[], sitemapURLs[], crawlDelay? } }.

PreviousOverview NextViewport