CrawlioCrawlio Docs

Discovery Commands

Discovery commands auto-locate well-known resources on a target site. Each command tries a ranked list of candidate paths (robots.txt directives, <link> tags, filesystem conventions) and stops at the first successful hit. The dialogue narrates which candidate matched.

All four discovery commands are Core tier.


/sitemap <url>

Discover the site's sitemap and return every URL it exposes. Results are capped to 500 URLs to avoid pathological crawls on million-URL sitemap indexes. In the desktop app, discovered URLs are seeded into the current project's crawl.

Discovery chain: /robots.txt (parse Sitemap: directives) → /sitemap.xml/sitemap_index.xml/wp-sitemap.xml/sitemap/sitemap.xml/sitemap.xml.gz/sitemap_index.xml.gz.

Robots.txt is checked first (RFC-sanctioned). Only on a miss does the command fall through to filesystem candidate paths. Gzipped sitemaps are transparently decompressed via streaming compression_stream.

Output

Field Type Description
urls [URL] Up to 500 discovered URLs
discoveredPath String The candidate path that matched (e.g. /robots.txt → https://example.com/sitemap.xml)
urlCount Int Total count of URLs found

CLI

crawlio cmd sitemap https://wordpress.org

MCP

{ "name": "slash_sitemap", "arguments": { "url": "https://wordpress.org" } }

Returns { urls: string[], discoveredPath: string, urlCount: number }.


/rss <url>

Discover and parse the site's RSS or Atom feed. Prefers the landing page's <link rel="alternate" type="application/rss+xml"> (or application/atom+xml) over filesystem conventions.

Discovery chain: fetch the landing page HTML → extract first <link rel="alternate"> with an RSS/Atom MIME type → if no match, try /feed/rss/feed.xml/atom.xml/rss.xml/index.xml.

Output

Field Type Description
feed RSSFeed Parsed feed object with title, description, entries (each entry has title, link, pubDate, description)
discoveredPath String The candidate path or rel=alternate URL that matched

CLI

crawlio cmd rss https://blog.example.com

MCP

{ "name": "slash_rss", "arguments": { "url": "https://blog.example.com" } }

Returns { feed: { title, description, entries[] }, discoveredPath: string }.


/openapi <url>

Discover and fetch an OpenAPI or legacy Swagger specification. JSON candidates are tried first (more common); YAML variants come last as a fallback. The response body is returned verbatim so YAML stays YAML.

Discovery chain: /openapi.json/swagger.json/v1/openapi.json/api/openapi.json/docs/openapi.json/.well-known/openapi.json/openapi.yaml/swagger.yaml/openapi.yml.

Each candidate is validated by checking for an openapi: or swagger: top-level key (JSON) or first non-comment line (YAML). Invalid documents are skipped.

Output

Field Type Description
data Data Raw spec bytes (JSON or YAML)
format "json" or "yaml" Detected format
discoveredPath String Candidate path that matched
version String OpenAPI/Swagger version string (e.g. "3.1.0")
pathsCount Int Number of API paths defined in the spec

CLI

crawlio cmd openapi https://api.example.com

MCP

{ "name": "slash_openapi", "arguments": { "url": "https://api.example.com" } }

Returns { format: string, discoveredPath: string, version: string, pathsCount: number }.


/robots <url>

Fetch and parse the site's /robots.txt. Returns both the raw file body and a parsed RobotsPolicy (user-agent rules, allowed/disallowed paths, sitemap directives, crawl-delay).

Discovery: single candidate path /robots.txt. A 4xx/5xx response is reported as "not found" rather than a transient error.

Output

Field Type Description
body String Raw robots.txt content
policy RobotsPolicy Parsed policy: user-agent rules, allow/disallow paths, sitemap URLs, crawl-delay

CLI

crawlio cmd robots https://example.com

MCP

{ "name": "slash_robots", "arguments": { "url": "https://example.com" } }

Returns { body: string, policy: { rules[], sitemapURLs[], crawlDelay? } }.

© 2026 Crawlio. All rights reserved.