CLI Reference - Flags, Formats, and Data Model
CLI Flags
Section titled “CLI Flags”crawler crawl
Section titled “crawler crawl”crawler crawl [OPTIONS] <URL>The https:// prefix is added automatically if omitted.
| Flag | Short | Default | Description |
|---|---|---|---|
--output <PATH> | -o | Auto-generated | Output file path |
--format <FMT> | -f | ndjson | Output format: ndjson, json, sitemap |
--max-pages <N> | -p | 100 | Maximum pages to crawl. The effective ceiling is min(--max-pages, your tier cap): 50 anonymous, 400 signed-in, 10,000 Pro. |
--max-depth <N> | -d | 10 | Maximum crawl depth |
--concurrency <N> | -c | 5 | Concurrent requests |
--delay <MS> | 200 | Delay between requests in ms | |
--no-extract | false | Disable content extraction (faster, smaller output) | |
--no-check-outgoing | false | Disable outgoing link checking (faster crawls) | |
--include-html | false | Include the raw (post-JS-render) HTML body of each HTML page as an html field. Off by default; significantly grows output size on large crawls. | |
--verbose | -v | false | Enable verbose logging |
--quiet | -q | false | Suppress all output except errors |
crawler fetch
Section titled “crawler fetch”crawler fetch [OPTIONS] <URL>Fetches a single URL. The https:// prefix is added automatically if omitted.
| Flag | Short | Default | Description |
|---|---|---|---|
--output <PATH> | -o | Auto-generated | Output file path |
--format <FMT> | -f | ndjson | Output format: ndjson, json, sitemap |
--no-extract | false | Disable content extraction | |
--user-agent <UA> | Custom User-Agent header | ||
--include-html | false | Include the raw HTML body of the response as an html field. JS rendering is disabled for fetch, so this is the UTF-8-decoded server bytes. | |
--verbose | -v | false | Enable verbose logging |
--quiet | -q | false | Suppress all output except errors |
crawler info
Section titled “crawler info”crawler info <FILE>No additional flags. Reads a .crawl file and displays summary statistics.
crawler export
Section titled “crawler export”crawler export [OPTIONS] <FILE>| Flag | Short | Default | Description |
|---|---|---|---|
--format <FMT> | -f | Required | Target format: json or sitemap |
--output <PATH> | -o | Auto-generated | Output file path |
crawler seo
Section titled “crawler seo”crawler seo [OPTIONS] <FILE>| Flag | Short | Default | Description |
|---|---|---|---|
--export <FMT> | Export format: csv, txt | ||
--output <PATH> | -o | Auto-generated | Output file path for export |
crawler update
Section titled “crawler update”crawler update [OPTIONS]Self-updates the CLI to the latest release. Downloads are verified against a signed manifest and a SHA-256 hash before they are installed.
| Flag | Short | Default | Description |
|---|---|---|---|
--check | false | Only check for a newer version; don’t install. |
Page Data Model
Section titled “Page Data Model”Each crawled page contains the following fields, serialized to JSON in the output formats.
| Field | Type | Description |
|---|---|---|
url | String | Full URL of the page |
status_code | Number | HTTP status code |
content_type | String? | Content-Type header value |
title | String? | HTML <title> text |
meta_description | String? | <meta name="description"> content |
canonical_url | String? | <link rel="canonical"> href |
discovered_from | String? | Parent URL that linked to this page |
links_found | Number | Number of new same-domain links discovered |
depth | Number | Crawl depth from the start URL |
response_time_ms | Number | HTTP response time in milliseconds |
markdown | String? | Extracted content as Markdown (when enabled) |
html | String? | Raw HTML body of the response. Only present when --include-html is passed; for pages that went through JS rendering, this is the post-render HTML. |
word_count | Number? | Word count of extracted content |
byline | String? | Author byline from content extraction |
excerpt | String? | Article excerpt from content extraction |
meta_robots | String? | <meta name="robots"> content |
x_robots_tag | String? | X-Robots-Tag HTTP header value |
rel_next | String? | <link rel="next"> href (pagination) |
rel_prev | String? | <link rel="prev"> href (pagination) |
outgoing_link_errors | Array? | Broken outgoing links and images (when checking is enabled) |
Each entry in the outgoing_link_errors array is an OutgoingLinkError object:
| Field | Type | Description |
|---|---|---|
url | String | The broken link URL |
status_code | Number | HTTP status code (0 for connection failure) |
error | String? | Error message for connection failures |
link_type | String | "anchor" or "image" |
Fields marked with ? are optional and may be absent depending on the page.
Crawl Events
Section titled “Crawl Events”The crawler emits real-time events during a crawl session, used by both the CLI and desktop app to display progress.
| Event | Data | Description |
|---|---|---|
Started | url | Crawl has begun |
Discovered | url, depth | New URL found and enqueued |
PageCrawled | Page object | Page successfully fetched and parsed |
PageError | url, error | Failed to fetch a page |
Progress | crawled, total_discovered | Periodic progress update |
Completed | total_pages, total_errors | Crawl finished |
Output Formats
Section titled “Output Formats”NDJSON (.crawl)
Section titled “NDJSON (.crawl)”Default format. Newline-delimited JSON with a metadata header line followed by one JSON object per crawled page.
{"_meta":true,"version":"0.1.0","url":"https://example.com","started_at":"2024-01-01T00:00:00Z","config":{"max_pages":100,"max_depth":10,"concurrency":5}}{"url":"https://example.com","status_code":200,"content_type":"text/html","title":"Example","meta_description":"...","links_found":15,"depth":0,"response_time_ms":234}The first line contains crawl metadata (_meta: true), including the version, target URL, start time, and config summary. Each subsequent line is a serialized page record. This format is streamable - output is valid after every line.
A JSON array of page objects, written all at once on finalization.
[ { "url": "https://example.com", "title": "Example", "meta_description": "An example site", "canonical_url": "https://example.com", "discovered_from": null, "status": 200, "markdown": "# Example\n\nContent here...", "word_count": 150, "byline": "Author Name", "excerpt": "A brief summary..." }]Content fields (markdown, word_count, byline, excerpt) are only included when present.
Sitemap XML
Section titled “Sitemap XML”Standard XML sitemap format compatible with search engines. Only includes pages with HTTP 200 status and text/html content type. Limited to 50,000 URLs per sitemap.
<?xml version="1.0" encoding="UTF-8"?><urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> <url> <loc>https://example.com</loc> <lastmod>2024-01-01</lastmod> </url></urlset>URLs are deduplicated and XML-escaped. The <lastmod> date is set to the crawl date.
SEO CSV
Section titled “SEO CSV”Two columns: Issue Type and URL. One row per affected URL, with values double-quote escaped.
"Issue Type","URL""Missing titles","https://example.com/page-1""Short content","https://example.com/page-2"SEO TXT
Section titled “SEO TXT”Human-readable report grouped by issue category with indented URLs.
Missing titles (2) https://example.com/page-1 https://example.com/page-2
Short content (1) https://example.com/page-3Filename Conventions
Section titled “Filename Conventions”When --output is omitted, filenames are auto-generated from the domain.
crawler crawl
Section titled “crawler crawl”| Format | Pattern | Example |
|---|---|---|
| NDJSON | {domain}.crawl | example-com.crawl |
| JSON | {domain}.json | example-com.json |
| Sitemap | {domain}-sitemap.xml | example-com-sitemap.xml |
crawler fetch
Section titled “crawler fetch”Filenames include the URL path for unique, readable names:
| Format | Pattern | Example |
|---|---|---|
| NDJSON | {domain}-{slug}.crawl | example-com-about-team.crawl |
| JSON | {domain}-{slug}.json | example-com-about-team.json |
| Sitemap | {domain}-{slug}.xml | example-com-about-team.xml |
Slug rules: path segments are joined with hyphens, file extensions (.html, .php, etc.) are stripped, query parameters and fragments are ignored, and the root path becomes index.
crawler seo
Section titled “crawler seo”| Format | Pattern | Example |
|---|---|---|
| SEO CSV | {domain}-seo.csv | example-com-seo.csv |
| SEO TXT | {domain}-seo.txt | example-com-seo.txt |
System Requirements
Section titled “System Requirements”- macOS - Apple Silicon (M1/M2/M3/M4) or Intel