CLI Reference

CLI Flags

`crawler crawl`

crawler crawl [OPTIONS] <URL>

The https:// prefix is added automatically if omitted.

Flag	Short	Default	Description
`--output <PATH>`	`-o`	Auto-generated	Output file path
`--format <FMT>`	`-f`	`ndjson`	Output format: `ndjson`, `json`, `sitemap`
`--max-pages <N>`	`-p`	`100`	Maximum pages to crawl
`--max-depth <N>`	`-d`	`10`	Maximum crawl depth
`--concurrency <N>`	`-c`	`5`	Concurrent requests
`--delay <MS>`		`200`	Delay between requests in ms
`--no-extract`		`false`	Disable content extraction (faster, smaller output)
`--verbose`	`-v`	`false`	Enable verbose logging
`--quiet`	`-q`	`false`	Suppress all output except errors

`crawler info`

crawler info <FILE>

No additional flags. Reads a .crawl file and displays summary statistics.

`crawler export`

crawler export [OPTIONS] <FILE>

Flag	Short	Default	Description
`--format <FMT>`	`-f`	Required	Target format: `json` or `sitemap`
`--output <PATH>`	`-o`	Auto-generated	Output file path

`crawler seo`

crawler seo [OPTIONS] <FILE>

Flag	Short	Default	Description
`--export <FMT>`			Export format: `csv`, `txt`
`--output <PATH>`	`-o`	Auto-generated	Output file path for export

Page Data Model

Each crawled page contains the following fields, serialized to JSON in the output formats.

Field	Type	Description
`url`	`String`	Full URL of the page
`status_code`	`Number`	HTTP status code
`content_type`	`String?`	Content-Type header value
`title`	`String?`	HTML `<title>` text
`meta_description`	`String?`	`<meta name="description">` content
`canonical_url`	`String?`	`<link rel="canonical">` href
`discovered_from`	`String?`	Parent URL that linked to this page
`links_found`	`Number`	Number of new same-domain links discovered
`depth`	`Number`	Crawl depth from the start URL
`response_time_ms`	`Number`	HTTP response time in milliseconds
`markdown`	`String?`	Extracted content as Markdown (when enabled)
`word_count`	`Number?`	Word count of extracted content
`byline`	`String?`	Author byline from content extraction
`excerpt`	`String?`	Article excerpt from content extraction
`meta_robots`	`String?`	`<meta name="robots">` content
`x_robots_tag`	`String?`	`X-Robots-Tag` HTTP header value
`rel_next`	`String?`	`<link rel="next">` href (pagination)
`rel_prev`	`String?`	`<link rel="prev">` href (pagination)

Fields marked with ? are optional and may be absent depending on the page.

Crawl Events

The crawler emits real-time events during a crawl session, used by both the CLI and desktop app to display progress.

Event	Data	Description
`Started`	`url`	Crawl has begun
`Discovered`	`url`, `depth`	New URL found and enqueued
`PageCrawled`	Page object	Page successfully fetched and parsed
`PageError`	`url`, `error`	Failed to fetch a page
`Progress`	`crawled`, `total_discovered`	Periodic progress update
`Completed`	`total_pages`, `total_errors`	Crawl finished

Output Formats

NDJSON (.crawl)

Default format. Newline-delimited JSON with a metadata header line followed by one JSON object per crawled page.

{"_meta":true,"version":"0.1.0","url":"https://example.com","started_at":"2024-01-01T00:00:00Z","config":{"max_pages":100,"max_depth":10,"concurrency":5}}
{"url":"https://example.com","status_code":200,"content_type":"text/html","title":"Example","meta_description":"...","links_found":15,"depth":0,"response_time_ms":234}

The first line contains crawl metadata (_meta: true), including the version, target URL, start time, and config summary. Each subsequent line is a serialized page record. This format is streamable - output is valid after every line.

JSON

A JSON array of page objects, written all at once on finalization.

[
  {
    "url": "https://example.com",
    "title": "Example",
    "meta_description": "An example site",
    "canonical_url": "https://example.com",
    "discovered_from": null,
    "status": 200,
    "markdown": "# Example\n\nContent here...",
    "word_count": 150,
    "byline": "Author Name",
    "excerpt": "A brief summary..."
  }
]

Content fields (markdown, word_count, byline, excerpt) are only included when present.

Sitemap XML

Standard XML sitemap format compatible with search engines. Only includes pages with HTTP 200 status and text/html content type. Limited to 50,000 URLs per sitemap.

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>https://example.com</loc>
    <lastmod>2024-01-01</lastmod>
  </url>
</urlset>

URLs are deduplicated and XML-escaped. The <lastmod> date is set to the crawl date.

SEO CSV

Two columns: Issue Type and URL. One row per affected URL, with values double-quote escaped.

"Issue Type","URL"
"Missing titles","https://example.com/page-1"
"Short content","https://example.com/page-2"

SEO TXT

Human-readable report grouped by issue category with indented URLs.

Missing titles (2)
  https://example.com/page-1
  https://example.com/page-2

Short content (1)
  https://example.com/page-3

Filename Conventions

When --output is omitted, filenames are auto-generated from the domain:

Format	Pattern	Example
NDJSON	`{domain}.crawl`	`example-com.crawl`
JSON	`{domain}.json`	`example-com.json`
Sitemap	`{domain}-sitemap.xml`	`example-com-sitemap.xml`
SEO CSV	`{domain}-seo.csv`	`example-com-seo.csv`
SEO TXT	`{domain}-seo.txt`	`example-com-seo.txt`

System Requirements

macOS - Apple Silicon (M1/M2/M3/M4) or Intel