CLI Reference
CLI Flags
Section titled “CLI Flags”crawler crawl
Section titled “crawler crawl”crawler crawl [OPTIONS] <URL>The https:// prefix is added automatically if omitted.
| Flag | Short | Default | Description |
|---|---|---|---|
--output <PATH> | -o | Auto-generated | Output file path |
--format <FMT> | -f | ndjson | Output format: ndjson, json, sitemap |
--max-pages <N> | -p | 100 | Maximum pages to crawl |
--max-depth <N> | -d | 10 | Maximum crawl depth |
--concurrency <N> | -c | 5 | Concurrent requests |
--delay <MS> | 200 | Delay between requests in ms | |
--no-extract | false | Disable content extraction (faster, smaller output) | |
--verbose | -v | false | Enable verbose logging |
--quiet | -q | false | Suppress all output except errors |
crawler info
Section titled “crawler info”crawler info <FILE>No additional flags. Reads a .crawl file and displays summary statistics.
crawler export
Section titled “crawler export”crawler export [OPTIONS] <FILE>| Flag | Short | Default | Description |
|---|---|---|---|
--format <FMT> | -f | Required | Target format: json or sitemap |
--output <PATH> | -o | Auto-generated | Output file path |
crawler seo
Section titled “crawler seo”crawler seo [OPTIONS] <FILE>| Flag | Short | Default | Description |
|---|---|---|---|
--export <FMT> | Export format: csv, txt | ||
--output <PATH> | -o | Auto-generated | Output file path for export |
Page Data Model
Section titled “Page Data Model”Each crawled page contains the following fields, serialized to JSON in the output formats.
| Field | Type | Description |
|---|---|---|
url | String | Full URL of the page |
status_code | Number | HTTP status code |
content_type | String? | Content-Type header value |
title | String? | HTML <title> text |
meta_description | String? | <meta name="description"> content |
canonical_url | String? | <link rel="canonical"> href |
discovered_from | String? | Parent URL that linked to this page |
links_found | Number | Number of new same-domain links discovered |
depth | Number | Crawl depth from the start URL |
response_time_ms | Number | HTTP response time in milliseconds |
markdown | String? | Extracted content as Markdown (when enabled) |
word_count | Number? | Word count of extracted content |
byline | String? | Author byline from content extraction |
excerpt | String? | Article excerpt from content extraction |
meta_robots | String? | <meta name="robots"> content |
x_robots_tag | String? | X-Robots-Tag HTTP header value |
rel_next | String? | <link rel="next"> href (pagination) |
rel_prev | String? | <link rel="prev"> href (pagination) |
Fields marked with ? are optional and may be absent depending on the page.
Crawl Events
Section titled “Crawl Events”The crawler emits real-time events during a crawl session, used by both the CLI and desktop app to display progress.
| Event | Data | Description |
|---|---|---|
Started | url | Crawl has begun |
Discovered | url, depth | New URL found and enqueued |
PageCrawled | Page object | Page successfully fetched and parsed |
PageError | url, error | Failed to fetch a page |
Progress | crawled, total_discovered | Periodic progress update |
Completed | total_pages, total_errors | Crawl finished |
Output Formats
Section titled “Output Formats”NDJSON (.crawl)
Section titled “NDJSON (.crawl)”Default format. Newline-delimited JSON with a metadata header line followed by one JSON object per crawled page.
{"_meta":true,"version":"0.1.0","url":"https://example.com","started_at":"2024-01-01T00:00:00Z","config":{"max_pages":100,"max_depth":10,"concurrency":5}}{"url":"https://example.com","status_code":200,"content_type":"text/html","title":"Example","meta_description":"...","links_found":15,"depth":0,"response_time_ms":234}The first line contains crawl metadata (_meta: true), including the version, target URL, start time, and config summary. Each subsequent line is a serialized page record. This format is streamable - output is valid after every line.
A JSON array of page objects, written all at once on finalization.
[ { "url": "https://example.com", "title": "Example", "meta_description": "An example site", "canonical_url": "https://example.com", "discovered_from": null, "status": 200, "markdown": "# Example\n\nContent here...", "word_count": 150, "byline": "Author Name", "excerpt": "A brief summary..." }]Content fields (markdown, word_count, byline, excerpt) are only included when present.
Sitemap XML
Section titled “Sitemap XML”Standard XML sitemap format compatible with search engines. Only includes pages with HTTP 200 status and text/html content type. Limited to 50,000 URLs per sitemap.
<?xml version="1.0" encoding="UTF-8"?><urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> <url> <loc>https://example.com</loc> <lastmod>2024-01-01</lastmod> </url></urlset>URLs are deduplicated and XML-escaped. The <lastmod> date is set to the crawl date.
SEO CSV
Section titled “SEO CSV”Two columns: Issue Type and URL. One row per affected URL, with values double-quote escaped.
"Issue Type","URL""Missing titles","https://example.com/page-1""Short content","https://example.com/page-2"SEO TXT
Section titled “SEO TXT”Human-readable report grouped by issue category with indented URLs.
Missing titles (2) https://example.com/page-1 https://example.com/page-2
Short content (1) https://example.com/page-3Filename Conventions
Section titled “Filename Conventions”When --output is omitted, filenames are auto-generated from the domain:
| Format | Pattern | Example |
|---|---|---|
| NDJSON | {domain}.crawl | example-com.crawl |
| JSON | {domain}.json | example-com.json |
| Sitemap | {domain}-sitemap.xml | example-com-sitemap.xml |
| SEO CSV | {domain}-seo.csv | example-com-seo.csv |
| SEO TXT | {domain}-seo.txt | example-com-seo.txt |
System Requirements
Section titled “System Requirements”- macOS - Apple Silicon (M1/M2/M3/M4) or Intel