CLI Reference - Flags, Formats, and Data Model
CLI Flags
Section titled “CLI Flags”crawler crawl
Section titled “crawler crawl”crawler crawl [OPTIONS] <URL>The https:// prefix is added automatically if omitted.
| Flag | Short | Default | Description |
|---|---|---|---|
--output <PATH> | -o | Auto-generated | Output file path |
--format <FMT> | -f | ndjson | Output format: ndjson, json, sitemap |
--max-pages <N> | -p | 100 | Maximum pages to crawl |
--max-depth <N> | -d | 10 | Maximum crawl depth |
--concurrency <N> | -c | 5 | Concurrent requests |
--delay <MS> | 200 | Delay between requests in ms | |
--no-extract | false | Disable content extraction (faster, smaller output) | |
--no-check-outgoing | false | Disable outgoing link checking (faster crawls) | |
--verbose | -v | false | Enable verbose logging |
--quiet | -q | false | Suppress all output except errors |
crawler fetch
Section titled “crawler fetch”crawler fetch [OPTIONS] <URL>Fetches a single URL. The https:// prefix is added automatically if omitted.
| Flag | Short | Default | Description |
|---|---|---|---|
--output <PATH> | -o | Auto-generated | Output file path |
--format <FMT> | -f | ndjson | Output format: ndjson, json, sitemap |
--no-extract | false | Disable content extraction | |
--user-agent <UA> | Custom User-Agent header | ||
--verbose | -v | false | Enable verbose logging |
--quiet | -q | false | Suppress all output except errors |
crawler info
Section titled “crawler info”crawler info <FILE>No additional flags. Reads a .crawl file and displays summary statistics.
crawler export
Section titled “crawler export”crawler export [OPTIONS] <FILE>| Flag | Short | Default | Description |
|---|---|---|---|
--format <FMT> | -f | Required | Target format: json or sitemap |
--output <PATH> | -o | Auto-generated | Output file path |
crawler seo
Section titled “crawler seo”crawler seo [OPTIONS] <FILE>| Flag | Short | Default | Description |
|---|---|---|---|
--export <FMT> | Export format: csv, txt | ||
--output <PATH> | -o | Auto-generated | Output file path for export |
Page Data Model
Section titled “Page Data Model”Each crawled page contains the following fields, serialized to JSON in the output formats.
| Field | Type | Description |
|---|---|---|
url | String | Full URL of the page |
status_code | Number | HTTP status code |
content_type | String? | Content-Type header value |
title | String? | HTML <title> text |
meta_description | String? | <meta name="description"> content |
canonical_url | String? | <link rel="canonical"> href |
discovered_from | String? | Parent URL that linked to this page |
links_found | Number | Number of new same-domain links discovered |
depth | Number | Crawl depth from the start URL |
response_time_ms | Number | HTTP response time in milliseconds |
markdown | String? | Extracted content as Markdown (when enabled) |
word_count | Number? | Word count of extracted content |
byline | String? | Author byline from content extraction |
excerpt | String? | Article excerpt from content extraction |
meta_robots | String? | <meta name="robots"> content |
x_robots_tag | String? | X-Robots-Tag HTTP header value |
rel_next | String? | <link rel="next"> href (pagination) |
rel_prev | String? | <link rel="prev"> href (pagination) |
outgoing_link_errors | Array? | Broken outgoing links and images (when checking is enabled) |
Each entry in the outgoing_link_errors array is an OutgoingLinkError object:
| Field | Type | Description |
|---|---|---|
url | String | The broken link URL |
status_code | Number | HTTP status code (0 for connection failure) |
error | String? | Error message for connection failures |
link_type | String | "anchor" or "image" |
Fields marked with ? are optional and may be absent depending on the page.
Crawl Events
Section titled “Crawl Events”The crawler emits real-time events during a crawl session, used by both the CLI and desktop app to display progress.
| Event | Data | Description |
|---|---|---|
Started | url | Crawl has begun |
Discovered | url, depth | New URL found and enqueued |
PageCrawled | Page object | Page successfully fetched and parsed |
PageError | url, error | Failed to fetch a page |
Progress | crawled, total_discovered | Periodic progress update |
Completed | total_pages, total_errors | Crawl finished |
Output Formats
Section titled “Output Formats”NDJSON (.crawl)
Section titled “NDJSON (.crawl)”Default format. Newline-delimited JSON with a metadata header line followed by one JSON object per crawled page.
{"_meta":true,"version":"0.1.0","url":"https://example.com","started_at":"2024-01-01T00:00:00Z","config":{"max_pages":100,"max_depth":10,"concurrency":5}}{"url":"https://example.com","status_code":200,"content_type":"text/html","title":"Example","meta_description":"...","links_found":15,"depth":0,"response_time_ms":234}The first line contains crawl metadata (_meta: true), including the version, target URL, start time, and config summary. Each subsequent line is a serialized page record. This format is streamable - output is valid after every line.
A JSON array of page objects, written all at once on finalization.
[ { "url": "https://example.com", "title": "Example", "meta_description": "An example site", "canonical_url": "https://example.com", "discovered_from": null, "status": 200, "markdown": "# Example\n\nContent here...", "word_count": 150, "byline": "Author Name", "excerpt": "A brief summary..." }]Content fields (markdown, word_count, byline, excerpt) are only included when present.
Sitemap XML
Section titled “Sitemap XML”Standard XML sitemap format compatible with search engines. Only includes pages with HTTP 200 status and text/html content type. Limited to 50,000 URLs per sitemap.
<?xml version="1.0" encoding="UTF-8"?><urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> <url> <loc>https://example.com</loc> <lastmod>2024-01-01</lastmod> </url></urlset>URLs are deduplicated and XML-escaped. The <lastmod> date is set to the crawl date.
SEO CSV
Section titled “SEO CSV”Two columns: Issue Type and URL. One row per affected URL, with values double-quote escaped.
"Issue Type","URL""Missing titles","https://example.com/page-1""Short content","https://example.com/page-2"SEO TXT
Section titled “SEO TXT”Human-readable report grouped by issue category with indented URLs.
Missing titles (2) https://example.com/page-1 https://example.com/page-2
Short content (1) https://example.com/page-3Filename Conventions
Section titled “Filename Conventions”When --output is omitted, filenames are auto-generated from the domain.
crawler crawl
Section titled “crawler crawl”| Format | Pattern | Example |
|---|---|---|
| NDJSON | {domain}.crawl | example-com.crawl |
| JSON | {domain}.json | example-com.json |
| Sitemap | {domain}-sitemap.xml | example-com-sitemap.xml |
crawler fetch
Section titled “crawler fetch”Filenames include the URL path for unique, readable names:
| Format | Pattern | Example |
|---|---|---|
| NDJSON | {domain}-{slug}.crawl | example-com-about-team.crawl |
| JSON | {domain}-{slug}.json | example-com-about-team.json |
| Sitemap | {domain}-{slug}.xml | example-com-about-team.xml |
Slug rules: path segments are joined with hyphens, file extensions (.html, .php, etc.) are stripped, query parameters and fragments are ignored, and the root path becomes index.
crawler seo
Section titled “crawler seo”| Format | Pattern | Example |
|---|---|---|
| SEO CSV | {domain}-seo.csv | example-com-seo.csv |
| SEO TXT | {domain}-seo.txt | example-com-seo.txt |
System Requirements
Section titled “System Requirements”- macOS - Apple Silicon (M1/M2/M3/M4) or Intel