Skip to content

CLI Reference - Flags, Formats, and Data Model

crawler crawl [OPTIONS] <URL>

The https:// prefix is added automatically if omitted.

FlagShortDefaultDescription
--output <PATH>-oAuto-generatedOutput file path
--format <FMT>-fndjsonOutput format: ndjson, json, sitemap
--max-pages <N>-p100Maximum pages to crawl
--max-depth <N>-d10Maximum crawl depth
--concurrency <N>-c5Concurrent requests
--delay <MS>200Delay between requests in ms
--no-extractfalseDisable content extraction (faster, smaller output)
--no-check-outgoingfalseDisable outgoing link checking (faster crawls)
--verbose-vfalseEnable verbose logging
--quiet-qfalseSuppress all output except errors
crawler fetch [OPTIONS] <URL>

Fetches a single URL. The https:// prefix is added automatically if omitted.

FlagShortDefaultDescription
--output <PATH>-oAuto-generatedOutput file path
--format <FMT>-fndjsonOutput format: ndjson, json, sitemap
--no-extractfalseDisable content extraction
--user-agent <UA>Custom User-Agent header
--verbose-vfalseEnable verbose logging
--quiet-qfalseSuppress all output except errors
crawler info <FILE>

No additional flags. Reads a .crawl file and displays summary statistics.

crawler export [OPTIONS] <FILE>
FlagShortDefaultDescription
--format <FMT>-fRequiredTarget format: json or sitemap
--output <PATH>-oAuto-generatedOutput file path
crawler seo [OPTIONS] <FILE>
FlagShortDefaultDescription
--export <FMT>Export format: csv, txt
--output <PATH>-oAuto-generatedOutput file path for export

Each crawled page contains the following fields, serialized to JSON in the output formats.

FieldTypeDescription
urlStringFull URL of the page
status_codeNumberHTTP status code
content_typeString?Content-Type header value
titleString?HTML <title> text
meta_descriptionString?<meta name="description"> content
canonical_urlString?<link rel="canonical"> href
discovered_fromString?Parent URL that linked to this page
links_foundNumberNumber of new same-domain links discovered
depthNumberCrawl depth from the start URL
response_time_msNumberHTTP response time in milliseconds
markdownString?Extracted content as Markdown (when enabled)
word_countNumber?Word count of extracted content
bylineString?Author byline from content extraction
excerptString?Article excerpt from content extraction
meta_robotsString?<meta name="robots"> content
x_robots_tagString?X-Robots-Tag HTTP header value
rel_nextString?<link rel="next"> href (pagination)
rel_prevString?<link rel="prev"> href (pagination)
outgoing_link_errorsArray?Broken outgoing links and images (when checking is enabled)

Each entry in the outgoing_link_errors array is an OutgoingLinkError object:

FieldTypeDescription
urlStringThe broken link URL
status_codeNumberHTTP status code (0 for connection failure)
errorString?Error message for connection failures
link_typeString"anchor" or "image"

Fields marked with ? are optional and may be absent depending on the page.

The crawler emits real-time events during a crawl session, used by both the CLI and desktop app to display progress.

EventDataDescription
StartedurlCrawl has begun
Discoveredurl, depthNew URL found and enqueued
PageCrawledPage objectPage successfully fetched and parsed
PageErrorurl, errorFailed to fetch a page
Progresscrawled, total_discoveredPeriodic progress update
Completedtotal_pages, total_errorsCrawl finished

Default format. Newline-delimited JSON with a metadata header line followed by one JSON object per crawled page.

{"_meta":true,"version":"0.1.0","url":"https://example.com","started_at":"2024-01-01T00:00:00Z","config":{"max_pages":100,"max_depth":10,"concurrency":5}}
{"url":"https://example.com","status_code":200,"content_type":"text/html","title":"Example","meta_description":"...","links_found":15,"depth":0,"response_time_ms":234}

The first line contains crawl metadata (_meta: true), including the version, target URL, start time, and config summary. Each subsequent line is a serialized page record. This format is streamable - output is valid after every line.

A JSON array of page objects, written all at once on finalization.

[
{
"url": "https://example.com",
"title": "Example",
"meta_description": "An example site",
"canonical_url": "https://example.com",
"discovered_from": null,
"status": 200,
"markdown": "# Example\n\nContent here...",
"word_count": 150,
"byline": "Author Name",
"excerpt": "A brief summary..."
}
]

Content fields (markdown, word_count, byline, excerpt) are only included when present.

Standard XML sitemap format compatible with search engines. Only includes pages with HTTP 200 status and text/html content type. Limited to 50,000 URLs per sitemap.

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>https://example.com</loc>
<lastmod>2024-01-01</lastmod>
</url>
</urlset>

URLs are deduplicated and XML-escaped. The <lastmod> date is set to the crawl date.

Two columns: Issue Type and URL. One row per affected URL, with values double-quote escaped.

"Issue Type","URL"
"Missing titles","https://example.com/page-1"
"Short content","https://example.com/page-2"

Human-readable report grouped by issue category with indented URLs.

Missing titles (2)
https://example.com/page-1
https://example.com/page-2
Short content (1)
https://example.com/page-3

When --output is omitted, filenames are auto-generated from the domain.

FormatPatternExample
NDJSON{domain}.crawlexample-com.crawl
JSON{domain}.jsonexample-com.json
Sitemap{domain}-sitemap.xmlexample-com-sitemap.xml

Filenames include the URL path for unique, readable names:

FormatPatternExample
NDJSON{domain}-{slug}.crawlexample-com-about-team.crawl
JSON{domain}-{slug}.jsonexample-com-about-team.json
Sitemap{domain}-{slug}.xmlexample-com-about-team.xml

Slug rules: path segments are joined with hyphens, file extensions (.html, .php, etc.) are stripped, query parameters and fragments are ignored, and the root path becomes index.

FormatPatternExample
SEO CSV{domain}-seo.csvexample-com-seo.csv
SEO TXT{domain}-seo.txtexample-com-seo.txt
  • macOS - Apple Silicon (M1/M2/M3/M4) or Intel