Skip to content

CLI Reference

crawler crawl [OPTIONS] <URL>

The https:// prefix is added automatically if omitted.

FlagShortDefaultDescription
--output <PATH>-oAuto-generatedOutput file path
--format <FMT>-fndjsonOutput format: ndjson, json, sitemap
--max-pages <N>-p100Maximum pages to crawl
--max-depth <N>-d10Maximum crawl depth
--concurrency <N>-c5Concurrent requests
--delay <MS>200Delay between requests in ms
--no-extractfalseDisable content extraction (faster, smaller output)
--verbose-vfalseEnable verbose logging
--quiet-qfalseSuppress all output except errors
crawler info <FILE>

No additional flags. Reads a .crawl file and displays summary statistics.

crawler export [OPTIONS] <FILE>
FlagShortDefaultDescription
--format <FMT>-fRequiredTarget format: json or sitemap
--output <PATH>-oAuto-generatedOutput file path
crawler seo [OPTIONS] <FILE>
FlagShortDefaultDescription
--export <FMT>Export format: csv, txt
--output <PATH>-oAuto-generatedOutput file path for export

Each crawled page contains the following fields, serialized to JSON in the output formats.

FieldTypeDescription
urlStringFull URL of the page
status_codeNumberHTTP status code
content_typeString?Content-Type header value
titleString?HTML <title> text
meta_descriptionString?<meta name="description"> content
canonical_urlString?<link rel="canonical"> href
discovered_fromString?Parent URL that linked to this page
links_foundNumberNumber of new same-domain links discovered
depthNumberCrawl depth from the start URL
response_time_msNumberHTTP response time in milliseconds
markdownString?Extracted content as Markdown (when enabled)
word_countNumber?Word count of extracted content
bylineString?Author byline from content extraction
excerptString?Article excerpt from content extraction
meta_robotsString?<meta name="robots"> content
x_robots_tagString?X-Robots-Tag HTTP header value
rel_nextString?<link rel="next"> href (pagination)
rel_prevString?<link rel="prev"> href (pagination)

Fields marked with ? are optional and may be absent depending on the page.

The crawler emits real-time events during a crawl session, used by both the CLI and desktop app to display progress.

EventDataDescription
StartedurlCrawl has begun
Discoveredurl, depthNew URL found and enqueued
PageCrawledPage objectPage successfully fetched and parsed
PageErrorurl, errorFailed to fetch a page
Progresscrawled, total_discoveredPeriodic progress update
Completedtotal_pages, total_errorsCrawl finished

Default format. Newline-delimited JSON with a metadata header line followed by one JSON object per crawled page.

{"_meta":true,"version":"0.1.0","url":"https://example.com","started_at":"2024-01-01T00:00:00Z","config":{"max_pages":100,"max_depth":10,"concurrency":5}}
{"url":"https://example.com","status_code":200,"content_type":"text/html","title":"Example","meta_description":"...","links_found":15,"depth":0,"response_time_ms":234}

The first line contains crawl metadata (_meta: true), including the version, target URL, start time, and config summary. Each subsequent line is a serialized page record. This format is streamable - output is valid after every line.

A JSON array of page objects, written all at once on finalization.

[
{
"url": "https://example.com",
"title": "Example",
"meta_description": "An example site",
"canonical_url": "https://example.com",
"discovered_from": null,
"status": 200,
"markdown": "# Example\n\nContent here...",
"word_count": 150,
"byline": "Author Name",
"excerpt": "A brief summary..."
}
]

Content fields (markdown, word_count, byline, excerpt) are only included when present.

Standard XML sitemap format compatible with search engines. Only includes pages with HTTP 200 status and text/html content type. Limited to 50,000 URLs per sitemap.

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>https://example.com</loc>
<lastmod>2024-01-01</lastmod>
</url>
</urlset>

URLs are deduplicated and XML-escaped. The <lastmod> date is set to the crawl date.

Two columns: Issue Type and URL. One row per affected URL, with values double-quote escaped.

"Issue Type","URL"
"Missing titles","https://example.com/page-1"
"Short content","https://example.com/page-2"

Human-readable report grouped by issue category with indented URLs.

Missing titles (2)
https://example.com/page-1
https://example.com/page-2
Short content (1)
https://example.com/page-3

When --output is omitted, filenames are auto-generated from the domain:

FormatPatternExample
NDJSON{domain}.crawlexample-com.crawl
JSON{domain}.jsonexample-com.json
Sitemap{domain}-sitemap.xmlexample-com-sitemap.xml
SEO CSV{domain}-seo.csvexample-com-seo.csv
SEO TXT{domain}-seo.txtexample-com-seo.txt
  • macOS - Apple Silicon (M1/M2/M3/M4) or Intel