CLI Features - Commands and Crawl Engine
Crawl Engine
Section titled “Crawl Engine”The crawler uses BFS (breadth-first search) traversal starting from the given URL. Pages at depth 0 are crawled first, then depth 1, and so on up to max_depth. Concurrency is controlled by the configured limit - requests never exceed the set number of concurrent connections.
Domain Constraint
Section titled “Domain Constraint”Only links on the same host as the start URL are followed. Cross-domain links are discovered but not crawled. This keeps the crawl focused on a single site.
URL Normalization
Section titled “URL Normalization”Before enqueuing, URLs are normalized:
- Fragments (
#section) are stripped - Trailing slashes are removed
- Previously seen URLs are skipped
Link Filtering
Section titled “Link Filtering”The following link types are ignored:
javascript:URLsmailto:linkstel:links- Anchor-only links (
#)
Subcommands
Section titled “Subcommands”The CLI provides five subcommands:
crawl - Crawl a website
Section titled “crawl - Crawl a website”crawler crawl <URL> [OPTIONS]Crawls a website starting from the given URL, staying within the same domain. Outputs crawled page data in the specified format. The https:// prefix is added automatically if omitted.
# Basic crawlcrawler crawl https://example.com
# Crawl with custom limitscrawler crawl https://example.com -p 500 -d 5 -c 10
# Output as JSONcrawler crawl https://example.com -f json -o site.json
# Generate a sitemapcrawler crawl https://example.com -f sitemap
# Fast crawl without content extractioncrawler crawl https://example.com --no-extract --delay 50fetch - Fetch a single URL
Section titled “fetch - Fetch a single URL”crawler fetch <URL> [OPTIONS]Fetches a single page without crawling the entire site. The output filename is derived from the URL path, so each page gets a unique, readable file.
# Fetch a single pagecrawler fetch https://example.com/about/team# Creates: example-com-about-team.crawl
# Fetch the homepagecrawler fetch https://example.com/# Creates: example-com-index.crawl
# Output as JSONcrawler fetch --format json https://example.com/pricing
# Skip content extractioncrawler fetch --no-extract https://example.com/about
# Custom output pathcrawler fetch -o page.crawl https://example.com/aboutThe output format is identical to crawler crawl, so fetched pages work with info, seo, and export commands.
info - Inspect a .crawl file
Section titled “info - Inspect a .crawl file”crawler info <FILE>Analyzes a .crawl (NDJSON) file and displays summary statistics: domain, page count, file size, status code distribution, and response time stats.
crawler info example-com.crawlexport - Convert a .crawl file
Section titled “export - Convert a .crawl file”crawler export <FILE> --format <fmt> [OPTIONS]Converts a .crawl file to another format. When --output is omitted, the path is derived from the input filename (e.g., example-com.json or example-com-sitemap.xml).
# Export to JSONcrawler export example-com.crawl -f json
# Export as sitemap XMLcrawler export example-com.crawl -f sitemap -o sitemap.xmlseo - Analyze SEO issues
Section titled “seo - Analyze SEO issues”crawler seo <FILE> [OPTIONS]Analyzes a .crawl file for SEO issues across 23 check categories. Only 2xx HTML pages are analyzed.
# Display SEO analysis in terminalcrawler seo example-com.crawl
# Export issues as CSVcrawler seo example-com.crawl --export csv
# Export issues as TXT with custom pathcrawler seo example-com.crawl --export txt -o report.txtSEO Analysis
Section titled “SEO Analysis”The SEO analyzer runs on successful HTML pages - pages with 2xx HTTP status codes and text/html content type. Non-HTML resources (images, CSS, JS) and error pages are excluded.
Duplicate detection is canonical-aware: pages pointing to the same <link rel="canonical"> URL are grouped together before checking for duplicate titles or descriptions. This prevents false positives across canonicalized page groups.
Check Categories
Section titled “Check Categories”| # | Category | Condition |
|---|---|---|
| 1 | Missing titles | title is empty or absent |
| 2 | Missing meta descriptions | meta_description is empty or absent |
| 3 | Titles too long | title.length > 60 |
| 4 | Titles too short | title.length < 30 |
| 5 | Descriptions too long | meta_description.length > 160 |
| 6 | Descriptions too short | meta_description.length < 50 |
| 7 | Missing content | word_count is null |
| 8 | Short content | word_count < 200 |
| 9 | Long content | word_count > 5,000 |
| 10 | Long URLs | url.length > 120 |
| 11 | Noindex pages | meta_robots or X-Robots-Tag contains noindex |
| 12 | Nofollow pages | meta_robots or X-Robots-Tag contains nofollow |
| 13 | Non-self canonicals | canonical_url differs from the page URL |
| 14 | Paginated pages | Page has rel="next" or rel="prev" |
| 15 | Duplicate titles | Multiple pages share the same title (grouped by canonical URL) |
| 16 | Duplicate descriptions | Multiple pages share the same meta description (grouped by canonical URL) |
| 17 | Missing H1 | Page has no H1 element |
| 18 | Multiple H1 tags | Page has more than one H1 element |
| 19 | Empty H1 | H1 element contains no text |
| 20 | Long H1 | H1 text > 70 characters |
| 21 | Short H1 | H1 text < 10 characters |
| 22 | Duplicate H1 | Multiple pages share the same H1 text |
| 23 | Broken links | Outgoing links or images return 404/410/5xx or fail to connect |
Export Formats
Section titled “Export Formats”- CSV - Two columns:
Issue TypeandURL. One row per affected URL, with values double-quote escaped. - TXT - Sections grouped by issue type. Each section shows the issue title with count, followed by indented URLs.
Content Extraction
Section titled “Content Extraction”When content extraction is enabled (the default), the crawler extracts the main article content from each HTML page and converts it to clean Markdown. The extraction process is similar to browser reader modes - it strips navigation, sidebars, footers, and scripts, keeping only the primary content.
The result includes:
- Markdown text - clean article content
- Word count - total word count
- Author byline - when available
- Excerpt - article summary
Disable extraction with --no-extract for faster crawling when you only need URLs and metadata.
Broken Link Checking
Section titled “Broken Link Checking”By default, crawler crawl checks every outgoing link and image src on crawled pages for broken URLs. Internal links are already verified by the crawl itself, so only external URLs are checked.
How it works
Section titled “How it works”- Sends a HEAD request to each external URL (falls back to GET on 405)
- Results are cached per URL so duplicate links across pages are only checked once
- Runs concurrently alongside the crawl and does not significantly slow it down
What counts as broken
Section titled “What counts as broken”- Anchors (
<a>links) - 404, 410, 5xx, or connection failures - Images (
<img>sources) - any 4xx or 5xx response, or connection failures
Disabling link checks
Section titled “Disabling link checks”Use --no-check-outgoing to skip outgoing link verification for faster crawls:
crawler crawl --no-check-outgoing https://example.comResults in SEO reports
Section titled “Results in SEO reports”Broken links appear in crawler seo reports (check #23) grouped by broken URL with the source pages listed.
The .crawl File Format
Section titled “The .crawl File Format”The default output is NDJSON (Newline-Delimited JSON) with a .crawl extension. Each line is a standalone JSON object.
The first line is a metadata header:
{ "_meta": true, "url": "https://example.com", "max_pages": 100, "max_depth": 10, "concurrency": 5}Subsequent lines are crawled page records:
{ "url": "https://example.com/", "status_code": 200, "content_type": "text/html", "title": "Example", "meta_description": "...", "canonical_url": "https://example.com/", "links_found": 42, "depth": 0, "response_time_ms": 120, "word_count": 350}Page records include optional SEO-relevant fields when available: meta_robots, x_robots_tag, rel_next, rel_prev, word_count, markdown, byline, excerpt, outgoing_link_errors.
This format is streamable - pages are written as they are crawled, so the file is always in a valid state even if the crawl is interrupted.
Polite Crawling
Section titled “Polite Crawling”The delay_ms parameter adds a delay between requests to avoid overloading the target server. The CLI defaults to 200ms.
| Setting | Value |
|---|---|
| User-Agent | crawler.sh/0.1 |
| Timeout | 30 seconds |
| Max Redirects | 10 |