CLI Features - Commands and Crawl Engine

Crawl Engine

The crawler uses BFS (breadth-first search) traversal starting from the given URL. Pages at depth 0 are crawled first, then depth 1, and so on up to max_depth. Concurrency is controlled by the configured limit - requests never exceed the set number of concurrent connections.

Domain Constraint

Only links on the same host as the start URL are followed. Cross-domain links are discovered but not crawled. This keeps the crawl focused on a single site.

URL Normalization

Before enqueuing, URLs are normalized:

Fragments (#section) are stripped
Trailing slashes are removed
Previously seen URLs are skipped

Link Filtering

The following link types are ignored:

javascript: URLs
mailto: links
tel: links
Anchor-only links (#)

JavaScript Rendering

crawler.sh includes a built-in JavaScript rendering engine that executes inline scripts and produces the final DOM. Pages built with client-side frameworks are crawled with their full rendered content - no external browser required.

Three rendering modes:

Mode	Flag	Behavior
Auto (default)	(none)	Samples the first few pages and enables rendering when client-side frameworks are detected
Always	`--render`	Renders every page with JavaScript
Never	`--no-render`	Skips JavaScript rendering for faster crawls

In Auto mode, the engine profiles the first few pages looking for JavaScript framework markers, empty body shells, script-to-text ratios, and bot protection signatures. If it detects that pages need rendering, it enables the JS engine automatically for the rest of the crawl. Resource limits (CPU timeout, memory cap, output size) are enforced to prevent runaway scripts from affecting the crawl.

Subcommands

The CLI provides five subcommands:

`crawl` - Crawl a website

crawler crawl <URL> [OPTIONS]

Crawls a website starting from the given URL, staying within the same domain. Outputs crawled page data in the specified format. The https:// prefix is added automatically if omitted.

# Basic crawl
crawler crawl https://example.com

# Crawl with custom limits
crawler crawl https://example.com -p 500 -d 5 -c 10

# Output as JSON
crawler crawl https://example.com -f json -o site.json

# Generate a sitemap
crawler crawl https://example.com -f sitemap

# Fast crawl without content extraction
crawler crawl https://example.com --no-extract --delay 50

`fetch` - Fetch a single URL

crawler fetch <URL> [OPTIONS]

Fetches a single page without crawling the entire site. The output filename is derived from the URL path, so each page gets a unique, readable file.

# Fetch a single page
crawler fetch https://example.com/about/team
# Creates: example-com-about-team.crawl

# Fetch the homepage
crawler fetch https://example.com/
# Creates: example-com-index.crawl

# Output as JSON
crawler fetch --format json https://example.com/pricing

# Skip content extraction
crawler fetch --no-extract https://example.com/about

# Custom output path
crawler fetch -o page.crawl https://example.com/about

The output format is identical to crawler crawl, so fetched pages work with info, seo, and export commands.

`info` - Inspect a .crawl file

crawler info <FILE>

Analyzes a .crawl (NDJSON) file and displays summary statistics: domain, page count, file size, status code distribution, and response time stats.

crawler info example-com.crawl

`export` - Convert a .crawl file

crawler export <FILE> --format <fmt> [OPTIONS]

Converts a .crawl file to another format. When --output is omitted, the path is derived from the input filename (e.g., example-com.json or example-com-sitemap.xml).

# Export to JSON
crawler export example-com.crawl -f json

# Export as sitemap XML
crawler export example-com.crawl -f sitemap -o sitemap.xml

`seo` - Analyze SEO issues

crawler seo <FILE> [OPTIONS]

Analyzes a .crawl file for SEO issues across 24 check categories. Only 2xx HTML pages are analyzed.

# Display SEO analysis in terminal
crawler seo example-com.crawl

# Export issues as CSV
crawler seo example-com.crawl --export csv

# Export issues as TXT with custom path
crawler seo example-com.crawl --export txt -o report.txt

SEO Analysis

The SEO analyzer runs on successful HTML pages - pages with 2xx HTTP status codes and text/html content type. Non-HTML resources (images, CSS, JS) and error pages are excluded.

Duplicate detection is canonical-aware: pages pointing to the same <link rel="canonical"> URL are grouped together before checking for duplicate titles or descriptions. This prevents false positives across canonicalized page groups.

Check Categories

#	Category	Condition
1	Missing titles	`title` is empty or absent
2	Missing meta descriptions	`meta_description` is empty or absent
3	Titles too long	`title.length > 60`
4	Titles too short	`title.length < 30`
5	Descriptions too long	`meta_description.length > 160`
6	Descriptions too short	`meta_description.length < 50`
7	Missing content	`word_count` is null
8	Short content	`word_count < 200`
9	Long content	`word_count > 5,000`
10	Long URLs	`url.length > 120`
11	Noindex pages	`meta_robots` or `X-Robots-Tag` contains `noindex`
12	Nofollow pages	`meta_robots` or `X-Robots-Tag` contains `nofollow`
13	Non-self canonicals	`canonical_url` differs from the page URL
14	Paginated pages	Page has `rel="next"` or `rel="prev"`
15	Duplicate titles	Multiple pages share the same title (grouped by canonical URL)
16	Duplicate descriptions	Multiple pages share the same meta description (grouped by canonical URL)
17	Missing H1	Page has no H1 element
18	Multiple H1 tags	Page has more than one H1 element
19	Empty H1	H1 element contains no text
20	Long H1	H1 text > 70 characters
21	Short H1	H1 text < 10 characters
22	Duplicate H1	Multiple pages share the same H1 text
23	Broken links	Outgoing links or images return 404/410/5xx or fail to connect
24	Content freshness	Page is missing date signals, stale (>730d), or has inconsistent / invalid / swapped `datePublished` and `dateModified` across JSON-LD, OG, and `Last-Modified`

Export Formats

CSV - Two columns: Issue Type and URL. One row per affected URL, with values double-quote escaped.
TXT - Sections grouped by issue type. Each section shows the issue title with count, followed by indented URLs.

Content Extraction

When content extraction is enabled (the default), the crawler extracts the main article content from each HTML page and converts it to clean Markdown. The extraction process is similar to browser reader modes - it strips navigation, sidebars, footers, and scripts, keeping only the primary content.

The result includes:

Markdown text - clean article content
Word count - total word count
Author byline - when available
Excerpt - article summary

Disable extraction with --no-extract for faster crawling when you only need URLs and metadata.

Raw HTML Capture

By default, crawler.sh extracts content to Markdown and discards the source HTML. Pass --include-html on crawl or fetch to keep the raw HTML body on every HTML page as an html field in the output:

# Whole-site crawl with HTML retained on every page
crawler crawl --include-html https://example.com

# Single-page fetch with HTML
crawler fetch --include-html https://example.com/about

What gets captured:

JS-rendered pages - the post-render HTML (after scripts have populated the DOM)
Static pages - the UTF-8-decoded bytes the server returned
Non-HTML responses (PDFs, images, JSON, etc.) - never populated; the field stays absent

The flag is off by default because raw HTML is significantly larger than the extracted Markdown - on a 1,000-page crawl, enabling --include-html typically multiplies output size by 5-15x depending on page complexity. Use it when you need the source markup (e.g., training data, structured-data extraction, archival snapshots) and stick with the default when you only need crawler.sh’s extracted content.

When the flag is off, the html key is omitted entirely from the serialized output, so existing .crawl files stay byte-identical and downstream tooling that has never seen the field continues to parse without changes.

Broken Link Checking

By default, crawler crawl checks every outgoing link and image src on crawled pages for broken URLs. Internal links are already verified by the crawl itself, so only external URLs are checked.

How it works

Sends a HEAD request to each external URL (falls back to GET on 405)
Results are cached per URL so duplicate links across pages are only checked once
Runs concurrently alongside the crawl and does not significantly slow it down

What counts as broken

Anchors (<a> links) - 404, 410, 5xx, or connection failures
Images (<img> sources) - any 4xx or 5xx response, or connection failures

Disabling link checks

Use --no-check-outgoing to skip outgoing link verification for faster crawls:

crawler crawl --no-check-outgoing https://example.com

Results in SEO reports

Broken links appear in crawler seo reports (check #23) grouped by broken URL with the source pages listed.

The .crawl File Format

The default output is NDJSON (Newline-Delimited JSON) with a .crawl extension. Each line is a standalone JSON object.

The first line is a metadata header:

{
  "_meta": true,
  "url": "https://example.com",
  "max_pages": 100,
  "max_depth": 10,
  "concurrency": 5
}

Subsequent lines are crawled page records:

{
  "url": "https://example.com/",
  "status_code": 200,
  "content_type": "text/html",
  "title": "Example",
  "meta_description": "...",
  "canonical_url": "https://example.com/",
  "links_found": 42,
  "depth": 0,
  "response_time_ms": 120,
  "word_count": 350
}

Page records include optional SEO-relevant fields when available: meta_robots, x_robots_tag, rel_next, rel_prev, word_count, markdown, byline, excerpt, outgoing_link_errors.

This format is streamable - pages are written as they are crawled, so the file is always in a valid state even if the crawl is interrupted.

Polite Crawling

The delay_ms parameter adds a delay between requests to avoid overloading the target server. The CLI defaults to 200ms.

Setting	Value
User-Agent	`crawler.sh/0.1`
Timeout	30 seconds
Max Redirects	10