Skip to content

CLI Features

The crawler uses BFS (breadth-first search) traversal starting from the given URL. Pages at depth 0 are crawled first, then depth 1, and so on up to max_depth. Concurrency is controlled by the configured limit - requests never exceed the set number of concurrent connections.

Only links on the same host as the start URL are followed. Cross-domain links are discovered but not crawled. This keeps the crawl focused on a single site.

Before enqueuing, URLs are normalized:

  • Fragments (#section) are stripped
  • Trailing slashes are removed
  • Previously seen URLs are skipped

The following link types are ignored:

  • javascript: URLs
  • mailto: links
  • tel: links
  • Anchor-only links (#)

The CLI provides four subcommands:

crawler crawl <URL> [OPTIONS]

Crawls a website starting from the given URL, staying within the same domain. Outputs crawled page data in the specified format. The https:// prefix is added automatically if omitted.

# Basic crawl
crawler crawl https://example.com
# Crawl with custom limits
crawler crawl https://example.com -p 500 -d 5 -c 10
# Output as JSON
crawler crawl https://example.com -f json -o site.json
# Generate a sitemap
crawler crawl https://example.com -f sitemap
# Fast crawl without content extraction
crawler crawl https://example.com --no-extract --delay 50
crawler info <FILE>

Analyzes a .crawl (NDJSON) file and displays summary statistics: domain, page count, file size, status code distribution, and response time stats.

crawler info example-com.crawl
crawler export <FILE> --format <fmt> [OPTIONS]

Converts a .crawl file to another format. When --output is omitted, the path is derived from the input filename (e.g., example-com.json or example-com-sitemap.xml).

# Export to JSON
crawler export example-com.crawl -f json
# Export as sitemap XML
crawler export example-com.crawl -f sitemap -o sitemap.xml
crawler seo <FILE> [OPTIONS]

Analyzes a .crawl file for SEO issues across 16 check categories. Only 2xx HTML pages are analyzed.

# Display SEO analysis in terminal
crawler seo example-com.crawl
# Export issues as CSV
crawler seo example-com.crawl --export csv
# Export issues as TXT with custom path
crawler seo example-com.crawl --export txt -o report.txt

The SEO analyzer runs on successful HTML pages - pages with 2xx HTTP status codes and text/html content type. Non-HTML resources (images, CSS, JS) and error pages are excluded.

Duplicate detection is canonical-aware: pages pointing to the same <link rel="canonical"> URL are grouped together before checking for duplicate titles or descriptions. This prevents false positives across canonicalized page groups.

#CategoryCondition
1Missing titlestitle is empty or absent
2Missing meta descriptionsmeta_description is empty or absent
3Titles too longtitle.length > 60
4Titles too shorttitle.length < 30
5Descriptions too longmeta_description.length > 160
6Descriptions too shortmeta_description.length < 50
7Missing contentword_count is null
8Short contentword_count < 200
9Long contentword_count > 5,000
10Long URLsurl.length > 120
11Noindex pagesmeta_robots or X-Robots-Tag contains noindex
12Nofollow pagesmeta_robots or X-Robots-Tag contains nofollow
13Non-self canonicalscanonical_url differs from the page URL
14Paginated pagesPage has rel="next" or rel="prev"
15Duplicate titlesMultiple pages share the same title (grouped by canonical URL)
16Duplicate descriptionsMultiple pages share the same meta description (grouped by canonical URL)
  • CSV - Two columns: Issue Type and URL. One row per affected URL, with values double-quote escaped.
  • TXT - Sections grouped by issue type. Each section shows the issue title with count, followed by indented URLs.

When content extraction is enabled (the default), the crawler extracts the main article content from each HTML page and converts it to clean Markdown. The extraction process is similar to browser reader modes - it strips navigation, sidebars, footers, and scripts, keeping only the primary content.

The result includes:

  • Markdown text - clean article content
  • Word count - total word count
  • Author byline - when available
  • Excerpt - article summary

Disable extraction with --no-extract for faster crawling when you only need URLs and metadata.

The default output is NDJSON (Newline-Delimited JSON) with a .crawl extension. Each line is a standalone JSON object.

The first line is a metadata header:

{
"_meta": true,
"url": "https://example.com",
"max_pages": 100,
"max_depth": 10,
"concurrency": 5
}

Subsequent lines are crawled page records:

{
"url": "https://example.com/",
"status_code": 200,
"content_type": "text/html",
"title": "Example",
"meta_description": "...",
"canonical_url": "https://example.com/",
"links_found": 42,
"depth": 0,
"response_time_ms": 120,
"word_count": 350
}

Page records include optional SEO-relevant fields when available: meta_robots, x_robots_tag, rel_next, rel_prev, word_count, markdown, byline, excerpt.

This format is streamable - pages are written as they are crawled, so the file is always in a valid state even if the crawl is interrupted.

The delay_ms parameter adds a delay between requests to avoid overloading the target server. The CLI defaults to 200ms.

SettingValue
User-Agentcrawler.sh/0.1
Timeout30 seconds
Max Redirects10