v0.8.3: Raw HTML Capture

What’s New in v0.8.3

This release adds raw HTML capture to the CLI and MCP server. crawler.sh has always extracted page content to clean Markdown and discarded the source HTML. Some workflows (training data, structured-data extraction, archival snapshots, custom parsers) need the markup itself. Now they can have it.

`--include-html`

Pass --include-html on crawler crawl or crawler fetch to keep the raw HTML body on every HTML page as an html field in the output:

# Whole-site crawl with HTML retained on every page
crawler crawl --include-html https://example.com

# Single-page fetch with HTML
crawler fetch --include-html https://example.com/about

What gets captured:

JavaScript-rendered pages - the post-render HTML, after scripts have populated the DOM
Static pages - the UTF-8-decoded bytes the server returned
Non-HTML responses (PDFs, images, JSON, plain text, etc.) - never populated; the field stays absent

The flag is off by default because raw HTML is significantly larger than the extracted Markdown. On a 1,000-page crawl, enabling --include-html typically multiplies output size by 5-15x depending on page complexity. Use it when you need the source markup and stick with the default when you only need crawler.sh’s extracted content.

Backward Compatible by Design

When the flag is off, the html key is omitted entirely from the serialized output. Existing .crawl files stay byte-identical and any tooling that has never seen the field continues to parse without changes.

The new field is also additive in the JSON schema, so older readers built against the previous page shape keep working.

MCP Support

The MCP server picks up the same option. Both crawl_site and fetch_page accept a new include_html boolean (default false):

{
  "name": "crawl_site",
  "arguments": {
    "url": "https://example.com",
    "max_pages": 50,
    "include_html": true
  }
}

When the flag is on, each returned page carries a non-null html field with the raw (post-JavaScript-render) HTML body. The discover_links tool intentionally does not expose this option - it returns only URLs and titles, so HTML capture would have no observable effect.

Output Shape

With --include-html on, each page record in the .crawl file gains one extra field:

{
  "url": "https://example.com/about",
  "status_code": 200,
  "content_type": "text/html; charset=utf-8",
  "title": "About",
  "markdown": "# About\n\nWe build...",
  "word_count": 412,
  "html": "<!doctype html><html><body>...</body></html>"
}

Without the flag, the page record looks the same as before - no html key.

Working with Captured HTML

Because .crawl files are NDJSON (one JSON object per line), the simplest way to extract the captured HTML is with jq:

# Extract every page's HTML to its own file
jq -r 'select(._meta != true and ._site_info != true) | .html // empty' run.crawl

# Just the first 200 chars per page, for inspection
jq 'select(._meta != true) | {url, snippet: (.html | .[0:200])}' run.crawl

If you prefer a single JSON array, convert with crawler export:

crawler export run.crawl -f json

When to Use This

Training datasets - keep both the rendered Markdown and the original markup so downstream pipelines can choose what to feed the model
Structured-data extraction - parse JSON-LD, microdata, RDFa, or custom attributes from the rendered DOM
Archival - capture a faithful snapshot of how a page rendered at crawl time, including JavaScript-injected content
Custom parsers - run your own extractor against the post-render HTML instead of relying on crawler.sh’s content-extraction defaults

For day-to-day crawling, link discovery, SEO analysis, and broken-link reports, the default extracted-Markdown output is still the right choice. It’s smaller, faster to process, and is what every other crawler.sh subcommand (info, seo, export) operates on.

Upgrade

CLI and MCP self-install:

curl -fsSL https://install.crawler.sh | sh
curl -fsSL https://install.crawler.sh/install-mcp.sh | sh

The desktop app updates automatically on launch. The desktop UI does not currently expose the raw-HTML toggle; this release ships the option on the CLI and over MCP only.