v0.8.3: Raw HTML Capture
New --include-html flag on crawler crawl and crawler fetch keeps the raw (post-JavaScript-render) HTML body alongside the extracted Markdown. Available in the CLI and over MCP.
What’s New in v0.8.3
This release adds raw HTML capture to the CLI and MCP server. crawler.sh has always extracted page content to clean Markdown and discarded the source HTML. Some workflows (training data, structured-data extraction, archival snapshots, custom parsers) need the markup itself. Now they can have it.
--include-html
Pass --include-html on crawler crawl or crawler fetch to keep the raw HTML body on every HTML page as an html field in the output:
# Whole-site crawl with HTML retained on every pagecrawler crawl --include-html https://example.com
# Single-page fetch with HTMLcrawler fetch --include-html https://example.com/aboutWhat gets captured:
- JavaScript-rendered pages - the post-render HTML, after scripts have populated the DOM
- Static pages - the UTF-8-decoded bytes the server returned
- Non-HTML responses (PDFs, images, JSON, plain text, etc.) - never populated; the field stays absent
The flag is off by default because raw HTML is significantly larger than the extracted Markdown. On a 1,000-page crawl, enabling --include-html typically multiplies output size by 5-15x depending on page complexity. Use it when you need the source markup and stick with the default when you only need crawler.sh’s extracted content.
Backward Compatible by Design
When the flag is off, the html key is omitted entirely from the serialized output. Existing .crawl files stay byte-identical and any tooling that has never seen the field continues to parse without changes.
The new field is also additive in the JSON schema, so older readers built against the previous page shape keep working.
MCP Support
The MCP server picks up the same option. Both crawl_site and fetch_page accept a new include_html boolean (default false):
{ "name": "crawl_site", "arguments": { "url": "https://example.com", "max_pages": 50, "include_html": true }}When the flag is on, each returned page carries a non-null html field with the raw (post-JavaScript-render) HTML body. The discover_links tool intentionally does not expose this option - it returns only URLs and titles, so HTML capture would have no observable effect.
Output Shape
With --include-html on, each page record in the .crawl file gains one extra field:
{ "url": "https://example.com/about", "status_code": 200, "content_type": "text/html; charset=utf-8", "title": "About", "markdown": "# About\n\nWe build...", "word_count": 412, "html": "<!doctype html><html><body>...</body></html>"}Without the flag, the page record looks the same as before - no html key.
Working with Captured HTML
Because .crawl files are NDJSON (one JSON object per line), the simplest way to extract the captured HTML is with jq:
# Extract every page's HTML to its own filejq -r 'select(._meta != true and ._site_info != true) | .html // empty' run.crawl
# Just the first 200 chars per page, for inspectionjq 'select(._meta != true) | {url, snippet: (.html | .[0:200])}' run.crawlIf you prefer a single JSON array, convert with crawler export:
crawler export run.crawl -f jsonWhen to Use This
- Training datasets - keep both the rendered Markdown and the original markup so downstream pipelines can choose what to feed the model
- Structured-data extraction - parse JSON-LD, microdata, RDFa, or custom attributes from the rendered DOM
- Archival - capture a faithful snapshot of how a page rendered at crawl time, including JavaScript-injected content
- Custom parsers - run your own extractor against the post-render HTML instead of relying on crawler.sh’s content-extraction defaults
For day-to-day crawling, link discovery, SEO analysis, and broken-link reports, the default extracted-Markdown output is still the right choice. It’s smaller, faster to process, and is what every other crawler.sh subcommand (info, seo, export) operates on.
Upgrade
CLI and MCP self-install:
curl -fsSL https://install.crawler.sh | shcurl -fsSL https://install.crawler.sh/install-mcp.sh | shThe desktop app updates automatically on launch. The desktop UI does not currently expose the raw-HTML toggle; this release ships the option on the CLI and over MCP only.
Related
About crawler.sh
crawler.sh is a fast Rust-based web crawler and SEO auditing tool that runs entirely on your own machine. Use the CLI for automation, scripts, and CI pipelines, or the desktop app for a visual dashboard with live crawl progress, SEO issue charts, and one-click exports.
Every release ships across both the CLI and the desktop app.
Download the latest version
or run crawler update
from the terminal to upgrade an existing install.