March 4, 2026

v0.4.0: Better Content Extraction

Significantly improved content extraction accuracy with richer article metadata including site name, lead image, and publish dates.

Mehmet Kose
Mehmet Kose
2 mins read

What’s New in v0.4.0

More Accurate Content Extraction

The content extraction engine has been replaced with a more accurate implementation that correctly identifies the main article text across a wider range of sites. Pages with complex layouts, sidebars, and navigation-heavy designs now produce cleaner, more focused markdown output.

This means fewer cases where boilerplate content (menus, footers, ads) leaks into your extracted markdown, and fewer cases where the actual article body gets missed entirely.

Richer Article Metadata

Crawl results now include additional metadata when available:

  • Site Name - the publication or site name (e.g. “The New York Times”)
  • Lead Image - the article’s primary image URL
  • Publish Date - when the article was originally published
  • Modified Date - when the article was last updated

These fields appear automatically in your .crawl files and JSON exports. No configuration changes needed.

Who Benefits

  • Content teams using Content Archive export get cleaner markdown with less noise from page chrome
  • SEO professionals analyzing content across large sites get more reliable word counts and excerpts
  • Developers building on crawl data get structured article metadata without additional scraping

About crawler.sh

crawler.sh is a fast Rust-based web crawler and SEO auditing tool that runs entirely on your own machine. Use the CLI for automation, scripts, and CI pipelines, or the desktop app for a visual dashboard with live crawl progress, SEO issue charts, and one-click exports.

Every release ships across both the CLI and the desktop app. Download the latest version or run crawler update from the terminal to upgrade an existing install.

Crawler.sh - Free Local AEO & SEO Spider and a Markdown content extractor | Product Hunt