March 4, 2026

v0.4.0: Better Content Extraction

Significantly improved content extraction accuracy with richer article metadata including site name, lead image, and publish dates.

Mehmet Kose

2 mins read

What’s New in v0.4.0

More Accurate Content Extraction

The content extraction engine has been replaced with a more accurate implementation that correctly identifies the main article text across a wider range of sites. Pages with complex layouts, sidebars, and navigation-heavy designs now produce cleaner, more focused markdown output.

This means fewer cases where boilerplate content (menus, footers, ads) leaks into your extracted markdown, and fewer cases where the actual article body gets missed entirely.

Richer Article Metadata

Crawl results now include additional metadata when available:

Site Name - the publication or site name (e.g. “The New York Times”)
Lead Image - the article’s primary image URL
Publish Date - when the article was originally published
Modified Date - when the article was last updated

These fields appear automatically in your .crawl files and JSON exports. No configuration changes needed.

Who Benefits

Content teams using Content Archive export get cleaner markdown with less noise from page chrome
SEO professionals analyzing content across large sites get more reliable word counts and excerpts
Developers building on crawl data get structured article metadata without additional scraping

Wrap-up

A CMS shouldn't slow you down. Crawler aims to expand into your workflow — whether you're coding content models, collaborating on product copy, or launching updates at 2am.

If that sounds like the kind of tooling you want to use — try Crawler .

Crawler runs locally on your machine. Use the CLI or the desktop app — your workflow, your terms.