v0.4.0: Better Content Extraction
Significantly improved content extraction accuracy with richer article metadata including site name, lead image, and publish dates.
What’s New in v0.4.0
More Accurate Content Extraction
The content extraction engine has been replaced with a more accurate implementation that correctly identifies the main article text across a wider range of sites. Pages with complex layouts, sidebars, and navigation-heavy designs now produce cleaner, more focused markdown output.
This means fewer cases where boilerplate content (menus, footers, ads) leaks into your extracted markdown, and fewer cases where the actual article body gets missed entirely.
Richer Article Metadata
Crawl results now include additional metadata when available:
- Site Name - the publication or site name (e.g. “The New York Times”)
- Lead Image - the article’s primary image URL
- Publish Date - when the article was originally published
- Modified Date - when the article was last updated
These fields appear automatically in your .crawl files and JSON exports. No configuration changes needed.
Who Benefits
- Content teams using Content Archive export get cleaner markdown with less noise from page chrome
- SEO professionals analyzing content across large sites get more reliable word counts and excerpts
- Developers building on crawl data get structured article metadata without additional scraping
Related
Wrap-up
A CMS shouldn't slow you down. Crawler aims to expand into your workflow — whether you're coding content models, collaborating on product copy, or launching updates at 2am.
If that sounds like the kind of tooling you want to use — try Crawler .