Clean Markdown from any website. Without the cloud bill.
crawler.sh is a local crawler that turns websites into RAG-ready Markdown for AI training, fine-tuning, and agent context. Renders JavaScript with a custom engine, respects robots.txt, and runs on your laptop. No headless Chrome, no per-page fees, no API quota.

Markdown for AI
RAG-ready Markdown from any page.
Extract the main article content from any page as clean Markdown, ready for RAG pipelines, fine-tuning corpora, or agent context. Every page ships with word count, author byline, language, and excerpt. Bulk export the entire site as a Markdown archive.

JavaScript
Rendering
SPAs rendered without headless Chrome.
A custom JavaScript render engine handles React, Vue, Next, and other SPAs without spinning up headless Chrome. Chrome 131 TLS fingerprint and shared cookie jar mean session-walled pages render with the right state. Auto-detected per site, or force on or off.

Polite Crawling
robots.txt and adaptive pacing by default.
Respects Disallow, Allow, and Crawl-delay out of the box. Adapts per-host pacing on 429 and 403 responses with exponential backoff, and slows down automatically on protected sites. Important when you are building an AI dataset and the source matters.

SEO Analysis
Automated checks across every page.
Detect missing titles, duplicate meta descriptions, noindex directives, thin content, broken links, long URLs, content freshness signals, and more. Useful before you ship a site, or before you train on one. Export issues as CSV or TXT.

Workflow Examples
From quick crawl to full pipeline
Built for Every Workflow
Extract readable content from any website as clean Markdown. Perfect for backups, migrations, or feeding content into other tools.
Run 24 automated checks across every page - find missing titles, duplicate descriptions, thin content, content freshness signals, and more before they hurt your rankings.
Generate W3C-compliant Sitemap XML from a live crawl. Keep your sitemaps accurate and up to date without manual maintenance.
Crawl your site regularly to catch broken links, missing pages, and status code changes before your visitors do.
Crawl any website, find every issue, and export the data you need - all from your own machine.