For AI training, RAG, and agent context
Feed your models Markdown from any website. Without a cloud middleman.
crawler.sh is a local crawler that turns websites into RAG-ready Markdown. Renders JavaScript with a custom engine, respects robots.txt, and runs on your laptop. No headless Chrome, no per-page fees, no API quota.
What this is for
Three common shapes of work that start with the same problem: you need a website as clean Markdown, not as raw HTML and not as a screenshot.
RAG corpora
Build retrieval-augmented generation pipelines from real product docs, internal wikis, and reference sites. Each crawled page becomes a Markdown chunk with the title, URL, language, and word count attached, ready for embedding and indexing. Re-crawl on a schedule and the freshness signals tell you what to re-embed.
Fine-tuning datasets
Assemble Markdown training sets from any public corpus without writing a scraper. Output is consistent across pages: real article body, no chrome, no nav, no cookie banners. Bulk export the entire crawl as a Markdown archive so you can hand the directory to a tokenizer pipeline.
Agent context
Give a coding agent or research assistant the actual content of a docs site instead of asking it to browse one URL at a time. One command crawls the site, the next pipes the Markdown into the agent. Run it through the MCP server or stream NDJSON straight to your tool.
URL in. Markdown out.
Three commands cover the common path from a public site to a clean Markdown corpus on disk. Same flow from the terminal or the desktop app.
Crawl the site
crawler crawl https://example.comCrawls every same-host page up to your limit. JavaScript-heavy pages are auto-detected and rendered. robots.txt is honored by default; pacing adapts to the site.
Extract the content
crawler crawl https://example.com --extract-contentEvery page is captured as Markdown alongside its title, URL, language, byline, word count, and excerpt. The output is a single NDJSON file you can stream into anything.
Bulk export as Markdown
Desktop app -> Downloads -> Content ArchiveOn the desktop app, the Pro-tier Content Archive download bundles the entire crawl as a Markdown archive ready for your RAG or fine-tune pipeline. From the CLI, the same Markdown lives in the per-page records of the .crawl NDJSON and can be split out with one line of jq.
Why our rendering is different
JavaScript rendering is the part where most scrapers either give up or get expensive. crawler.sh ships a quieter, lighter render path that gets through more sites and stays on the right side of politeness.
Custom JavaScript render engine
No headless Chrome dependency. A purpose-built JavaScript engine renders SPAs in-process, with a small fraction of the memory and startup cost. React, Vue, Next, Nuxt, and similar single-page apps are auto-detected and rendered without you flipping a switch.
Chrome 131 TLS fingerprint
Outbound HTTPS handshakes match what Chrome 131 sends today, including the JA3 and JA4 fingerprints that bot defenses look at. Many sites that block headless Chrome will pass crawler.sh through to the real content instead of serving a placeholder.
Shared cookie jar
JavaScript-driven fetch calls during rendering share the same cookie jar with the main crawler. Session-walled pages and pages that depend on a cookie set earlier in the crawl render with the correct state instead of looking like a fresh anonymous visit.
Polite by default
Respects robots.txt out of the box, including Disallow, Allow, and Crawl-delay. Per-host backoff doubles the inter-request delay on 429 and 403 responses, then halves it after a streak of successes. Sites you would have been blocked from now finish. Pass --ignore-robots if you really need it.
Why local matters
Cloud scrapers charge per page and route every byte through their infrastructure. crawler.sh runs on your machine, so the only thing that scales with crawl size is your own bandwidth.
Cloud scraper numbers are ballparks across publicly listed vendor pricing and change frequently. Check the source for the current rate.
Stop renting a scraper. Run your own.
Install in one command. Crawl any site into clean Markdown in seconds. Free up to 1,000 pages, $99 a year for 10,000.