For AI training, RAG, and agent context

Feed your models Markdown from any website. Without a cloud middleman.

crawler.sh is a local crawler that turns websites into RAG-ready Markdown. Renders JavaScript with a custom engine, respects robots.txt, and runs on your laptop. No headless Chrome, no per-page fees, no API quota.

Download Read the Docs

What this is for

Three common shapes of work that start with the same problem: you need a website as clean Markdown, not as raw HTML and not as a screenshot.

RAG corpora

Build retrieval-augmented generation pipelines from real product docs, internal wikis, and reference sites. Each crawled page becomes a Markdown chunk with the title, URL, language, and word count attached, ready for embedding and indexing. Re-crawl on a schedule and the freshness signals tell you what to re-embed.

Fine-tuning datasets

Assemble Markdown training sets from any public corpus without writing a scraper. Output is consistent across pages: real article body, no chrome, no nav, no cookie banners. Bulk export the entire crawl as a Markdown archive so you can hand the directory to a tokenizer pipeline.

Agent context

Give a coding agent or research assistant the actual content of a docs site instead of asking it to browse one URL at a time. One command crawls the site, the next pipes the Markdown into the agent. Run it through the MCP server or stream NDJSON straight to your tool.

URL in. Markdown out.

Three commands cover the common path from a public site to a clean Markdown corpus on disk. Same flow from the terminal or the desktop app.

Step 1

Crawl the site

crawler crawl https://example.com

Crawls every same-host page up to your limit. JavaScript-heavy pages are auto-detected and rendered. robots.txt is honored by default; pacing adapts to the site.

Step 2

Extract the content

crawler crawl https://example.com --extract-content

Every page is captured as Markdown alongside its title, URL, language, byline, word count, and excerpt. The output is a single NDJSON file you can stream into anything.

Step 3

Bulk export as Markdown

Desktop app -> Downloads -> Content Archive

On the desktop app, the Pro-tier Content Archive download bundles the entire crawl as a Markdown archive ready for your RAG or fine-tune pipeline. From the CLI, the same Markdown lives in the per-page records of the .crawl NDJSON and can be split out with one line of jq.

Why our rendering is different

JavaScript rendering is the part where most scrapers either give up or get expensive. crawler.sh ships a quieter, lighter render path that gets through more sites and stays on the right side of politeness.

Custom JavaScript render engine

No headless Chrome dependency. A purpose-built JavaScript engine renders SPAs in-process, with a small fraction of the memory and startup cost. React, Vue, Next, Nuxt, and similar single-page apps are auto-detected and rendered without you flipping a switch.

Chrome 131 TLS fingerprint

Outbound HTTPS handshakes match what Chrome 131 sends today, including the JA3 and JA4 fingerprints that bot defenses look at. Many sites that block headless Chrome will pass crawler.sh through to the real content instead of serving a placeholder.

Shared cookie jar

JavaScript-driven fetch calls during rendering share the same cookie jar with the main crawler. Session-walled pages and pages that depend on a cookie set earlier in the crawl render with the correct state instead of looking like a fresh anonymous visit.

Polite by default

Respects robots.txt out of the box, including Disallow, Allow, and Crawl-delay. Per-host backoff doubles the inter-request delay on 429 and 403 responses, then halves it after a streak of successes. Sites you would have been blocked from now finish. Pass --ignore-robots if you really need it.

Why local matters

Cloud scrapers charge per page and route every byte through their infrastructure. crawler.sh runs on your machine, so the only thing that scales with crawl size is your own bandwidth.

Property

Cloud scrapers

crawler.sh

Where it runs

Vendor cloud

Your laptop

Marginal cost per page

Per-page credit

10,000 pages with JS rendering

Roughly $50 to $200 depending on vendor

Included in $99 / year Pro

Monthly minimums

Subscription tier required

None. Free tier covers 1,000 pages.

API key required to call it

Yes

No. Local binary.

Pages leave your machine

Yes, every page round-trips through the vendor

No. Crawl, render, and Markdown all happen locally.

Cloud scraper numbers are ballparks across publicly listed vendor pricing and change frequently. Check the source for the current rate.

Stop renting a scraper. Run your own.

Install in one command. Crawl any site into clean Markdown in seconds. Free up to 1,000 pages, $99 a year for 10,000.

Download Read the Docs