How crawler.sh renders JavaScript without headless Chrome

The cost of “just use headless Chrome”

Most crawlers that promise to render JavaScript do it by spinning up a full headless Chrome process under the hood. It works, and for a one-page-at-a-time SaaS that hides the bill, it is a reasonable default. The problem starts when you try to do it on your own machine, or worse, when you try to crawl a few thousand pages of a JavaScript-heavy site.

A single headless Chrome process is roughly 200 MB of resident memory before any page is loaded. Spin up four workers to keep up with a modest concurrency, and you are at almost a gigabyte just to render. Cold start is multiple seconds per worker. The TLS fingerprint and a long list of telltale runtime objects (the missing chrome.runtime.app, the giveaway navigator.webdriver, the wrong WebGL vendor string) make headless Chrome easy to detect, so any site with serious bot defenses serves it a placeholder.

crawler.sh ships its own JavaScript render path instead. It runs in-process, in a fraction of the memory, with a TLS handshake that matches what current Chrome sends today. This post walks through what that means in practice.

What the render engine actually does

When the crawler hits a page that needs JavaScript to render content, the engine does four things in sequence:

Fetch the HTML over a TLS connection that matches Chrome 131. Many sites that block headless Chrome on the basis of the JA3 or JA4 fingerprint let this through.
Parse the HTML into a real DOM with CSS selector matching. The same selector engine browsers use, not a regex-and-prayer parser.
Run the page’s JavaScript against a populated window, document, location, and navigator. setTimeout, setInterval, fetch, XMLHttpRequest, URL, URLSearchParams, Blob, FormData, FileReader, and history.pushState are all wired up. getComputedStyle returns a CSSStyleDeclaration shape, not a bare proxy. canvas.getContext returns a shimmed 2D context or a WebGL stub with a plausible vendor string instead of null.
Extract the rendered text as clean Markdown.

There is no separate browser process, no IPC, no DevTools protocol round-trips. The engine is a library that crawler-core links against. One in-process render worker is on the order of 10 to 20 MB of resident memory, not 200.

Fingerprint hygiene without the heavy lifting

Bot defenses do not need much to flag a page hit as headless. A few high-signal tells we close off by default:

Synthetic globals used by the engine to bridge JavaScript and the host are hidden from Object.keys(window) and from for...in loops. A site that enumerates the window object sees something that looks like a normal Chrome page, not a debug runtime.
canvas.getContext('2d') and getContext('webgl' | 'webgl2') return objects, not null. The WebGL stubs report Apple-Silicon-on-Chrome vendor strings, which are plausible for a modern macOS user.
The TLS ClientHello is shaped to match Chrome 131, including cipher order and extension order. JA3 and JA4 fingerprinting both pass.
A single cookie jar is shared between the static fetch and the JavaScript-driven fetch calls during rendering. A cf_clearance cookie set on page one is sent on page two automatically. Session-walled sites render with the right state instead of looking like a fresh anonymous visit on every page.

None of this is bullet-proof. A determined fingerprinting service with enough JavaScript probes can still tell. But “good enough to get the real content for the long tail of sites” is a much lower bar, and that is the bar that matters for an AI ingestion workflow.

Politeness, by the way

The same release that added the fingerprint hygiene also moved crawler.sh to respect robots.txt by default. Disallow:, Allow:, and Crawl-delay: are honored. Per-host backoff doubles the request delay on a 429 or 403 and halves it after a streak of successes. The effective wait between requests is the maximum of your configured delay, the site profiler’s posture floor, any Crawl-delay from robots.txt, and the current dynamic backoff.

This matters more than it sounds for AI workloads. A training dataset assembled by ignoring robots.txt and triggering rate-limit waves is a liability. A dataset assembled by a crawler that slows down on its own, identifies itself honestly, and stops at every Disallow rule is one you can ship.

What it means for you

Three things change when the render engine is in-process instead of a headless Chrome out-of-process.

Cost. Four render workers fit in about 60 to 80 MB instead of about 800 MB. You can run crawler.sh on a laptop on battery and watch the fan stay quiet. There is no per-page fee because there is no vendor in the middle.

Latency. Cold start is milliseconds, not seconds. A 200 page crawl with JavaScript rendering on a modest network completes in tens of seconds, not minutes.

Where the data goes. Pages are fetched directly by your machine. Nothing is routed through us. If you are building a RAG corpus from a corporate doc site, that means crawler.sh is one of the few options where the corpus never leaves your laptop or your VPC.

If you are building something AI-adjacent and you have been paying a cloud scraper per page, give crawler.sh a try. Free up to 400 pages, $99 a year for 10,000. Same engine in the CLI, the desktop app, and the new MCP server so your agent can drive it.

The cost of “just use headless Chrome”

What the render engine actually does

Fingerprint hygiene without the heavy lifting

Politeness, by the way

What it means for you

Related

Wrap-up