v0.8.0: Polite Crawling and Fingerprint Hygiene

What’s New in v0.8.0

This release lands three connected pieces of work shipped on top of v0.7.8. The crawler now behaves like a polite, well-identified client: it slows down when a site pushes back, honors robots.txt by default, and the JavaScript render path is harder to fingerprint as headless. The net result is fewer false 429 / 403 cliffs on protected sites, fewer “you look like a bot” placeholders, and a crawler that adjusts itself to the site instead of running flat out.

Adaptive Crawl Posture

The site profiler already classifies the target’s protection level (Easy, Medium, High, Extreme) by sampling the first few pages. In v0.8.0 that classification is no longer informational. It now feeds a CrawlPosture that the crawler reads on every request:

Heavier protection means a longer minimum delay between requests and a longer drain budget for JavaScript-rendered pages.
Easier sites stay fast.
Posture can only narrow your --delay value, never widen it. Whatever floor you configured is still the floor; posture just adds a higher floor when the site clearly needs it.

The practical effect is that protected sites are crawled politely instead of being hammered and blocked partway through. Unprotected sites should feel about the same as before.

robots.txt by Default

crawler.sh now reads and respects robots.txt out of the box:

Disallow:, Allow:, and Crawl-delay: directives for the * user-agent group are honored.
Longest-match Allow overrides Disallow, with RFC 9309 wildcard handling for * and $.
Disallowed URLs are filtered out of the crawl queue, sitemap seeds, and even the start URL itself. A new “skipped by robots” event surfaces these so the desktop app and CLI summary can show what was excluded and why.
Crawl-delay is clamped to 5 minutes so a hostile or accidental value cannot wedge a crawl.

If you genuinely need to ignore robots.txt for a one-off audit, the CLI grew an opt-out flag:

crawler crawl --ignore-robots https://example.com

The default is to respect. There is no breakage for existing users beyond the new default behavior; the flag is purely additive.

Per-Host Adaptive Backoff

The crawler now reacts to pushback in real time. On a 429, 403, or 503 response, the dynamic inter-request delay doubles (starting at 250 ms, capped at 10 s). After 5 consecutive successful responses, it halves back down toward zero. The effective wait between requests is the maximum of:

Your configured --delay
The posture floor for the site’s difficulty
Any Crawl-delay from robots.txt
The current dynamic backoff

Whichever is largest wins. The crawler slows itself down at the first sign of trouble instead of climbing into a hard block, and it speeds back up on its own once the site is happy again.

Quieter JavaScript Rendering

The JavaScript render path used to leak several signals that modern bot defenses fingerprint. v0.8.0 cleans those up:

Synthetic internal globals are hidden from Object.keys(window) and from for...in loops.
canvas.getContext('2d') no longer returns null. It returns a shimmed CanvasRenderingContext2D. getContext('webgl') and getContext('webgl2') return WebGL stubs with plausible Chrome-on-Apple-Silicon vendor strings instead of refusing the call.
URL, URLSearchParams, window.history.pushState, window.history.replaceState, Blob, File, FormData, and FileReader are now defined. Some sites used the absence of these to reject the page before any content rendered.
getComputedStyle returns an object with the real CSSStyleDeclaration shape rather than a bare wrapper.

For most crawls this is invisible. On sites that were silently serving a “you look like a bot” interstitial, the real page now comes through.

One Identity Across the Crawl

The outbound HTTP client used by the JS render path now applies a Chrome 131 TLS fingerprint and accepts a shared cookie jar from the caller. JavaScript-driven fetch() calls during rendering and the main crawl’s link discovery now share one identity:

Same TLS profile.
Same cookies, including session cookies set during a previous page in the same crawl.

This matters for login-walled or session-pinned pages: rendered content matches what the main crawler sees, and a session established on one page carries through to the next.

Behavior Notes

Crawls of unprotected sites should feel about the same as v0.7.x.
Crawls of protected or rate-limited sites will be slower but should actually finish, instead of catching a wave of 429s and bailing partway through.
URLs blocked by robots.txt will be skipped by default. Pass --ignore-robots to restore the previous behavior on the CLI.
.crawl output files from earlier versions remain readable.