v0.8.0: Polite Crawling and Fingerprint Hygiene
Adaptive per-site posture, robots.txt support, per-host backoff, and a quieter JS render path that looks less like a headless browser to bot defenses.
What’s New in v0.8.0
This release lands three connected pieces of work shipped on top of v0.7.8. The crawler now behaves like a polite, well-identified client: it slows down when a site pushes back, honors robots.txt by default, and the JavaScript render path is harder to fingerprint as headless. The net result is fewer false 429 / 403 cliffs on protected sites, fewer “you look like a bot” placeholders, and a crawler that adjusts itself to the site instead of running flat out.
Adaptive Crawl Posture
The site profiler already classifies the target’s protection level (Easy, Medium, High, Extreme) by sampling the first few pages. In v0.8.0 that classification is no longer informational. It now feeds a CrawlPosture that the crawler reads on every request:
- Heavier protection means a longer minimum delay between requests and a longer drain budget for JavaScript-rendered pages.
- Easier sites stay fast.
- Posture can only narrow your
--delayvalue, never widen it. Whatever floor you configured is still the floor; posture just adds a higher floor when the site clearly needs it.
The practical effect is that protected sites are crawled politely instead of being hammered and blocked partway through. Unprotected sites should feel about the same as before.
robots.txt by Default
crawler.sh now reads and respects robots.txt out of the box:
Disallow:,Allow:, andCrawl-delay:directives for the*user-agent group are honored.- Longest-match
AllowoverridesDisallow, with RFC 9309 wildcard handling for*and$. - Disallowed URLs are filtered out of the crawl queue, sitemap seeds, and even the start URL itself. A new “skipped by robots” event surfaces these so the desktop app and CLI summary can show what was excluded and why.
Crawl-delayis clamped to 5 minutes so a hostile or accidental value cannot wedge a crawl.
If you genuinely need to ignore robots.txt for a one-off audit, the CLI grew an opt-out flag:
crawler crawl --ignore-robots https://example.comThe default is to respect. There is no breakage for existing users beyond the new default behavior; the flag is purely additive.
Per-Host Adaptive Backoff
The crawler now reacts to pushback in real time. On a 429, 403, or 503 response, the dynamic inter-request delay doubles (starting at 250 ms, capped at 10 s). After 5 consecutive successful responses, it halves back down toward zero. The effective wait between requests is the maximum of:
- Your configured
--delay - The posture floor for the site’s difficulty
- Any
Crawl-delayfrom robots.txt - The current dynamic backoff
Whichever is largest wins. The crawler slows itself down at the first sign of trouble instead of climbing into a hard block, and it speeds back up on its own once the site is happy again.
Quieter JavaScript Rendering
The JavaScript render path used to leak several signals that modern bot defenses fingerprint. v0.8.0 cleans those up:
- Synthetic internal globals are hidden from
Object.keys(window)and fromfor...inloops. canvas.getContext('2d')no longer returnsnull. It returns a shimmedCanvasRenderingContext2D.getContext('webgl')andgetContext('webgl2')return WebGL stubs with plausible Chrome-on-Apple-Silicon vendor strings instead of refusing the call.URL,URLSearchParams,window.history.pushState,window.history.replaceState,Blob,File,FormData, andFileReaderare now defined. Some sites used the absence of these to reject the page before any content rendered.getComputedStylereturns an object with the realCSSStyleDeclarationshape rather than a bare wrapper.
For most crawls this is invisible. On sites that were silently serving a “you look like a bot” interstitial, the real page now comes through.
One Identity Across the Crawl
The outbound HTTP client used by the JS render path now applies a Chrome 131 TLS fingerprint and accepts a shared cookie jar from the caller. JavaScript-driven fetch() calls during rendering and the main crawl’s link discovery now share one identity:
- Same TLS profile.
- Same cookies, including session cookies set during a previous page in the same crawl.
This matters for login-walled or session-pinned pages: rendered content matches what the main crawler sees, and a session established on one page carries through to the next.
Behavior Notes
- Crawls of unprotected sites should feel about the same as v0.7.x.
- Crawls of protected or rate-limited sites will be slower but should actually finish, instead of catching a wave of 429s and bailing partway through.
- URLs blocked by
robots.txtwill be skipped by default. Pass--ignore-robotsto restore the previous behavior on the CLI. .crawloutput files from earlier versions remain readable.
Related
About crawler.sh
crawler.sh is a fast Rust-based web crawler and SEO auditing tool that runs entirely on your own machine. Use the CLI for automation, scripts, and CI pipelines, or the desktop app for a visual dashboard with live crawl progress, SEO issue charts, and one-click exports.
Every release ships across both the CLI and the desktop app.
Download the latest version
or run crawler update
from the terminal to upgrade an existing install.