v0.7.0: Built-in JavaScript Rendering for Web Crawls

What’s New in v0.7.0

Built-in JavaScript Rendering

crawler.sh can now crawl JavaScript-heavy websites. A built-in JS rendering engine executes inline scripts and produces the final DOM, just like a browser would - but without launching a browser. Pages built with client-side frameworks are crawled with their full rendered content, including dynamically inserted text, links, and meta tags.

This means you no longer need a separate headless browser setup to crawl modern web applications. The rendering engine is lightweight, runs in-process, and is sandboxed with CPU timeouts, memory limits, and output size caps to prevent runaway scripts from affecting your crawl.

Auto-Detection

By default, crawler.sh uses Auto mode: it samples the first few pages of a crawl and analyzes them for signals that indicate client-side rendering is needed. The profiler looks for empty body shells, high script-to-text ratios, framework-specific mount points, and bot protection signatures. If it detects that pages need rendering, the JS engine activates automatically for the rest of the crawl.

No configuration is required for most sites. If you know upfront whether a site needs rendering, you can override the behavior:

# Auto-detect (default)
crawler crawl https://example.com

# Force JS rendering on every page
crawler crawl https://example.com --render

# Skip JS rendering entirely (faster for static sites)
crawler crawl https://example.com --no-render

What Gets Rendered

The JS engine executes all inline <script> tags in the page, runs setTimeout and setInterval callbacks, resolves Promises, and produces the final DOM state. The resulting HTML is then used for content extraction, link discovery, and SEO analysis - so titles set by JavaScript, dynamically loaded navigation links, and client-rendered meta tags are all captured.

Resource Limits

The rendering engine enforces strict resource limits to keep crawls safe and predictable:

CPU timeout - scripts that run too long are terminated
Memory cap - prevents memory bombs from affecting the crawl
Node limit - caps the number of DOM nodes a script can create
Output size limit - prevents oversized rendered pages from consuming disk space
Timer limits - caps the number of setTimeout/setInterval invocations

If any limit is hit, the page is returned with whatever content was available before the limit was reached. The crawl continues normally.

Desktop App

The desktop app includes a new JS Rendering option in the Settings card. Choose between Auto-detect (default), Always, or Never. The setting applies to the current crawl session.

CLI Usage

# Crawl a JavaScript-heavy site
crawler crawl https://spa-example.com

# The auto-detection kicks in and you see:
# ⚡ JS rendering detected - enabling for remaining pages

# Force rendering for sites where auto-detection is too conservative
crawler crawl https://spa-example.com --render

# Disable rendering for maximum speed on static sites
crawler crawl https://static-site.com --no-render

All output formats (NDJSON, JSON, Sitemap XML) include the rendered content. SEO analysis runs against the rendered DOM, so JavaScript-injected SEO issues are caught.

Who Benefits

SEO professionals can audit JavaScript-heavy sites without setting up a headless browser
Content teams get clean Markdown extraction from client-rendered pages
MLOps engineers can collect training data from JavaScript applications at scale
Developers get a single-binary solution for crawling any site, static or dynamic