Headless browser

What is a Headless Browser in Web Scraping

A headless browser is a web browser without a graphical user interface, used for automated testing and data extraction.

A headless browser is a web browser that runs without a graphical user interface. It can load pages, execute JavaScript, and interact with the DOM just like a regular browser, but it does so in the background without rendering visual output. This makes headless browsers ideal for automated tasks like web scraping, testing, and data extraction.

The term “headless” refers to the missing visual component. A standard browser has a window, tabs, address bar, and rendering pipeline that draws pixels to the screen. A headless browser strips all of that away and exposes only the programmatic interface. You interact with it through code rather than clicks and keystrokes.

How headless browsers work

A headless browser includes the same core components as a regular browser: HTML parser, CSS engine, and JavaScript engine. It builds the DOM, applies styles, runs scripts, and maintains state including cookies, local storage, and session storage. The only missing piece is the rendering pipeline that draws pixels to a screen.

When you ask a headless browser to navigate to a URL, it:

  1. Resolves the domain and establishes a TCP/TLS connection
  2. Sends the HTTP request and receives the response
  3. Parses the HTML and builds the DOM tree
  4. Downloads and applies CSS
  5. Executes JavaScript, which may modify the DOM
  6. Fires events like DOMContentLoaded and load
  7. Returns the final DOM and metadata to your script

Common headless browsers

  • Chromium / Chrome - The most widely used headless engine, accessible via Puppeteer or Playwright. Supports the full Chrome DevTools protocol for advanced automation.
  • Firefox - Gecko engine in headless mode, supported by Playwright and Selenium. Useful for testing Firefox-specific behavior.
  • WebKit - Safari’s engine, used by Playwright for cross-browser testing. Helpful when you need to verify Safari compatibility.
  • QuickJS - A lightweight JavaScript engine used in specialized crawling tools. Not a full browser, but sufficient for executing scripts and building a DOM.

Headless browser detection

Websites use various techniques to detect and block headless browsers, including:

  • Checking for missing plugins or non-standard window sizes
  • Analyzing behavior patterns that differ from human users, such as instant form submissions or perfectly linear mouse movements
  • Testing for WebDriver or Puppeteer properties on navigator like navigator.webdriver
  • Canvas fingerprinting to identify automated environments by drawing test patterns and checking for artifacts
  • Checking for missing fonts or language properties that real browsers have
  • Timing attacks that measure how quickly pages load or interactions complete

Sophisticated headless setups include countermeasures like removing automation markers, spoofing user agents, and mimicking human interaction patterns. However, this arms race continues as detection methods improve.

Headless vs lightweight JS rendering

Traditional headless browsers are powerful but resource-intensive. Each page load spins up a full browser context, often consuming 100MB or more of memory per tab. For crawling a 10,000-page site, this overhead becomes prohibitive.

Lightweight alternatives use a JavaScript engine without the full browser, trading some capabilities for dramatically lower memory usage and faster execution. A QuickJS-based renderer might use only 10-20MB per page. The tradeoff is that lightweight engines may not support every browser API or handle complex interactions like WebGL or media playback.

For most crawling tasks where the goal is to extract text content and links, a lightweight engine is sufficient and far more efficient.

Use cases for headless browsers

  • End-to-end testing - Verifying that user flows work correctly in a real browser environment
  • Screenshot capture - Generating visual previews of pages at specific dimensions
  • PDF generation - Converting HTML pages to PDF documents
  • Complex interaction - Filling forms, clicking buttons, and handling multi-step flows
  • SPA crawling - Waiting for client-side routes to resolve before extracting content
  • Bot protection bypass - Some protections are easier to bypass with a real browser than a simple HTTP client

How crawler.sh uses headless rendering

crawler.sh uses a custom JavaScript rendering engine built on QuickJS rather than a full headless browser. This engine executes JavaScript, builds a DOM tree, and extracts rendered HTML. It includes fingerprint-hygiene features to reduce headless detection: hidden synthetic globals that do not appear in Object.keys(window), Canvas 2D shims that return plausible fingerprint data, WebGL stubs with vendor strings matching Chrome on Apple Silicon, and standard browser APIs like URL, Blob, FileReader, and FormData.

The engine is gated behind the JS rendering mode and is used automatically when the site profiler detects JavaScript-heavy pages. For static sites, the engine is skipped entirely, preserving the speed and efficiency of a direct HTTP crawl.

Crawler.sh - Free Local AEO & SEO Spider and a Markdown content extractor | Product Hunt