web crawler - SEO Glossary

A web crawler (also called a spider or bot) is a program that systematically browses the web by following hyperlinks from page to page. It starts from one or more seed URLs, downloads the page content, extracts links, and then visits those linked pages. This process repeats until the crawler reaches a defined limit or runs out of new URLs to visit.

How web crawlers work

A typical web crawler follows this cycle:

Start with a seed URL
Download the page and parse its HTML
Extract all links from the page
Add new, unvisited links to a queue
Pick the next URL from the queue and repeat

Crawlers track which URLs they have already visited to avoid infinite loops. They respect robots.txt rules and may throttle their request rate to avoid overloading servers.

Types of web crawlers

Search engine crawlers - Googlebot, Bingbot, and others that build the search engine index
SEO crawlers - Tools that audit websites for technical SEO issues
Data collection crawlers - Programs that extract specific data for research or analysis
Monitoring crawlers - Bots that check for uptime, content changes, or security issues

How crawler.sh works

crawler.sh is an SEO-focused web crawler built in Rust for speed and low memory usage. It performs breadth-first crawling within a single domain, recording status codes, redirects, meta tags, and content for every page. The crawl data can be analyzed with crawler info for statistics, crawler seo for SEO issues, and exported to JSON or sitemap XML formats.

How web crawlers work

Types of web crawlers

How crawler.sh works

Related