robots.txt

robots.txt is a file that tells search engine crawlers which pages or sections of a site should not be crawled.

robots.txt is a plain text file placed at the root of a website (for example, https://example.com/robots.txt) that provides instructions to search engine crawlers. It uses the Robots Exclusion Protocol to specify which pages or directories should not be crawled.

How robots.txt works

When a search engine crawler visits a site, it first checks for a robots.txt file. The file contains rules specifying which user agents (crawlers) are allowed or disallowed from accessing certain paths. For example:

User-agent: *
Disallow: /admin/
Disallow: /search/
Sitemap: https://example.com/sitemap.xml

This tells all crawlers to avoid the /admin/ and /search/ directories and provides the location of the XML sitemap.

Important limitations

robots.txt is a suggestion, not an enforcement mechanism. Well-behaved crawlers honor it, but malicious bots may ignore it. It also does not prevent pages from appearing in search results. If other pages link to a disallowed URL, search engines may still index it based on anchor text and context. Use noindex to prevent indexing.

Common mistakes

  • Blocking CSS or JavaScript files that search engines need to render pages
  • Accidentally blocking important sections of the site
  • Using robots.txt to hide sensitive information (it is publicly accessible)
  • Forgetting to update robots.txt after site structure changes

How crawler.sh helps

When you run crawler crawl, the crawler checks for the presence of a robots.txt file and reports it in the site info summary. The crawler seo command includes a site-level check for robots.txt, helping you verify the file exists and is accessible.

Crawler.sh - Free Local AEO & SEO Spider and a Markdown content extractor | Product Hunt