robots.txt
robots.txt is a file that tells search engine crawlers which pages or sections of a site should not be crawled.
robots.txt is a plain text file placed at the root of a website (for example, https://example.com/robots.txt) that provides instructions to search engine crawlers. It uses the Robots Exclusion Protocol to specify which pages or directories should not be crawled.
How robots.txt works
When a search engine crawler visits a site, it first checks for a robots.txt file. The file contains rules specifying which user agents (crawlers) are allowed or disallowed from accessing certain paths. For example:
User-agent: *Disallow: /admin/Disallow: /search/Sitemap: https://example.com/sitemap.xmlThis tells all crawlers to avoid the /admin/ and /search/ directories and provides the location of the XML sitemap.
Important limitations
robots.txt is a suggestion, not an enforcement mechanism. Well-behaved crawlers honor it, but malicious bots may ignore it. It also does not prevent pages from appearing in search results. If other pages link to a disallowed URL, search engines may still index it based on anchor text and context. Use noindex to prevent indexing.
Common mistakes
- Blocking CSS or JavaScript files that search engines need to render pages
- Accidentally blocking important sections of the site
- Using robots.txt to hide sensitive information (it is publicly accessible)
- Forgetting to update robots.txt after site structure changes
How crawler.sh helps
When you run crawler crawl, the crawler checks for the presence of a robots.txt file and reports it in the site info summary. The crawler seo command includes a site-level check for robots.txt, helping you verify the file exists and is accessible.