robots.txt - SEO Glossary

robots.txt is a plain text file placed at the root of a website (for example, https://example.com/robots.txt) that provides instructions to search engine crawlers. It uses the Robots Exclusion Protocol to specify which pages or directories should not be crawled.

How robots.txt works

When a search engine crawler visits a site, it first checks for a robots.txt file. The file contains rules specifying which user agents (crawlers) are allowed or disallowed from accessing certain paths. For example:

User-agent: *
Disallow: /admin/
Disallow: /search/
Sitemap: https://example.com/sitemap.xml

This tells all crawlers to avoid the /admin/ and /search/ directories and provides the location of the XML sitemap.

Important limitations

robots.txt is a suggestion, not an enforcement mechanism. Well-behaved crawlers honor it, but malicious bots may ignore it. It also does not prevent pages from appearing in search results. If other pages link to a disallowed URL, search engines may still index it based on anchor text and context. Use noindex to prevent indexing.

Common mistakes

Blocking CSS or JavaScript files that search engines need to render pages
Accidentally blocking important sections of the site
Using robots.txt to hide sensitive information (it is publicly accessible)
Forgetting to update robots.txt after site structure changes

How crawler.sh helps

When you run crawler crawl, the crawler checks for the presence of a robots.txt file and reports it in the site info summary. The crawler seo command includes a site-level check for robots.txt, helping you verify the file exists and is accessible.

How robots.txt works

Important limitations

Common mistakes

How crawler.sh helps

Related