Guides
March 1, 2026

How to Create Sitemap.xml by Crawling Your Website

Generate an accurate XML sitemap from a real crawl of your site. No guessing, no stale URLs - just the pages that actually exist and return 200.

Mehmet Kose
Mehmet Kose
4 mins read

Most sitemap generators work from a static list of URLs you give them. The problem is obvious: your sitemap ends up including pages that redirect, return 404, or haven’t existed for months. Search engines crawl those dead URLs, waste their crawl budget, and your real pages get indexed slower.

A better approach is to generate your sitemap from an actual crawl of your live site. You get exactly the pages that exist, respond with 200, and are reachable by following links - nothing more, nothing less.

Step 1: Install crawler.sh CLI

Install the CLI with a single command:

curl -fsSL https://install.crawler.sh | sh

Verify the installation:

crawler --version

Step 2: Generate a sitemap in one command

The fastest way to create a sitemap is to crawl your site directly into sitemap format:

crawler crawl https://yoursite.com -f sitemap

This does everything in a single pass:

  1. Discovers your pages by following links from the homepage.
  2. Checks robots.txt and existing sitemaps to find additional URLs.
  3. Filters out redirects, errors, and non-HTML responses.
  4. Writes a valid XML sitemap containing only pages that returned 200.

The output file is auto-named based on your domain - for example, yoursite-com-sitemap.xml.

Step 3: Crawl first, export later

If you want to inspect your crawl results before generating the sitemap, use the two-step approach:

# Step 1: Crawl and save as NDJSON
crawler crawl https://yoursite.com
# Step 2: Export to sitemap
crawler export yoursite-com.crawl -f sitemap

The .crawl file is an NDJSON log of every page the crawler visited, including status codes, redirects, and metadata. You can analyze it with crawler info or run SEO checks with crawler seo before deciding to generate the sitemap:

# Check what the crawl found
crawler info yoursite-com.crawl
# Run SEO analysis
crawler seo yoursite-com.crawl

This is useful when you want to fix issues before publishing a new sitemap. No point telling search engines about pages with broken canonical tags or redirect loops.

Step 4: Control the crawl

By default, the crawler follows links from the homepage with sensible defaults. You can tune it for your site:

crawler crawl https://yoursite.com -f sitemap \
--max-pages 5000 \
--max-depth 10 \
--concurrency 10 \
--delay 100
  • --max-pages - Maximum number of pages to crawl. Free tier allows up to 600, Pro goes up to 10,000.
  • --max-depth - How many link-hops deep to follow from the starting URL.
  • --concurrency - Number of simultaneous requests. Higher values crawl faster but hit your server harder.
  • --delay - Milliseconds to wait between requests. Be polite to your own servers.

The crawler stays on your domain - it won’t follow links to external sites.

Step 5: Keep your sitemap fresh

A sitemap is only useful if it reflects the current state of your site. The easiest way to keep it accurate is to regenerate it periodically:

# Add to a cron job or CI pipeline
crawler crawl https://yoursite.com -f sitemap -o /var/www/yoursite/sitemap.xml

The -o flag lets you specify the output path directly, so you can write the sitemap straight to your web root or a deployment directory.

For sites that change frequently, run this weekly. For mostly static sites, monthly is fine. The point is that your sitemap always matches reality.

What gets included (and what doesn’t)

The generated sitemap only includes URLs that meet these criteria:

  • Same domain as the starting URL.
  • HTTP 200 response. Redirects (301, 302, 307, 308), client errors (4xx), and server errors (5xx) are excluded.
  • HTML content type. PDFs, images, and other non-HTML resources are excluded.

This means your sitemap is clean by default. No redirect chains, no error pages, no broken URLs clogging up your crawl budget.

The sitemap protocol has a hard limit of 50,000 URLs per file. If your crawl exceeds that, the output will be truncated with a warning.

From sitemap to full audit

Once you’re generating sitemaps from real crawls, you’re one step away from a full site audit. The same .crawl data powers crawler.sh’s 18 automated SEO checks:

crawler crawl https://yoursite.com
crawler seo yoursite-com.crawl --export csv

You get a CSV with every issue - missing titles, duplicate descriptions, broken canonicals, redirect chains, and more. Fix the issues, regenerate the sitemap, and you’ve got a clean site that search engines can process efficiently.

Crawler.sh - Free Local AEO & SEO Spider and a Markdown content extractor | Product Hunt