Training AI models requires large volumes of clean, structured text. Web crawling is one of the most common ways to build these datasets, but raw HTML is full of noise - navigation menus, footers, ads, and boilerplate that dilute the quality of your training data. What you need is the actual content, extracted cleanly and consistently.

This guide walks you through crawling a website and extracting clean Markdown content suitable for AI model training using the crawler.sh CLI.

Step 1: Install crawler.sh CLI

Install the CLI with a single command:

curl -fsSL https://install.crawler.sh | sh

This downloads the correct binary for your operating system and architecture, places it in ~/.crawler/bin/, and adds it to your PATH. Restart your terminal or run source ~/.bashrc (or ~/.zshrc) to pick up the new PATH entry.

Verify the installation:

crawler --version

Step 2: Crawl with content extraction enabled

Run a crawl with the --extract-content flag to extract clean Markdown from every page:

crawler crawl https://example.com --extract-content

With content extraction enabled, the crawler processes each page through a readability algorithm that strips away navigation, sidebars, headers, footers, and other non-content elements. What remains is the main body text converted to clean Markdown, along with metadata like word count, byline, and excerpt.

For larger sites, increase the page limit:

crawler crawl https://example.com --extract-content --max-pages 5000

Step 3: Review the crawl results

Use the info command to understand what was collected:

crawler info example-com.crawl

This shows total pages crawled, status code distribution, and response time statistics. Check how many pages returned 200 OK - these are the pages with usable content. Pages with errors or redirects typically do not contain extractable text.

Step 4: Export to JSON for processing

Export the crawl data to JSON format, which is easier to process programmatically:

crawler export example-com.crawl --format json --output dataset.json

The JSON output includes each page’s URL, title, meta description, and the extracted Markdown content. You can then write a script to filter, clean, and format this data for your specific AI training pipeline.

Step 5: Export as Markdown archive

For Pro subscribers, you can export all extracted content as a ZIP archive of individual Markdown files:

crawler crawl https://example.com --extract-content --format markdown

Each page becomes a separate .md file named after its URL path. This format works well for training datasets where you want one document per file, and it preserves the clean text without any HTML or metadata overhead.

Tips for building quality training data

Once you have your extracted content, here are some best practices for preparing it for AI training:

Filter by word count. Pages with very few words (under 100) are likely navigation pages, error pages, or stubs. Exclude them from your dataset.
Remove duplicate content. Many sites have the same content accessible at multiple URLs. Deduplicate by comparing the extracted Markdown text.
Respect robots.txt. The crawler automatically checks robots.txt before crawling. If a site disallows crawling, respect that decision.
Check content licensing. Make sure the content you crawl is licensed for your intended use. Public documentation, open-source project sites, and Creative Commons content are safer choices.
Combine multiple sources. A single website may not provide enough variety. Crawl multiple relevant sites and merge the results for a more diverse dataset.
Preserve metadata. Keep the page title, URL, and description alongside the content. This metadata can be useful for filtering, categorization, or as additional training signal.