MLOps pipelines need reliable, repeatable access to web content. Whether you are collecting text for model training, validating that your model documentation is published correctly, or monitoring the pages that serve your ML API, automated web crawling fills a gap that manual checks and ad-hoc scripts cannot.

This guide shows you how to integrate crawler.sh CLI into MLOps workflows for training data collection, documentation validation, and scheduled crawling in CI/CD pipelines.

Step 1: Install crawler.sh CLI

Install the CLI with a single command:

curl -fsSL https://install.crawler.sh | sh

This downloads the correct binary for your operating system and architecture, places it in ~/.crawler/bin/, and adds it to your PATH. Restart your terminal or run source ~/.bashrc (or ~/.zshrc) to pick up the new PATH entry.

Verify the installation:

crawler --version

For CI/CD environments, add the install command to your pipeline setup step. The installer detects the OS and architecture automatically, so the same command works on macOS and Linux runners.

Step 2: Collect training data with content extraction

Run a crawl with the --extract-content flag to extract clean Markdown from every page:

crawler crawl https://docs.example.com --extract-content --max-pages 5000

The crawler processes each page through a readability algorithm that strips navigation, sidebars, headers, footers, and boilerplate. What remains is the main body text converted to clean Markdown, along with metadata like word count, byline, and excerpt. This is exactly the kind of structured text that ML training pipelines need.

For domain-specific models, target the sites that contain the terminology and writing patterns your model should learn:

crawler crawl https://research-papers.example.com --extract-content
crawler crawl https://technical-docs.example.com --extract-content

Step 3: Export structured datasets

Export the crawl data to JSON for ingestion into your data pipeline:

crawler export docs-example-com.crawl --format json --output dataset.json

The JSON output includes each page’s URL, title, meta description, status code, and extracted Markdown content. You can process this programmatically to filter and prepare training data.

A common pattern is to chain the crawl and export in a single script:

#!/bin/bash
DOMAIN="docs.example.com"
CRAWL_FILE="${DOMAIN//./-}.crawl"

crawler crawl "https://$DOMAIN" --extract-content --max-pages 5000
crawler export "$CRAWL_FILE" --format json --output "dataset-$(date +%Y%m%d).json"

Filter by status code and word count downstream to remove error pages and thin content that would add noise to your training set.

Step 4: Automate crawls in CI/CD

Add crawler commands to your CI/CD pipeline or schedule them with cron for regular data collection. Here is a cron example that runs a weekly crawl:

# Run every Sunday at 2 AM
0 2 * * 0 /home/user/.crawler/bin/crawler crawl https://docs.example.com --extract-content --max-pages 5000 --output /data/crawls/docs-$(date +\%Y\%m\%d).crawl

For CI/CD pipelines, add the crawl as a pipeline step:

# Install crawler
curl -fsSL https://install.crawler.sh | sh
export PATH="$HOME/.crawler/bin:$PATH"

# Crawl and export
crawler crawl https://docs.example.com --extract-content --max-pages 2000
crawler export docs-example-com.crawl --format json --output training-data.json

Use the --max-pages flag to cap crawl size and keep pipeline run times predictable. Use --delay to add a millisecond delay between requests if you need to avoid overloading target servers.

Step 5: Validate documentation sites

Use the seo command to audit your ML model documentation for structural issues:

crawler crawl https://docs.ml-platform.example.com
crawler seo docs-ml-platform-example-com.crawl

The SEO analysis checks for missing titles, broken internal links, missing descriptions, empty pages, and 22 other issues that affect how users and search engines find your documentation. This is especially useful after deploying auto-generated API docs or model cards.

Export the report for tracking in your issue tracker:

crawler seo docs-ml-platform-example-com.crawl --format csv --output docs-audit.csv

Run this after every documentation deployment to catch regressions before users hit them.

Best practices for MLOps integration

Version your crawl outputs. Include dates or commit hashes in output filenames so you can trace which training data corresponds to which model version.
Set page limits for reproducibility. Use --max-pages to ensure crawls produce consistent dataset sizes across runs.
Respect rate limits. Use --delay to add a pause between requests. This prevents overwhelming target servers and avoids getting blocked.
Use crawler info for monitoring. Run crawler info on your crawl files to verify page counts, status code distributions, and response times before feeding data into training.
Combine with content filtering. Not every crawled page is useful for training. Filter by word count (skip pages under 100 words) and status code (keep only 200 OK pages) in your data pipeline.
Store crawl files alongside model artifacts. Treat .crawl files as part of your model lineage so you can reproduce datasets later.