How to Integrate crawler.sh into MLOps Pipelines
Learn how to use crawler.sh CLI in MLOps workflows to collect training data, validate documentation sites, and automate web crawling in CI/CD pipelines.
MLOps pipelines need reliable, repeatable access to web content. Whether you are collecting text for model training, validating that your model documentation is published correctly, or monitoring the pages that serve your ML API, automated web crawling fills a gap that manual checks and ad-hoc scripts cannot.
This guide shows you how to integrate crawler.sh CLI into MLOps workflows for training data collection, documentation validation, and scheduled crawling in CI/CD pipelines.
Step 1: Install crawler.sh CLI
Install the CLI with a single command:
curl -fsSL https://install.crawler.sh | shThis downloads the correct binary for your operating system and architecture, places it in ~/.crawler/bin/, and adds it to your PATH. Restart your terminal or run source ~/.bashrc (or ~/.zshrc) to pick up the new PATH entry.
Verify the installation:
crawler --versionFor CI/CD environments, add the install command to your pipeline setup step. The installer detects the OS and architecture automatically, so the same command works on macOS and Linux runners.
Step 2: Collect training data with content extraction
Run a crawl with the --extract-content flag to extract clean Markdown from every page:
crawler crawl https://docs.example.com --extract-content --max-pages 5000The crawler processes each page through a readability algorithm that strips navigation, sidebars, headers, footers, and boilerplate. What remains is the main body text converted to clean Markdown, along with metadata like word count, byline, and excerpt. This is exactly the kind of structured text that ML training pipelines need.
For domain-specific models, target the sites that contain the terminology and writing patterns your model should learn:
crawler crawl https://research-papers.example.com --extract-contentcrawler crawl https://technical-docs.example.com --extract-contentStep 3: Export structured datasets
Export the crawl data to JSON for ingestion into your data pipeline:
crawler export docs-example-com.crawl --format json --output dataset.jsonThe JSON output includes each page’s URL, title, meta description, status code, and extracted Markdown content. You can process this programmatically to filter and prepare training data.
A common pattern is to chain the crawl and export in a single script:
#!/bin/bashDOMAIN="docs.example.com"CRAWL_FILE="${DOMAIN//./-}.crawl"
crawler crawl "https://$DOMAIN" --extract-content --max-pages 5000crawler export "$CRAWL_FILE" --format json --output "dataset-$(date +%Y%m%d).json"Filter by status code and word count downstream to remove error pages and thin content that would add noise to your training set.
Step 4: Automate crawls in CI/CD
Add crawler commands to your CI/CD pipeline or schedule them with cron for regular data collection. Here is a cron example that runs a weekly crawl:
# Run every Sunday at 2 AM0 2 * * 0 /home/user/.crawler/bin/crawler crawl https://docs.example.com --extract-content --max-pages 5000 --output /data/crawls/docs-$(date +\%Y\%m\%d).crawlFor CI/CD pipelines, add the crawl as a pipeline step:
# Install crawlercurl -fsSL https://install.crawler.sh | shexport PATH="$HOME/.crawler/bin:$PATH"
# Crawl and exportcrawler crawl https://docs.example.com --extract-content --max-pages 2000crawler export docs-example-com.crawl --format json --output training-data.jsonUse the --max-pages flag to cap crawl size and keep pipeline run times predictable. Use --delay to add a millisecond delay between requests if you need to avoid overloading target servers.
Step 5: Validate documentation sites
Use the seo command to audit your ML model documentation for structural issues:
crawler crawl https://docs.ml-platform.example.comcrawler seo docs-ml-platform-example-com.crawlThe SEO analysis checks for missing titles, broken internal links, missing descriptions, empty pages, and 22 other issues that affect how users and search engines find your documentation. This is especially useful after deploying auto-generated API docs or model cards.
Export the report for tracking in your issue tracker:
crawler seo docs-ml-platform-example-com.crawl --format csv --output docs-audit.csvRun this after every documentation deployment to catch regressions before users hit them.
Best practices for MLOps integration
- Version your crawl outputs. Include dates or commit hashes in output filenames so you can trace which training data corresponds to which model version.
- Set page limits for reproducibility. Use
--max-pagesto ensure crawls produce consistent dataset sizes across runs. - Respect rate limits. Use
--delayto add a pause between requests. This prevents overwhelming target servers and avoids getting blocked. - Use
crawler infofor monitoring. Runcrawler infoon your crawl files to verify page counts, status code distributions, and response times before feeding data into training. - Combine with content filtering. Not every crawled page is useful for training. Filter by word count (skip pages under 100 words) and status code (keep only 200 OK pages) in your data pipeline.
- Store crawl files alongside model artifacts. Treat
.crawlfiles as part of your model lineage so you can reproduce datasets later.