A training data pipeline is the automated system that transforms raw source material into a format suitable for training large language models. It is the factory floor of AI development: pages, documents, and conversations enter at one end, and structured, deduplicated, validated datasets exit at the other.
Building a good pipeline is harder than it sounds. Raw data is messy. Web pages contain ads, navigation, and broken HTML. Books have page headers and footers. Conversations include typos, slang, and off-topic digressions. A pipeline must clean this input without destroying the valuable signal within it.
Stages of a training data pipeline
A typical pipeline has six to eight stages, each implemented as a transform that filters or enriches the data:
1. Ingestion
Collect raw source material from web crawls, APIs, file uploads, or existing datasets. At this stage, the data is in its native format: HTML, PDF, Word documents, JSON, or plain text.
2. Extraction
Pull the primary content out of its container. For HTML, this means removing boilerplate (headers, footers, ads, sidebars) and keeping the article body. For PDFs, this means converting to text while preserving paragraph structure.
3. Normalization
Convert everything to a consistent format. Common choices:
- Markdown for text documents
- JSONL for structured records
- Parquet for columnar storage at scale
{"text": "The quick brown fox...", "source": "https://example.com/article", "language": "en"}{"text": "Machine learning is...", "source": "https://example.com/blog", "language": "en"}4. Deduplication
Remove exact duplicates and near-duplicates. Techniques include:
- Exact matching - Hash-based comparison of full documents
- MinHash - Probabilistic algorithm for finding near-duplicates in large datasets
- SimHash - Locality-sensitive hashing for fuzzy deduplication
- Semantic deduplication - Embedding documents and clustering by similarity
Deduplication matters because training on repeated content wastes compute and can cause the model to overfit to common phrases.
5. Quality filtering
Apply heuristics to remove low-quality documents:
- Language detection (keep only high-confidence target language)
- Readability scores
- Text-to-markup ratio
- Punctuation and symbol density
- Sentence length distribution
- Stop word presence
6. Content filtering
Remove harmful, sensitive, or unwanted material:
- Toxicity classifiers
- PII detection and redaction
- Adult content filters
- Copyrighted material detection
- Bias mitigation strategies
7. Formatting for training
Structure the data for the specific training objective:
- Pre-training - Simple text sequences with tokenization
- Instruction tuning -
{instruction, input, output}triples or conversation threads - RLHF - Prompts with ranked response pairs from human annotators
8. Validation
Verify the dataset before it goes to the model:
- Statistical checks on length distributions and token counts
- Manual spot-checks of random samples
- Automated tests for data leakage (ensuring evaluation benchmarks do not appear in training)
- Format validation against the expected schema
Pipeline quality determines model quality
The phrase “garbage in, garbage out” applies strongly to language models. A model trained on clean, diverse, well-structured data will outperform a model trained on a larger but noisier dataset. Recent research has shown that careful curation can match or exceed the benefits of simply scaling data volume.
For example, the Llama 2 paper emphasized data quality improvements over pure scale. The training corpus was heavily deduplicated and filtered, resulting in better performance with less total data than some competitors.
Pipeline tools and frameworks
- HuggingFace Datasets - A library for loading, processing, and sharing datasets
- Dolma - AI2’s open data pipeline for pre-training
- Data-Juicer - A one-stop data processing system for LLM training
- Crawler tools - First-stage ingestion from web sources
- Spark / Ray - Distributed processing for billion-document scales
How crawler.sh fits into training pipelines
crawler.sh serves as the ingestion and extraction layer of a training data pipeline. It replaces the custom crawler and HTML-to-text converter that most teams build themselves:
# Crawl a site and export clean Markdowncrawler crawl https://docs.example.com --max-depth 3 --render --output archive.zip
# Extracted Markdown files are ready for the normalization stageThe output is structured, clean, and formatted consistently. Each page becomes a Markdown file with metadata in the frontmatter. This integrates directly with pipeline tools like HuggingFace Datasets or custom preprocessing scripts.
For teams building domain-specific models, crawler.sh enables targeted collection: crawl a set of authoritative sources, extract the content, and feed it into the pipeline as high-quality training material.