Training data pipeline

What is a Training Data Pipeline for LLMs

A training data pipeline is the automated process of collecting, cleaning, formatting, and validating data for large language model training.

A training data pipeline is the automated system that transforms raw source material into a format suitable for training large language models. It is the factory floor of AI development: pages, documents, and conversations enter at one end, and structured, deduplicated, validated datasets exit at the other.

Building a good pipeline is harder than it sounds. Raw data is messy. Web pages contain ads, navigation, and broken HTML. Books have page headers and footers. Conversations include typos, slang, and off-topic digressions. A pipeline must clean this input without destroying the valuable signal within it.

Stages of a training data pipeline

A typical pipeline has six to eight stages, each implemented as a transform that filters or enriches the data:

1. Ingestion

Collect raw source material from web crawls, APIs, file uploads, or existing datasets. At this stage, the data is in its native format: HTML, PDF, Word documents, JSON, or plain text.

2. Extraction

Pull the primary content out of its container. For HTML, this means removing boilerplate (headers, footers, ads, sidebars) and keeping the article body. For PDFs, this means converting to text while preserving paragraph structure.

3. Normalization

Convert everything to a consistent format. Common choices:

  • Markdown for text documents
  • JSONL for structured records
  • Parquet for columnar storage at scale
{"text": "The quick brown fox...", "source": "https://example.com/article", "language": "en"}
{"text": "Machine learning is...", "source": "https://example.com/blog", "language": "en"}

4. Deduplication

Remove exact duplicates and near-duplicates. Techniques include:

  • Exact matching - Hash-based comparison of full documents
  • MinHash - Probabilistic algorithm for finding near-duplicates in large datasets
  • SimHash - Locality-sensitive hashing for fuzzy deduplication
  • Semantic deduplication - Embedding documents and clustering by similarity

Deduplication matters because training on repeated content wastes compute and can cause the model to overfit to common phrases.

5. Quality filtering

Apply heuristics to remove low-quality documents:

  • Language detection (keep only high-confidence target language)
  • Readability scores
  • Text-to-markup ratio
  • Punctuation and symbol density
  • Sentence length distribution
  • Stop word presence

6. Content filtering

Remove harmful, sensitive, or unwanted material:

  • Toxicity classifiers
  • PII detection and redaction
  • Adult content filters
  • Copyrighted material detection
  • Bias mitigation strategies

7. Formatting for training

Structure the data for the specific training objective:

  • Pre-training - Simple text sequences with tokenization
  • Instruction tuning - {instruction, input, output} triples or conversation threads
  • RLHF - Prompts with ranked response pairs from human annotators

8. Validation

Verify the dataset before it goes to the model:

  • Statistical checks on length distributions and token counts
  • Manual spot-checks of random samples
  • Automated tests for data leakage (ensuring evaluation benchmarks do not appear in training)
  • Format validation against the expected schema

Pipeline quality determines model quality

The phrase “garbage in, garbage out” applies strongly to language models. A model trained on clean, diverse, well-structured data will outperform a model trained on a larger but noisier dataset. Recent research has shown that careful curation can match or exceed the benefits of simply scaling data volume.

For example, the Llama 2 paper emphasized data quality improvements over pure scale. The training corpus was heavily deduplicated and filtered, resulting in better performance with less total data than some competitors.

Pipeline tools and frameworks

  • HuggingFace Datasets - A library for loading, processing, and sharing datasets
  • Dolma - AI2’s open data pipeline for pre-training
  • Data-Juicer - A one-stop data processing system for LLM training
  • Crawler tools - First-stage ingestion from web sources
  • Spark / Ray - Distributed processing for billion-document scales

How crawler.sh fits into training pipelines

crawler.sh serves as the ingestion and extraction layer of a training data pipeline. It replaces the custom crawler and HTML-to-text converter that most teams build themselves:

# Crawl a site and export clean Markdown
crawler crawl https://docs.example.com --max-depth 3 --render --output archive.zip
# Extracted Markdown files are ready for the normalization stage

The output is structured, clean, and formatted consistently. Each page becomes a Markdown file with metadata in the frontmatter. This integrates directly with pipeline tools like HuggingFace Datasets or custom preprocessing scripts.

For teams building domain-specific models, crawler.sh enables targeted collection: crawl a set of authoritative sources, extract the content, and feed it into the pipeline as high-quality training material.

Crawler.sh - Free Local AEO & SEO Spider and a Markdown content extractor | Product Hunt