What is Data Cleaning for LLM Training

Data cleaning is the process of transforming raw, noisy datasets into high-quality training material for large language models. Raw web data contains boilerplate HTML, duplicate articles, spelling errors, toxic comments, and personally identifiable information. Cleaning removes or fixes these problems so the model learns from signal rather than noise.

The importance of data cleaning has grown as the field shifted from “more data is always better” to “better data beats more data.” A model trained on 100 billion clean tokens often outperforms one trained on 500 billion dirty tokens. Cleaning is now recognized as one of the highest-leverage steps in the training pipeline.

Types of noise in raw data

HTML boilerplate - Navigation menus, cookie banners, ads, sidebars, and footers that repeat across pages
Template text - “Click here to learn more”, “Sign up for our newsletter”, and other site-wide phrases that contain no information
Duplicates - Identical or near-identical pages scraped multiple times
Encoding errors - Mojibake, invalid UTF-8 sequences, and mixed character sets
Low-quality text - Auto-generated content, spam, keyword-stuffed pages, and placeholder text
Toxic content - Hate speech, harassment, and harmful instructions
PII - Email addresses, phone numbers, credit cards, and names that should not be in training data
Code and markup - Inline CSS, JavaScript snippets, and JSON blobs mixed with prose
Non-text content - Base64 images, SVG paths, and binary data accidentally included

Cleaning techniques

Boilerplate removal

Extract the main content from HTML using readability algorithms or DOM-based heuristics. The goal is to keep the article body while removing:

<!-- Remove this -->
<nav>Home | About | Contact</nav>
<aside>Related articles...</aside>
<footer>Copyright 2024...</footer>

<!-- Keep this -->
<article>
  <h1>Machine Learning Basics</h1>
  <p>Machine learning is a subset of...</p>
</article>

Deduplication

Exact deduplication uses cryptographic hashes (MD5, SHA-256) to find identical documents. Near-deduplication uses algorithms like MinHash and SimHash to find documents that are 80-95% similar, often caused by:

Print-friendly versions of the same article
Mobile and desktop variants of the same page
Syndicated content published on multiple sites
Template-generated pages with minor variations

from datasketch import MinHash, MinHashLSH

m1 = MinHash()
m1.update("The quick brown fox".encode('utf8'))

m2 = MinHash()
m2.update("The quick brown fox jumps".encode('utf8'))

# Jaccard similarity estimate
print(m1.jaccard(m2))  # ~0.8 for near-duplicates

Quality scoring

Documents are scored on multiple dimensions and discarded if they fall below threshold:

Perplexity - A language model assigns a probability to the text. Unnaturally predictable text (low perplexity) is often templated or repetitive. Unnaturally unpredictable text (high perplexity) is often garbled or non-grammatical.
Readability - Flesch-Kincaid, Gunning Fog, or automated readability scores
Language confidence - FastText or langdetect scores below 0.9 often indicate mixed-language or nonsensical text
Stop word ratio - Legitimate text contains common words. Spam often has unusual word distributions

PII removal

Regular expressions and named entity recognition find and redact:

Email addresses: [a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}
Phone numbers: varying regional formats
Credit cards: Luhn-valid number sequences
Names: spaCy or transformer-based NER models

Some pipelines replace PII with tokens like [EMAIL] or [NAME] rather than deleting it, preserving sentence structure while anonymizing.

Toxicity filtering

Classifiers like Perspective API, Detoxify, or in-house models score text for:

Toxicity
Severe toxicity
Identity attack
Insult
Threat
Sexually explicit content

Pages scoring above a threshold are removed or quarantined for manual review.

The cleaning tradeoff

Aggressive cleaning improves quality but risks removing valuable content:

A page with a few typos might still contain unique technical knowledge
Conversational text with slang might be exactly what you want for chat model training
Technical documentation with code snippets is valuable despite unusual token distributions

Most pipelines use a tiered approach: strict filtering for pre-training, looser filtering for domain-specific fine-tuning where every relevant document matters.

How crawler.sh reduces cleaning burden

crawler.sh performs extraction and initial cleaning at crawl time, producing output that requires far less downstream processing:

Boilerplate removal - The extraction engine automatically identifies and removes navigation, ads, and sidebars using DOM heuristics and text density analysis
HTML-to-Markdown conversion - Converts tag soup to clean Markdown, eliminating inline styles, script tags, and presentation markup
Metadata separation - Titles, descriptions, and canonical URLs are stored in frontmatter, not mixed into the content
Consistent formatting - Every output file follows the same structure, simplifying downstream parsing
JavaScript rendering - Captures the final rendered text rather than the raw HTML shell, avoiding extraction of loading spinners and placeholder text

By handling these steps during ingestion, crawler.sh reduces the number of documents that fail quality filters and simplifies the deduplication stage. Teams spend less time writing cleaning rules and more time curating high-value sources.