Data cleaning

What is Data Cleaning for LLM Training

Data cleaning is the process of removing noise, duplicates, and low-quality content from datasets before training language models.

Data cleaning is the process of transforming raw, noisy datasets into high-quality training material for large language models. Raw web data contains boilerplate HTML, duplicate articles, spelling errors, toxic comments, and personally identifiable information. Cleaning removes or fixes these problems so the model learns from signal rather than noise.

The importance of data cleaning has grown as the field shifted from “more data is always better” to “better data beats more data.” A model trained on 100 billion clean tokens often outperforms one trained on 500 billion dirty tokens. Cleaning is now recognized as one of the highest-leverage steps in the training pipeline.

Types of noise in raw data

  • HTML boilerplate - Navigation menus, cookie banners, ads, sidebars, and footers that repeat across pages
  • Template text - “Click here to learn more”, “Sign up for our newsletter”, and other site-wide phrases that contain no information
  • Duplicates - Identical or near-identical pages scraped multiple times
  • Encoding errors - Mojibake, invalid UTF-8 sequences, and mixed character sets
  • Low-quality text - Auto-generated content, spam, keyword-stuffed pages, and placeholder text
  • Toxic content - Hate speech, harassment, and harmful instructions
  • PII - Email addresses, phone numbers, credit cards, and names that should not be in training data
  • Code and markup - Inline CSS, JavaScript snippets, and JSON blobs mixed with prose
  • Non-text content - Base64 images, SVG paths, and binary data accidentally included

Cleaning techniques

Boilerplate removal

Extract the main content from HTML using readability algorithms or DOM-based heuristics. The goal is to keep the article body while removing:

<!-- Remove this -->
<nav>Home | About | Contact</nav>
<aside>Related articles...</aside>
<footer>Copyright 2024...</footer>
<!-- Keep this -->
<article>
<h1>Machine Learning Basics</h1>
<p>Machine learning is a subset of...</p>
</article>

Deduplication

Exact deduplication uses cryptographic hashes (MD5, SHA-256) to find identical documents. Near-deduplication uses algorithms like MinHash and SimHash to find documents that are 80-95% similar, often caused by:

  • Print-friendly versions of the same article
  • Mobile and desktop variants of the same page
  • Syndicated content published on multiple sites
  • Template-generated pages with minor variations
from datasketch import MinHash, MinHashLSH
m1 = MinHash()
m1.update("The quick brown fox".encode('utf8'))
m2 = MinHash()
m2.update("The quick brown fox jumps".encode('utf8'))
# Jaccard similarity estimate
print(m1.jaccard(m2)) # ~0.8 for near-duplicates

Quality scoring

Documents are scored on multiple dimensions and discarded if they fall below threshold:

  • Perplexity - A language model assigns a probability to the text. Unnaturally predictable text (low perplexity) is often templated or repetitive. Unnaturally unpredictable text (high perplexity) is often garbled or non-grammatical.
  • Readability - Flesch-Kincaid, Gunning Fog, or automated readability scores
  • Language confidence - FastText or langdetect scores below 0.9 often indicate mixed-language or nonsensical text
  • Stop word ratio - Legitimate text contains common words. Spam often has unusual word distributions

PII removal

Regular expressions and named entity recognition find and redact:

  • Email addresses: [a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}
  • Phone numbers: varying regional formats
  • Credit cards: Luhn-valid number sequences
  • Names: spaCy or transformer-based NER models

Some pipelines replace PII with tokens like [EMAIL] or [NAME] rather than deleting it, preserving sentence structure while anonymizing.

Toxicity filtering

Classifiers like Perspective API, Detoxify, or in-house models score text for:

  • Toxicity
  • Severe toxicity
  • Identity attack
  • Insult
  • Threat
  • Sexually explicit content

Pages scoring above a threshold are removed or quarantined for manual review.

The cleaning tradeoff

Aggressive cleaning improves quality but risks removing valuable content:

  • A page with a few typos might still contain unique technical knowledge
  • Conversational text with slang might be exactly what you want for chat model training
  • Technical documentation with code snippets is valuable despite unusual token distributions

Most pipelines use a tiered approach: strict filtering for pre-training, looser filtering for domain-specific fine-tuning where every relevant document matters.

How crawler.sh reduces cleaning burden

crawler.sh performs extraction and initial cleaning at crawl time, producing output that requires far less downstream processing:

  • Boilerplate removal - The extraction engine automatically identifies and removes navigation, ads, and sidebars using DOM heuristics and text density analysis
  • HTML-to-Markdown conversion - Converts tag soup to clean Markdown, eliminating inline styles, script tags, and presentation markup
  • Metadata separation - Titles, descriptions, and canonical URLs are stored in frontmatter, not mixed into the content
  • Consistent formatting - Every output file follows the same structure, simplifying downstream parsing
  • JavaScript rendering - Captures the final rendered text rather than the raw HTML shell, avoiding extraction of loading spinners and placeholder text

By handling these steps during ingestion, crawler.sh reduces the number of documents that fail quality filters and simplifies the deduplication stage. Teams spend less time writing cleaning rules and more time curating high-value sources.

Crawler.sh - Free Local AEO & SEO Spider and a Markdown content extractor | Product Hunt