Synthetic data is training material generated by AI models rather than collected from real-world sources. A large language model like GPT-4 can write thousands of examples of product reviews, code explanations, or medical Q&A pairs in the time it takes a human to write one. This synthetic output, when properly filtered and validated, becomes training data for smaller or specialized models.
The concept is not new. Data augmentation has long been used in computer vision: rotate an image, add noise, or adjust brightness to create new training examples from a small real dataset. Synthetic data for language models is the textual equivalent: generate variations, paraphrases, and new compositions to expand a training corpus.
How synthetic data is generated
The typical workflow involves three steps:
- Seed collection - Gather a small set of high-quality real examples (the seed set)
- Generation - Use a powerful model to produce variations based on the seeds
- Filtering - Remove low-quality, repetitive, or incorrect synthetic examples
For example, the Self-Instruct method works like this:
# Real seed examplesseeds = [ {"instruction": "Explain photosynthesis.", "output": "Photosynthesis is the process..."}]
# Prompt a powerful model to generate new instruction-output pairsprompt = f"""Given these examples, generate 10 new instruction-output pairs:{json.dumps(seeds)}"""
# GPT-4 generates:# {"instruction": "Explain cellular respiration.", "output": "Cellular respiration is..."}# {"instruction": "What is the Calvin cycle?", "output": "The Calvin cycle..."}When synthetic data works
Synthetic data excels in specific scenarios:
- Format conversion - Converting existing data into a new schema. For example, turning paragraph-style explanations into question-answer pairs.
- Low-resource languages - Real data is scarce for languages like Icelandic or Swahili. A multilingual model can generate synthetic examples to bootstrap a dataset.
- Domain expansion - A model trained on general text can generate domain-specific examples (legal, medical, technical) when seeded with a few real cases.
- Privacy-sensitive domains - Generating synthetic medical records or financial transactions avoids using real patient or customer data.
- Scale amplification - Starting with 1,000 human-written examples and expanding to 100,000 synthetic ones.
When synthetic data fails
Synthetic data has clear limitations:
- Factual accuracy - Models hallucinate. Synthetic examples may contain false facts, invented citations, or incorrect code. A model trained on these errors learns to reproduce them.
- Bias amplification - The teacher model’s biases are passed to the student model. If GPT-4 overrepresents certain viewpoints, the synthetic data will too.
- Lack of novelty - Synthetic data is derived from the teacher model’s training distribution. It rarely captures genuinely new information or recent events.
- Quality ceiling - A student model cannot exceed the teacher model’s capabilities when trained primarily on synthetic data.
- Model collapse - Training successive generations of models on synthetic data from previous generations degrades quality over time, like a photocopy of a photocopy.
Synthetic data vs real data
| Real data | Synthetic data |
|---|---|
| Grounded in reality | May contain hallucinations |
| Expensive to collect | Cheap to generate at scale |
| Diverse and novel | Derived from existing model knowledge |
| Contains edge cases | Tends toward average, safe outputs |
| Legally complex (copyright, privacy) | Cleaner legal status, but provenance questions remain |
| Limited quantity | Unlimited quantity with diminishing returns |
The best practice is a hybrid approach: a foundation of real, verified data augmented with synthetic examples for format diversity and scale.
Quality control for synthetic data
Effective synthetic data pipelines include rigorous filtering:
- Deduplication - Remove synthetic examples that are too similar to each other or to the seed set
- Fact-checking - Use retrieval systems or smaller verification models to flag likely hallucinations
- Diversity scoring - Ensure the synthetic set covers the intended topic distribution
- Human review - Spot-check random samples for quality, especially for high-stakes domains
- Perplexity filtering - Remove examples that are too predictable (copied from seeds) or too chaotic (nonsensical output)
How crawler.sh enables better synthetic data
crawler.sh improves synthetic data generation by providing high-quality seed material:
- Verified sources - Crawl authoritative documentation, research papers, and expert blogs to use as factual seeds
- Current information - Unlike stale training data, crawled content reflects the latest knowledge, reducing hallucination in time-sensitive domains
- Clean extraction - Markdown output removes HTML noise, giving the generation model clean text to work from
- Structured format - Consistent Markdown headings and lists make it easier to programmatically convert content into instruction pairs
- Domain targeting - Crawl specific verticals (legal, medical, technical) to generate domain-specific synthetic data with real grounding
For example, a team building a coding assistant might crawl the latest documentation for a new framework, use those clean docs as seeds, and generate synthetic Q&A pairs. The real documentation grounds the synthetic examples in correct, current information.