How to Preprocess Web Content for RLHF Training Pairs
A step-by-step guide to crawling web content, cleaning it, and structuring it into preference pairs for RLHF reward model training.
Reinforcement Learning from Human Feedback (RLHF) depends on preference pairs - two model responses to the same prompt, labeled by a human annotator as better or worse. But the quality of those responses starts upstream, with the source content your base model trained on. Noisy, poorly structured training data produces mediocre model outputs, which makes preference annotation harder and less informative.
This guide walks you through crawling web content, cleaning it for model training, and structuring it into a format ready for RLHF preference pair generation.
Step 1: Install crawler.sh CLI
Install the CLI with a single command:
curl -fsSL https://install.crawler.sh | shRestart your terminal or run source ~/.bashrc (or ~/.zshrc) to pick up the new PATH entry. Verify the installation:
crawler --versionStep 2: Crawl with content extraction
Run a crawl with the --extract-content flag to pull clean Markdown from every page:
crawler crawl https://docs.example.com --extract-content --max-pages 10000The crawler processes each page through a readability algorithm that strips navigation, sidebars, footers, and boilerplate. What remains is the main body text converted to Markdown, along with metadata like word count, byline, and excerpt.
For RLHF source data, target sites with high-quality writing in the domain your model will serve:
crawler crawl https://medical-guidelines.example.com --extract-content --max-pages 5000crawler crawl https://legal-reference.example.com --extract-content --max-pages 5000Each crawl produces a .crawl file in NDJSON format - one JSON object per line, one line per page.
Step 3: Inspect what you collected
Before processing, review what the crawl captured:
crawler info docs-example-com.crawlThis shows total pages crawled, status code distribution, and response time statistics. Pay attention to:
- 200 OK pages - These contain usable content. Everything else is noise.
- Word count distribution - Pages with fewer than 100 words are likely stubs or navigation shells.
- Redirect chains - Redirected pages rarely carry useful content for training.
Step 4: Export to JSON
Export the crawl data to a structured JSON file:
crawler export docs-example-com.crawl --format json --output raw-dataset.jsonThe JSON output includes each page’s URL, title, meta description, status code, response time, and extracted Markdown content. This structured format is what your preprocessing scripts will consume.
Step 5: Filter and clean the dataset
Not every crawled page belongs in your training data. Write a preprocessing script to filter and clean the exported JSON. Here are the key filters:
Remove low-value pages
import json
with open("raw-dataset.json") as f: pages = json.load(f)
filtered = [ page for page in pages if page.get("status_code") == 200 and page.get("word_count", 0) >= 150 and page.get("markdown") # has extracted content]
print(f"Kept {len(filtered)} of {len(pages)} pages")Pages with fewer than 150 words rarely contain enough substance to produce meaningful training signal. Error pages and redirects add nothing but noise.
Deduplicate content
Many sites serve identical content at multiple URLs. Deduplicate by comparing the extracted Markdown:
seen_content = set()unique_pages = []
for page in filtered: content_hash = hash(page["markdown"].strip()) if content_hash not in seen_content: seen_content.add(content_hash) unique_pages.append(page)
print(f"Removed {len(filtered) - len(unique_pages)} duplicates")Normalize formatting
Strip inconsistent whitespace, fix broken Unicode, and standardize heading levels so the model sees consistent formatting:
import re
def normalize(text): # Collapse multiple blank lines text = re.sub(r'\n{3,}', '\n\n', text) # Strip trailing whitespace per line text = '\n'.join(line.rstrip() for line in text.split('\n')) return text.strip()
for page in unique_pages: page["markdown"] = normalize(page["markdown"])Step 6: Segment into training chunks
RLHF works best when prompts target specific topics. Long documents should be split into focused chunks that each cover a single concept or section:
def chunk_by_headings(markdown, min_words=100): sections = re.split(r'\n(?=## )', markdown) chunks = [] for section in sections: words = len(section.split()) if words >= min_words: chunks.append(section.strip()) return chunks
all_chunks = []for page in unique_pages: chunks = chunk_by_headings(page["markdown"]) for chunk in chunks: all_chunks.append({ "source_url": page["url"], "content": chunk, "word_count": len(chunk.split()) })
print(f"Created {len(all_chunks)} training chunks")Splitting on H2 headings preserves the topical coherence of each section. Sections shorter than 100 words are discarded - they lack enough context to generate useful prompts.
Step 7: Generate prompts for preference pairs
Each chunk becomes the basis for a prompt. The goal is to create questions or instructions that your model can answer, producing two candidate responses for human annotators to compare:
def create_prompt(chunk): # Extract the heading as topic context lines = chunk["content"].split('\n') heading = lines[0].lstrip('#').strip() if lines[0].startswith('#') else ""
return { "prompt": f"Explain the following topic in detail: {heading}", "reference_content": chunk["content"], "source_url": chunk["source_url"] }
prompts = [create_prompt(c) for c in all_chunks if c["word_count"] >= 150]The reference content is not shown to annotators - it serves as ground truth for evaluating whether model responses are factually grounded. Prompts with richer reference content produce more informative preference comparisons.
Step 8: Structure the output for annotation
Save the final dataset in a format your annotation pipeline can ingest:
import json
output = { "metadata": { "source_domains": ["docs.example.com"], "total_prompts": len(prompts), "created": "2026-03-14", "min_word_count": 150 }, "prompts": prompts}
with open("rlhf-prompts.json", "w") as f: json.dump(output, f, indent=2)From here, feed each prompt to your model to generate two candidate responses, then send the prompt-response pairs to annotators for preference labeling.
Step 9: Automate the pipeline
Source content changes over time. Automate the crawl-and-preprocess pipeline so your training data stays fresh:
#!/bin/bashDATE=$(date +%Y%m%d)DOMAIN="docs.example.com"
# Crawlcrawler crawl "https://$DOMAIN" --extract-content --max-pages 10000
# Exportcrawler export "${DOMAIN//./-}.crawl" --format json --output "raw-$DATE.json"
# Preprocess (your Python script from steps 5-8)python preprocess_rlhf.py "raw-$DATE.json" --output "rlhf-prompts-$DATE.json"Schedule this with cron or run it as a CI/CD step:
# Weekly refresh every Monday at 3 AM0 3 * * 1 /path/to/crawl-and-preprocess.shVersioning each output by date lets you trace which training data produced which reward model, a critical requirement for debugging alignment regressions.
Best practices
- Crawl multiple domains. A single source creates blind spots. Blend content from several authoritative sites in your target domain for broader coverage.
- Preserve metadata. Keep the source URL and page title alongside each chunk. Annotators can reference the original page when judgments are ambiguous.
- Filter before you annotate. Every low-quality prompt wastes annotator time. Annotator disagreement is hard enough without adding noisy prompts to the mix.
- Match chunk size to prompt complexity. Short chunks produce simple prompts. For nuanced preference signals, use chunks of 200-500 words that contain enough depth.
- Refresh on a schedule. Stale source content leads to stale model outputs. Regular crawls keep your pipeline grounded in current information.
- Track lineage. Record which crawl file, export date, and preprocessing version produced each prompt set. When your reward model drifts, you need to trace the cause.