March 14, 2026

How to Preprocess Web Content for RLHF Training Pairs

A step-by-step guide to crawling web content, cleaning it, and structuring it into preference pairs for RLHF reward model training.

Mehmet Kose
6 mins read

Reinforcement Learning from Human Feedback (RLHF) depends on preference pairs - two model responses to the same prompt, labeled by a human annotator as better or worse. But the quality of those responses starts upstream, with the source content your base model trained on. Noisy, poorly structured training data produces mediocre model outputs, which makes preference annotation harder and less informative.

This guide walks you through crawling web content, cleaning it for model training, and structuring it into a format ready for RLHF preference pair generation.

Step 1: Install crawler.sh CLI

Install the CLI with a single command:

curl -fsSL https://install.crawler.sh | sh

Restart your terminal or run source ~/.bashrc (or ~/.zshrc) to pick up the new PATH entry. Verify the installation:

crawler --version

Step 2: Crawl with content extraction

Run a crawl with the --extract-content flag to pull clean Markdown from every page:

crawler crawl https://docs.example.com --extract-content --max-pages 10000

The crawler processes each page through a readability algorithm that strips navigation, sidebars, footers, and boilerplate. What remains is the main body text converted to Markdown, along with metadata like word count, byline, and excerpt.

For RLHF source data, target sites with high-quality writing in the domain your model will serve:

crawler crawl https://medical-guidelines.example.com --extract-content --max-pages 5000
crawler crawl https://legal-reference.example.com --extract-content --max-pages 5000

Each crawl produces a .crawl file in NDJSON format - one JSON object per line, one line per page.

Step 3: Inspect what you collected

Before processing, review what the crawl captured:

crawler info docs-example-com.crawl

This shows total pages crawled, status code distribution, and response time statistics. Pay attention to:

  • 200 OK pages - These contain usable content. Everything else is noise.
  • Word count distribution - Pages with fewer than 100 words are likely stubs or navigation shells.
  • Redirect chains - Redirected pages rarely carry useful content for training.

Step 4: Export to JSON

Export the crawl data to a structured JSON file:

crawler export docs-example-com.crawl --format json --output raw-dataset.json

The JSON output includes each page’s URL, title, meta description, status code, response time, and extracted Markdown content. This structured format is what your preprocessing scripts will consume.

Step 5: Filter and clean the dataset

Not every crawled page belongs in your training data. Write a preprocessing script to filter and clean the exported JSON. Here are the key filters:

Remove low-value pages

import json
with open("raw-dataset.json") as f:
pages = json.load(f)
filtered = [
page for page in pages
if page.get("status_code") == 200
and page.get("word_count", 0) >= 150
and page.get("markdown") # has extracted content
]
print(f"Kept {len(filtered)} of {len(pages)} pages")

Pages with fewer than 150 words rarely contain enough substance to produce meaningful training signal. Error pages and redirects add nothing but noise.

Deduplicate content

Many sites serve identical content at multiple URLs. Deduplicate by comparing the extracted Markdown:

seen_content = set()
unique_pages = []
for page in filtered:
content_hash = hash(page["markdown"].strip())
if content_hash not in seen_content:
seen_content.add(content_hash)
unique_pages.append(page)
print(f"Removed {len(filtered) - len(unique_pages)} duplicates")

Normalize formatting

Strip inconsistent whitespace, fix broken Unicode, and standardize heading levels so the model sees consistent formatting:

import re
def normalize(text):
# Collapse multiple blank lines
text = re.sub(r'\n{3,}', '\n\n', text)
# Strip trailing whitespace per line
text = '\n'.join(line.rstrip() for line in text.split('\n'))
return text.strip()
for page in unique_pages:
page["markdown"] = normalize(page["markdown"])

Step 6: Segment into training chunks

RLHF works best when prompts target specific topics. Long documents should be split into focused chunks that each cover a single concept or section:

def chunk_by_headings(markdown, min_words=100):
sections = re.split(r'\n(?=## )', markdown)
chunks = []
for section in sections:
words = len(section.split())
if words >= min_words:
chunks.append(section.strip())
return chunks
all_chunks = []
for page in unique_pages:
chunks = chunk_by_headings(page["markdown"])
for chunk in chunks:
all_chunks.append({
"source_url": page["url"],
"content": chunk,
"word_count": len(chunk.split())
})
print(f"Created {len(all_chunks)} training chunks")

Splitting on H2 headings preserves the topical coherence of each section. Sections shorter than 100 words are discarded - they lack enough context to generate useful prompts.

Step 7: Generate prompts for preference pairs

Each chunk becomes the basis for a prompt. The goal is to create questions or instructions that your model can answer, producing two candidate responses for human annotators to compare:

def create_prompt(chunk):
# Extract the heading as topic context
lines = chunk["content"].split('\n')
heading = lines[0].lstrip('#').strip() if lines[0].startswith('#') else ""
return {
"prompt": f"Explain the following topic in detail: {heading}",
"reference_content": chunk["content"],
"source_url": chunk["source_url"]
}
prompts = [create_prompt(c) for c in all_chunks if c["word_count"] >= 150]

The reference content is not shown to annotators - it serves as ground truth for evaluating whether model responses are factually grounded. Prompts with richer reference content produce more informative preference comparisons.

Step 8: Structure the output for annotation

Save the final dataset in a format your annotation pipeline can ingest:

import json
output = {
"metadata": {
"source_domains": ["docs.example.com"],
"total_prompts": len(prompts),
"created": "2026-03-14",
"min_word_count": 150
},
"prompts": prompts
}
with open("rlhf-prompts.json", "w") as f:
json.dump(output, f, indent=2)

From here, feed each prompt to your model to generate two candidate responses, then send the prompt-response pairs to annotators for preference labeling.

Step 9: Automate the pipeline

Source content changes over time. Automate the crawl-and-preprocess pipeline so your training data stays fresh:

#!/bin/bash
DATE=$(date +%Y%m%d)
DOMAIN="docs.example.com"
# Crawl
crawler crawl "https://$DOMAIN" --extract-content --max-pages 10000
# Export
crawler export "${DOMAIN//./-}.crawl" --format json --output "raw-$DATE.json"
# Preprocess (your Python script from steps 5-8)
python preprocess_rlhf.py "raw-$DATE.json" --output "rlhf-prompts-$DATE.json"

Schedule this with cron or run it as a CI/CD step:

# Weekly refresh every Monday at 3 AM
0 3 * * 1 /path/to/crawl-and-preprocess.sh

Versioning each output by date lets you trace which training data produced which reward model, a critical requirement for debugging alignment regressions.

Best practices

  • Crawl multiple domains. A single source creates blind spots. Blend content from several authoritative sites in your target domain for broader coverage.
  • Preserve metadata. Keep the source URL and page title alongside each chunk. Annotators can reference the original page when judgments are ambiguous.
  • Filter before you annotate. Every low-quality prompt wastes annotator time. Annotator disagreement is hard enough without adding noisy prompts to the mix.
  • Match chunk size to prompt complexity. Short chunks produce simple prompts. For nuanced preference signals, use chunks of 200-500 words that contain enough depth.
  • Refresh on a schedule. Stale source content leads to stale model outputs. Regular crawls keep your pipeline grounded in current information.
  • Track lineage. Record which crawl file, export date, and preprocessing version produced each prompt set. When your reward model drifts, you need to trace the cause.
Crawler.sh - Free Local AEO & SEO Spider and a Markdown content extractor | Product Hunt