Chunking is the process of splitting long documents into smaller pieces that fit within a language model’s context window. A typical Wikipedia article or technical documentation page might be 5,000 tokens. A language model with a 4,000-token limit cannot process the entire document at once. Chunking breaks the text into manageable segments while preserving as much meaning and context as possible.
Chunking is fundamental to RAG pipelines, fine-tuning datasets, and any workflow where documents exceed the model’s input capacity. Poor chunking creates fragments that lose context. Good chunking produces self-contained pieces that a model can understand and reason about independently.
Why chunking matters
Language models have finite context windows, measured in tokens. Current limits range from 4,096 tokens (older GPT models) to 200,000+ tokens (Claude 3 and Gemini 1.5 Pro). Even with large windows, processing entire documents is expensive and can dilute attention. Chunking enables:
- RAG retrieval - Finding the most relevant passage from a document rather than feeding the entire document into the model
- Efficient inference - Processing only the relevant chunks rather than the full text
- Parallel processing - Multiple chunks can be embedded or summarized simultaneously
- Scalability - Indexing millions of documents without exceeding memory or context limits
Chunking strategies
Fixed-size chunking
The simplest approach: split text every N tokens or characters.
def fixed_chunk(text, chunk_size=500): words = text.split() chunks = [] for i in range(0, len(words), chunk_size): chunks.append(" ".join(words[i:i + chunk_size])) return chunksPros: Simple, fast, predictable chunk sizes. Cons: Cuts sentences and paragraphs in half, destroying context.
Recursive splitting
Split on the largest natural boundary first, then recursively split on smaller boundaries if a piece is still too large.
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter( chunk_size=500, chunk_overlap=50, separators=["\n\n", "\n", ". ", " ", ""])
chunks = splitter.split_text(long_document)Priority of separators:
- Paragraph breaks (
\n\n) - Line breaks (
\n) - Sentence ends (
.) - Word boundaries (
) - Character boundaries (“)
This preserves sentence and paragraph boundaries as much as possible.
Semantic chunking
Group sentences or paragraphs that are semantically similar before splitting.
# Embed each sentence, then cluster nearby sentences togethersentences = doc.sentencesembeddings = [embed(s.text) for s in sentences]
# Group sentences where cosine similarity > thresholdclusters = []current_cluster = [sentences[0]]
for i in range(1, len(sentences)): if cosine_similarity(embeddings[i-1], embeddings[i]) > 0.8: current_cluster.append(sentences[i]) else: clusters.append(current_cluster) current_cluster = [sentences[i]]Pros: Each chunk covers a coherent topic. Cons: More computationally expensive; requires an embedding model.
Heading-based chunking
Split at Markdown or HTML headings so each chunk is a self-contained section.
# Main Title[Section 1 content...]
## Subheading A[Section A content...] <-- Chunk 1
## Subheading B[Section B content...] <-- Chunk 2
### Sub-subheading[Subsection content...] <-- Chunk 3Pros: Natural topic boundaries; headings provide context. Cons: Sections may vary wildly in length; some might be too short or too long.
Chunk overlap
Overlap between adjacent chunks preserves context at the boundaries:
Chunk 1: "The quick brown fox jumps over the lazy dog. The dog..."Chunk 2: "...lazy dog. The dog wakes up and runs away. Then..."A typical overlap is 10-20% of the chunk size. This ensures that concepts split across chunk boundaries appear in full in at least one chunk.
Chunk size tradeoffs
| Small chunks (100-300 tokens) | Large chunks (1,000-2,000 tokens) |
|---|---|
| More precise retrieval | Less granular but more context per chunk |
| Lower embedding cost | Fewer total chunks to index |
| Risk of missing cross-sentence context | Risk of including irrelevant information |
| Better for keyword-heavy content | Better for narrative or explanatory text |
The optimal size depends on the content type and the model’s context window. Technical documentation with dense facts works well in smaller chunks. Long-form articles and reports may need larger chunks to preserve narrative flow.
Chunking for different formats
- Plain text - Use sentence or paragraph boundaries
- Markdown - Split at heading levels; preserve code blocks as single chunks
- HTML - Split at
<h1>,<h2>,<p>tags; preserve<table>integrity - Code - Split at function or class boundaries; preserve import blocks
- Conversations - Split at speaker turns or topic shifts
How crawler.sh enables better chunking
crawler.sh produces Markdown output that chunks naturally and effectively:
- Heading hierarchy preserved -
#,##,###map to document sections, making heading-based chunking trivial - Clean paragraphs - HTML-to-Markdown conversion produces well-formed paragraphs separated by blank lines
- Code block integrity - Triple-backtick blocks are kept intact, preventing splitting inside code examples
- Table preservation - Markdown tables are maintained as single units
- Metadata for context - Each chunk can carry the page title and URL as metadata, providing context even when the chunk is small
- Consistent structure - Every document follows the same Markdown conventions, so one chunking strategy works across the entire corpus
For RAG pipelines, this means:
# 1. Crawl and export as Markdowncrawler crawl https://docs.example.com --output docs.zip
# 2. Chunk by Markdown headings with 200-token overlap# Each chunk retains source URL and page title as metadata
# 3. Embed and index chunks in vector databaseThe result is a document store where retrieved chunks are self-contained, well-structured, and traceable to their source.