What is Chunking in LLM Document Processing

Chunking is the process of splitting long documents into smaller pieces that fit within a language model’s context window. A typical Wikipedia article or technical documentation page might be 5,000 tokens. A language model with a 4,000-token limit cannot process the entire document at once. Chunking breaks the text into manageable segments while preserving as much meaning and context as possible.

Chunking is fundamental to RAG pipelines, fine-tuning datasets, and any workflow where documents exceed the model’s input capacity. Poor chunking creates fragments that lose context. Good chunking produces self-contained pieces that a model can understand and reason about independently.

Why chunking matters

Language models have finite context windows, measured in tokens. Current limits range from 4,096 tokens (older GPT models) to 200,000+ tokens (Claude 3 and Gemini 1.5 Pro). Even with large windows, processing entire documents is expensive and can dilute attention. Chunking enables:

RAG retrieval - Finding the most relevant passage from a document rather than feeding the entire document into the model
Efficient inference - Processing only the relevant chunks rather than the full text
Parallel processing - Multiple chunks can be embedded or summarized simultaneously
Scalability - Indexing millions of documents without exceeding memory or context limits

Chunking strategies

Fixed-size chunking

The simplest approach: split text every N tokens or characters.

def fixed_chunk(text, chunk_size=500):
    words = text.split()
    chunks = []
    for i in range(0, len(words), chunk_size):
        chunks.append(" ".join(words[i:i + chunk_size]))
    return chunks

Pros: Simple, fast, predictable chunk sizes. Cons: Cuts sentences and paragraphs in half, destroying context.

Recursive splitting

Split on the largest natural boundary first, then recursively split on smaller boundaries if a piece is still too large.

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
    separators=["\n\n", "\n", ". ", " ", ""]
)

chunks = splitter.split_text(long_document)

Priority of separators:

Paragraph breaks (\n\n)
Line breaks (\n)
Sentence ends (. )
Word boundaries ( )
Character boundaries (“)

This preserves sentence and paragraph boundaries as much as possible.

Semantic chunking

Group sentences or paragraphs that are semantically similar before splitting.

# Embed each sentence, then cluster nearby sentences together
sentences = doc.sentences
embeddings = [embed(s.text) for s in sentences]

# Group sentences where cosine similarity > threshold
clusters = []
current_cluster = [sentences[0]]

for i in range(1, len(sentences)):
    if cosine_similarity(embeddings[i-1], embeddings[i]) > 0.8:
        current_cluster.append(sentences[i])
    else:
        clusters.append(current_cluster)
        current_cluster = [sentences[i]]

Pros: Each chunk covers a coherent topic. Cons: More computationally expensive; requires an embedding model.

Heading-based chunking

Split at Markdown or HTML headings so each chunk is a self-contained section.

# Main Title
[Section 1 content...]

## Subheading A
[Section A content...]  <-- Chunk 1

## Subheading B
[Section B content...]  <-- Chunk 2

### Sub-subheading
[Subsection content...]  <-- Chunk 3

Pros: Natural topic boundaries; headings provide context. Cons: Sections may vary wildly in length; some might be too short or too long.

Chunk overlap

Overlap between adjacent chunks preserves context at the boundaries:

Chunk 1:  "The quick brown fox jumps over the lazy dog. The dog..."
Chunk 2:  "...lazy dog. The dog wakes up and runs away. Then..."

A typical overlap is 10-20% of the chunk size. This ensures that concepts split across chunk boundaries appear in full in at least one chunk.

Chunk size tradeoffs

Small chunks (100-300 tokens)	Large chunks (1,000-2,000 tokens)
More precise retrieval	Less granular but more context per chunk
Lower embedding cost	Fewer total chunks to index
Risk of missing cross-sentence context	Risk of including irrelevant information
Better for keyword-heavy content	Better for narrative or explanatory text

The optimal size depends on the content type and the model’s context window. Technical documentation with dense facts works well in smaller chunks. Long-form articles and reports may need larger chunks to preserve narrative flow.

Chunking for different formats

Plain text - Use sentence or paragraph boundaries
Markdown - Split at heading levels; preserve code blocks as single chunks
HTML - Split at <h1>, <h2>, <p> tags; preserve <table> integrity
Code - Split at function or class boundaries; preserve import blocks
Conversations - Split at speaker turns or topic shifts

How crawler.sh enables better chunking

crawler.sh produces Markdown output that chunks naturally and effectively:

Heading hierarchy preserved - #, ##, ### map to document sections, making heading-based chunking trivial
Clean paragraphs - HTML-to-Markdown conversion produces well-formed paragraphs separated by blank lines
Code block integrity - Triple-backtick blocks are kept intact, preventing splitting inside code examples
Table preservation - Markdown tables are maintained as single units
Metadata for context - Each chunk can carry the page title and URL as metadata, providing context even when the chunk is small
Consistent structure - Every document follows the same Markdown conventions, so one chunking strategy works across the entire corpus

For RAG pipelines, this means:

# 1. Crawl and export as Markdown
crawler crawl https://docs.example.com --output docs.zip

# 2. Chunk by Markdown headings with 200-token overlap
# Each chunk retains source URL and page title as metadata

# 3. Embed and index chunks in vector database

The result is a document store where retrieved chunks are self-contained, well-structured, and traceable to their source.