What is a Context Window in LLMs

A context window is the maximum number of tokens a language model can process in a single forward pass. It includes everything: the system prompt, the user’s question, any retrieved context, conversation history, and the model’s own generated output. When the total exceeds the window limit, the model cannot see the overflow, or the application must truncate or summarize content to fit.

Think of the context window as the model’s working memory. A human reading a book can refer back to earlier chapters, but only holds so much in mind at once. Similarly, a model can only “pay attention” to tokens within its window. Everything outside is invisible.

How tokens work

A token is the basic unit of text that a model processes. It is not necessarily a word. Depending on the tokenizer:

“hello” = 1 token
“tokenizer” = 1 token
“antidisestablishmentarianism” = 5-6 tokens
A space or punctuation mark may be its own token

Tokenizers like TikToken (OpenAI) and SentencePiece (Google) use subword algorithms to split rare words into smaller pieces. This means token counts are not predictable from word counts. A rough rule of thumb is that 100 tokens equals approximately 75 words in English.

Context window sizes by model

Model	Context Window	Notes
GPT-3.5	4,096 tokens	Standard; fits ~3,000 words
GPT-4	8,192 / 32,768 tokens	Two variants with different limits
GPT-4 Turbo	128,000 tokens	Fits ~100 pages of text
Claude 3 Haiku	200,000 tokens	Fast, large context
Claude 3 Opus	200,000 tokens	Highest quality, large context
Gemini 1.5 Pro	1,000,000+ tokens	Experimental million-token window
Llama 2	4,096 tokens	Base model limit
Llama 3	8,000 / 128,000 tokens	Extended via fine-tuning
Mistral	32,000 tokens	Open model with large native window

Input vs output tokens

The context window is a shared budget between input and output:

Total tokens = system_prompt + user_message + context + conversation_history + model_output

If you send a 3,000-token prompt to a 4,000-token model, only 1,000 tokens remain for the response. If the model tries to generate more, it stops mid-sentence or truncates. Applications must budget tokens carefully:

Reserve tokens for the answer
Trim conversation history to the most recent exchanges
Limit retrieved context to top-k most relevant chunks
Use summaries instead of full documents when possible

Why context windows matter for crawled content

When building applications that use crawled web content, the context window is the primary constraint:

A single news article might be 2,000 tokens
A product manual might be 10,000 tokens
A legal brief might be 50,000 tokens

You cannot feed the entire document into a standard model. You must either:

Chunk the document - Split into pieces and process individually (see Chunking)
Summarize first - Compress the document into a shorter form that fits
Retrieve relevant passages - Use RAG to find only the sections that answer the user’s question
Use a larger model - Pay more for models with bigger context windows

Token efficiency strategies

Different formats consume tokens at different rates:

Format	Tokens per 100 words	Notes
Plain text	~130	Minimal overhead
Markdown	~135	Slight overhead for `#` and `**`
HTML	~250-400	Tag soup wastes massive token budget
JSON	~150-200	Quotes, braces, and keys add overhead
XML	~300-500	Verbose tags consume tokens

This is why converting HTML to Markdown before feeding content to an LLM is standard practice. The same article in HTML might consume 3,000 tokens, while the Markdown version uses 1,500. That difference determines whether the article fits in one pass or requires chunking.

Long context challenges

Even models with large context windows face practical limits:

Attention dilution - The model’s attention mechanism spreads across all tokens. With 100,000 tokens, each token gets less “focus” than with 1,000 tokens. Important details in the middle may be ignored.
Cost - API pricing is per-token. A 128,000-token input costs 32x more than a 4,000-token input.
Latency - Processing time scales with sequence length. Longer inputs take longer to process.
Lost in the middle - Research shows models are better at recalling information at the beginning and end of a long context than in the middle.

How crawler.sh optimizes for context windows

crawler.sh is designed to maximize the value of every token in your context window:

HTML-to-Markdown conversion - Reduces token count by 40-60% compared to raw HTML, letting you fit more content per request
Boilerplate removal - Eliminates navigation, ads, and footers that waste tokens without adding information
Clean extraction - Produces well-structured paragraphs and lists that compress efficiently
Metadata frontmatter - Stores title, URL, and description outside the main content so they do not compete for the token budget during content processing
Consistent formatting - Standardized Markdown means predictable token counts across documents

For example, when feeding a crawled documentation site into a RAG pipeline:

# Crawl and export as clean Markdown
crawler crawl https://docs.example.com --render --output docs.zip

# The Markdown files use ~40% fewer tokens than HTML
# This means more content fits in each chunk
# More content per chunk = better retrieval accuracy