A context window is the maximum number of tokens a language model can process in a single forward pass. It includes everything: the system prompt, the user’s question, any retrieved context, conversation history, and the model’s own generated output. When the total exceeds the window limit, the model cannot see the overflow, or the application must truncate or summarize content to fit.
Think of the context window as the model’s working memory. A human reading a book can refer back to earlier chapters, but only holds so much in mind at once. Similarly, a model can only “pay attention” to tokens within its window. Everything outside is invisible.
How tokens work
A token is the basic unit of text that a model processes. It is not necessarily a word. Depending on the tokenizer:
- “hello” = 1 token
- “tokenizer” = 1 token
- “antidisestablishmentarianism” = 5-6 tokens
- A space or punctuation mark may be its own token
Tokenizers like TikToken (OpenAI) and SentencePiece (Google) use subword algorithms to split rare words into smaller pieces. This means token counts are not predictable from word counts. A rough rule of thumb is that 100 tokens equals approximately 75 words in English.
Context window sizes by model
| Model | Context Window | Notes |
|---|---|---|
| GPT-3.5 | 4,096 tokens | Standard; fits ~3,000 words |
| GPT-4 | 8,192 / 32,768 tokens | Two variants with different limits |
| GPT-4 Turbo | 128,000 tokens | Fits ~100 pages of text |
| Claude 3 Haiku | 200,000 tokens | Fast, large context |
| Claude 3 Opus | 200,000 tokens | Highest quality, large context |
| Gemini 1.5 Pro | 1,000,000+ tokens | Experimental million-token window |
| Llama 2 | 4,096 tokens | Base model limit |
| Llama 3 | 8,000 / 128,000 tokens | Extended via fine-tuning |
| Mistral | 32,000 tokens | Open model with large native window |
Input vs output tokens
The context window is a shared budget between input and output:
Total tokens = system_prompt + user_message + context + conversation_history + model_outputIf you send a 3,000-token prompt to a 4,000-token model, only 1,000 tokens remain for the response. If the model tries to generate more, it stops mid-sentence or truncates. Applications must budget tokens carefully:
- Reserve tokens for the answer
- Trim conversation history to the most recent exchanges
- Limit retrieved context to top-k most relevant chunks
- Use summaries instead of full documents when possible
Why context windows matter for crawled content
When building applications that use crawled web content, the context window is the primary constraint:
- A single news article might be 2,000 tokens
- A product manual might be 10,000 tokens
- A legal brief might be 50,000 tokens
You cannot feed the entire document into a standard model. You must either:
- Chunk the document - Split into pieces and process individually (see Chunking)
- Summarize first - Compress the document into a shorter form that fits
- Retrieve relevant passages - Use RAG to find only the sections that answer the user’s question
- Use a larger model - Pay more for models with bigger context windows
Token efficiency strategies
Different formats consume tokens at different rates:
| Format | Tokens per 100 words | Notes |
|---|---|---|
| Plain text | ~130 | Minimal overhead |
| Markdown | ~135 | Slight overhead for # and ** |
| HTML | ~250-400 | Tag soup wastes massive token budget |
| JSON | ~150-200 | Quotes, braces, and keys add overhead |
| XML | ~300-500 | Verbose tags consume tokens |
This is why converting HTML to Markdown before feeding content to an LLM is standard practice. The same article in HTML might consume 3,000 tokens, while the Markdown version uses 1,500. That difference determines whether the article fits in one pass or requires chunking.
Long context challenges
Even models with large context windows face practical limits:
- Attention dilution - The model’s attention mechanism spreads across all tokens. With 100,000 tokens, each token gets less “focus” than with 1,000 tokens. Important details in the middle may be ignored.
- Cost - API pricing is per-token. A 128,000-token input costs 32x more than a 4,000-token input.
- Latency - Processing time scales with sequence length. Longer inputs take longer to process.
- Lost in the middle - Research shows models are better at recalling information at the beginning and end of a long context than in the middle.
How crawler.sh optimizes for context windows
crawler.sh is designed to maximize the value of every token in your context window:
- HTML-to-Markdown conversion - Reduces token count by 40-60% compared to raw HTML, letting you fit more content per request
- Boilerplate removal - Eliminates navigation, ads, and footers that waste tokens without adding information
- Clean extraction - Produces well-structured paragraphs and lists that compress efficiently
- Metadata frontmatter - Stores title, URL, and description outside the main content so they do not compete for the token budget during content processing
- Consistent formatting - Standardized Markdown means predictable token counts across documents
For example, when feeding a crawled documentation site into a RAG pipeline:
# Crawl and export as clean Markdowncrawler crawl https://docs.example.com --render --output docs.zip
# The Markdown files use ~40% fewer tokens than HTML# This means more content fits in each chunk# More content per chunk = better retrieval accuracy