Tokenization is the process of converting raw text into a sequence of tokens, the basic units that a language model processes. A token is not necessarily a word. It might be a word fragment, a punctuation mark, a number, or even a multi-word expression that appears frequently in training data. Every input to a modern language model, from a single prompt to an entire document, must pass through a tokenizer before the model can process it.
Tokenization affects almost every aspect of working with language models: how much text fits in a context window, how API costs are calculated, and how the model interprets spacing, capitalization, and special characters. Two prompts that look similar to a human can tokenize into very different lengths, producing different costs and behavior.
How tokenization works
Tokenizers use subword algorithms to split text into a vocabulary of common pieces. The goal is to represent frequent words as single tokens while breaking rare words into smaller, reusable fragments.
For example, with a typical tokenizer:
- “hello” = 1 token
- “tokenizer” = 1 token
- “antidisestablishmentarianism” = 5-6 tokens
- “New York” = 2 tokens
- “1234” = 2-4 tokens (depending on the tokenizer)
- A space or newline may be attached to the preceding token
Common tokenization algorithms
- Byte Pair Encoding (BPE) - Starts with individual characters and iteratively merges the most frequent adjacent pairs into new tokens. Used by GPT-2, GPT-3, and GPT-4 (TikToken).
- WordPiece - Similar to BPE but uses a likelihood-based merge criterion. Used by BERT and other Google models.
- SentencePiece - Treats text as a raw stream of characters (including spaces) and learns subword units without language-specific preprocessing. Used by Llama, T5, and Gemini.
- Unigram language modeling - Starts with a large vocabulary and prunes it down by removing tokens that least affect the overall likelihood. Used by some versions of T5.
Tokens vs words
There is no fixed ratio between tokens and words. It depends on the language, the vocabulary, and the text structure:
| Language | Approximate tokens per word | Notes |
|---|---|---|
| English | 1.3 | Most efficient for subword tokenizers |
| German | 1.5-2.0 | Compound words split into fragments |
| Japanese | 2.0-3.0 | Character-based scripts require more tokens |
| Chinese | 1.5-2.5 | Each character may be 1-2 tokens |
| Code | 1.5-4.0 | Depends on language; whitespace-heavy code costs more |
A rough rule of thumb for English prose is that 100 tokens equals approximately 75 words. For code or structured data, the ratio is worse.
Why tokenization matters
Context window limits
A model’s context window is measured in tokens, not words. A 4,000-token window holds roughly 3,000 words of English text. If your document is 5,000 words, it will not fit regardless of how important the content is.
API pricing
Cloud LLM APIs charge per token. Both input and output tokens count toward the bill. A prompt that is 1,000 tokens longer than necessary costs more on every single request. At scale, inefficient tokenization can multiply costs.
Model behavior
Token boundaries affect how models process text. A model might handle “password123” differently from “password 123” because the tokenizer groups them differently. Prompt engineering sometimes exploits tokenization quirks to achieve specific effects.
Token efficiency by format
Different text formats consume tokens at different rates for the same semantic content:
| Format | Tokens per 100 words | Notes |
|---|---|---|
| Plain text | ~130 | Minimal overhead |
| Markdown | ~135 | Slight overhead for syntax characters |
| HTML | ~250-400 | Tag soup, attributes, and nesting waste tokens |
| JSON | ~150-200 | Quotes, braces, and keys add overhead |
| XML | ~300-500 | Verbose tags consume tokens |
| Minified code | ~180-250 | Single-letter variables help, but syntax adds up |
This is why converting HTML to Markdown before feeding content to an LLM is standard practice. The same article in HTML might consume 3,000 tokens, while the Markdown version uses 1,500. That difference determines whether the article fits in one pass or requires chunking.
Counting tokens
Most providers expose tokenizers or utilities to count tokens before sending a request:
import tiktoken
# GPT-4 tokenizerencoder = tiktoken.encoding_for_model("gpt-4")
text = "Tokenization is the process of splitting text into pieces."tokens = encoder.encode(text)
print(f"Tokens: {len(tokens)}")print(f"Token IDs: {tokens}")# Tokens: 11# Token IDs: [16864, 374, 279, 3740, 315, 7093, 323, 18311, 13437, 13, 1438]For open-source models, the Hugging Face tokenizers library provides equivalent functionality:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")text = "Tokenization is the process of splitting text into pieces."tokens = tokenizer.encode(text)
print(f"Tokens: {len(tokens)}")# Tokens: 12 (Llama uses a different vocabulary)Tokenization pitfalls
- Invisible characters - Zero-width spaces, non-breaking spaces, and Unicode normalization differences can produce unexpected token counts
- Language mismatch - A tokenizer trained primarily on English will be inefficient for other languages, producing longer sequences and higher costs
- Special tokens - Most tokenizers reserve special tokens for padding, separation, and system messages. These count against the context window
- Case sensitivity - “Hello” and “hello” may tokenize differently. Some tokenizers preserve case; others do not
How crawler.sh optimizes token usage
crawler.sh is designed to maximize the value of every token in your context window:
- HTML-to-Markdown conversion - Reduces token count by 40-60% compared to raw HTML, letting you fit more content per request
- Boilerplate removal - Eliminates navigation, ads, footers, and inline styles that waste tokens without adding information
- Clean extraction - Produces well-structured paragraphs and lists that compress efficiently
- Consistent formatting - Standardized Markdown means predictable token counts across documents
- Metadata separation - Title, URL, and description are stored as metadata rather than inline, keeping the main content compact
For example, when feeding crawled documentation into a model:
# Crawl and export as clean Markdowncrawler crawl https://docs.example.com --render --output docs.zip
# The Markdown files use ~40% fewer tokens than raw HTML# This means more content fits in each chunk# More content per chunk = better retrieval accuracy