What is Tokenization in Large Language Models

Tokenization is the process of converting raw text into a sequence of tokens, the basic units that a language model processes. A token is not necessarily a word. It might be a word fragment, a punctuation mark, a number, or even a multi-word expression that appears frequently in training data. Every input to a modern language model, from a single prompt to an entire document, must pass through a tokenizer before the model can process it.

Tokenization affects almost every aspect of working with language models: how much text fits in a context window, how API costs are calculated, and how the model interprets spacing, capitalization, and special characters. Two prompts that look similar to a human can tokenize into very different lengths, producing different costs and behavior.

How tokenization works

Tokenizers use subword algorithms to split text into a vocabulary of common pieces. The goal is to represent frequent words as single tokens while breaking rare words into smaller, reusable fragments.

For example, with a typical tokenizer:

“hello” = 1 token
“tokenizer” = 1 token
“antidisestablishmentarianism” = 5-6 tokens
“New York” = 2 tokens
“1234” = 2-4 tokens (depending on the tokenizer)
A space or newline may be attached to the preceding token

Common tokenization algorithms

Byte Pair Encoding (BPE) - Starts with individual characters and iteratively merges the most frequent adjacent pairs into new tokens. Used by GPT-2, GPT-3, and GPT-4 (TikToken).
WordPiece - Similar to BPE but uses a likelihood-based merge criterion. Used by BERT and other Google models.
SentencePiece - Treats text as a raw stream of characters (including spaces) and learns subword units without language-specific preprocessing. Used by Llama, T5, and Gemini.
Unigram language modeling - Starts with a large vocabulary and prunes it down by removing tokens that least affect the overall likelihood. Used by some versions of T5.

Tokens vs words

There is no fixed ratio between tokens and words. It depends on the language, the vocabulary, and the text structure:

Language	Approximate tokens per word	Notes
English	1.3	Most efficient for subword tokenizers
German	1.5-2.0	Compound words split into fragments
Japanese	2.0-3.0	Character-based scripts require more tokens
Chinese	1.5-2.5	Each character may be 1-2 tokens
Code	1.5-4.0	Depends on language; whitespace-heavy code costs more

A rough rule of thumb for English prose is that 100 tokens equals approximately 75 words. For code or structured data, the ratio is worse.

Why tokenization matters

Context window limits

A model’s context window is measured in tokens, not words. A 4,000-token window holds roughly 3,000 words of English text. If your document is 5,000 words, it will not fit regardless of how important the content is.

API pricing

Cloud LLM APIs charge per token. Both input and output tokens count toward the bill. A prompt that is 1,000 tokens longer than necessary costs more on every single request. At scale, inefficient tokenization can multiply costs.

Model behavior

Token boundaries affect how models process text. A model might handle “password123” differently from “password 123” because the tokenizer groups them differently. Prompt engineering sometimes exploits tokenization quirks to achieve specific effects.

Token efficiency by format

Different text formats consume tokens at different rates for the same semantic content:

Format	Tokens per 100 words	Notes
Plain text	~130	Minimal overhead
Markdown	~135	Slight overhead for syntax characters
HTML	~250-400	Tag soup, attributes, and nesting waste tokens
JSON	~150-200	Quotes, braces, and keys add overhead
XML	~300-500	Verbose tags consume tokens
Minified code	~180-250	Single-letter variables help, but syntax adds up

This is why converting HTML to Markdown before feeding content to an LLM is standard practice. The same article in HTML might consume 3,000 tokens, while the Markdown version uses 1,500. That difference determines whether the article fits in one pass or requires chunking.

Counting tokens

Most providers expose tokenizers or utilities to count tokens before sending a request:

import tiktoken

# GPT-4 tokenizer
encoder = tiktoken.encoding_for_model("gpt-4")

text = "Tokenization is the process of splitting text into pieces."
tokens = encoder.encode(text)

print(f"Tokens: {len(tokens)}")
print(f"Token IDs: {tokens}")
# Tokens: 11
# Token IDs: [16864, 374, 279, 3740, 315, 7093, 323, 18311, 13437, 13, 1438]

For open-source models, the Hugging Face tokenizers library provides equivalent functionality:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
text = "Tokenization is the process of splitting text into pieces."
tokens = tokenizer.encode(text)

print(f"Tokens: {len(tokens)}")
# Tokens: 12 (Llama uses a different vocabulary)

Tokenization pitfalls

Invisible characters - Zero-width spaces, non-breaking spaces, and Unicode normalization differences can produce unexpected token counts
Language mismatch - A tokenizer trained primarily on English will be inefficient for other languages, producing longer sequences and higher costs
Special tokens - Most tokenizers reserve special tokens for padding, separation, and system messages. These count against the context window
Case sensitivity - “Hello” and “hello” may tokenize differently. Some tokenizers preserve case; others do not

How crawler.sh optimizes token usage

crawler.sh is designed to maximize the value of every token in your context window:

HTML-to-Markdown conversion - Reduces token count by 40-60% compared to raw HTML, letting you fit more content per request
Boilerplate removal - Eliminates navigation, ads, footers, and inline styles that waste tokens without adding information
Clean extraction - Produces well-structured paragraphs and lists that compress efficiently
Consistent formatting - Standardized Markdown means predictable token counts across documents
Metadata separation - Title, URL, and description are stored as metadata rather than inline, keeping the main content compact

For example, when feeding crawled documentation into a model:

# Crawl and export as clean Markdown
crawler crawl https://docs.example.com --render --output docs.zip

# The Markdown files use ~40% fewer tokens than raw HTML
# This means more content fits in each chunk
# More content per chunk = better retrieval accuracy