Type-Token Ratio (TTR) is the most straightforward way to measure lexical diversity. Calculate it by dividing unique words in a text (types) by total words (tokens). The result falls between 0 and 1, where higher values indicate more varied vocabulary.
For example, the sentence “the cat sat on the mat” has 5 unique words (the, cat, sat, on, mat) and 6 total words, giving a TTR of 0.83. A more repetitive sentence like “the the the cat cat cat” has 2 unique words and 6 total words, giving a TTR of 0.33.
How TTR is calculated
The formula is simple:
TTR = Number of unique words / Total number of words
A TTR of 1.0 means every word in the text is unique. A TTR approaching 0 means the same words repeat throughout. Most natural language content falls between 0.4 and 0.7, depending on topic and length.
The length problem
TTR has a well-known limitation: it decreases as text gets longer. In any language, common words (articles, prepositions, pronouns) get reused more as content grows. A 100-word paragraph will nearly always have a higher TTR than a 10,000-word article, even if the longer text uses richer vocabulary.
This means TTR is only reliable when comparing texts of similar length. Comparing the TTR of a 200-word product description against a 3,000-word guide will produce misleading results. For length-independent measurement, metrics like MTLD are more appropriate.
TTR in SEO and content analysis
Despite its limitations, TTR is useful for content quality assessment when scoped well:
- Comparing similar pages - TTR works well when comparing pages of roughly the same length that target the same topic. A category page with a TTR of 0.35 compared to competitors at 0.55 likely needs more varied language.
- Detecting keyword stuffing - Unusually low TTR on a page optimized for specific keywords can indicate over-repetition that search engines may penalize.
- Content audits at scale - When normalized for length (by calculating TTR over fixed-size text windows), it can flag pages across a site that may need vocabulary enrichment.
- AEO readiness - AI answer engines favor content with varied phrasing, making TTR a useful proxy for how well content serves diverse query formulations.
Variants of TTR
Several adjusted versions of TTR exist to address the length sensitivity problem:
- Standardized TTR (STTR) - Calculates TTR over consecutive segments of equal length (usually 1,000 words) and averages the results
- Root TTR - Divides the number of types by the square root of the number of tokens, partially correcting for length
- Log TTR - Uses the logarithm of types divided by the logarithm of tokens for a more stable measure across different text lengths
- MTLD - A fundamentally different approach that measures how far into a text you can read before lexical diversity drops below a threshold
How crawler.sh helps
The crawler crawl --extract-content flag extracts page content in clean Markdown, giving you the raw text needed to calculate TTR and other lexical diversity metrics site-wide. Pair this with the crawler seo command to identify thin pages, and you can prioritize which content needs vocabulary enrichment and deeper topical coverage.