Type-Token Ratio

Type-Token Ratio (TTR) is the most straightforward way to measure lexical diversity. Calculate it by dividing unique words in a text (types) by total words (tokens). The result falls between 0 and 1, where higher values indicate more varied vocabulary.

For example, the sentence “the cat sat on the mat” has 5 unique words (the, cat, sat, on, mat) and 6 total words, giving a TTR of 0.83. A more repetitive sentence like “the the the cat cat cat” has 2 unique words and 6 total words, giving a TTR of 0.33.

How TTR is calculated

The formula is simple:

TTR = Number of unique words / Total number of words

A TTR of 1.0 means every word in the text is unique. A TTR approaching 0 means the same words repeat throughout. Most natural language content falls between 0.4 and 0.7, depending on topic and length.

The length problem

TTR has a well-known limitation: it decreases as text gets longer. In any language, common words (articles, prepositions, pronouns) get reused more as content grows. A 100-word paragraph will nearly always have a higher TTR than a 10,000-word article, even if the longer text uses richer vocabulary.

This means TTR is only reliable when comparing texts of similar length. Comparing the TTR of a 200-word product description against a 3,000-word guide will produce misleading results. For length-independent measurement, metrics like MTLD are more appropriate.

TTR in SEO and content analysis

Despite its limitations, TTR is useful for content quality assessment when scoped well:

Comparing similar pages - TTR works well when comparing pages of roughly the same length that target the same topic. A category page with a TTR of 0.35 compared to competitors at 0.55 likely needs more varied language.
Detecting keyword stuffing - Unusually low TTR on a page optimized for specific keywords can indicate over-repetition that search engines may penalize.
Content audits at scale - When normalized for length (by calculating TTR over fixed-size text windows), it can flag pages across a site that may need vocabulary enrichment.
AEO readiness - AI answer engines favor content with varied phrasing, making TTR a useful proxy for how well content serves diverse query formulations.

Variants of TTR

Several adjusted versions of TTR exist to address the length sensitivity problem:

Standardized TTR (STTR) - Calculates TTR over consecutive segments of equal length (usually 1,000 words) and averages the results
Root TTR - Divides the number of types by the square root of the number of tokens, partially correcting for length
Log TTR - Uses the logarithm of types divided by the logarithm of tokens for a more stable measure across different text lengths
MTLD - A fundamentally different approach that measures how far into a text you can read before lexical diversity drops below a threshold

How crawler.sh helps

The crawler crawl --extract-content flag extracts page content in clean Markdown, giving you the raw text needed to calculate TTR and other lexical diversity metrics site-wide. Pair this with the crawler seo command to identify thin pages, and you can prioritize which content needs vocabulary enrichment and deeper topical coverage.

How TTR is calculated

The length problem

TTR in SEO and content analysis

Variants of TTR

How crawler.sh helps

Related