An embedding is a dense vector of floating-point numbers that represents the semantic meaning of a piece of text. Unlike traditional search, which matches keywords, embeddings capture meaning. The sentence “The cat sat on the mat” and “A feline rested on the rug” share no keywords, yet their embeddings are close together in vector space because they describe the same idea.
Embeddings are produced by embedding models, which are neural networks trained to convert text into vectors. The key property is that semantically similar texts produce vectors that are mathematically close, while unrelated texts produce vectors that are far apart. This enables semantic search, recommendation, clustering, and classification at scale.
How embeddings are generated
An embedding model reads text and outputs a fixed-length vector. The process works like this:
- Tokenization - The text is split into tokens using the model’s tokenizer
- Neural encoding - A transformer network processes the tokens and builds contextual representations
- Pooling - The token-level representations are combined into a single vector, typically by averaging or using a special CLS token
- Normalization - The vector is often normalized to unit length so that similarity can be measured with a simple dot product
The result is a vector such as [0.023, -0.156, 0.891, ...] with dimensions ranging from 384 to 4,096 depending on the model.
Measuring similarity
Once text is converted to vectors, similarity is computed with mathematical operations:
- Cosine similarity - Measures the angle between two vectors. Range: -1 (opposite) to 1 (identical). Values above 0.7 typically indicate strong semantic similarity.
- Euclidean distance - Measures the straight-line distance between vector endpoints. Smaller values mean closer vectors.
- Dot product - For unit-length vectors, this is equivalent to cosine similarity. Fast to compute and widely used in vector databases.
import numpy as np
def cosine_similarity(a, b): return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
# Two similar sentences produce vectors with high cosine similarityvec1 = embed("The bank approved my loan")vec2 = embed("I received approval for a mortgage from the bank")
print(cosine_similarity(vec1, vec2))# 0.89 (high similarity)
# A different sense of "bank" produces a dissimilar vectorvec3 = embed("I sat by the river bank")print(cosine_similarity(vec1, vec3))# 0.42 (low similarity)Dimensionality
Embedding vectors vary in length depending on the model:
| Model | Dimensions | Notes |
|---|---|---|
| OpenAI text-embedding-3-small | 1,536 | Good balance of quality and cost |
| OpenAI text-embedding-3-large | 3,072 | Higher quality, more expensive |
| Cohere embed-english-v3 | 1,024 | Strong performance on English text |
| BGE-large-en-v1.5 | 1,024 | Open-source leader for retrieval tasks |
| E5-large-v2 | 1,024 | Microsoft model optimized for semantic search |
| GTE-large | 1,024 | General text embeddings, strong on benchmarks |
Higher dimensions can capture finer semantic distinctions but require more storage and slightly more computation. For most applications, 768-1,536 dimensions are sufficient.
Use cases
Semantic search
Instead of matching keywords, a query is embedded and the database returns the most similar document vectors. This handles synonyms, paraphrases, and conceptually related content that keyword search misses.
Retrieval-Augmented Generation (RAG)
A user question is embedded, relevant documents are retrieved from a vector database, and those documents are fed to a language model as context. The quality of retrieval depends entirely on the quality of the embeddings.
Clustering and classification
Documents with similar embeddings can be grouped into clusters without predefined categories. A support ticket system might automatically cluster tickets by topic based on embedding similarity.
Recommendation
Content recommendation engines embed items and user preferences, then suggest items whose embeddings are close to the user’s profile vector.
Duplicate detection
Two documents with cosine similarity near 1.0 are likely duplicates or near-duplicates, even if their wording differs.
Choosing an embedding model
The right model depends on the task:
| Task | Recommended approach |
|---|---|
| General English search | OpenAI text-embedding-3-small or Cohere embed-v3 |
| High-precision retrieval | OpenAI text-embedding-3-large or BGE-large |
| Multilingual content | Cohere embed-multilingual-v3 or E5-multilingual |
| Cost-sensitive at scale | BGE-small or GTE-base (open-source, self-hosted) |
| Domain-specific (legal, medical) | Fine-tune a base model on domain text |
Key considerations:
- Context length - Most embedding models handle 512-8,192 tokens. Long documents must be chunked before embedding.
- Language coverage - Models trained primarily on English underperform on other languages
- Task alignment - Some models are optimized for similarity (bi-encoders), others for classification. Match the model to the task
- Latency and cost - API-hosted models charge per token. Self-hosted open-source models trade setup complexity for predictable costs
Common embedding pitfalls
- Garbage in, garbage out - Noisy, poorly formatted, or irrelevant text produces low-quality vectors
- Vocabulary mismatch - Technical jargon, brand names, and neologisms may not be well-represented in the model’s training data
- Length truncation - Documents longer than the model’s context window are truncated, losing meaning from the end
- Outdated knowledge - Models trained before a term was coined will produce weaker embeddings for that term
How crawler.sh improves embedding quality
crawler.sh produces text that embeds more accurately and retrieves more reliably:
- Clean text - HTML-to-Markdown conversion removes navigation, ads, and boilerplate that pollute the semantic signal
- Structured content - Preserved headings and paragraphs keep logical sections intact, so each chunk contains coherent meaning
- Consistent formatting - Uniform Markdown structure means similar documents embed predictably
- Metadata for context - Each chunk carries its source URL and page title, enabling the retrieval system to provide provenance
- JavaScript rendering - Dynamically loaded content is captured, ensuring the text that gets embedded is complete
For a RAG pipeline, this means:
# 1. Crawl and export as Markdowncrawler crawl https://docs.example.com --render --output docs.zip
# 2. Chunk by Markdown headings# 3. Embed each chunk# 4. Store in vector database with source URL metadataThe result is a vector store where retrieved chunks are semantically coherent, traceable to their source, and free of the HTML noise that degrades embedding quality.