What are Embeddings in AI and Machine Learning

An embedding is a dense vector of floating-point numbers that represents the semantic meaning of a piece of text. Unlike traditional search, which matches keywords, embeddings capture meaning. The sentence “The cat sat on the mat” and “A feline rested on the rug” share no keywords, yet their embeddings are close together in vector space because they describe the same idea.

Embeddings are produced by embedding models, which are neural networks trained to convert text into vectors. The key property is that semantically similar texts produce vectors that are mathematically close, while unrelated texts produce vectors that are far apart. This enables semantic search, recommendation, clustering, and classification at scale.

How embeddings are generated

An embedding model reads text and outputs a fixed-length vector. The process works like this:

Tokenization - The text is split into tokens using the model’s tokenizer
Neural encoding - A transformer network processes the tokens and builds contextual representations
Pooling - The token-level representations are combined into a single vector, typically by averaging or using a special CLS token
Normalization - The vector is often normalized to unit length so that similarity can be measured with a simple dot product

The result is a vector such as [0.023, -0.156, 0.891, ...] with dimensions ranging from 384 to 4,096 depending on the model.

Measuring similarity

Once text is converted to vectors, similarity is computed with mathematical operations:

Cosine similarity - Measures the angle between two vectors. Range: -1 (opposite) to 1 (identical). Values above 0.7 typically indicate strong semantic similarity.
Euclidean distance - Measures the straight-line distance between vector endpoints. Smaller values mean closer vectors.
Dot product - For unit-length vectors, this is equivalent to cosine similarity. Fast to compute and widely used in vector databases.

import numpy as np

def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

# Two similar sentences produce vectors with high cosine similarity
vec1 = embed("The bank approved my loan")
vec2 = embed("I received approval for a mortgage from the bank")

print(cosine_similarity(vec1, vec2))
# 0.89 (high similarity)

# A different sense of "bank" produces a dissimilar vector
vec3 = embed("I sat by the river bank")
print(cosine_similarity(vec1, vec3))
# 0.42 (low similarity)

Dimensionality

Embedding vectors vary in length depending on the model:

Model	Dimensions	Notes
OpenAI text-embedding-3-small	1,536	Good balance of quality and cost
OpenAI text-embedding-3-large	3,072	Higher quality, more expensive
Cohere embed-english-v3	1,024	Strong performance on English text
BGE-large-en-v1.5	1,024	Open-source leader for retrieval tasks
E5-large-v2	1,024	Microsoft model optimized for semantic search
GTE-large	1,024	General text embeddings, strong on benchmarks

Higher dimensions can capture finer semantic distinctions but require more storage and slightly more computation. For most applications, 768-1,536 dimensions are sufficient.

Use cases

Semantic search

Instead of matching keywords, a query is embedded and the database returns the most similar document vectors. This handles synonyms, paraphrases, and conceptually related content that keyword search misses.

Retrieval-Augmented Generation (RAG)

A user question is embedded, relevant documents are retrieved from a vector database, and those documents are fed to a language model as context. The quality of retrieval depends entirely on the quality of the embeddings.

Clustering and classification

Documents with similar embeddings can be grouped into clusters without predefined categories. A support ticket system might automatically cluster tickets by topic based on embedding similarity.

Recommendation

Content recommendation engines embed items and user preferences, then suggest items whose embeddings are close to the user’s profile vector.

Duplicate detection

Two documents with cosine similarity near 1.0 are likely duplicates or near-duplicates, even if their wording differs.

Choosing an embedding model

The right model depends on the task:

Task	Recommended approach
General English search	OpenAI text-embedding-3-small or Cohere embed-v3
High-precision retrieval	OpenAI text-embedding-3-large or BGE-large
Multilingual content	Cohere embed-multilingual-v3 or E5-multilingual
Cost-sensitive at scale	BGE-small or GTE-base (open-source, self-hosted)
Domain-specific (legal, medical)	Fine-tune a base model on domain text

Key considerations:

Context length - Most embedding models handle 512-8,192 tokens. Long documents must be chunked before embedding.
Language coverage - Models trained primarily on English underperform on other languages
Task alignment - Some models are optimized for similarity (bi-encoders), others for classification. Match the model to the task
Latency and cost - API-hosted models charge per token. Self-hosted open-source models trade setup complexity for predictable costs

Common embedding pitfalls

Garbage in, garbage out - Noisy, poorly formatted, or irrelevant text produces low-quality vectors
Vocabulary mismatch - Technical jargon, brand names, and neologisms may not be well-represented in the model’s training data
Length truncation - Documents longer than the model’s context window are truncated, losing meaning from the end
Outdated knowledge - Models trained before a term was coined will produce weaker embeddings for that term

How crawler.sh improves embedding quality

crawler.sh produces text that embeds more accurately and retrieves more reliably:

Clean text - HTML-to-Markdown conversion removes navigation, ads, and boilerplate that pollute the semantic signal
Structured content - Preserved headings and paragraphs keep logical sections intact, so each chunk contains coherent meaning
Consistent formatting - Uniform Markdown structure means similar documents embed predictably
Metadata for context - Each chunk carries its source URL and page title, enabling the retrieval system to provide provenance
JavaScript rendering - Dynamically loaded content is captured, ensuring the text that gets embedded is complete

For a RAG pipeline, this means:

# 1. Crawl and export as Markdown
crawler crawl https://docs.example.com --render --output docs.zip

# 2. Chunk by Markdown headings
# 3. Embed each chunk
# 4. Store in vector database with source URL metadata

The result is a vector store where retrieved chunks are semantically coherent, traceable to their source, and free of the HTML noise that degrades embedding quality.