Embedding

What are Embeddings in AI and Machine Learning

Embeddings are dense vectors that capture semantic meaning, enabling machines to compare, search, and cluster content by meaning.

An embedding is a dense vector of floating-point numbers that represents the semantic meaning of a piece of text. Unlike traditional search, which matches keywords, embeddings capture meaning. The sentence “The cat sat on the mat” and “A feline rested on the rug” share no keywords, yet their embeddings are close together in vector space because they describe the same idea.

Embeddings are produced by embedding models, which are neural networks trained to convert text into vectors. The key property is that semantically similar texts produce vectors that are mathematically close, while unrelated texts produce vectors that are far apart. This enables semantic search, recommendation, clustering, and classification at scale.

How embeddings are generated

An embedding model reads text and outputs a fixed-length vector. The process works like this:

  1. Tokenization - The text is split into tokens using the model’s tokenizer
  2. Neural encoding - A transformer network processes the tokens and builds contextual representations
  3. Pooling - The token-level representations are combined into a single vector, typically by averaging or using a special CLS token
  4. Normalization - The vector is often normalized to unit length so that similarity can be measured with a simple dot product

The result is a vector such as [0.023, -0.156, 0.891, ...] with dimensions ranging from 384 to 4,096 depending on the model.

Measuring similarity

Once text is converted to vectors, similarity is computed with mathematical operations:

  • Cosine similarity - Measures the angle between two vectors. Range: -1 (opposite) to 1 (identical). Values above 0.7 typically indicate strong semantic similarity.
  • Euclidean distance - Measures the straight-line distance between vector endpoints. Smaller values mean closer vectors.
  • Dot product - For unit-length vectors, this is equivalent to cosine similarity. Fast to compute and widely used in vector databases.
import numpy as np
def cosine_similarity(a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
# Two similar sentences produce vectors with high cosine similarity
vec1 = embed("The bank approved my loan")
vec2 = embed("I received approval for a mortgage from the bank")
print(cosine_similarity(vec1, vec2))
# 0.89 (high similarity)
# A different sense of "bank" produces a dissimilar vector
vec3 = embed("I sat by the river bank")
print(cosine_similarity(vec1, vec3))
# 0.42 (low similarity)

Dimensionality

Embedding vectors vary in length depending on the model:

ModelDimensionsNotes
OpenAI text-embedding-3-small1,536Good balance of quality and cost
OpenAI text-embedding-3-large3,072Higher quality, more expensive
Cohere embed-english-v31,024Strong performance on English text
BGE-large-en-v1.51,024Open-source leader for retrieval tasks
E5-large-v21,024Microsoft model optimized for semantic search
GTE-large1,024General text embeddings, strong on benchmarks

Higher dimensions can capture finer semantic distinctions but require more storage and slightly more computation. For most applications, 768-1,536 dimensions are sufficient.

Use cases

Instead of matching keywords, a query is embedded and the database returns the most similar document vectors. This handles synonyms, paraphrases, and conceptually related content that keyword search misses.

Retrieval-Augmented Generation (RAG)

A user question is embedded, relevant documents are retrieved from a vector database, and those documents are fed to a language model as context. The quality of retrieval depends entirely on the quality of the embeddings.

Clustering and classification

Documents with similar embeddings can be grouped into clusters without predefined categories. A support ticket system might automatically cluster tickets by topic based on embedding similarity.

Recommendation

Content recommendation engines embed items and user preferences, then suggest items whose embeddings are close to the user’s profile vector.

Duplicate detection

Two documents with cosine similarity near 1.0 are likely duplicates or near-duplicates, even if their wording differs.

Choosing an embedding model

The right model depends on the task:

TaskRecommended approach
General English searchOpenAI text-embedding-3-small or Cohere embed-v3
High-precision retrievalOpenAI text-embedding-3-large or BGE-large
Multilingual contentCohere embed-multilingual-v3 or E5-multilingual
Cost-sensitive at scaleBGE-small or GTE-base (open-source, self-hosted)
Domain-specific (legal, medical)Fine-tune a base model on domain text

Key considerations:

  • Context length - Most embedding models handle 512-8,192 tokens. Long documents must be chunked before embedding.
  • Language coverage - Models trained primarily on English underperform on other languages
  • Task alignment - Some models are optimized for similarity (bi-encoders), others for classification. Match the model to the task
  • Latency and cost - API-hosted models charge per token. Self-hosted open-source models trade setup complexity for predictable costs

Common embedding pitfalls

  • Garbage in, garbage out - Noisy, poorly formatted, or irrelevant text produces low-quality vectors
  • Vocabulary mismatch - Technical jargon, brand names, and neologisms may not be well-represented in the model’s training data
  • Length truncation - Documents longer than the model’s context window are truncated, losing meaning from the end
  • Outdated knowledge - Models trained before a term was coined will produce weaker embeddings for that term

How crawler.sh improves embedding quality

crawler.sh produces text that embeds more accurately and retrieves more reliably:

  • Clean text - HTML-to-Markdown conversion removes navigation, ads, and boilerplate that pollute the semantic signal
  • Structured content - Preserved headings and paragraphs keep logical sections intact, so each chunk contains coherent meaning
  • Consistent formatting - Uniform Markdown structure means similar documents embed predictably
  • Metadata for context - Each chunk carries its source URL and page title, enabling the retrieval system to provide provenance
  • JavaScript rendering - Dynamically loaded content is captured, ensuring the text that gets embedded is complete

For a RAG pipeline, this means:

# 1. Crawl and export as Markdown
crawler crawl https://docs.example.com --render --output docs.zip
# 2. Chunk by Markdown headings
# 3. Embed each chunk
# 4. Store in vector database with source URL metadata

The result is a vector store where retrieved chunks are semantically coherent, traceable to their source, and free of the HTML noise that degrades embedding quality.

Crawler.sh - Free Local AEO & SEO Spider and a Markdown content extractor | Product Hunt