RAG

What is RAG in AI Systems

Retrieval-Augmented Generation combines search and language models to produce factual, current answers grounded in source documents.

RAG, or Retrieval-Augmented Generation, is an architecture that combines information retrieval with text generation. Instead of relying solely on what a language model learned during training, a RAG system fetches relevant documents from an external knowledge base and feeds them to the model as context. This produces answers that are factual, current, and traceable to specific sources.

The core insight behind RAG is that language models have a knowledge cutoff: they do not know anything published after their training date. They also tend to hallucinate facts. RAG solves both problems by grounding the model’s output in retrieved documents rather than latent memory.

How RAG works

A RAG system has two stages:

1. Retrieval

When a user asks a question, the system:

  • Converts the question into an embedding vector using an embedding model
  • Searches a vector database for documents with similar embeddings
  • Returns the top-k most relevant chunks of text
# Example: OpenAI embedding + Pinecone vector DB
import openai
question = "What are the latest features in Python 3.13?"
# 1. Embed the question
response = openai.embeddings.create(
model="text-embedding-3-small",
input=question
)
question_vector = response.data[0].embedding
# 2. Search vector database
results = index.query(
vector=question_vector,
top_k=5,
include_metadata=True
)
# 3. Retrieved chunks become context
context = "\n\n".join([r.metadata["text"] for r in results.matches])

2. Generation

The retrieved context is inserted into a prompt template, and the language model generates an answer based on that context:

You are a helpful assistant. Use the following context to answer the question.
If the answer is not in the context, say "I don't know."
Context:
{context}
Question: {question}
Answer:

The model sees both the question and the relevant documents, enabling it to cite specific information rather than making up facts.

RAG vs fine-tuning

RAGFine-tuning
Knowledge stays current (update the document store)Knowledge is frozen at training time
Can cite sources and show provenanceCannot show where an answer came from
Works with any model, no training requiredRequires GPU time and dataset curation
Cheaper to update (just add documents)Expensive to retrain when information changes
May retrieve irrelevant contextConsistent behavior across queries
Best for factual, time-sensitive Q&ABest for style, tone, and task-specific behavior

RAG and fine-tuning are not mutually exclusive. Many production systems fine-tune a model for conversational style and then use RAG to supply factual content.

Key components of a RAG system

  • Document store - The corpus of text that can be searched. This might be a company’s internal wiki, a crawled website, or a curated knowledge base.
  • Embedding model - Converts text into dense vectors that capture semantic meaning. Common choices include OpenAI’s text-embedding-3, Cohere embed, or open-source models like BGE and E5.
  • Vector database - Stores document embeddings and performs fast similarity search. Options include Pinecone, Weaviate, Qdrant, Chroma, and pgvector.
  • Chunking strategy - Documents are split into pieces small enough to fit in the context window while preserving coherence.
  • Reranking - A cross-encoder model re-scores the retrieved chunks to improve relevance beyond pure embedding similarity.
  • Generation model - The LLM that synthesizes the final answer from the retrieved context.

Chunking strategies for RAG

How you split documents matters:

  • Fixed-size chunks - Split every N tokens. Simple but may cut sentences in half.
  • Recursive splitting - Split on paragraphs, then sentences, then words. Preserves structure better.
  • Semantic splitting - Use an embedding model to find natural topic boundaries.
  • Heading-based splitting - Split at Markdown or HTML headings so each chunk is a self-contained section.

The chunk size must fit within the model’s context window while leaving room for the question and the answer. For a 4,000-token model, chunks of 500-1,000 tokens are typical.

Common RAG failure modes

  • Retrieval misses - The correct document is in the database but the embedding search does not find it. Often caused by vocabulary mismatch between the question and the document.
  • Context overload - Too many irrelevant chunks dilute the signal. The model gets confused by contradictory information.
  • Hallucination despite RAG - The model ignores the retrieved context and answers from its training data. This can be mitigated with prompt engineering and instruction tuning.
  • Citation errors - The model claims a fact is from a source that did not actually contain it.

How crawler.sh powers RAG systems

crawler.sh is an ideal ingestion tool for RAG document stores:

  • Fresh content - Crawl any website to populate the document store with current information. When the source updates, re-crawl and update the embeddings.
  • Clean Markdown - The Markdown extraction produces text that chunks cleanly and embeds well. HTML boilerplate and navigation do not pollute the vector space.
  • Metadata preservation - Each chunk retains its source URL, title, and description, enabling the generation model to cite sources accurately.
  • JavaScript rendering - Dynamic sites that load content via JS are fully captured, ensuring the document store is complete.
  • Local operation - Sensitive internal wikis or documentation can be crawled and processed entirely on-premise before being embedded.
# Crawl a documentation site for RAG ingestion
crawler crawl https://docs.example.com --max-depth 5 --render --output docs.zip
# The extracted Markdown files are ready for chunking and embedding
Crawler.sh - Free Local AEO & SEO Spider and a Markdown content extractor | Product Hunt