RAG, or Retrieval-Augmented Generation, is an architecture that combines information retrieval with text generation. Instead of relying solely on what a language model learned during training, a RAG system fetches relevant documents from an external knowledge base and feeds them to the model as context. This produces answers that are factual, current, and traceable to specific sources.
The core insight behind RAG is that language models have a knowledge cutoff: they do not know anything published after their training date. They also tend to hallucinate facts. RAG solves both problems by grounding the model’s output in retrieved documents rather than latent memory.
How RAG works
A RAG system has two stages:
1. Retrieval
When a user asks a question, the system:
- Converts the question into an embedding vector using an embedding model
- Searches a vector database for documents with similar embeddings
- Returns the top-k most relevant chunks of text
# Example: OpenAI embedding + Pinecone vector DBimport openai
question = "What are the latest features in Python 3.13?"
# 1. Embed the questionresponse = openai.embeddings.create( model="text-embedding-3-small", input=question)question_vector = response.data[0].embedding
# 2. Search vector databaseresults = index.query( vector=question_vector, top_k=5, include_metadata=True)
# 3. Retrieved chunks become contextcontext = "\n\n".join([r.metadata["text"] for r in results.matches])2. Generation
The retrieved context is inserted into a prompt template, and the language model generates an answer based on that context:
You are a helpful assistant. Use the following context to answer the question.If the answer is not in the context, say "I don't know."
Context:{context}
Question: {question}Answer:The model sees both the question and the relevant documents, enabling it to cite specific information rather than making up facts.
RAG vs fine-tuning
| RAG | Fine-tuning |
|---|---|
| Knowledge stays current (update the document store) | Knowledge is frozen at training time |
| Can cite sources and show provenance | Cannot show where an answer came from |
| Works with any model, no training required | Requires GPU time and dataset curation |
| Cheaper to update (just add documents) | Expensive to retrain when information changes |
| May retrieve irrelevant context | Consistent behavior across queries |
| Best for factual, time-sensitive Q&A | Best for style, tone, and task-specific behavior |
RAG and fine-tuning are not mutually exclusive. Many production systems fine-tune a model for conversational style and then use RAG to supply factual content.
Key components of a RAG system
- Document store - The corpus of text that can be searched. This might be a company’s internal wiki, a crawled website, or a curated knowledge base.
- Embedding model - Converts text into dense vectors that capture semantic meaning. Common choices include OpenAI’s text-embedding-3, Cohere embed, or open-source models like BGE and E5.
- Vector database - Stores document embeddings and performs fast similarity search. Options include Pinecone, Weaviate, Qdrant, Chroma, and pgvector.
- Chunking strategy - Documents are split into pieces small enough to fit in the context window while preserving coherence.
- Reranking - A cross-encoder model re-scores the retrieved chunks to improve relevance beyond pure embedding similarity.
- Generation model - The LLM that synthesizes the final answer from the retrieved context.
Chunking strategies for RAG
How you split documents matters:
- Fixed-size chunks - Split every N tokens. Simple but may cut sentences in half.
- Recursive splitting - Split on paragraphs, then sentences, then words. Preserves structure better.
- Semantic splitting - Use an embedding model to find natural topic boundaries.
- Heading-based splitting - Split at Markdown or HTML headings so each chunk is a self-contained section.
The chunk size must fit within the model’s context window while leaving room for the question and the answer. For a 4,000-token model, chunks of 500-1,000 tokens are typical.
Common RAG failure modes
- Retrieval misses - The correct document is in the database but the embedding search does not find it. Often caused by vocabulary mismatch between the question and the document.
- Context overload - Too many irrelevant chunks dilute the signal. The model gets confused by contradictory information.
- Hallucination despite RAG - The model ignores the retrieved context and answers from its training data. This can be mitigated with prompt engineering and instruction tuning.
- Citation errors - The model claims a fact is from a source that did not actually contain it.
How crawler.sh powers RAG systems
crawler.sh is an ideal ingestion tool for RAG document stores:
- Fresh content - Crawl any website to populate the document store with current information. When the source updates, re-crawl and update the embeddings.
- Clean Markdown - The Markdown extraction produces text that chunks cleanly and embeds well. HTML boilerplate and navigation do not pollute the vector space.
- Metadata preservation - Each chunk retains its source URL, title, and description, enabling the generation model to cite sources accurately.
- JavaScript rendering - Dynamic sites that load content via JS are fully captured, ensuring the document store is complete.
- Local operation - Sensitive internal wikis or documentation can be crawled and processed entirely on-premise before being embedded.
# Crawl a documentation site for RAG ingestioncrawler crawl https://docs.example.com --max-depth 5 --render --output docs.zip
# The extracted Markdown files are ready for chunking and embedding