A comprehensive guide to optimizing Retrieval-Augmented Generation (RAG) systems for production use.
Core Techniques
πͺ Chunking Strategy
Getting chunk size right is critical. Too small = loss of context. Too large = noisy retrieval.
Key Principles:
- Chunk based on logical units (sentences, paragraphs, sections) rather than arbitrary character counts
- Always add overlap between chunks (10-20%) to avoid losing important context at boundaries
- Consider token-based chunking using your model's tokenizerβbut this locks you into that tokenizer
- Small chunks = more diverse results but higher computational overhead (more embeddings to store and search)
- Recursive chunking works well: start with large units, break down progressively until you fit context limits
Chunking Options:
- Word-based chunking
- Sentence-based chunking
- Paragraph-based chunking
- Section-based chunking
- Token-based chunking (using model's tokenizer)
Important Considerations:
- Chunk size should not exceed max context length of the generator or embedding model
- Small chunks provide more diverse information (can fit more chunks in context)
- However, small chunks can cause loss of information
- Small chunks increase computational overhead (need to generate and store more embedding vectors)
- If you use token-based chunking with a specific tokenizer, changing models later requires re-indexing all documents
π― Query Rewriting
User queries are often incomplete, especially in multi-turn conversations.
The Problem:
Example conversation:
User: When was the last time John Doe bought something from us?
AI: John bought a Fruity Fedora on January 3, 2030.
User: How about Emily Doe?
The last query will fail retrievalβit lacks context. It should be rewritten to "When was the last time Emily Doe bought something from us?"
Solution:
- Use LLMs to rewrite queries as self-contained questions
- Each query should contain all necessary context, even in multi-turn conversations
- The rewritten query should be semantically complete and understandable without prior context
Important Warning:
- Be careful of hallucinations when using AI models for query rewriting
- Validate that rewrites make logical sense
- Consider implementing checks to ensure the rewritten query preserves the user's intent
π Reranking
The two-stage retrieval paradigm: cheap retrieval β expensive reranking.
Why Reranking?
Especially useful when you want to:
- Reduce the number of retrieved documents
- Reduce the number of input tokens to your LLM
- Improve the precision of your final result set
Common Approach: Hybrid Retrieval System
- First stage: Fetch candidates with a cheap retriever
- Use BM25 + vector search to cast a wide net
- Term-based retrieval catches exact keyword matches
- Embedding-based retrieval handles semantic similarity
- Second stage: Rerank with a better model
- Use cross-encoder models for precision
- More computationally expensive but much more accurate
- Only applied to the smaller candidate set
Time-Based Reranking:
Documents can also be reranked based on time, giving higher weight to more recent items.
This is critical for time-sensitive applications:
- Email systems
- Stock market data
- News aggregation
- Real-time monitoring systems
Benefits:
- Reduces input tokens to your LLM (significant cost savings!)
- Improves relevance of final result set
- Allows you to use cheaper retrievers for initial candidate generation
π Contextual Retrieval
This one's a game-changer. Chunks often lack context needed for accurate retrieval.
The Problem:
A chunk about "Q3 revenue increased 15%" is useless without knowing:
- Which company?
- Which year?
- Which product line or division?
Anthropic's Solution:
- For each chunk, use an LLM to generate a brief context (50-100 tokens) explaining the chunk's relationship to the overall document
- Prepend this context to the chunk before embedding
- Now the chunk becomes self-contained and retrievable
Example Transformation:
Before: "Q3 revenue increased 15%"
After: "This chunk is from Acme Corp's 2024 annual report, discussing the cloud services division. Q3 revenue increased 15%"
Anthropic's Prompt Template:
<document>
{{WHOLE_DOCUMENT}}
</document>
Here is the chunk we want to situate within the whole document:
<chunk>
{{CHUNK_CONTENT}}
</chunk>
Please give a short succinct context to situate this chunk within
the overall document for the purposes of improving search retrieval.
Answer only with the succinct context and nothing else.
Additional Augmentation Strategies:
You can also augment chunks with:
- Metadata: tags, keywords, timestamps
- Product information: descriptions and reviews
- Media context: image/video captions and titles
- Expected user questions: what users might ask about this content
- Special identifiers: error codes, product IDs, reference numbers
- Chunk-specific terms: technical jargon or domain-specific terminology
The Process:
Document Corpus β Split into chunks β
Generate context for each chunk β
Prepend context to chunk β
Embed augmented chunks β
Store in vector database
Benefits:
- Dramatically improves retrieval precision
- Makes chunks self-contained and understandable
- Reduces false positives from semantic search
- Particularly effective for technical documents, reports, and structured data
Figure: Contextual Retrieval Pipeline - Document chunks are augmented with context before embedding
Advanced Techniques
π Hybrid Search
What it is:
Combining dense (vector-based) retrieval with sparse (keyword-based) retrieval. This is arguably the most impactful optimization missing from basic RAG implementations.
How it works:
- Dense retrieval (Vector search): Uses embedding models to find semantically similar content
- Great for: conceptual queries, synonyms, paraphrasing
- Weakness: misses exact keyword matches
- Sparse retrieval (BM25/keyword search): Traditional keyword matching algorithms
- Great for: exact terms, product codes, names, technical jargon
- Weakness: misses semantic similarity
- Hybrid approach: Combine both methods and merge results
- Each method scores documents independently
- Final ranking uses weighted combination of both scores
- Typically: 70% vector + 30% BM25, but tune for your use case
Step-by-step process:
- Chunk the knowledge base: Break down documents into smaller chunks (typically a few hundred tokens each)
- Create dual encodings: Generate both TF-IDF encodings and semantic embeddings for each chunk
- BM25 retrieval: Use BM25 algorithm to find top chunks based on exact keyword matches
- Vector retrieval: Use embeddings to find top chunks based on semantic similarity
- Rank fusion: Combine and deduplicate results from both methods using rank fusion techniques (e.g., Reciprocal Rank Fusion)
- Context augmentation: Add the top-K chunks to the prompt to generate the final response
Why it's critical:
Almost every production RAG system uses hybrid search because:
- Covers both semantic and lexical matching
- Handles edge cases better (product SKUs, error codes, exact names)
- More robust to different query types
- Empirically shows 15-30% improvement in retrieval accuracy
Implementation:
# Pseudocode
vector_results = vector_search(query_embedding, top_k=20)
bm25_results = bm25_search(query_text, top_k=20)
# Merge and rerank
combined_results = merge_with_weights(
vector_results,
bm25_results,
vector_weight=0.7,
bm25_weight=0.3
)
π·οΈ Metadata Filtering
What it is:
Pre-filtering documents by metadata attributes before performing vector search to narrow the search space.
How it works:
Instead of searching your entire vector database, first filter by metadata:
- Date ranges (e.g., "only documents from 2024")
- Source (e.g., "only from engineering documentation")
- Category (e.g., "only policy documents")
- Author, department, tags, etc.
Then perform vector search only on the filtered subset.
Example:
Query: "What's our refund policy for enterprise customers?"
Without metadata filtering:
- Search all 100,000 documents
- Get mixed results (consumer policies, internal docs, marketing material)
With metadata filtering:
- Filter:
category="policy" AND customer_type="enterprise" - Search only 500 relevant documents
- Much higher precision
Benefits:
- Dramatically reduces search space
- Improves retrieval speed
- Increases precision by eliminating irrelevant documents
- Lower computational costs
When to use:
- Multi-tenant systems (filter by customer/organization)
- Time-sensitive data (filter by date)
- Multi-domain corpora (filter by domain/category)
- Access control requirements (filter by permissions)
π³ Parent-Child / Hierarchical Retrieval
What it is:
Embed small chunks for precision, but return larger parent chunks to the LLM for better context.
The Problem:
- Small chunks = precise retrieval but insufficient context for generation
- Large chunks = good context but imprecise retrieval (noisy results)
The Solution:
- Create a hierarchy: large parent chunks and smaller child chunks
- Embed and index the small child chunks for retrieval
- When a child chunk is retrieved, return the parent chunk to the LLM
Example Structure:
Document: "Product Manual"
βββ Parent Chunk 1: "Installation Guide" (2000 tokens)
β βββ Child Chunk 1.1: "Prerequisites" (200 tokens) β embed this
β βββ Child Chunk 1.2: "Step-by-step instructions" (200 tokens) β embed this
β βββ Child Chunk 1.3: "Troubleshooting" (200 tokens) β embed this
βββ Parent Chunk 2: "Configuration Guide" (2000 tokens)
βββ Child Chunk 2.1: "Basic settings" (200 tokens) β embed this
βββ Child Chunk 2.2: "Advanced settings" (200 tokens) β embed this
Retrieval Flow:
- User query: "How do I fix installation errors?"
- Vector search finds Child Chunk 1.3: "Troubleshooting" (high precision)
- System returns Parent Chunk 1: "Installation Guide" to the LLM (full context)
- LLM has both the relevant section AND surrounding context
Benefits:
- Best of both worlds: precise retrieval + rich context
- LLM gets proper context for better generation
- Reduces hallucination from insufficient context
- Maintains semantic coherence
π HyDE (Hypothetical Document Embeddings)
What it is:
Generate a hypothetical answer first, then embed that for retrieval instead of the original query. This bridges the query-document semantic gap.
The Problem:
Questions and answers use different vocabulary and structure:
- Query: "How do I fix CUDA out of memory errors?"
- Documentation: "To resolve OOM issues, reduce batch size, enable gradient accumulation, use mixed precision training..."
These have different semantic representations despite being related.
How HyDE works:
- User submits query: "How do I fix CUDA out of memory errors?"
- Ask LLM to generate a hypothetical answer (even if it hallucinates):
"To fix CUDA OOM errors, you should reduce batch size, use gradient accumulation, enable mixed precision training with torch.cuda.amp, clear cache with torch.cuda.empty_cache()..." - Embed this hypothetical answer
- Search using the hypothetical answer's embedding
- The real documentation is semantically closer to this "answer" than to the original question
Why it works:
- Answers and documents exist in similar semantic space
- Questions exist in a different semantic space
- By converting query β hypothetical answer, we bridge the gap
- The hypothetical answer doesn't need to be factually correctβit just needs to be semantically similar to real answers
When to use:
- Technical documentation queries
- How-to questions
- Queries where the semantic gap between question and answer is large
- Domains with specialized vocabulary
Caution:
- Adds extra LLM call (latency + cost)
- Works best with high-quality instruction-following models
- May not help for factual lookup queries ("What is the capital of France?")
π Multi-Query Retrieval
What it is:
Generate multiple variations of the user's query and retrieve documents for each variation, then merge the results.
How it works:
- Original query: "Best practices for API security"
- Generate variations:
- "How to secure REST APIs"
- "API authentication and authorization methods"
- "Preventing API vulnerabilities"
- "API security design patterns"
- Retrieve documents for each variation
- Merge and deduplicate results
- Optionally rerank the merged set
Difference from Query Rewriting:
- Query Rewriting: 1 query β 1 better query
- Multi-Query: 1 query β multiple query variations β merged results
Benefits:
- Catches documents that a single query phrasing might miss
- More comprehensive coverage
- Reduces dependency on exact query phrasing
- Handles ambiguous queries better
Implementation Strategies:
- Use LLM to generate variations
- Use pre-defined templates for common query types
- Generate variations based on synonyms and related terms
Trade-offs:
- More retrieval calls = higher latency
- More documents to process and rerank
- Diminishing returns after 3-5 query variations
β Self-RAG / Corrective RAG
What it is:
The system evaluates whether retrieved documents are actually relevant to the query and can take corrective actions (re-retrieve, try different strategy, or fall back gracefully).
The Problem:
Traditional RAG blindly trusts retrieved documents:
- Retrieved docs might be irrelevant
- Retrieved docs might be outdated
- Retrieved docs might contradict each other
- No docs might be available
How Self-RAG works:
1. Retrieve documents
2. LLM evaluates each document:
- Is this relevant to the query?
- Does this support or contradict other documents?
- Is this information sufficient to answer?
3. Decision branches:
a. All docs relevant β Proceed to generate answer
b. Partially relevant β Retrieve more documents
c. Not relevant β Try different retrieval strategy or web search
d. No docs available β Answer from parametric knowledge or admit uncertainty
Example Flow:
Query: "Latest features in Python 3.12"
Step 1: Initial retrieval returns docs about Python 3.10
Step 2: Self-RAG detects version mismatch
"Query asks for 3.12, but retrieved docs are about 3.10"
Step 3: Corrective action: Trigger web search for current information
Step 4: Verify new results match query requirements
Step 5: Generate answer with appropriate sources
Corrective RAG Variant:
Focuses specifically on correction strategies:
- If retrieval quality is low β reformulate query and retry
- If retrieved docs are contradictory β retrieve more for disambiguation
- If no relevant docs found β expand search scope or use external sources
Implementation Considerations:
- Add relevance scoring step after retrieval
- Define thresholds for "good enough" relevance
- Implement fallback strategies (web search, admit uncertainty)
- Track retrieval quality metrics
Benefits:
- Prevents hallucination from irrelevant context
- More robust and reliable RAG system
- Handles edge cases gracefully
- Improves user trust through transparency
When to implement:
- Production systems where accuracy is critical
- Domains where information changes frequently
- Multi-source retrieval systems
- Applications where wrong answers have consequences
πΈοΈ Graph RAG
What it is:
Use knowledge graphs to capture entity relationships that vector search misses. Instead of treating documents as isolated chunks, build a graph of interconnected entities and relationships.
The Problem with Vector Search:
Vector search finds semantically similar text, but it misses explicit relationships:
- "John worked on Project X" (in document A)
- "Project X failed in 2023" (in document B)
- Vector search might miss the connection between John and the project failure
How Graph RAG works:
- Entity Extraction: Extract entities (people, places, organizations, concepts) from documents
- Relationship Mapping: Identify relationships between entities
- Graph Construction: Build knowledge graph with entities as nodes and relationships as edges
- Hybrid Retrieval:
- Use vector search for semantic similarity
- Use graph traversal for relationship queries
- Combine both for comprehensive retrieval
Example Graph Structure:
[John] --works_on--> [Project X] --failed_in--> [2023]
[Project X] --belongs_to--> [Cloud Division]
[Cloud Division] --part_of--> [Acme Corp]
Query Examples:
"What projects did John work on that failed?"
- Graph traversal: John β works_on β Project X β failed_in β 2023
- Returns: "Project X failed in 2023, John worked on it"
Benefits:
- Captures "who did what, when, where" relationships
- Handles multi-hop reasoning (John β Project β Division β Company)
- Answers relationship queries that vector search can't
- Provides explainable retrieval paths
Use Cases:
- Research papers (author β paper β cites β paper)
- Corporate knowledge (employee β project β department β company)
- Legal documents (case β cites β statute β applies_to β situation)
- Medical records (patient β condition β treatment β outcome)
Implementation:
- Use graph databases (Neo4j, Amazon Neptune)
- Extract entities with NER models
- Extract relationships with relation extraction models
- Combine graph traversal with vector search results
Challenges:
- More complex to implement and maintain
- Requires entity extraction and relationship mapping
- Graph quality depends on extraction accuracy
- Higher computational overhead
When to use:
- Complex domains with rich entity relationships
- When "how things connect" matters as much as "what's similar"
- Investigative or analytical use cases
- Knowledge management systems
π€ Agentic RAG
What it is:
Treat RAG as a multi-step decision-making process where an LLM agent decides how to retrieve information, rather than using a single fixed retrieval pipeline.
The Problem:
Different queries need different retrieval strategies:
- "What were our Q3 sales?" β SQL database query
- "Explain our refund policy" β Vector search on policy documents
- "Latest news about competitor X" β Web search
- "Show me code examples for authentication" β Code search
One-size-fits-all retrieval doesn't work for diverse queries.
How Agentic RAG works:
- Query Analysis: Agent analyzes the user's query to understand intent
- Strategy Selection: Agent decides which tools/retrievers to use
- Multi-Step Retrieval: Agent can chain multiple retrievals
- Self-Evaluation: Agent evaluates if retrieved information is sufficient
- Adaptive Refinement: Agent can try different approaches if first attempt fails
Example Flow:
Query: "Compare our Q3 revenue vs competitors in the cloud market"
Agent reasoning:
1. "Need internal Q3 revenue" β Query SQL database
2. "Need competitor revenue" β Web search
3. "Need cloud market context" β Vector search internal reports
4. Combine all three sources β Generate comparative analysis
Common Patterns:
ReAct (Reasoning + Acting):
Thought: I need to find the user's order history
Action: query_database(user_id=123, table="orders")
Observation: Found 15 orders
Thought: User asked about recent orders, I should filter
Action: filter_orders(timeframe="last_30_days")
Observation: 3 orders in last 30 days
Thought: Now I can answer
Answer: You placed 3 orders in the last 30 days...
Tool Selection:
- Classify query intent (factual, analytical, procedural, transactional)
- Route to appropriate tools:
- Factual β Vector search
- Analytical β SQL + Vector search
- Procedural β Documentation search
- Transactional β API calls
Implementation:
Often built with frameworks like:
- LangGraph (state machines for multi-step retrieval)
- LangChain Agents
- Custom agent loops with tool calling
Benefits:
- Handles complex, multi-faceted queries
- Adapts to query requirements dynamically
- Can combine multiple data sources intelligently
- More robust to diverse query types
Challenges:
- Higher latency (multiple LLM calls)
- More complex to debug
- Potential for infinite loops or excessive tool use
- Requires careful prompt engineering for agent reasoning
When to use:
- Complex applications with diverse data sources
- When queries require multi-step reasoning
- Enterprise systems with SQL, vector DB, web search, APIs
- Conversational interfaces that need context-aware retrieval
π― Fine-Tuning Embedding Models
What it is:
Train domain-specific embedding models on your specialized corpus instead of using general-purpose embeddings.
The Problem:
General-purpose embedding models (OpenAI, Cohere, etc.) are trained on broad internet data. They might not capture nuances in your specific domain:
- Medical terminology and relationships
- Legal language and precedents
- Corporate jargon and acronyms
- Technical documentation conventions
How it works:
- Collect training data:
- Positive pairs: (query, relevant document)
- Negative pairs: (query, irrelevant document)
- Can generate synthetically using LLMs
- Fine-tune embedding model:
- Start with pre-trained model (e.g., sentence-transformers)
- Continue training on domain-specific pairs
- Optimize for your retrieval task
- Deploy fine-tuned model:
- Use for encoding both queries and documents
- Store embeddings in vector database
Benefits:
- Significantly better retrieval accuracy in specialized domains
- Captures domain-specific semantic relationships
- Better handling of jargon and terminology
- Can optimize for your specific retrieval patterns
Trade-offs:
- Expensive: requires labeled data and training resources
- Maintenance: need to retrain as domain evolves
- Deployment: need to host custom model
- May lose some general knowledge from base model
When to use:
- Highly specialized domains (medical, legal, scientific)
- When retrieval accuracy is critical and worth the investment
- Large-scale systems where marginal improvements have big impact
- When you have sufficient training data (thousands of examples)
Data Generation Strategy:
Use LLMs to generate synthetic training data:
For each document chunk:
1. Generate 3-5 queries that this chunk should answer
2. Generate 3-5 queries that this chunk should NOT answer
3. Use as positive and negative training pairs
Quick Reference
Start Here (Highest ROI):
- Hybrid Search (dense + sparse retrieval)
- Contextual Retrieval (augment chunks with context)
- Reranking (two-stage retrieval)
Next Level:
- Query Rewriting (handle multi-turn conversations)
- Metadata Filtering (narrow search space)
- Parent-Child Retrieval (precision + context)
Advanced (When Needed):
- HyDE (bridge query-document gap)
- Multi-Query Retrieval (comprehensive coverage)
- Self-RAG / Corrective RAG (robust production systems)
- Agentic RAG (complex multi-source queries)
- Graph RAG (relationship-heavy domains)
- Fine-tuned Embeddings (specialized domains)
Measuring Success
Always measure retrieval quality before optimizing generation:
Key Metrics:
- Recall@K: What % of relevant documents are in top K results?
- Precision@K: What % of top K results are actually relevant?
- MRR (Mean Reciprocal Rank): Average 1/rank of first relevant result
- NDCG (Normalized Discounted Cumulative Gain): Quality of ranking
Remember: Bad inputs = bad outputs, no matter how good your LLM is.
Resources
- Anthropic's Contextual Retrieval: https://www.anthropic.com/engineering/contextual-retrieval
- Chip Huyen's AI Engineering: Comprehensive coverage of RAG systems
- Microsoft's Graph RAG Paper: Deep dive into knowledge graph integration
- HyDE Paper: "Precise Zero-Shot Dense Retrieval without Relevance Labels"
Citation
If you found this guide helpful and would like to cite it:
Cited as:
Haseeb, Raja. (Feb 2026). "RAG Optimization Techniques". Personal Blog.
https://rajahaseeb.com/posts/rag-optimization-techniques/
Or in BibTeX format:
@article{pytholic2026ragoptim,
title = "RAG Optimization Techniques",
author = "Haseeb, Raja",
journal = "pytholic.github.io",
year = "2026",
month = "Feb",
url = "https://pytholic.github.io/posts/rag-optimization-techniques/"
}