Back to Posts

RAG Optimization Techniques

February 9, 2026

RAG LLM Machine Learning Vector Search Information Retrieval

A comprehensive guide to optimizing Retrieval-Augmented Generation (RAG) systems for production use.


Core Techniques

πŸͺ“ Chunking Strategy

Getting chunk size right is critical. Too small = loss of context. Too large = noisy retrieval.

Key Principles:

Chunking Options:

Important Considerations:

🎯 Query Rewriting

User queries are often incomplete, especially in multi-turn conversations.

The Problem:

Example conversation:

User: When was the last time John Doe bought something from us?
AI: John bought a Fruity Fedora on January 3, 2030.
User: How about Emily Doe?

The last query will fail retrievalβ€”it lacks context. It should be rewritten to "When was the last time Emily Doe bought something from us?"

Solution:

Important Warning:

πŸ” Reranking

The two-stage retrieval paradigm: cheap retrieval β†’ expensive reranking.

Why Reranking?

Especially useful when you want to:

Common Approach: Hybrid Retrieval System

  1. First stage: Fetch candidates with a cheap retriever
    • Use BM25 + vector search to cast a wide net
    • Term-based retrieval catches exact keyword matches
    • Embedding-based retrieval handles semantic similarity
  2. Second stage: Rerank with a better model
    • Use cross-encoder models for precision
    • More computationally expensive but much more accurate
    • Only applied to the smaller candidate set

Time-Based Reranking:

Documents can also be reranked based on time, giving higher weight to more recent items.

This is critical for time-sensitive applications:

Benefits:

πŸ“ Contextual Retrieval

This one's a game-changer. Chunks often lack context needed for accurate retrieval.

The Problem:

A chunk about "Q3 revenue increased 15%" is useless without knowing:

Anthropic's Solution:

  1. For each chunk, use an LLM to generate a brief context (50-100 tokens) explaining the chunk's relationship to the overall document
  2. Prepend this context to the chunk before embedding
  3. Now the chunk becomes self-contained and retrievable

Example Transformation:

Before: "Q3 revenue increased 15%"

After: "This chunk is from Acme Corp's 2024 annual report, discussing the cloud services division. Q3 revenue increased 15%"

Anthropic's Prompt Template:

<document>
{{WHOLE_DOCUMENT}}
</document>

Here is the chunk we want to situate within the whole document:
<chunk>
{{CHUNK_CONTENT}}
</chunk>

Please give a short succinct context to situate this chunk within
the overall document for the purposes of improving search retrieval.
Answer only with the succinct context and nothing else.

Additional Augmentation Strategies:

You can also augment chunks with:

The Process:

Document Corpus β†’ Split into chunks β†’
Generate context for each chunk β†’
Prepend context to chunk β†’
Embed augmented chunks β†’
Store in vector database

Benefits:

Contextual Retrieval Process

Figure: Contextual Retrieval Pipeline - Document chunks are augmented with context before embedding

Advanced Techniques

πŸ” Hybrid Search

What it is:

Combining dense (vector-based) retrieval with sparse (keyword-based) retrieval. This is arguably the most impactful optimization missing from basic RAG implementations.

How it works:

Step-by-step process:

  1. Chunk the knowledge base: Break down documents into smaller chunks (typically a few hundred tokens each)
  2. Create dual encodings: Generate both TF-IDF encodings and semantic embeddings for each chunk
  3. BM25 retrieval: Use BM25 algorithm to find top chunks based on exact keyword matches
  4. Vector retrieval: Use embeddings to find top chunks based on semantic similarity
  5. Rank fusion: Combine and deduplicate results from both methods using rank fusion techniques (e.g., Reciprocal Rank Fusion)
  6. Context augmentation: Add the top-K chunks to the prompt to generate the final response

Why it's critical:

Almost every production RAG system uses hybrid search because:

Implementation:

# Pseudocode
vector_results = vector_search(query_embedding, top_k=20)
bm25_results = bm25_search(query_text, top_k=20)

# Merge and rerank
combined_results = merge_with_weights(
    vector_results,
    bm25_results,
    vector_weight=0.7,
    bm25_weight=0.3
)

🏷️ Metadata Filtering

What it is:

Pre-filtering documents by metadata attributes before performing vector search to narrow the search space.

How it works:

Instead of searching your entire vector database, first filter by metadata:

Then perform vector search only on the filtered subset.

Example:

Query: "What's our refund policy for enterprise customers?"

Without metadata filtering:

With metadata filtering:

Benefits:

When to use:

🌳 Parent-Child / Hierarchical Retrieval

What it is:

Embed small chunks for precision, but return larger parent chunks to the LLM for better context.

The Problem:

The Solution:

  1. Create a hierarchy: large parent chunks and smaller child chunks
  2. Embed and index the small child chunks for retrieval
  3. When a child chunk is retrieved, return the parent chunk to the LLM

Example Structure:

Document: "Product Manual"
β”œβ”€β”€ Parent Chunk 1: "Installation Guide" (2000 tokens)
β”‚   β”œβ”€β”€ Child Chunk 1.1: "Prerequisites" (200 tokens) ← embed this
β”‚   β”œβ”€β”€ Child Chunk 1.2: "Step-by-step instructions" (200 tokens) ← embed this
β”‚   └── Child Chunk 1.3: "Troubleshooting" (200 tokens) ← embed this
└── Parent Chunk 2: "Configuration Guide" (2000 tokens)
    β”œβ”€β”€ Child Chunk 2.1: "Basic settings" (200 tokens) ← embed this
    └── Child Chunk 2.2: "Advanced settings" (200 tokens) ← embed this

Retrieval Flow:

  1. User query: "How do I fix installation errors?"
  2. Vector search finds Child Chunk 1.3: "Troubleshooting" (high precision)
  3. System returns Parent Chunk 1: "Installation Guide" to the LLM (full context)
  4. LLM has both the relevant section AND surrounding context

Benefits:

🎭 HyDE (Hypothetical Document Embeddings)

What it is:

Generate a hypothetical answer first, then embed that for retrieval instead of the original query. This bridges the query-document semantic gap.

The Problem:

Questions and answers use different vocabulary and structure:

These have different semantic representations despite being related.

How HyDE works:

  1. User submits query: "How do I fix CUDA out of memory errors?"
  2. Ask LLM to generate a hypothetical answer (even if it hallucinates):
    "To fix CUDA OOM errors, you should reduce batch size,
    use gradient accumulation, enable mixed precision training
    with torch.cuda.amp, clear cache with torch.cuda.empty_cache()..."
  3. Embed this hypothetical answer
  4. Search using the hypothetical answer's embedding
  5. The real documentation is semantically closer to this "answer" than to the original question

Why it works:

When to use:

Caution:

πŸ”€ Multi-Query Retrieval

What it is:

Generate multiple variations of the user's query and retrieve documents for each variation, then merge the results.

How it works:

  1. Original query: "Best practices for API security"
  2. Generate variations:
    • "How to secure REST APIs"
    • "API authentication and authorization methods"
    • "Preventing API vulnerabilities"
    • "API security design patterns"
  3. Retrieve documents for each variation
  4. Merge and deduplicate results
  5. Optionally rerank the merged set

Difference from Query Rewriting:

Benefits:

Implementation Strategies:

Trade-offs:

βœ… Self-RAG / Corrective RAG

What it is:

The system evaluates whether retrieved documents are actually relevant to the query and can take corrective actions (re-retrieve, try different strategy, or fall back gracefully).

The Problem:

Traditional RAG blindly trusts retrieved documents:

How Self-RAG works:

1. Retrieve documents
2. LLM evaluates each document:
   - Is this relevant to the query?
   - Does this support or contradict other documents?
   - Is this information sufficient to answer?
3. Decision branches:
   a. All docs relevant β†’ Proceed to generate answer
   b. Partially relevant β†’ Retrieve more documents
   c. Not relevant β†’ Try different retrieval strategy or web search
   d. No docs available β†’ Answer from parametric knowledge or admit uncertainty

Example Flow:

Query: "Latest features in Python 3.12"

Step 1: Initial retrieval returns docs about Python 3.10
Step 2: Self-RAG detects version mismatch
        "Query asks for 3.12, but retrieved docs are about 3.10"
Step 3: Corrective action: Trigger web search for current information
Step 4: Verify new results match query requirements
Step 5: Generate answer with appropriate sources

Corrective RAG Variant:

Focuses specifically on correction strategies:

Implementation Considerations:

Benefits:

When to implement:

πŸ•ΈοΈ Graph RAG

What it is:

Use knowledge graphs to capture entity relationships that vector search misses. Instead of treating documents as isolated chunks, build a graph of interconnected entities and relationships.

The Problem with Vector Search:

Vector search finds semantically similar text, but it misses explicit relationships:

How Graph RAG works:

  1. Entity Extraction: Extract entities (people, places, organizations, concepts) from documents
  2. Relationship Mapping: Identify relationships between entities
  3. Graph Construction: Build knowledge graph with entities as nodes and relationships as edges
  4. Hybrid Retrieval:
    • Use vector search for semantic similarity
    • Use graph traversal for relationship queries
    • Combine both for comprehensive retrieval

Example Graph Structure:

[John] --works_on--> [Project X] --failed_in--> [2023]
[Project X] --belongs_to--> [Cloud Division]
[Cloud Division] --part_of--> [Acme Corp]

Query Examples:

"What projects did John work on that failed?"

Benefits:

Use Cases:

Implementation:

Challenges:

When to use:

πŸ€– Agentic RAG

What it is:

Treat RAG as a multi-step decision-making process where an LLM agent decides how to retrieve information, rather than using a single fixed retrieval pipeline.

The Problem:

Different queries need different retrieval strategies:

One-size-fits-all retrieval doesn't work for diverse queries.

How Agentic RAG works:

  1. Query Analysis: Agent analyzes the user's query to understand intent
  2. Strategy Selection: Agent decides which tools/retrievers to use
  3. Multi-Step Retrieval: Agent can chain multiple retrievals
  4. Self-Evaluation: Agent evaluates if retrieved information is sufficient
  5. Adaptive Refinement: Agent can try different approaches if first attempt fails

Example Flow:

Query: "Compare our Q3 revenue vs competitors in the cloud market"

Agent reasoning:
1. "Need internal Q3 revenue" β†’ Query SQL database
2. "Need competitor revenue" β†’ Web search
3. "Need cloud market context" β†’ Vector search internal reports
4. Combine all three sources β†’ Generate comparative analysis

Common Patterns:

ReAct (Reasoning + Acting):

Thought: I need to find the user's order history
Action: query_database(user_id=123, table="orders")
Observation: Found 15 orders
Thought: User asked about recent orders, I should filter
Action: filter_orders(timeframe="last_30_days")
Observation: 3 orders in last 30 days
Thought: Now I can answer
Answer: You placed 3 orders in the last 30 days...

Tool Selection:

Implementation:

Often built with frameworks like:

Benefits:

Challenges:

When to use:

🎯 Fine-Tuning Embedding Models

What it is:

Train domain-specific embedding models on your specialized corpus instead of using general-purpose embeddings.

The Problem:

General-purpose embedding models (OpenAI, Cohere, etc.) are trained on broad internet data. They might not capture nuances in your specific domain:

How it works:

  1. Collect training data:
    • Positive pairs: (query, relevant document)
    • Negative pairs: (query, irrelevant document)
    • Can generate synthetically using LLMs
  2. Fine-tune embedding model:
    • Start with pre-trained model (e.g., sentence-transformers)
    • Continue training on domain-specific pairs
    • Optimize for your retrieval task
  3. Deploy fine-tuned model:
    • Use for encoding both queries and documents
    • Store embeddings in vector database

Benefits:

Trade-offs:

When to use:

Data Generation Strategy:

Use LLMs to generate synthetic training data:

For each document chunk:
1. Generate 3-5 queries that this chunk should answer
2. Generate 3-5 queries that this chunk should NOT answer
3. Use as positive and negative training pairs

Quick Reference

Start Here (Highest ROI):

  1. Hybrid Search (dense + sparse retrieval)
  2. Contextual Retrieval (augment chunks with context)
  3. Reranking (two-stage retrieval)

Next Level:

  1. Query Rewriting (handle multi-turn conversations)
  2. Metadata Filtering (narrow search space)
  3. Parent-Child Retrieval (precision + context)

Advanced (When Needed):

  1. HyDE (bridge query-document gap)
  2. Multi-Query Retrieval (comprehensive coverage)
  3. Self-RAG / Corrective RAG (robust production systems)
  4. Agentic RAG (complex multi-source queries)
  5. Graph RAG (relationship-heavy domains)
  6. Fine-tuned Embeddings (specialized domains)

Measuring Success

Always measure retrieval quality before optimizing generation:

Key Metrics:

Remember: Bad inputs = bad outputs, no matter how good your LLM is.

Resources


Citation

If you found this guide helpful and would like to cite it:

Cited as:

Haseeb, Raja. (Feb 2026). "RAG Optimization Techniques". Personal Blog.

https://rajahaseeb.com/posts/rag-optimization-techniques/

Or in BibTeX format:

@article{pytholic2026ragoptim,
    title   = "RAG Optimization Techniques",
    author  = "Haseeb, Raja",
    journal = "pytholic.github.io",
    year    = "2026",
    month   = "Feb",
    url     = "https://pytholic.github.io/posts/rag-optimization-techniques/"
}