RAG Done Right: Production Patterns That Actually Work

You've probably heard the term "RAG" thrown around in every AI conversation lately. But here's the thing — most implementations I see in the wild are... well, they're not great. They retrieve irrelevant chunks, hallucinate despite having context, and generally disappoint users.

Let's fix that. In this deep dive, I'll show you the patterns that separate toy RAG demos from production-ready systems that actually work.

The RAG Architecture You Actually Need

RAG isn't just "embed documents, retrieve, generate." A production system needs:

Smart chunking — Not just splitting on character count
Hybrid search — Combining semantic and keyword search
Re-ranking — Because first-pass retrieval is rarely optimal
Query transformation — Rewriting user queries for better retrieval

Step 1: Intelligent Document Chunking

The biggest mistake? Chunking by character count. Your documents have structure — use it:

Python

Copy

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings

# Bad: Fixed-size chunks ignore document structure
bad_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200
)

# Better: Semantic chunking groups related content
embeddings = OpenAIEmbeddings()
semantic_splitter = SemanticChunker(
    embeddings=embeddings,
    breakpoint_threshold_type="percentile",
    breakpoint_threshold_amount=95
)

# Best: Custom chunking based on document structure
def smart_chunk(document):
    """Chunk based on headers, paragraphs, and semantic boundaries."""
    chunks = []
    current_chunk = []
    current_header = ""
    
    for line in document.split('\n'):
        # Detect headers and start new chunks
        if line.startswith('#'):
            if current_chunk:
                chunks.append({
                    'content': '\n'.join(current_chunk),
                    'header': current_header,
                    'type': 'section'
                })
            current_header = line
            current_chunk = [line]
        else:
            current_chunk.append(line)
    
    return chunks

Step 2: Hybrid Search with Reciprocal Rank Fusion

Vector search is great for semantic similarity, but sometimes users search for exact terms. Combine both:

Python

Copy

from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever
from langchain_community.vectorstores import Chroma

# Create vector store retriever
vectorstore = Chroma.from_documents(documents, embeddings)
vector_retriever = vectorstore.as_retriever(search_kwargs={"k": 10})

# Create BM25 (keyword) retriever
bm25_retriever = BM25Retriever.from_documents(documents)
bm25_retriever.k = 10

# Combine with ensemble
hybrid_retriever = EnsembleRetriever(
    retrievers=[vector_retriever, bm25_retriever],
    weights=[0.6, 0.4]  # Favor semantic, but include keyword matches
)

Step 3: Re-ranking for Precision

First-pass retrieval casts a wide net. Re-ranking narrows it down to what's actually relevant:

Python

Copy

from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import CrossEncoderReranker
from langchain_community.cross_encoders import HuggingFaceCrossEncoder

# Initialize cross-encoder for re-ranking
model = HuggingFaceCrossEncoder(model_name="cross-encoder/ms-marco-MiniLM-L-6-v2")
compressor = CrossEncoderReranker(model=model, top_n=5)

# Wrap your retriever with re-ranking
reranking_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=hybrid_retriever
)

Step 4: Query Transformation

Users don't always phrase queries optimally. Transform them:

Python

Copy

from langchain.retrievers import MultiQueryRetriever
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4-turbo-preview", temperature=0)

# Generate multiple query variations
multi_query_retriever = MultiQueryRetriever.from_llm(
    retriever=reranking_retriever,
    llm=llm
)

# Or use HyDE (Hypothetical Document Embeddings)
from langchain.chains import HypotheticalDocumentEmbedder

hyde_embeddings = HypotheticalDocumentEmbedder.from_llm(
    llm=llm,
    base_embeddings=embeddings,
    prompt_key="web_search"  # Generate hypothetical answer, then search
)

The Complete Pipeline

Here's how it all comes together:

Python

Copy

from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

# Define the RAG prompt
rag_prompt = ChatPromptTemplate.from_template("""
Answer the question based only on the following context. If you cannot 
answer from the context, say "I don't have enough information."

Context:
{context}

Question: {question}

Answer:""")

# Build the chain
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

rag_chain = (
    {"context": multi_query_retriever | format_docs, "question": RunnablePassthrough()}
    | rag_prompt
    | llm
    | StrOutputParser()
)

# Use it
response = rag_chain.invoke("What are the key features of the new API?")
print(response)

"RAG improves accuracy from 70% to 95%+ by grounding responses in your actual data."

The difference between a demo and production RAG is in these details. Start with the basics, measure your retrieval quality, and iterate. Your users will thank you.

RAG Done Right: Production Patterns That Actually Work

The RAG Architecture You Actually Need

Step 1: Intelligent Document Chunking

Step 2: Hybrid Search with Reciprocal Rank Fusion

Step 3: Re-ranking for Precision

Step 4: Query Transformation

The Complete Pipeline

More Articles

Building Your First AI Agent: A Practical Guide to Agentic Development

Prompt Engineering Patterns That Actually Work

Vector Embeddings Explained: The Foundation of Modern AI

Fine-Tuning LLMs on a Budget: A Practical Guide to LoRA and QLoRA