AIMachine leariningDeep-learning

RAG Done Right: Production Patterns That Actually Work

Most RAG implementations fail in production. Learn the advanced patterns that make the difference: semantic chunking, hybrid search, re-ranking, and query transformation. Complete with production-ready code.

2 min read

You've probably heard the term "RAG" thrown around in every AI conversation lately. But here's the thing — most implementations I see in the wild are... well, they're not great. They retrieve irrelevant chunks, hallucinate despite having context, and generally disappoint users.

Let's fix that. In this deep dive, I'll show you the patterns that separate toy RAG demos from production-ready systems that actually work.

The RAG Architecture You Actually Need

RAG isn't just "embed documents, retrieve, generate." A production system needs:

  • Smart chunking — Not just splitting on character count
  • Hybrid search — Combining semantic and keyword search
  • Re-ranking — Because first-pass retrieval is rarely optimal
  • Query transformation — Rewriting user queries for better retrieval

Step 1: Intelligent Document Chunking

The biggest mistake? Chunking by character count. Your documents have structure — use it:

Python
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings

# Bad: Fixed-size chunks ignore document structure
bad_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200
)

# Better: Semantic chunking groups related content
embeddings = OpenAIEmbeddings()
semantic_splitter = SemanticChunker(
    embeddings=embeddings,
    breakpoint_threshold_type="percentile",
    breakpoint_threshold_amount=95
)

# Best: Custom chunking based on document structure
def smart_chunk(document):
    """Chunk based on headers, paragraphs, and semantic boundaries."""
    chunks = []
    current_chunk = []
    current_header = ""
    
    for line in document.split('\n'):
        # Detect headers and start new chunks
        if line.startswith('#'):
            if current_chunk:
                chunks.append({
                    'content': '\n'.join(current_chunk),
                    'header': current_header,
                    'type': 'section'
                })
            current_header = line
            current_chunk = [line]
        else:
            current_chunk.append(line)
    
    return chunks

Step 2: Hybrid Search with Reciprocal Rank Fusion

Vector search is great for semantic similarity, but sometimes users search for exact terms. Combine both:

Python
from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever
from langchain_community.vectorstores import Chroma

# Create vector store retriever
vectorstore = Chroma.from_documents(documents, embeddings)
vector_retriever = vectorstore.as_retriever(search_kwargs={"k": 10})

# Create BM25 (keyword) retriever
bm25_retriever = BM25Retriever.from_documents(documents)
bm25_retriever.k = 10

# Combine with ensemble
hybrid_retriever = EnsembleRetriever(
    retrievers=[vector_retriever, bm25_retriever],
    weights=[0.6, 0.4]  # Favor semantic, but include keyword matches
)

Step 3: Re-ranking for Precision

First-pass retrieval casts a wide net. Re-ranking narrows it down to what's actually relevant:

Python
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import CrossEncoderReranker
from langchain_community.cross_encoders import HuggingFaceCrossEncoder

# Initialize cross-encoder for re-ranking
model = HuggingFaceCrossEncoder(model_name="cross-encoder/ms-marco-MiniLM-L-6-v2")
compressor = CrossEncoderReranker(model=model, top_n=5)

# Wrap your retriever with re-ranking
reranking_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=hybrid_retriever
)

Step 4: Query Transformation

Users don't always phrase queries optimally. Transform them:

Python
from langchain.retrievers import MultiQueryRetriever
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4-turbo-preview", temperature=0)

# Generate multiple query variations
multi_query_retriever = MultiQueryRetriever.from_llm(
    retriever=reranking_retriever,
    llm=llm
)

# Or use HyDE (Hypothetical Document Embeddings)
from langchain.chains import HypotheticalDocumentEmbedder

hyde_embeddings = HypotheticalDocumentEmbedder.from_llm(
    llm=llm,
    base_embeddings=embeddings,
    prompt_key="web_search"  # Generate hypothetical answer, then search
)

The Complete Pipeline

Here's how it all comes together:

Python
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

# Define the RAG prompt
rag_prompt = ChatPromptTemplate.from_template("""
Answer the question based only on the following context. If you cannot 
answer from the context, say "I don't have enough information."

Context:
{context}

Question: {question}

Answer:""")

# Build the chain
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

rag_chain = (
    {"context": multi_query_retriever | format_docs, "question": RunnablePassthrough()}
    | rag_prompt
    | llm
    | StrOutputParser()
)

# Use it
response = rag_chain.invoke("What are the key features of the new API?")
print(response)
"RAG improves accuracy from 70% to 95%+ by grounding responses in your actual data."

The difference between a demo and production RAG is in these details. Start with the basics, measure your retrieval quality, and iterate. Your users will thank you.

👏🔥👍
8 people found this helpful

Was this helpful?

Loading reactions...

Share this article:

Written by

Amanuel Garomsa

Machine Learning Engineer & Full Stack Developer. Writing about AI, software development, and technology.

More Articles

View all