You've probably heard the term "RAG" thrown around in every AI conversation lately. But here's the thing — most implementations I see in the wild are... well, they're not great. They retrieve irrelevant chunks, hallucinate despite having context, and generally disappoint users.
Let's fix that. In this deep dive, I'll show you the patterns that separate toy RAG demos from production-ready systems that actually work.
The RAG Architecture You Actually Need
RAG isn't just "embed documents, retrieve, generate." A production system needs:
- Smart chunking — Not just splitting on character count
- Hybrid search — Combining semantic and keyword search
- Re-ranking — Because first-pass retrieval is rarely optimal
- Query transformation — Rewriting user queries for better retrieval
Step 1: Intelligent Document Chunking
The biggest mistake? Chunking by character count. Your documents have structure — use it:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings
# Bad: Fixed-size chunks ignore document structure
bad_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200
)
# Better: Semantic chunking groups related content
embeddings = OpenAIEmbeddings()
semantic_splitter = SemanticChunker(
embeddings=embeddings,
breakpoint_threshold_type="percentile",
breakpoint_threshold_amount=95
)
# Best: Custom chunking based on document structure
def smart_chunk(document):
"""Chunk based on headers, paragraphs, and semantic boundaries."""
chunks = []
current_chunk = []
current_header = ""
for line in document.split('\n'):
# Detect headers and start new chunks
if line.startswith('#'):
if current_chunk:
chunks.append({
'content': '\n'.join(current_chunk),
'header': current_header,
'type': 'section'
})
current_header = line
current_chunk = [line]
else:
current_chunk.append(line)
return chunksStep 2: Hybrid Search with Reciprocal Rank Fusion
Vector search is great for semantic similarity, but sometimes users search for exact terms. Combine both:
from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever
from langchain_community.vectorstores import Chroma
# Create vector store retriever
vectorstore = Chroma.from_documents(documents, embeddings)
vector_retriever = vectorstore.as_retriever(search_kwargs={"k": 10})
# Create BM25 (keyword) retriever
bm25_retriever = BM25Retriever.from_documents(documents)
bm25_retriever.k = 10
# Combine with ensemble
hybrid_retriever = EnsembleRetriever(
retrievers=[vector_retriever, bm25_retriever],
weights=[0.6, 0.4] # Favor semantic, but include keyword matches
)Step 3: Re-ranking for Precision
First-pass retrieval casts a wide net. Re-ranking narrows it down to what's actually relevant:
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import CrossEncoderReranker
from langchain_community.cross_encoders import HuggingFaceCrossEncoder
# Initialize cross-encoder for re-ranking
model = HuggingFaceCrossEncoder(model_name="cross-encoder/ms-marco-MiniLM-L-6-v2")
compressor = CrossEncoderReranker(model=model, top_n=5)
# Wrap your retriever with re-ranking
reranking_retriever = ContextualCompressionRetriever(
base_compressor=compressor,
base_retriever=hybrid_retriever
)Step 4: Query Transformation
Users don't always phrase queries optimally. Transform them:
from langchain.retrievers import MultiQueryRetriever
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-4-turbo-preview", temperature=0)
# Generate multiple query variations
multi_query_retriever = MultiQueryRetriever.from_llm(
retriever=reranking_retriever,
llm=llm
)
# Or use HyDE (Hypothetical Document Embeddings)
from langchain.chains import HypotheticalDocumentEmbedder
hyde_embeddings = HypotheticalDocumentEmbedder.from_llm(
llm=llm,
base_embeddings=embeddings,
prompt_key="web_search" # Generate hypothetical answer, then search
)The Complete Pipeline
Here's how it all comes together:
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
# Define the RAG prompt
rag_prompt = ChatPromptTemplate.from_template("""
Answer the question based only on the following context. If you cannot
answer from the context, say "I don't have enough information."
Context:
{context}
Question: {question}
Answer:""")
# Build the chain
def format_docs(docs):
return "\n\n".join(doc.page_content for doc in docs)
rag_chain = (
{"context": multi_query_retriever | format_docs, "question": RunnablePassthrough()}
| rag_prompt
| llm
| StrOutputParser()
)
# Use it
response = rag_chain.invoke("What are the key features of the new API?")
print(response)"RAG improves accuracy from 70% to 95%+ by grounding responses in your actual data."
The difference between a demo and production RAG is in these details. Start with the basics, measure your retrieval quality, and iterate. Your users will thank you.