Tech Stack
Architecture
Document Ingestion -> Chunking (semantic, not fixed-size) -> Embedding (text-embedding-3-large) -> Vector Store (pgvector or Pinecone) -> Hybrid Retrieval (semantic + keyword) -> Reranker (Cohere) -> LLM Generation with citations.I've built RAG systems for 3 production products. The tutorials make it look easy — embed, store, retrieve, generate. Reality is messier.
Why Most RAG Systems Fail: Fixed-size chunking destroys context. Naive similarity search returns irrelevant results. The LLM generates confident answers from tangentially related chunks. Users lose trust after the first wrong answer.
Chunking Strategy That Works: Forget fixed 500-token chunks. Use semantic chunking — split on paragraph boundaries, headers, and topic shifts. Each chunk should be a self-contained thought. For code documentation, chunk by function/class. For legal documents, chunk by clause. The chunking strategy matters more than the embedding model.
The Embedding Model Choice: OpenAI's text-embedding-3-large is the safe default — best price/performance ratio. For cost-sensitive applications, text-embedding-3-small works surprisingly well. For multilingual (Hindi + English), Cohere's embed-multilingual-v3 is better. Always benchmark on YOUR data, not MTEB leaderboards.
Hybrid Retrieval is Non-Negotiable: Pure vector similarity misses exact keyword matches. Pure keyword search misses semantic meaning. Use both. pgvector for semantic search + PostgreSQL full-text search for keywords. Combine results with Reciprocal Rank Fusion. This single change improved our retrieval accuracy by 35%.
The Reranking Secret: After initial retrieval (top 20 results), run them through a cross-encoder reranker (Cohere Rerank or a local model). Rerankers are 10x more accurate than bi-encoder similarity because they see the query and document together. This cut our hallucination rate by 60%.
Multimodal RAG: Text is just the beginning. My Multimodal RAG system handles PDFs (with table extraction), images (Gemini vision descriptions), audio (Whisper transcription), and video (frame extraction + audio). Every modality gets embedded into the same vector space.
The Citation Pattern: Every generated answer must include source references. Map each claim to the chunk it came from. Display chunk metadata (document name, page number, timestamp) alongside the answer. If the LLM can't find relevant chunks, it should say "I don't have information about that" instead of guessing.
Production Lessons: Index 100K+ documents? Use Pinecone or Qdrant — pgvector slows down past 500K vectors without careful indexing. Cache frequent queries in Redis. Monitor retrieval quality with weekly sampling — RAG systems degrade as new documents shift the embedding space.
Want to build something like this?
I architect and deploy end-to-end AI systems — from MVP to revenue.
Let's Talk