Writing & Thoughts
The Blog
Deep-dives into AI, ML systems, data engineering, and full-stack development — written from the trenches.
Deep-dives into AI, ML systems, data engineering, and full-stack development — written from the trenches.
Learn how Retrieval-Augmented Generation (RAG) works from basics to production, including embeddings, vector databases, and real-world architecture.
RAG (Retrieval-Augmented Generation) is a technique that combines:
Instead of relying only on model knowledge, RAG fetches relevant data first.
LLMs alone:
RAG solves this by grounding responses in real data.
Text → vector representation
Find similar chunks using cosine similarity
Top-K relevant chunks are selected
LLM uses context to produce accurate response
Chunking is the process of splitting documents into smaller pieces before storing them.
✅ Optimal size: 300–800 tokens per chunk
👉 This directly impacts retrieval quality.
After searching the vector database, we select the Top-K most relevant chunks.
👉 Balance is key for accurate answers.
Initial retrieval may not return the most relevant results.
Re-ranking improves this by:
✅ Result: Better relevance → better answers
Combines:
Why this matters:
✅ Best practice for production systems
👉 RAG performance is NOT just about the LLM
👉 It heavily depends on retrieval quality
Popular options:
| Use Case | DB |
|---|---|
| Already using Postgres | pgvector |
| Large scale | Pinecone |
| Prototyping | FAISS |
| Advanced search | Weaviate |
Docs → Chunk → Embed → Store User → Query → Search → LLM → Answer
👉 RAG = "Open-book exam for LLMs"
It’s not magic — it’s smart retrieval + reasoning.