A large language model only knows what was in its training data, frozen at a cutoff, and it can confidently invent facts. Retrieval-Augmented Generation (RAG) fixes both problems: at query time it fetches relevant passages from an external knowledge base and puts them in the prompt, so the model answers from current, authoritative sources rather than memory alone.
Working principle
RAG has two phases. Indexing (offline): documents are split into chunks, each converted to a vector embedding and stored in a vector database. Retrieval + generation (online): the user's question is embedded, the most similar chunks are retrieved by nearest-neighbour search, and these are concatenated with the question into a prompt. The LLM then generates an answer grounded in the retrieved context, ideally with citations.
| Approach | Freshness | Best for |
|---|---|---|
| Plain LLM | Frozen at cutoff | General reasoning |
| Fine-tuning | Bakes in at train time | Style, format, skills |
| RAG | Live — update the index | Facts, private/changing data |
Key insightRAG quality is dominated by retrieval quality, not the LLM: poor chunking or embeddings mean the model never sees the right context. Hybrid (keyword + vector) search and re-ranking are common upgrades.
Applications
- Enterprise Q&A over internal docs, wikis and tickets
- Customer-support assistants citing the knowledge base
- Research and legal/medical assistants needing verifiable sources
References & further reading
- Lewis et al., “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks,” NeurIPS 2020.
- Karpukhin et al., “Dense Passage Retrieval for Open-Domain QA,” EMNLP 2020.
- Gao et al., “Retrieval-Augmented Generation for Large Language Models: A Survey,” 2024.