Retrieval-Augmented Generation (RAG) Systems

A large language model only knows what was in its training data, frozen at a cutoff, and it can confidently invent facts. Retrieval-Augmented Generation (RAG) fixes both problems: at query time it fetches relevant passages from an external knowledge base and puts them in the prompt, so the model answers from current, authoritative sources rather than memory alone.

Working principle

RAG has two phases. Indexing (offline): documents are split into chunks, each converted to a vector embedding and stored in a vector database. Retrieval + generation (online): the user's question is embedded, the most similar chunks are retrieved by nearest-neighbour search, and these are concatenated with the question into a prompt. The LLM then generates an answer grounded in the retrieved context, ideally with citations.

Figure 1. Retrieved passages are injected into the prompt before generation, anchoring the answer to real sources and enabling citations.

Table 1. Plain LLM vs. RAG vs. fine-tuning
Approach	Freshness	Best for
Plain LLM	Frozen at cutoff	General reasoning
Fine-tuning	Bakes in at train time	Style, format, skills
RAG	Live — update the index	Facts, private/changing data

Key insightRAG quality is dominated by retrieval quality, not the LLM: poor chunking or embeddings mean the model never sees the right context. Hybrid (keyword + vector) search and re-ranking are common upgrades.

Applications

Enterprise Q&A over internal docs, wikis and tickets
Customer-support assistants citing the knowledge base
Research and legal/medical assistants needing verifiable sources

References & further reading

Lewis et al., “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks,” NeurIPS 2020.
Karpukhin et al., “Dense Passage Retrieval for Open-Domain QA,” EMNLP 2020.
Gao et al., “Retrieval-Augmented Generation for Large Language Models: A Survey,” 2024.