Introduction
Retrieval-Augmented Generation (RAG) is an advanced AI framework that enhances the response accuracy of generative models by integrating information retrieval mechanisms. Unlike standalone Large Language Models (LLMs), which generate responses based solely on their pre-trained knowledge, RAG applications fetch and incorporate relevant external data dynamically. This makes RAG highly effective for applications requiring up-to-date, factual, and domain-specific responses.
Architecture & Workflow
A RAG system consists of several components, each playing a critical role in fetching, processing, and generating contextually enriched responses.
WorkFlow
Below is a typical workflow of a RAG-based application:
User Query Input: The user submits a query through an interface.
Embedding Model: Converts the query into a vector representation.
Vector Database (VectorDB): Stores document embeddings and retrieves similar entries.
Retriever: Fetches the most relevant documents based on query similarity.
Context Manager: Selects and formats retrieved data.
LLM (Large Language Model): Uses the retrieved context to generate a response.
Response Delivery: The generated response is sent back to the user.
Components of a RAG Application
1. User Query Interface
This is the entry point where users input their queries.
Can be a web application, chatbot, or API.
Example: Simple User Query Input (Python Flask API)
1
2
3
4
5
6
7
8
9
10
11
from flask import Flask, request, jsonify
app = Flask(__name__)
@app.route('/query', methods=['POST'])
def get_query():
user_query = request.json['query']
return jsonify({"query_received": user_query})
if __name__ == '__main__':
app.run(debug=True)
2. Embedding Model
Converts user queries and documents into numerical vectors.
Popular choices: OpenAI’s text-embedding-ada-002, SentenceTransformers, BERT.
Example: Converting Text to Vector Using OpenAI Embeddings
1
2
3
4
5
6
7
8
9
10
11
import openai
def get_embedding(text):
response = openai.Embedding.create(
input=text,
model="text-embedding-ada-002"
)
return response['data'][0]['embedding']
query_embedding = get_embedding("What is RAG in AI?")
print(query_embedding)
3. Vector Database (VectorDB)
Stores document embeddings and enables efficient retrieval.
Examples: FAISS, Pinecone, Weaviate, Chroma, Milvus.
Example: Storing and Searching in FAISS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import faiss
import numpy as np
# Initialize FAISS index
d = 1536 # Dimension of OpenAI embeddings
index = faiss.IndexFlatL2(d)
# Adding vectors (Example)
document_vectors = np.random.random((10, d)).astype('float32')
index.add(document_vectors)
# Searching for the nearest vector
query_vector = np.random.random((1, d)).astype('float32')
distances, indices = index.search(query_vector, k=5)
print(indices)
4. Retriever
Searches the VectorDB for similar documents.
Uses techniques like k-NN (k-Nearest Neighbors), Approximate Nearest Neighbors (ANN).
Example: Retrieving Similar Documents from FAISS
1
2
3
4
5
6
def retrieve_similar_documents(query_vector, index, k=3):
distances, indices = index.search(query_vector, k)
return indices
similar_docs = retrieve_similar_documents(query_vector, index, k=3)
print(f"Top 3 similar documents: {similar_docs}")
5. Re-Ranker (Optional)
Re-ranks retrieved documents based on relevance scores.
Models like BM25, Cohere Rerank, Cross-Encoder Models are used.
Example: Using BM25 for Re-Ranking
1
2
3
4
5
6
7
8
9
10
from rank_bm25 import BM25Okapi
documents = ["AI is transforming industries", "RAG improves LLM accuracy", "Vector databases store embeddings"]
tokenized_docs = [doc.split(" ") for doc in documents]
bm25 = BM25Okapi(tokenized_docs)
query = "How does RAG work?"
tokenized_query = query.split(" ")
scores = bm25.get_scores(tokenized_query)
print(scores)
6. Context Manager
Selects relevant retrieved documents and formats them for LLM input.
Example: Formatting Retrieved Text for LLM
1
2
3
4
5
def format_context(retrieved_texts):
return "\n".join(retrieved_texts)
context = format_context(["RAG fetches external knowledge", "It improves response accuracy"])
print(context)
7. Large Language Model (LLM)
Generates responses using both query and retrieved documents.
Examples: GPT-4, Llama, Claude, Mistral.
Example: Generating a Response with OpenAI GPT-4
1
2
3
4
5
6
7
8
9
10
11
12
13
def generate_response(query, context):
response = openai.ChatCompletion.create(
model="gpt-4",
messages=[
{"role": "system", "content": "Use the given context to answer accurately."},
{"role": "user", "content": query},
{"role": "assistant", "content": context}
]
)
return response["choices"][0]["message"]["content"]
response = generate_response("What is RAG?", context)
print(response)
8. Response Generation & Post-Processing
Enhances response quality through formatting, summarization, or citation.
Can involve grounding checks, hallucination detection.
Conclusion
RAG applications combine the power of generative AI with external data retrieval to create fact-based, relevant, and contextually rich responses. By integrating embedding models, vector databases, and re-ranking mechanisms, RAG enhances LLM performance significantly.
Key Takeaways:
- LLMs alone can hallucinate – RAG provides real-time external knowledge.
- Embedding models & VectorDBs enable retrieval of similar context.
- Retrieval + Re-ranking ensures relevance in responses.
- By following the code examples, you can build your own end-to-end RAG system for more accurate AI-driven applications.
** Ready to build your own RAG application? Start coding today! **