Yet another post about RAG: Building a Retrieval-Augmented Generation (RAG) Pipeline

4 min readJun 11, 2024

If you’ve been around the AI and NLP world lately, you’d hear everyone buzzing about Retrieval-Augmented Generation (RAG). It’s the latest craze, combining information retrieval and natural language generation to create smarter, more context-aware systems.

This post will talk about basics: What is a RAG pipeline, and share some basic concepts behind it, how to build one, and some tips for deploying it in production (ops).

What is a RAG Pipeline?

A RAG pipeline integrates two core components:

Retrieval: This part is like a high-tech librarian that searches a huge corpus of documents to find the most relevant ones in response to a query.
Generation: This part acts like a creative writer who uses the retrieved documents to generate a coherent and contextually relevant answer to the query.

If you prefer a more hands on practice, here’s a handy notebook for you :)

Basic Concepts

Data Collection and Preprocessing: Gathering and preparing a bunch of documents for our high-tech librarian.
Embedding Generation: Transforming these documents into vector representations, like turning words into magic numbers, using models like BERT.
Information Retrieval: Using similarity search techniques to find relevant documents — think of it as matchmaking, but for documents.
Natural Language Generation: Using models like GPT to weave these retrieved nuggets of information into beautiful prose.

Building a RAG Pipeline

Step 1: Data Collection and Preprocessing

First, we need to collect a bunch of documents. These documents should be preprocessed to ensure they’re ready for our pipeline. Here are some dummy documents to get started:

documents = [
    "Document 1: This is the content of the first document...",
    "Document 2: This document talks about data science...",
    "Document 3: In this document, we discuss AI history...",
    "Document 4: Overview of natural language processing...",
    "Document 5: Deep learning in image recognition..."
]

Step 2: Embedding Generation

Next, we generate embeddings for these documents using a pre-trained model. Think of embeddings as the secret sauce that lets our system understand the content.

import torch
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')
model = AutoModel.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')
def generate_embeddings(documents):
    inputs = tokenizer(documents, padding=True, truncation=True, return_tensors='pt')
    with torch.no_grad():
        outputs = model(**inputs)
    return outputs.last_hidden_state.mean(dim=1)
document_embeddings = generate_embeddings(documents)

Step 3: Information Retrieval

Now, let’s retrieve the most relevant documents based on a query. This is where our librarian really shines.

def retrieve_documents(query, document_embeddings, top_k=3):
    query_embedding = generate_embeddings([query])
    similarities = torch.nn.functional.cosine_similarity(query_embedding, document_embeddings)
    top_k_indices = torch.topk(similarities, k=top_k).indices
    return [documents[idx] for idx in top_k_indices]
query = "Tell me about natural language processing."
retrieved_docs = retrieve_documents(query, document_embeddings)

Step 4: Natural Language Generation

Finally, we generate a response based on the query and the retrieved documents. Our creative writer takes over here.

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
generator_tokenizer = AutoTokenizer.from_pretrained('t5-small')
generator_model = AutoModelForSeq2SeqLM.from_pretrained('t5-small')
def generate_answer(query, retrieved_docs):
    input_text = query + " " + " ".join(retrieved_docs)
    inputs = generator_tokenizer.encode(input_text, return_tensors='pt', truncation=True)
    outputs = generator_model.generate(inputs, max_length=150, num_beams=5, early_stopping=True)
    return generator_tokenizer.decode(outputs[0], skip_special_tokens=True)
answer = generate_answer(query, retrieved_docs)
print("Answer:", answer)

Production Discussion

So, you’ve got your RAG noteboook up and running, but now you need to make it production-ready. Here are some ideas:

1. Use Robust Frameworks

ElasticSearch: Been around for a while, a robust framework for full-text search and analytics, still very effective for production RAG pipelines.
Transformers: Developed by Hugging Face, this framework is highly recommended due to its comprehensive ecosystem and strong community support.

The following frameworks are also recommended and frequently mentioned in the context of RAG pipelines:

Haystack: Known as a versatile tool for RAG pipelines, providing extensive features for document retrieval and question answering.
FAISS: Renowned for its high-speed similarity search capabilities.
Milvus: A widely used framework for efficient vector data management.

2. Automate with Workflow Orchestration Tools

Because who wants to do everything manually? These tools help you automate your pipeline for tasks like data processing, collection, embedding generation, and model serving. Here are some ideas for production:

Data Collection and Preprocessing: Automate the collection of new documents and preprocessing tasks to ensure your corpus is always up-to-date and clean. Use scheduled tasks to scrape data, clean it, and store it in a structured format.
Embedding Generation: Set up regular jobs to update document embeddings, especially if your document corpus is frequently updated. This ensures your retrieval step is always using the most relevant data.
Information Retrieval and Generation: Automate the retrieval and generation processes to handle incoming queries efficiently. This can be crucial for maintaining performance and scalability.
Model Serving: Use a model serving platform to deploy your retrieval and generation models. This allows you to handle requests in real-time and ensures that your models are easily updated and maintained.

Some Sample Code To help you get started

Apache Airflow: The granddaddy of workflow automation.

from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from datetime import datetime
default_args = {
    'owner': 'airflow',
    'start_date': datetime(2023, 1, 1),
    'retries': 1
}
dag = DAG('rag_pipeline', default_args=default_args, schedule_interval='@daily')
def run_rag_pipeline():
    # Place your RAG pipeline code from notebook here
    pass
run_pipeline = PythonOperator(
    task_id='run_rag_pipeline',
    python_callable=run_rag_pipeline,
    dag=dag
)

Dagster: The new kid on the block, with fancy features like type checking and data versioning.

from dagster import pipeline, solid
@solid
def run_rag_pipeline(_):
    # Place your RAG pipeline code from notebook here
    pass
@pipeline
def rag_pipeline():
    run_rag_pipeline()

Conclusion

Building a RAG pipeline is like the fundamental principles of “data structures and algorithms” in today’s AI engineering landscape. It’s mixed with various disciplines such as data management, machine learning, and leveraging large language models (LLMs).

However, it’s important not to blindly dive into trends and buzzwords. Instead, focus on creating value and understanding how these tools can benefit your projects. Hopefully, this post has provided some valuable context and insights into what a RAG pipeline is all about. Enjoy exploring this exciting field!