Basic to Advanced RAG: A Beginner’s Deep-Dive Made Easy (2024)

RAG (Retrieval-Augmented Generation): This is a new era, so why talk about wearing rags? No way! We are here talking about RAG, the Retrieval-Augmented Generation, as well as advanced RAG. The name sounds so old, but the technology is new, freaking everyone out. It can take chatbots like ChatGPT to the next level! Really? But how?

After completing this blog post, you will be able to understand how RAG works and also how you can make advanced RAG utilizing only a few different prompt templates (a link to the code having templates will be shared in the conclusion) that improve the retrieval process. This blog focuses on working and prompts to advance your RAG.

This blog post is well-researched to equip you with the knowledge of RAG and advanced RAG. Please complete this post, and I promise you will gain knowledge by spending only a few minutes.

Overcoming RAG’s Challenges: The Rise of New and Improved Models

RAG was first introduced in a research paper published by Patrick Lewis and his team at Facebook AI Research in 2020. Later frameworks like LangChain, whose purpose seems to revolve around making generative AI-powered apps, or especially LLM usage, easier as a framework, started giving more functionalities and tools to support RAG.

RAG helped us overcome the problem of AI hallucinations and generalizations. Though RAG felt like it was solving the problem, just like every system, even this one had some limitations. Now, what were those limitations, and how did the new advanced RAG models solve them? Well, keep reading to know, but to get the answer, let’s first look into how RAG works.

You Are Already Familiar With RAG: Working of ChatGPT to Understand RAG

Yes, you are already familiar with RAG; you have already used ChatGPT. ChatGPT uses LLM models like GPT or Davinci. The way they work is simple: you send a prompt in the form of instructions or a query, and it simply retrieves the right information according to the semantic meaning of the words that it can find within the model in the form of vectors and gives you the output by retrieving that information and converting it back to text.

But when it does not find information, it says, ‘I am sorry. According to my last update, I don’t have information on that.’ In simple words, RAG retrieves information from the document or any other source of data you attached, augments it with your prompt, and sends it to the LLM. This is how it simply works; this is the whole idea, which is why it is called the Retrieval-Augmented Generation.

RAG allows you to connect external data to LLM, which solves the problem of the apology from your LLM. RAG typically operates in three stages: indexing, retrieval, and generation. Let’s discuss them one by one and provide an example to understand how RAG works, making it super easy for you to grasp RAG.

Basic Workings of RAG: Indexing, Retrieval, and Generation

Indexing:

Indexing is the organization and storage of documents or any data in a structured format that facilitates efficient retrieval. In the RAG framework, data from the corpus generated from your data is indexed to enable quick and accurate retrieval during query processing. This involves creating indexes that map each document to the relevant keywords, phrases, or vectors, allowing for rapid lookup and retrieval based on user queries.

Retrieval:

Retrieval is the process of finding and retrieving relevant information from your data. In the RAG framework, It retrieves based on the user’s query. Utilizing the indexing phase.

Generation:

In the generation phase, the retrieved data is combined with the user’s prompt and sent to the LLM, and then the LLM generates an output based on the combined information and converts it into human-readable text form.

An Example of the Working of Typical RAG

Let us say that you have a text document containing data. You use LangChain to use RAG. Now your text will be broken down into smaller chunks that can easily be converted into vectors and then stored in a vector store after getting converted into vectors, while vectors are numerical representations of your words.

These are converted into vectors so that they can be retrieved easily by performing a similarity search when a user gives a prompt. If a user gives a prompt, then the user’s prompt is converted into a vector too, and then a similarity search is performed.

LangChain RAG Basic working breaking down in chunks

These vectors carry a semantic meaning behind each word, and the values in the vector represent the number of occurrences. Suppose your vector embeddings are projected at some point in this 3D space. The location space would be determined by the semantic meaning and content of that document.

Documents at similar locations and spaces contain similar information When a query is converted into vectors, a similarity search is performed between our query’s vectors and the stored information vector, and when they match, they are retrieved.

To have a better idea about semantic meaning, remember that in the dictionary, foot and food might be found on the same page, but here food, burger, and salad will be found at nearby locations, or you can say the same page of a vector store.

LangChain RAG: Symbolic representation of working Basic RAG in LangChain

After the return, the information is sent to the LLM, which then generates an output response and answers to the query. To further improve the process of retrieval, more steps and methodologies are proposed, which you can call advanced RAG, as discussed below.

Query Translation: Advanced Rag Methods to Be Utilized In RAG

These advanced RAG techniques are similar to the original; they are not different. These are just methods to improve information retrieval from a given document or source according to the given situation. Our main aim is to take an input and then modify it in such a way that it improves the retrieval process from the given document, which can later be sent to the LLM to generate a response.

1. Multi Query Approach

In this advanced RAG approach, when a user gives an input query or instructions, there is a high chance that the query may not properly align with the locations in the vector space where our required information is present, so in RAG in the retrieval process, you give a default prompt.

LangChain RAG: Advance RAG Symbolic representation of vector store

You give a default prompt that helps to further generate similar but differently worded questions by rewriting them in a few different ways; we try to approach and retrieve the document that we want. Here is an example of a prompt template.

LangChain RAG: Asymbolic Representation of vector store

Prompt: “You are an AI language model assistant. To obtain pertinent documents from a vector database, you must create five variations of the provided user inquiry. Your goal is to assist the user in overcoming some of the drawbacks of the distance-based similarity search by providing other viewpoints on the user question. Provide these substitute queries, separating them with newlines.”

The user gives a query, which is then sent with this prompt template, and more questions are generated, which will then be utilized to retrieve the right information. Here is an example of the original question by the user and the new-worded question generated with the template.
Original question: Why was LLM introduced to our company?

New questions :

“What motivated the introduction of LLM to our company?”
“What led to the decision to implement LLM in our company?”
“What factors influenced the adoption of LLM within our company?”
“Why was LLM chosen as a solution for our company’s needs?”
“What specific objectives or challenges prompted the integration of LLM into our company’s operations?”

Now all of these questions are later used to retrieve the information, which is then sent to LLM to generate a response.

2. RAG Fusion

RAG Fusion works the same way the multi-query approach works; the only difference here is that after generating that multi-query and the retrieval by those queries, each retrieved document of information is given a rank based on its relevance to the user input, and then they are fused to generate better output.

LangChain RAG: retrieved docyment symbolic representation

This is how RAG Fusion calculates the rankings.

Suppose the user wants to visit Italy, so he gives a query, “The best places to visit in Italy?” Now, the two alternative questions generated can be

“What are the top tourist destinations in Italy?”
“Which cities or attractions should one consider visiting while in Italy?”

Now, after the retrieval, this is how they might have been ranked

Document	Query 1	Query 2	Query 3	Score Calculation
A (Rome)	1st	2nd	3rd	(1 / (1 + 1)) + (1 / (2 + 1)) + (1 / (3 + 1)) ≈ 0.5 + 0.33 + 0.25 ≈ 1.08
B (Florence)	2nd	Not Retrieved	1st	(1 / (2 + 1)) + 0 + (1 / (1 + 1)) ≈ 0.33 + 1 ≈ 1.33
C (Tuscany)	3rd	Not Retrieved	2nd	(1 / (3 + 1)) + 0 + (1 / (2 + 1)) ≈ 0.25 + 0.33 ≈ 0.58

In the above calculation, the formula used is: Reciprocal Rank Score = 1 / (rank + k), where k is a small constant value (often 1).

3. Decomposition Technique

So in the advanced Rag, this is another type that became popular, as the name suggests. We break down the query and the input question into the form of sub-questions, or, you can say, take an input question and break it down into a set of sub-problems and then solve them separately.

After that, IR-Cot (interleave retrieval of a chain of thought) can also be implemented to solve the set of sub-problems. We take a sub-question, retrieve relevant information, and then use that retrieved information to answer another sub-question. This approach is also known as multi-step decomposition.

You can understand both of these with the example below.
single-step decomposition: The user gives an input question. What are the causes and effects of climate change? This question might be broken down into two parts.

What are the main causes of climate change?
What are the different types of effects caused by climate change?

The above two questions will be used to retrieve information independently, and later they can either be concatenated or used by the LLM to generate a well-structured answer using the retrieved information from both questions.

Multi-Step Decomposition Approach: IR-CoT Combined.

User’s question: “How has the internet impacted communication in the 21st century?” Initial Answer generated by the user’s original query: “The internet has revolutionised communication by providing various online tools like email, social media, and video chat.”

Now, on analysing the initial answer, a new-generated question is: will the internet impact face-to-face communication?

The answer to this newly generated question will update the query and retrieve another piece of information, and so forth. Eventually, the information will be used to generate the most satisfactory answer.

4. Step Back Prompting

In this process of advanced RAG, we give a default prompt, such that whenever a user gives a prompt, it steps back and generates its query in a higher-level form. So here is a prompt template saved as a default with a few examples of prompting.

Prompt: “You are an expert in world knowledge. Your task is to step back and paraphrase a question into a more generic step-back question, which is easy to answer.”
and then some few shot examples are sent separately for the prompt templates that can be seen below

“Input: “Barrack Obama was born in what country?
Output: What is the history of Barack Obama?
Input: Could the member of the police perform a lawful arrest?
Output: What can the members of the police do?”

Now if the user asks the question “Who is Bill Gates?” then, according to our given default prompt or prompt template, the model steps back and then generates a higher level or even deeper question to retrieve the data along with the original question. The new higher-level question can be: “What is Bill Gates known for?”

Remember, both of the questions will be used to retrieve the information; no query will be removed; furthermore, practically, this will be used to retrieve the data that you want RAG to work on, like your company data or favorite book, so Bill Gates can be replaced with Rajesh, your company colleague, or if it’s your favorite fiction, then it can be Harry Potter or whoever your favorite character in fiction is.

5. HyDE (Hypothetical Document Embedding)

Hyde is also an incredible technique of advanced RAG used to retrieve data by creating a hypothetical document from the user’s query. If the hypothetical document is found to be more in proximity to the required embeddings, it works as a map for the retrieval process.

Users’ questions are really short, so their embeddings are shorter too, while these embeddings are going to be used to find the required information in large document embeddings in the high-dimensional vector space. A hypothetical document created by your question increases the likelihood of its retrieval.

Here too, we give a prompt template to generate a hypothetical document that is a great addition to improve retrieval in RAG.

Conclusion:

So we discussed RAG and how this too has some limitations, and then we looked at how RAG and advanced RAG works. Afterward, we discussed techniques that can be used to improve RAG and overcome RAG limitations. I hope you enjoyed reading my blog post on RAG.

Stay in touch with our posts to learn how to implement RAG using LangChain, which will soon be posted.

To access the code with the prompt templates discussed and explained above, visit this LangChain link.

Feel free to contact us for any suggestions and feedback, and do read our other posts.

Liked the Post? You can share as well

Muhammad Talal Khan Afridi

Get The Latest AI News and Insights

Directly To Your Inbox

Join Our Community of Tech Enthusiasts

Machine Learning Spot

Basic to Advanced RAG: A Beginner’s Deep-Dive Made Easy (2024)

Overcoming RAG’s Challenges: The Rise of New and Improved Models

You Are Already Familiar With RAG: Working of ChatGPT to Understand RAG