What is RAG?

Open at Intel host Katherine Druckman talks to fellow Intel Open Source Evangelist Ezequiel “Eze” Lanza about retrieval augmented generation, or, more simply, RAG. What is it? And how can you use it to build and deploy better AI applications? Read on to find out.

“Let's try to keep it simple. RAG is basically an approach. The main idea of that approach is to help the LLM or the model to provide you with answers on topics it wasn't trained on.”

— Ezequiel Lanza, Open Source AI Evangelist, Intel  

 

Katherine Druckman: We have a really exciting episode today, in my opinion, because today it is just me and my favorite colleague Eze, and we are going to break down a concept in AI development because we keep throwing it out there. And yeah, we've defined it here and there, but we really want to go a little bit deeper so that people really truly understand what it is we keep bringing up. And that is… what is this RAG concept that we keep talking about. 

So, Eze, but before we get into that, introduce yourself, for the one or two people who’ve never heard an episode and may not yet know who you are. 

Ezequiel Lanza: Right. Yes. Great. Thank you for having me. It's a pleasure to be here. So, I am your teammate. We work together in Open at Intel. My role is open source AI evangelist, and what I basically do is work with the communities, and I work with the open source AI communities, trying to do some kind of DevRel. I mean, I try to explain how things work, how Intel is contributing to open source, and what we can do with the open source tools that we have in the ecosystem, right? 

Katherine Druckman: Awesome. Since you are the skilled data scientist between the two of us, can you tell us, what is RAG in the context of AI app development? 

RAG is an Approach

Ezequiel Lanza: Yes. I mean the answer can be very complex, right? Let's try to keep it simple. RAG is basically an approach. The main idea of that approach is to help the LLM or the model to provide you with answers on topics it wasn't trained on. Now, suppose you have a robot or something, or even you when you don't know something, you Google it. You are using external sources to feed from that information so you can get the answer. Basically, RAG is an approach, a technique, with multiple tools that work together. The main concept is you have an LLM that was trained on something pre-trained on a fixed amount of data, and you would like to ask to that model to give you an answer on something that wasn't part of that big amount of training data. And we can go as deep as you want. I mean, it's multiple pieces. 

Understanding Retrieval

Katherine Druckman: Yeah, let's talk about those pieces. This is how we learn. Let's break it down. Let's just start with retrieval, right? It's an acronym, “retrieval augmented generation.” Tell us about how the retrieval step works. I know there are a lot of moving parts, even just within those parts. 

Ezequiel Lanza: Yeah, the RAG architecture has main parts that shouldn't be avoided. For instance, you will need the retriever, you will need the embedding model, and you will need the augmented part when you are changing the prompt.  

When we are talking about retrieving the documents, let's suppose you’re asking a model for some information, and you're sending that prompt with a question. We always use the same question. It   could be, "Who is my mom?" Or something that the model doesn’t have the information to answer. So, that will be a prompt. You send that information, and, of course, the model will say, "I have no idea what you're talking about."  

So, how can you do it manually without abusing RAG in your mind? What if you change the prompt? That was actually a paper that published in 2021 or 2022, I think, from Meta where they realized that if you change the prompt and you put information in the prompt, the LLM can pick out that information and can give you some answers. 

And that's where the retriever part comes to be very interesting because if it was so easy that you would just have to change the prompt because you have the information in your mind or something that you already know. But if you want to build a system, you will need to find in the sources which information is related with the question so that it can be added to the new prompt that you are building.  

Basically, when we talk about retrievers, we start talking about knowledge databases, where we have our information. Let's suppose you have a PDF. You’d like to make a question around the PDF. Let's suppose that we have our PDF in our knowledge base, and we'd like to ask a question about something that is inside of that PDF. 

The retriever will see your answer, will see your question, and based on your question, will go to the vector database and extract the part of text that is most similar to the initial question, and will create a new prompt, and that’s the prompt that will be sent to the LLM. So, basically, the retriever has to do this comparison and something that is called a similarity search. It's looking at how similar the documents are to the initial question to create a new prompt. 

What are Embeddings?

Katherine Druckman: Let's break it down. Tell us how embeddings work. 

Ezequiel Lanza: Embedding is actually a very old concept. It started in the old NLPs. Embedding is when you would like to represent words put to numbers. This is what an embedding actually is. We are working with machines with algorithms, and usually, they don't work with words or with letters. We need to use numbers. So for a phrase or letter or word to be used by a model or by a machine, it has to be represented by a number, a vector. Text embeddings, they go one step ahead of that, which is... Let's suppose that you have 5,000 words in English. I don't know how many words exist in the language. You will need to find numbers to represent each word. Cat, women, man, king, or whatever, they each have to be an embedding or a vector, a way to represent that word. 

But you can do it in multiple ways. You can do it, let's suppose you can say, "Okay, cat is number one, dog is number two," and you start numbering all the words of a dictionary. But there is a problem with that. Since they are machines, sometimes you will need to do a similarity search to see, for instance, how similar “cat” is to “dog.” The data is a meaningful way to represent those words. So, there are embedding models that were trained on text, and they have this ability to convert or transform a word to a vector. And that's really important because, since I mentioned that the PDF, or the document, is stored in our vector database, how it is stored is that it’s converted from text to vectors, and those vectors are stored on the vector database, or in the knowledge base, let's say. 

There’s a very common example and you’ll see it everywhere with an algorithm called word2vec, a model. It's with “king” and “queen” and “women” and “men.” So, if you have the words “king” and “queen” and you would like to measure the distance of how similar they are, the distance between “king” and “queen” will be the same distance between “men” and “women.” So that's when we talk about meaning. I mean, all the fruits, they have similar representations, all the companies have similar representations, and so on. That's basically what an embedding is. 

Where Does Chunking Fit in All This?

Katherine Druckman: Where does chunking come in? I know we've translated words into numbers. 

Ezequiel Lanza: Yeah. 

Katherine Druckman: Let's talk about where chunking comes into this equation. 

Ezequiel Lanza: Chunking is when... Let's suppose, with the same example, we have a PDF. It's how we convert that PDF to be stored in our knowledge base. The chunking is a parameter when you are moving from documents to vectors. Chunking is normally what we called the amount of words or the paragraph that we use to convert as a vector. For instance, you have a PDF, and in the PDF, you can have 2,000 words, and you would like to use chunks of 500 words, or 200 words or characters. So, you say, "I would like to divide this PDF in 20 different parts." Each part of that is what we call a chunk. We have a chunk, we have these 500 words that we feed the embedding model, and we have a vector for this entire chunk. That's stored on the vector database, and once you need to look for that similarity, you'll be extracting or retrieving that chunk from the vector database. 

That's when things start to be a bit complicated because, how do you decide what is a good size of chunking? It's because you are not just doing words, as I explained. 

Katherine Druckman: They need a relationship and context, right? 

Ezequiel Lanza: And you need to find a paragraph that can capture the information. Chunking by five words is not enough, 1000 could be a lot. So, trying to find the right point or the balance, or the measure of the chunk, that's another challenge, and that's when things start to be more interesting. 

Reranking. How Does it Work? And Why Do We Care?

Katherine Druckman: Okay. So, we have augmented our training data, we have supplied additional information that the model can now retrieve, and it is retrieving chunks and words that have been converted into numbers and stored in the vector database. Let's talk about another concept: reranking. How does that work, and why do we care? 

Ezequiel Lanza: Well, reranking is…we have some very naive parts on the RAG. So, we can do it very simply. We can have embeddings. Vector database. Reranking. We can have top five or top three. We have a new prompt. And we put it on the LLM. But when we start doing that, we would probably like to check to see if the quality of the documents that were retrieved is good enough. I mean, are we going to really trust the similarity search that we are just doing? Or would we like to, for instance, extract the 10 most similar documents from the initial question? The reranking is like an extra filter that has another text model that is evaluating the quality of the documents retrieved from the retriever and the initial question. So it's more like checking or double-checking the quality of the retrieved documents because it's not always so simple. 

For instance, the question that you're asking, the documents that are retrieved are probably not related to the question. You might need to detect that, and how you detect it, or how you improve, or try to improve that, is using data. After the reranking you have, for instance, you start from 10 documents or 10 initial documents that you actually retrieved and the model thinks are similar. Once you have those documents, you can say to the reranker, "Hey, just give me two documents that you think are most relevant to that question." And these two documents are the documents that you put in your final prompt. And that prompt is the one that will go to the LLM. And we can start adding layers and layers and layers and layers to check or to double check, or to secure, like guardrails. I mean, there are a lot of things that can be added just to verify the quality tables. There's a lot of things that have to be verified.

Real World Deployment: Avoiding Vendor Lock-in with OPEA

Katherine Druckman: I appreciate that you mentioned guardrails. That’s a necessary conversation too. Okay, so we've learned so much about how the RAG, retrieval augmented generation technique, works. Tell me, are there some common frameworks that you might include using this technique? 

Ezequiel Lanza: Well, I think that the theory is great. We can imagine that it's a big, complicated thing because we have multiple moving pieces. We have not just one part. We have the retriever, we have embedding, we have models, and they have to all talk each other. In terms of software, for instance, you can use LangChain, they have a framework that you can... Or an API, for instance, that helps you on building this final pipeline. At the end, when you need to put all the documents together to build a new prompt, LangChain provides you a great API, or you have LlamaIndex that you can use to do the same thing. And you also have Hugging Face, of course, because the LLM or the model that it will be using, it will live on Hugging Face and they'll be using their API. And there are a lot of parts. What I see as a big challenge yet is that we'll need to deploy that, and we'd like to make it for real. We would like to have these applications. 

Katherine Druckman: Yeah, we want a real app here. 

Ezequiel Lanza: Yeah, it's fine when you have it in a Jupyter Notebook, or once you have your prototype, but you would like to start thinking about microservices, cloud native, how to create blocks for those particular pieces, how they talk each other, what is the API? And I think that there are great projects like OPEA, which is a project that was launched four months ago, and that is the main goal of that project. Try to make it simple to use those building blocks and to have an end-to-end application from the retriever and the embedding, the prompt, and the UI with the end user, for instance. Having this flexibility…and another thing that is very important is we mentioned vector databases, for instance. With vector databases, you can use Milvus, you can use tons of vector databases. And sometimes they have different APIs to interact or different ways to retrieve the documents. If you can treat each of those documents, or each of those blocks, as independent parts, you can easily switch, for instance, from one vector database to another vector database or to one model to another model. 

That's also very... and this is what I love about this project…it avoids vendor lock-in, right? Something that we... 

Katherine Druckman: Music to my ears. 

Ezequiel Lanza: We really love to... We try to do that also. 

Katherine Druckman: As open source nerds. Yes, we do appreciate avoiding vendor lock-in. 

Ezequiel Lanza: Exactly. So that part is very interesting. It helps on deploying and gives you the flexibility to use basically whatever you want. We talked about LangChain, we talked about LlamaIndex. If you're familiar with LlamaIndex, you can use the block that is used in LlamaIndex. If you're familiar with LangChain, you can build your own block on LangChain, and they can all talk each other. Have the retriever in one platform and the embedding models in another platform using Hugging Face, even using an external API if you want, you can also use it. And that exists because six months ago it was a challenge, and because most of the AI applications, they are great because we love to prototype them. It's great that we have Jupyter Notebooks, but that step to start talking about APIs, start talking about cloud native microservices, or how to actually deploy it…it's the big challenge for AI to really explode in their benefits. 

So Many Ways to RAG

Katherine Druckman: Fantastic. Yeah, it is. Explode is a good word here. The whole field is kind of explosive these days, which is both exciting and presents hurdles. Thank you. I think people will have learned a few things here. Do you have any parting words for those of us who want to learn more about RAG techniques? 

Ezequiel Lanza: Yes. Great. I do. We talked about similarity search, but actually RAG is... The goal of RAG is to augment the prompt, and you can augment in very different ways if you want. If you'd like to know more, for instance, there are other techniques as text to SQL. Instead of retrieving the documents from a vector database, something in the middle can generate a SQL query, can go to the SQL database, can extract that data, and can create a new prompt. And we're also doing RAG. So, text to SQL is one. Another is RAFT, which is a mix between fine-tuning and retrieving. There are tons of things. And the good thing is that we can find something new mostly every week. I don't want to say it every day. But… 

Katherine Druckman: Yeah. It's wild, isn't it? Yeah. 

Ezequiel Lanza: ... every week we have something different. And... 

Katherine Druckman: It's exciting. It's hard to keep up with. 

Ezequiel Lanza: Exciting. It's very exciting. Yeah. 

Katherine Druckman: Yeah, it's really hard to keep up with. And to that end, I also have to plug, I should mention, you should follow Eze on LinkedIn. That's how I keep up with all of this, frankly. Eze shares all the good stuff. I try to share some security stuff, but if you're really trying to keep up with all the excitement around AI and AI development, that's where I go for my information. 

Ezequiel Lanza: Yeah, I try to do what I can see. I like to share with what is new and what is hot. And so, yes, sometimes we can do it. Sometimes it's so fast that what we post one day is old the next. But, yes, I think that the main thing, or the good thing, about the community is that it’s starting to share what others are doing, not reinventing the wheel. And what’s really different from 10 years ago, or five years ago, when I started out, we had nothing. Now, we have tons of papers, we have a lot of information to learn, people working on that. So, that's a great moment for AI, I think. 

Katherine Druckman: Yeah, it's an interesting part of the growth cycle, right? We're starting that process I think, or starting to solidify the process of really establishing what are the best practices, and what is everyone else doing. How are we doing this? What's the best way? So yeah, exciting times. Well, cool. Thank you so much. And obviously, go find Eze on the internet and I'm sure you're available to answer questions if people have them, right? 

Ezequiel Lanza: Absolutely. Yeah, absolutely. Yes, reach out and more than happy to help or to give some guidance.   

Katherine Druckman: Awesome. Until next time, we will keep this conversation going. 

About the Guest

Ezequiel Lanza, Open Source AI Evangelist, Intel

Ezequiel Lanza is an Intel open source AI evangelist, passionate about helping people discover the exciting world of AI. He’s also a frequent AI conference presenter and creator of use cases, tutorials, and guides to help developers adopt open source AI tools. He holds an MS in data science. Find him on X and LinkedIn 

About the Host

Katherine Druckman, Open Source Security Evangelist, Intel

Katherine Druckman, an Intel open source security evangelist, hosts the podcasts Open at Intel, Reality 2.0, and FLOSS Weekly. A security and privacy advocate, software engineer, and former digital director of Linux Journal, she's a long-time champion of open source and open standards. She is a software engineer and content creator with over a decade of experience in engineering, content strategy, product management, user experience, and technology evangelism. Find her on LinkedIn