"This is why you need a RAG system" - Apurva Misra - Tool Use - AI Conversations

Episode Transcript
Available transcripts are automatically generated. Complete accuracy is not guaranteed.
(00:00):
For that you have to narrow it down and you give a bigger LLM
or an orchestrator access to these tools or these sub agents.
And then based on the user query, it's going to decide like
what to call, when to call, how many times to call and when to
stop and give the answer back tothe user.
That's where like semantic similarity doesn't work that
well. And you would want to use
keyword search and a combinationof both are the best.

(00:21):
Always look at the data, figure out what the issue is.
And that is where I was saying like whatever you build is not
going to be the final solution because you will have to keep
working on it. It created a nicely that loop
for them to keep improving theirdocumentation and make it more
diverse. It's obviously not trained on
your proprietary information that's within your company, so
you would want to use a rack system there.
Are you using RAG? And if so, are you using it
correctly? We all know that context is king

(00:44):
and RAG is 1 vital component of optimizing your context.
But are you using it right? Do you even need to use it at
all? On episode 63 of Tool Use
brought to my Tool Hive, we're joined by Perva Mishra.
She's the founder of Centec, an AI consultancy that helps
companies of all sizes with their AI strategy.
Along with building AI solutions, we're going to deep
dive into RAG. Do you need a reranker?
Do you need to cash? What are some non-technical

(01:05):
solutions to allow everyone to benefit from this?
So please enjoy this conversation with Apurva.
I'm hoping everyone has used Chat GPD.
That's why AI is hyped up right now.
Like everyone knows about AI thanks to Chat GPD.
So if you want to use AI in yoursystem and if you're thinking of
like building a customer supportsolution, for example, you have

(01:29):
like HubSpot documentation, you have the HubSpot knowledge base,
you have made like a bunch of YouTube videos and you want to
provide like a layer on top of that that people can come and
ask questions to. And it could like look up from
these sources and answer to the people that is a rack system or
like basically like any sort of like sources that you have

(01:50):
inside your company and you wantto build a layer on top of it so
that users can communicate with it within natural language that
I would call a rack system. And it's not just like
communication, like it's not just conversation.
The rack system could be just Q&A, like asking a question,
getting an answer, like filling out an RFP document.
So it could be it's just basically a rack system is you

(02:13):
have a lot of documentation, youhave a lot of videos, you have a
lot of knowledge, proprietary knowledge that the LLM is not
trained on, but you want the LLMto utilize that to answer.
And that's where you build a rack system.
So if in simple terms, if you think about it, you can go and
attach a PDF document in ChatGPTand ask questions, ask it to

(02:36):
summarize it, ask it to do a lotof things with that PDF
document. But what if that PDF document is
really long or you have a lot ofPDF documents and you want to do
the same thing to that? That's where you would use a
rack system. So what it does is so it breaks
down your documentation into like chunks pieces, and then it
would go embedded and save it inlike a vector DB.

(02:57):
And then when you ask something,it would go and find like the
relevant chunks and then do run the LLM layer on top of that so
that it's able to like summarizeit or answer a question out of
it. So that's what a simple rack
system does. And This is why you need a rack
system. You can do it simply.
A rack system is basically like providing the LLM the

(03:17):
information that it doesn't haveso that it can answer your
questions or like do a task for you based on this information.
But because there's something called context window, there's a
limit of how much knowledge it can go over at one time.
You would want to like use a rack system so that it like
breaks it down and embeds it, and like does a search over it
and gets you the relevant chunkswhen it's needed to answer your

(03:40):
query. Excellent.
Yeah. And there's a few things I want
to go a bit deeper into. But one thing that comes up on
the Internet a lot and you hear the controversy around it is,
oh, RAG is dead. We have million context windows.
What are your thoughts on the ability for us to just cram
everything into the context and let the AI figured out versus a
RAG system which is going to more granularly select data to

(04:00):
inject into the context? So there's a paper which has
come out and in there there's a graph where it shows even though
the context window is going up, if you look at the performance
of these models with these like bigger context window after like
a certain length like 200K or something like the performance
plateaus. So it doesn't like improve.

(04:20):
So there is, I feel like I guessit's, it's going to keep
increasing, the context window is going to keep increasing.
But how the current LLM architecture is set up, it's
still not going to perform as well even though the context
windows is like 1,000,000 long. Like if you put like anything
more than two 100K tokens, it's not going to perform as well

(04:41):
because it's not able to focus. It's LLMS are like I keep
telling my audiences they have ADHD.
If you put too much in the context, they get confused.
And with like any like enterprise company, you know,
there's so much that you have toexplain in the context.
For example, if you're a trucking company, you will
explain what a dot number is like explain anything like
that's proprietary to your company that it should know any

(05:01):
of those. And then you would give that
pull that relevant chunks from the documents.
Then you will ask, put like somesort of cautionary prompt here.
Don't answer to these kind of queries.
And so the prom goes so long andthere's so much that the LLM is
doing that it's not able to focus.
So it's, it's very important to just give it the important
chunks instead of giving everything.
Because when you start building a production ready system there,

(05:22):
you would not think like the prom goes really big.
It becomes really big like when you're trying to explain
everything, trying to do a complex task, giving it
examples, few short learning. So it grows really big.
So it's always important to likekeep the keep the context
focused on what you're trying todo.
So you would want to pull only the relevance stuff instead of
giving everything in there. And like to your previous

(05:44):
question that you asked, like why would business leaders would
only use RAG? The other piece is also that I
mentioned, it is a slide in my like presentation that I have,
which is like you want to use RAG system 'cause this LLM
models, if you think about like the recent model I think got
released by Entropic, it was streamed on data till like
January 2025. So if you are, I don't know, a

(06:06):
news company or something and you want the latest information,
you would want to use Rack system because it doesn't know
what happened last week. So it doesn't have that unless
you provided the search tool, ithas to go like search.
Otherwise it doesn't have that knowledge within that model.
So that is where a rack system comes handy.
The other is like your proprietary knowledge, you know,

(06:27):
so these are like 2 important things.
It's not trained on the latest information.
You would have to provide it thetools like web search otherwise
like it, it's obviously not trained on your proprietary
information that's within your company.
So you would want to use the rack system there.
One thing I want to touch on that you brought up is
embeddings. So I have a collection of PDFs
from company policies to user interviews.

(06:47):
Just hodgepodge of different things.
If I take those and want to embed them, what's the workflow
like? Maybe give some examples for if
you have an engineering team so you can build something in house
Or for the less technical onlineor a digital service.
How do I take my PDFs and turn them into this magical format
that LLMS can ingest? So embeddings are basically
numbers. So if you think about it, it's

(07:09):
basically a vector of numbers, so an array, and you have
numbers inside that. So it could be like generally
models are like 1536 length long.
It depends on the embedding model that you're using.
So how long of a vector it givesout.
So you can give it like a chunk.So embedding models also have a
context window. So you can give it a chunk of
like a certain limit and then itwould give you a vector for that

(07:32):
chunk. So the pipeline for this
ingestion would be basically youingesting your PDF documents and
then you're breaking it into smaller pieces.
So in some cases, you don't evenhave to break it into smaller
pieces. You know, for example, when I
was working with the start of tobuild a customer support bot for
them, their HubSpot knowledge base, like any, they were

(07:53):
answering questions of that the users were having by writing
documentation for that. So like for one question, they
have like maybe 1 pager or two pager and you can like, you
don't have to take it into pieces.
You can just like put that into an embedding model and you can
get a vector out of it easily. So it depends on your use case
and your kind of documentation. Like how would you want to chunk
it, you know, but you would wantto chunk it so that it makes

(08:15):
sense. For example, if the PDF is small
enough, please don't chunk it. If the PDF is really big and,
and you have to chunk it, you, it's better if you overlap the
chunks of it. For example, the PDF is this
long and you're chunking it might as well like overlap it a
bit. You know, you have like 50
tokens overlap so that this chunk has the context of the

(08:36):
previous chunk. And also you might cut it like
midway of like whatever it's trying to say.
So like maybe do the baking at like a new line or there's a
paragraph change. So like some something like
that, like chunk it thoughtfully.
The other thing is also there issomething called which was a
blog post that got that got released by Entropic.

(08:57):
It's called like contextual retrieval.
So if you are in a domain which is more complex, for example,
healthcare, you can also use an LLM to summarize what the whole
document was and put that at thetop of each chunk.
So this chunk is like #2 in thisPDF, and this document was
talking about Alzheimer's. And then, then, then you do the

(09:20):
embedding so that it has some context in there of the whole
document. And it's not lost it.
What I'm trying to get at is it's very simple.
You have to chunk it. You put it in an embedding model
and create a vector out of it. And you store that vector into a
vector TB. That's what an ingestion
pipeline is simply. But it can be very complex when
you're working in a production system and in whichever domain

(09:40):
you're working in. You would have to play around
with these parameters and figureout what works best for you.
The the next thing is the embedding model.
Does the embedding model make sense for your use case?
For example, you can use an embedding model.
Opening Eye has a bunch, other other companies also have a
bunch of them. There are open source embedding
models like the nomic ones and also there are multimodal

(10:03):
embedding model. So if your PDF has images, you
can embed them in the same spaceor you might want to do this,
you might want to take the image, create text out of it and
then embed that. So like you can go like multiple
ways. But the other thing is like the
embedding model that you're using, for example, if you're
building a system which is focused on customer churn for

(10:24):
that system, a sentence like I want to subscribe to your
service and I don't want to subscribe to a service like are
two very separate sentiments. And you want them to be far away
in the embedding space so you can filter by it.
So that's where you might want to train your own embedding
model, so you can fine tune yourown embedding model as well

(10:47):
based on your requirements. But otherwise, mostly I've seen
these embedding models work fine, but sometimes there's a
need to like find, you know, ownembedding model.
You mentioned some of the metadata, but like the document
that the chunk was found and whatnot.
Is there a risk of putting too much metadata when you're
working with some embeddings? Or is it kind of ideal to try to
add that extra context so the model knows where this chunk
came from? I would say it's never too much.

(11:09):
So one is the contextual retrieval piece that I was
saying. So you you put the summarized
stuff at the top of the chunk and then you do the embedding.
The other thing is the metadata.So metadata is like what was the
file name this chunk is from? Like when was it last modified?
So date modified or date created, That could be one
thing. And then what's the page number?

(11:29):
So metadata mostly depends on your use case.
For example, the healthcare company that I'm working with
recently, the kind of queries that they are getting are two
kinds. One is the Q&A that is liquor
typical drag. Like people have question and
you can find answers inside the document and you would have to
pull the retrieve. You will have to pull the like
relevant chunks and then answer the query.

(11:50):
The other kind of questions thatthey are getting is like file
search questions in in this folder, how many documents do I
have? Like how many HR documents do I
have? Can you find me this vendors
this document? So that's where you, you would
figure out like you need more metadata.
So having more metadata is not abad thing.
It lets you like filter first based on the metadata and then

(12:11):
you do the retrieval in that case, like when when we do the
ingestion pipeline, see now it'sgetting more complex.
Like when you focus on the ingestion pipeline, you do an
LLM call, ask it to like based on this document, what do you
think? Is this an HR document or is
this an IT document? And that that would be a
metadata field as well as it's an HR document.
What page number? So like it depends on the use
case, the kind of metadata you want to save.

(12:31):
But like more metadata is not a bad thing.
So when you're doing the retrieval, what you would do is
like based on the query that theuser has, maybe you would do
metadata filtering first. OK, the user wants to get
documents from last month. So I'll filter by date from
those documents and then I will do my semantic search that is
like embedding matching and likeget those relevant chunks.
So it helps you actually it helps you with latency as well

(12:54):
because you're doing semantic similarity on less number of
documents and also helps you improve the results.
I really like that and I think we'll come back to it because
the idea of optimizing these pipelines is super important.
But one thing I want to bring upbefore I forget is you mentioned
the potential for fine tuning anembedding model and I actually
haven't interacted with anyone that's got to the point where
they need it. But just for evaluating the
existing embedding models, do you have any strategies for

(13:15):
testing to see how how this actually works?
Do I need to explore other ones?How do you go about like
evaluating an embedding model? You would evaluate the embedding
model based on the retrieval. So is it doing the retrieval how
I expect it to for my use case, for example like I was giving an
example of like subscription andnot not a subscription.
So what I would do is like choose an embedding model like

(13:38):
there's MTEB, MTEB ranking on hugging face.
You can find that and you can filter it like they have like
nice filters like if you're doing classification, like how
the embedding models are ranked,which language are you focused
on, which are multi models. So it's it's a pretty nice
leaderboard. You can like find an embedding
model from there. And once you get that embedding
model, what you would want to dois like embed your stuff, like

(14:00):
embed like a few documents and then you will want to like type
a query on that and see how it'sretrieving from the vector DB
based on those embeddings. And if those embeddings do not
make sense, you might want to like stay around with like
chunks and stuff. But if it does not make sense
for your domain, like how you want the embedding to look like,
if it doesn't, you can obviouslylike plot it as well in 2D like

(14:24):
you are big actually compressingthe high dimensional thing into
two-dimensional. If it doesn't make sense for
your use case, you might want tolike fine tune or like go check
out some other embedding models.So that is how you would do
visualization is one thing. The other thing is testing it
out honestly. And then the leaderboard there
is like really helpful on figuring out which embedding
model to use. For the the vector DB that you

(14:46):
mentioned, there's a few big providers.
What do you recommend people explore when they're looking for
a vector DB? Are there certain criteria you
have for different use cases? Or how can people accelerate
finding the right vector database for their use case?
So whenever I'm working with a company, the first thing I do is
like look at the current text stack and if they're using
Postgres, might as well just addthe PG vector extension and use

(15:08):
that as the vector DB. You wouldn't want to like start
using a new technology unless it's needed.
And generally I've noticed is like, that's enough.
This is enough. Like using the Postgres with PG
vector is enough. But I have realized Pine cone
has a nice UI. So like anyone who is new and
working in this technology, likepine cone is amazing that way.
Very nice UI. It helps you understand like you

(15:30):
in the UI itself, you can do like top K, like put the ID of
like whichever chunk you want tofind similar chunks to you can
do like top K very nice UI there.
Then you have a bunch of other vector DBS as well.
But what I've realized is like unless you have like millions
and millions of document and like latency is becoming a big

(15:50):
deal, you don't have to look forspecialized solutions.
Postgres with PG vector is good enough.
And honestly like if you're using any of these technologies
already, might as well stick to it.
For example, like if someone is using Postgres, I would like
just add PG vector and start using.
I would not go and find a new vector DB and trying to try to
add that into their text stack. And I'd echo that I use Postgres

(16:11):
as my database at box 1 and we just set up PG vectors.
So it works, It works. It's right there.
You don't have a new stack. It's just reducing complexity.
And especially these things do sound complex.
I did want to give a quick shoutout.
We had a previous guest on Kirk Marple.
He has a company called Graphlitthat kind of abstracts all this.
We're just like a nice drag and drop interface.
And it's a nice way to kind of get started, but you do lose a
lot of control if you're not pulling it in the house.

(16:32):
But different needs, different use cases.
Do you have an idea or like a heuristic around when people
should go for one of those self hosted or sorry hosted managed
services versus bring things in house?
Depends on how much money you have and and how many resources
you have in house. If you have a lot of engineers
in house and you can utilize them you can like do it in

(16:55):
house. Otherwise like if you have a lot
of money get a managed service. Managed services are always
easier obviously and they like right now like pine cone
introduced so or less like it's quicker cheaper as well.
So it depends like it's a trade off between like how much money
you have and resources you have in house.
Otherwise I feel like like posters with PG vector, since

(17:17):
you're already maintaining posters like it's it's just an
add on onto it. So it's not a lot of effort in
house either. One thing that comes up a lot is
the quality of data, where garbage in garbage out.
It's very important that you tryto add the context where
possible. Do you recommend people doing
their first rag pipeline to justdump the data in?
Like just copy and paste all thePDFs into their their workflow?

(17:40):
Or should there be some degree of pre processing and if So
what? What should they look for to try
to improve the quality of their data?
So yeah, you have to do a bunch of things there, but all my
answers are going to be it depends.
It depends on your use. But for example, like
deduplication is one thing. So when you're working with
enterprises, deduplication is like the most important.

(18:02):
The other is also our back, likewho has access to what?
So you want to save that as wellbecause you don't want to refer
to documents that this person doesn't have access to while
answering their query, you know.So that's an important piece as
well. So like the preprocessing, like
you were saying, garbage in, garbage out, like the
preprocessing piece, like it's, it's a lot of like typical NLP
pipeline where you do clean up before it goes in, but not a lot

(18:26):
of it as well. Like you wouldn't want to do
stemming because you're breakingdown words.
Like these were like techniques we used to use earlier.
But now like these models are able to like understand the word
playing and play mean the same thing.
You know, it would completely depend on the use case, the kind
of process pre processing it needs.
But generally I've noticed it's,it's not a lot.
The clean up in the text is not a lot.

(18:46):
It's always nice when things areeasier, but say people wanted to
adjust their strategy moving forward.
You'd mentioned the company thatdocumented their interactions,
like a customer support interactions, to create
documents out of that and that can be fed in, whether as a
playbook or just as general knowledge.
Do you have any other advice forpeople who are looking to try to
improve their data collection ortheir data structure for these

(19:08):
systems? Is there anything they can do to
like update their current processes, or are things to keep
in mind to make sure that the data they're putting into the
system is is optimized? It's like the name they want to
build AI agents, but they go into hell and they look at the
data and it's a mess. So, so like the customer support
example that I was talking aboutearlier that you mentioned now
you know what they, so they started with the thing was the

(19:30):
CEO there. He what he noticed was like
anytime a new question comes in,someone like writes an e-mail
back and that's it. And now we don't rather than all
is saved anywhere and a new question, like a similar
question would come in and we'lldo the same process again.
Like someone goes and looks for something and like figures out
the answer and sends it back. So he was like, OK, you better
like write a document and send adocument link instead of like
writing a new answer every single time in the e-mail.

(19:53):
So that's how it all started. So when I got in, they had like
they started with YouTube videosinitially and then they realized
it's very hard to upgrade like update the YouTube videos
because it's startups. What happens is their interface
is changing so quickly, like every three days their
interface. Looks different.
A new button got added, a drop down thing change and so those
screenshots and everything, if they're made a video, it's you

(20:14):
would have to delete the video and make it again.
But with like a document, it's easier to update.
Like you can change the screenshots and it what makes
sense. So they had like a bunch of
YouTube videos. Some of them were like outdated
and then they started putting stuff in the HubSpot knowledge
base. So when I came in, they had a
very nice selection of like like500 sold documents in HubSpot

(20:34):
knowledge base and a bunch of YouTube videos.
So typical ingestion pipeline. I build a solution and then what
we realized is a conversation isnot something that's useful
here. No one wants to talk to a chat
bot. So AQ and a system so people can
ask a question and they would get an answer and if the answer
is not suitable, there was a button that they could just say
like connect to support and it would automatically send an

(20:56):
e-mail to like a human would be looking at the team, the support
team. Also, what we added was a thumbs
up, thumbs down, which everyone has seen and all these
interfaces. But what it did was like if a
question comes in and we send ananswer back and if someone says
connect to support or if someoneputs a thumbs down, we would
take that into account and try to figure out what's up with the

(21:17):
document. Was the answer correct?
What was the confidence score like?
Why were they connected to a human for this?
And with that, we could like improve the documentation for
them. So this whole flow was like a
feedback loop of improving our documentation.
Like we figured, oh, it did pullthe right document.
The confidence level was high, but the answer was still not
correct. That means the document was
expired like the reference was the source reference was wrong.

(21:40):
So we need to go work on this document.
So it it created a nice feedbackloop for them to like keep
improving their documentation and make it more diverse.
Like sometimes like there was nomatching documents.
So it was a sorry and that's whythey connected to a human
support. And the feedback loop is exactly
what you mentioned with the iteration.
You're not going to nail it the first time.
There's way too many edge cases.Do you have any advice on how
people should try to implement that?

(22:01):
Like if they click the thumbs down, do you just like inject
into a Google sheet? Is there custom software written
that helps collect and aggregateall of the incorrect use cases?
Or how does that part of the flow work?
In this case I had built a custom custom software.
Sorry, cause the customer support team wasn't that
technical. So we wanted it to be easy for

(22:21):
them. They could see the query, they
could see the answer, they couldsee like the what the confidence
score was. And if they want they can click
and see like the chunks that were retrieved.
They could like write their own comments in there and like put
the link of a new document. So that was the whole interface
that I had built for them. But it could be as simple as an
Excel sheet, honestly, just likekeep tracking what's happening
with the system, put it in Excelsheet and please look at your

(22:43):
data. So if you can look at your data
like in a healthier company. So if you can look at your data
like that's the best. Always look at the data, figure
out what the issue is. And that is where I was saying
like, whatever you build is not going to be the final solution
because you will have to keep working on it.
Like especially with like AI, because it's so, so reliant on
the data. And the thing with data is it

(23:06):
gets outdated. So you need to keep working on
keeping the data up to date so that the AI can rely on it.
So we find a incorrect answer. Update the documentation.
How then do you update the embeddings?
Do you delete embeddings relatedto the old document?
Do you replace with a new one? Do you just like throw it and
say re embed this? How does that part work?
That's a big question. So it it depends, but sometimes

(23:30):
the easiest thing to do is like delete the embeddings and just
create new ones. It's costly.
Obviously. In this case, the useful thing
was HubSpot knowledge base givesyou date modified.
So like every, I think every three days or every week there
ingestion pipeline was running and it would go and see if this
document was modified, this old document was modified or not.

(23:53):
If it was, we were like recreating the embeddings.
So it made our life easier. So it would depend like in your
case, whatever the sources are, if they have something something
like this, you can utilize that and you don't have to create the
embeddings for the documents which did not get modified.
Sometimes what happens is you need to create all the new
embeddings. Sometimes it gets more complex
like when it's like very costly for your system, then you would

(24:15):
have to figure out, OK, there was those changes in this
particular document. Then I have to go look at all
the chunks and like do the embedding only for those chunks
or I'll just do the embedding for that one chunk.
But then the structure of the document changes.
So it's all I have. Like for my personal uses that I
have found like, I have tried tolike embed all the chunks.
If there is a marker like this available date modified from the

(24:38):
source, that's very useful. Nice.
Yeah, just another reason to make sure you include all the
metadata. Is there a risk, and this is
just general curiosity kind of going off track a little bit,
embedding models like we all know with LLMS, they're non
deterministic and sometimes the same input results in different
output depending on how things are processed.
Do we risk the same with embeddings where if I re embed
the same document over and over will I always get the same

(24:59):
vectors out or is there a risk that it could change?
No, you won't get the same vectors out.
It's probabilistic, but it's it would pretty much mean the same
thing. It would allow it would be
around the same space. So there isn't.
So like that's why like generally the measure that we
use for like semantic similarityis something called cosine
similarity. What it does is it checks like

(25:19):
the angle between like 2 vectorsin that high dimensional space.
This is like very simpler 2D space.
But like if you have these two vectors, it will check the angle
between them. So even if it like moves a
little, the angle is still pretty much like.
They are similar documents so itwould be able to put them.
For the cosine similarity, are there best practices for people
to determine what the correct value is for the use case?

(25:40):
Or is it just trial and error tosay for this given query we want
to get X amount of documents that are relevant so we'll just
keep tweaking it till we get theright number?
Or is there like a good default setting like where do you
approach cosine similarity? So it's it's it's actually a bit
of trial and error there. Like you would pull for example,
we didn't talk much about the reranking module, but like you put

(26:03):
like 20 documents or 50 documents from the chunk.
It depends like how your chunks are, how many you want to pull.
Like if the chunks are very tiny, you want to pull like
more, more number of documents. And then you send it to a re
ranking module. And what that would do is like
re rank them based on like how relevant they are to this
particular query. And then you can choose like the

(26:24):
top 10 from there or top 20 fromthere based on the re ranking
score. The other thing also is like
sometimes what we do is like instead of just pulling like top
50 documents, how we do the pullis very different.
Like if it hit this chunk it found like it was a similar to
the query with more than .8 as aconfidence score, we might want

(26:44):
to pull the chunks underneath itand above it so that we get the
context. So it depends how the chunking
was done, how how the embedding was done, like what you want to
pull and then you would like runit by the re ranking module and
see like how how it's performing.
Are you getting the relevant chunks or not?
Anything you do here that I'm saying like you want to like do
trial and error, you would, you would want to begin with a

(27:08):
golden data set like these are the queries, these are the
correct answers. These are the documents you want
to pull these correct answers from.
And then like when you're doing a top ten, are you, are you
checking like these like 3 documents?
Were they pulled in these top 10or not?
You know, and what was the confidence score like?
So you want to do this for like 100 queries and then you would

(27:31):
you would like figure out what the right parameters are for
your? Another way to give AI systems
access to your data is by using MCP, and that can be a little
bit scary, but I've been using Tool Hive.
Tool Hive makes it simple and secure to use MCP servers.
It includes a registry of trust MCP servers.
It allows you to containerize any other server with a single
command. You can install it in a client
seconds and secret protection and network isolation are built

(27:52):
in. You can try Tool Hive today.
It's free and it's open source and you can learn more at Tool
Hive dot dev. Now back to the conversation
with Apurva. Could we go a bit deeper into
Rerank where I'm kind of curious.
Are rerankers kind of mandatory for a successful rank pipeline?
How do they actually influence the data flow?
Do they change the embeddings orhow do rerankers work?

(28:13):
So I would suggest if you're building a rack system from
scratch, don't use a reranker, just keep it very simple to
begin with. But generally like in production
systems based on the domain, we have to use a re ranker.
So basically what a re ranker is.
So if you think about, think about the embedding ingestion
pipeline that we discussed till now, you know you have a query,
No, you don't have a query, you have documents and you are

(28:36):
embedding a document using an embedding model and you're
saving the embeddings in a vector DB, right?
And when a query comes in, you're using the same embedding
model, creating a vector and trying to find a similar vectors
in the vector DB, pulling the right chunks based on the
similarity and using that to answer the query right.
With re ranking, what we're doing is it's again an embedding

(28:57):
model. But what it does is it takes a
query and as well as the documents, the chunks, the
embeddings that have come in andthen tries to re rank them based
on based on this query, which document is relevant.
So, So what the change here is in our previous system, like the
documents were going separately,getting the embeddings created

(29:17):
and and getting saved. The query was going separately
into the embedding model, getting its embedding created.
In a reranker, the query and a bunch of documents go together
and they get reranked based on like how similar they are to the
query, like how relevant they are to the query.
And then that's the result that you get out.
So this is called a like cross encoder by encoder system.
So and that's the difference between like doing embedding

(29:40):
separately and then using a reranker.
So reranker is costly. Why is it costliest?
You cannot, once you get a new query from a user, you cannot
run a reranker on all the documents.
That's why you run a retrieval before to narrow it down to like
2020 documents and then you run a reranker on top of it because
it's costly and time consuming. And then you get like the
reranked documents and you can like do a filtering there and

(30:03):
send it to the LLM. Does it make sense?
So you have the retrieval first narrow down the documents and
then you send it to a reranker. It would rerank based on the
query and then you would filter it again and send it to the OK.
So it's kind of like take top 20, rerank it, get top three and
then you get a higher likelihoodof success.
That does sound like it adds a fair bit of latency to the

(30:24):
process though. I guess what what situations and
from the users perspective wouldyou want to engage with the
system with the re ranker versusjust regular rag?
Are they OK with latency or not?Do do they care about the
correct answer more than waiting?
So it's a trade off again, in your domain, what are the kind
of users there? You know, this is where the user

(30:46):
interface user experience comes in.
Like what are they OK with? For example, like if you think
we're deeply search and chat, GPD or DP search and like all a
lot of these like chat products have deeply search.
Now you're waiting for like 15 minutes and it's doing the
stuff. So it's doing it in the
background. So if you're OK with like
writing a query and then going and doing your stuff and it

(31:06):
would give you an answer. So this is a trade off between
the quality versus the latency. So the quality improves quite a
bit if I can throw more compute and more time at it and the
quality goes down. If I have to make it really
quick, I can make the answer really quick, but it would not
be a good answer, right? Yeah.
Are there strategies around persisting the results of a RE

(31:29):
ranker where if someone asks another similar query, maybe not
verbatim the same phrasing but something or on the same topic,
it'll already know that the onesthat have been ranked higher in
a previous query should be referenced first?
Yeah, that is where we come intocaching.
So you can do. There are a bunch of types of
cachings. So 1 is like the caching you

(31:49):
would have seen in like softwareengineering, like you have Redis
and you are like exact matching.You have this key coming in,
you'll find the value and then you'll send it back after you're
getting the same key. So it could be like someone
asked what what is the capital of France?
And someone asks again what is the capital of France, it'll
find like the answer is Paris and give it back to the user.

(32:11):
The other kind of caching would be like you caching like a
result from like a piece of a pipeline.
Like you were saying you will want to whatever is time
consuming in your pipeline, you might want to cache it.
For example, you have a system, I would say a CRM assistant and
you know, you, what you noticed is like a lot of people today
are asking about a particular client and you have, I don't

(32:32):
know, you have to pull like datafrom Snowflake.
So you, your data sources could be anything, right?
Like you can pull data from Snowflake that you have to
analyze. If you're built, you're building
an agentic system, you're pulling something from
Snowflake, you're pulling something from Salesforce and
you're finding like the revenue or whatever and you can save it.
And now if some other user in that team is asking about the

(32:53):
same client, you already had therevenue, you can use it for
calculating whatever they're asking you to calculate.
So you have the last year's revenue and you have to, there's
a step that gets added to that to answer to this query.
You have something saved. So you this is like partially
saving the results when the pipeline is running in your
agent system. So that is one thing.

(33:13):
The third variety is like doing semantic similarity.
So if I say what is the capital of France?
And if I say I want to know France's capital, it would be
able to figure out these two arevery similar queries.
So the answer would be same. But these this, this thing is
not not very reliable. Like semantic similarity in this
in the caching area is not very reliable because I can say find

(33:38):
me restaurants, find me good restaurants in New York and find
me nice restaurants and that's it.
These two queries could be very similar.
The only difference is New York,but like in the embedding space,
they might be very similar and you will have to figure out what
the threshold should be. So it's like very touch and go,
you know like not very reliable to like use semantic similarity

(33:58):
in caching. Are there any other
considerations we should make between semantic searching and
like keyword searching? This is where we get into hybrid
search. This Is Us going back to the
retrieval piece again. OK, so we are we are done with
our ingestion and we have saved our vectors in the vector DB,
the embeddings in the vector DB.And now if I want to do a search
on top of that, a query comes inand I have to do a search on top

(34:21):
of that, I would do cosine similarity, find the top 10.
But what if you are in a field like healthcare where there are
terminologies which are not usedin our day-to-day life?
I don't know, maybe there's a, there's a word I cannot come of
anything like Alzheimer's, but it's like very common, like a
word, a disease which we don't normally use or an acronym

(34:42):
there. That's where you would want to
use keyword search. So a user asks, tell me more
about whatever the acronym is and I can go and find the
documents which have that acronym.
And then what I can do is do vector search on top of that.
So I filtered out from like 10,000 documents.
I filtered like 100 documents having that keyboard.
And then I can do semantic similarity based on like what

(35:05):
the user was asking. Like the user was asking about
how many patients. So like whichever document has
like is talking about patients. So it can do a semantic
similarity on top of that. And then those top 100 would
reduce to like top 50. And then you use that to send it
to the reranker. So the thing with keyboard
keyword matching is it's more exact, like if it has the
keyword it, it would, it would pull those document out with

(35:27):
semantic similarities. It's it's a bit, it would work
sometimes, it would not work sometimes, like it would pull
like the top 20, but it's not necessary.
It would always pull those top 20, you know, So it's always
good to use a combination of it.And that's what is called hybrid
search. So with hybrid search, what
you're doing is you're using a combination of keyword search as

(35:47):
well as semantic similarity. So you understand the sentiment
of the query like the user was asking about patients or asking
about summarizing the patient numbers or something like that.
You understanding the semantics semantics of it.
With keyword you also understanding the this whatever
the name is HRMC as it doesn't have really it doesn't have a

(36:07):
semantic meaning to it. But with keyword search you will
be able to like see a matching keyword in which documents it
exists. So you're able to do a
combination of those two kinds of searches and be able to find
the relevant doc. Not every word has a semantic
meaning to it. That's where like semantic
similarity doesn't work that well and you would want to use
keyword search and a combinationof both are the best.
I love the idea of hybrid search.

(36:28):
How do we go about implementing it?
Do we pass it to like the query to an LM and say extract the
keywords with low semantic meaning?
Or how do we from a user's query, understand what we should
route towards keyword and what we should route towards
semantic? So these vector DBS have that
capability in which, for example, if you look at quadrant
as a vector DB, it has that capability.

(36:48):
You just define it as a hybrid search and you can be like I
want to give 40% weightage to keyword search and for 60%
weightage to hybrid, sorry, semantic similarity.
So it would use a combination ofthat and get you like the top 10
chunks based on that. A lot of these vector DBS have
that capability already, so you don't have to dive into it.
But generally, like based on your use case, you would want to

(37:12):
decide like what you want to do.Or you could, you could first
start with like semantics and RTC how it's doing.
And then like, OK, there are these queries with acronyms
which are not working. Or there are these queries with
like these disease names which are not working.
And you might want to use keyword search.
And that's where you would want to make it hybrid.
And that's, that's where like when you're saving your
embeddings, for example, in PineCone, it faced an index.

(37:36):
It's like a container. And you tell it like, I'm going
to do cosine similarity. And then what you can do is you
can also create these sparse vectors.
So now it's getting very technical.
So when you do semantic similarity and you're these,
you're using these embedding models, the vectors that come
out of it. So vector and embedding is the
same thing, embedding or a vector that comes out of it.
It's called a dense vector. While when you want to do

(37:58):
keyword search, keyword search would be something like you have
a dictionary already, a dictionary with like 10,000 how
many words are there in English?And then if your document has
like 10 words, it would like puta 111 wherever those words are
in the dictionary. So it's called a sparse vector
because it's pretty much empty except whatever words were there
in the document. So what it would do is it would

(38:19):
create the dense vector as well as the sparse vector and save
it. And like when you come into a
hybrid search, it will do a matching on both those things.
So it would do a keyword. So based on like what keywords
are there in the query, it wouldsee like which documents have
the keywords wherever it is 111 for that.
And it will also do semantic similarity using the cosine
similarity and find you the top key based on cosine similarity

(38:41):
and dense vectors. Yeah.
I appreciate going to the detailgetting getting a deeper
understanding these things. I think it's really important.
One other area I'd like to make sure we cover, whether it's
contrasting against hybrid search or just the general
approach, is a gentic rag feels kind of buzz wordy.
It comes up more and more. I'd love your experience.
Have you used it? What how does it compare?
When should be able to consider it?

(39:02):
Any insight there? The thing with the agents is
everyone calls everything an agent now, and it's getting very
confusing. So what I think is an agent is
when you give it some capabilityto make its own decisions.
So there is a while loop, it hasaccess to tools and it can make
its decision. Like I have to call this tool
again because I haven't reached my answer, you know, and I'm

(39:23):
going to call the other tool because this one tool is not
good. But I'm talking about agent
drag. I mentioned deep research a lot
there because deep research whatwhat is happening is it's going
and doing a search and then it'slike, oh, this is not enough.
Let me go do the search again. So it's like accessing a tool
and trying to figure out how many times I should search or
the kind of queries I'm going touse in the search.
So that's what I would call agentic.

(39:46):
I've also worked on systems where we have tried to make it
agentic in the sense based on the user query.
So we talked about the Rack system, but we didn't talk about
how it would vary based on the user query.
So sometimes the user query thatcomes in is they want an answer
from the document, right? And you do your typical rack

(40:06):
system. Sometimes what they want is the
file search. I want the 10 documents find me
the 10 vendor documents in this folder.
So it's actually a file search query.
So you can use a classifier and then based on like what the
classifier classifies this queryto be, oh this is a Q&A query or
this is a file search query. I'm gonna do the next step.
You can define that or what you can do is use this bigger

(40:29):
orchestrator LLM and then give it access to these tools like a
file search tool, answer answer retrieval tool.
So like based on the query, it'sable to like decide which tool
to call and like maybe the queryis like in find me the vendor
docs in this folder and then summarize the revenue or
something. So it needs a combination.
I'm doing a file search tool 1stand then based on the documents

(40:52):
retrieved, I'm gonna do the answer retrieval tool.
So based on these documents I'm getting an answer.
So that's where we get into Agenticrac.
Like you build tools for like very narrowed down use case
because they have to work. So for that you have to narrow
it down and you give a bigger LLM or an orchestrator access to

(41:13):
these tools or these sub agents.And then based on the user
query, it's going to decide likewhat to call, when to call, how
many times to call, and when to stop and give the answer back to
the user. When people go from a
development right system to production, what type of things
should they keep in mind, both in terms of safety as well as
efficiency? What what are the the gotchas?

(41:35):
When you have a proof of conceptthat works great and that you
want it to go live, what should you look out for?
So you would want to add a lot of guard rails to your system.
That's where like safety comes in.
For example, you would want to like host these models yourself.
Or maybe you're OK with cloud hosted models, you don't want to
call open Air API directly. So that's one thing like trying

(41:55):
to figure out your model gateway, like how would you call
these models? The other thing is like what
would go inside the models? You would want to remove PII
information before it goes into LMS.
Whatever comes out of these models, do you want to send it
to the user directly? Maybe not.
You would want to like clean that up, check like it's, it's
not abusing the user. It's not talking about the
competitor like something. So you would want to like clean

(42:17):
that up before you send it back to the user.
So that's where the guardrail piece comes in.
The other thing is like you would want to like I mentioned
this before too, like case everything like observe
everything like the question coming in after each step,
what's happening? And like date, obviously the
timestamp, the parameters that are there in the system and the

(42:37):
output that's going to the user,which model was accessed.
So everything that was inside the system at that point of
time. So that like if there's
something that's going wrong, you can go back, see like what
was the system state like, like which model was getting used
that day? Maybe you're using an LM
gateway, you know, it switches between like entropic open AI
based on like the latency. Or maybe you're using a model

(42:57):
router, like the query is not complex.
I'm gonna use a simpler model. So you wouldn't want to know the
state of the system at that point of time based on the user
query and all the outputs from each of the pieces so that you
can figure out where the problemwas so that you can go and
actually solve. So that's something I see a lot
of teams like skipping on. The third thing I would say is

(43:19):
the user feedback. Since this technology is very
new and the users are trying to still learn how to use it and
even we are trying to learn how to build with it and make it
more reliable. You would want to like get user
feedback as much as possible andlike be more innovative with the
user experience as well. Not everything has to be a chat
conversation, right? So like having thumbs up, thumbs
down, making it easier and making maybe pulling implicit

(43:41):
feedback from the user. For example, if you're using
GitHub Copilot or cursor, like if you don't press tab or don't
accept their auto answer, or maybe like you accept the code
and then you make changes to it.I'm pretty sure they're taking
this as implicit feedback and working on it like what we
generated did not work for the user.
So doing that, like any sort of feedback, implicit or explicit,
that you can utilize to improve your system would be great.

(44:02):
This was awesome, I learned a lot.
I'm sure you understood as well.Really appreciate the master
class and you sharing your exposure.
Before I let you go, is there anything you'd like the audience
to know? Yeah, you can connect with me,
reach out to me on LinkedIn. I'm always happy to have
conversation. My website is
www.apulvamistra.com and my company is Centec.
I help out startups as well as like mid size companies with AI

(44:26):
strategy as well as building solutions for them.
And I also do speaking and run workshops.
So if you need that let me know and you can connect with me on
LinkedIn. My name is Apulva Mishra and my
I think my handle is Mishra Apulva.
Thank you for listening to this conversation with the Parva
Mishra. I really learned a lot about
Reg. I hope you did too.
And for the implementation, I hope you just take away as it
depends. It depends how you implement it.

(44:47):
It depends on your data, your use cases.
You're going to have to really consider the end use case for
making your decisions through this process.
But hope this gave some insight as to things you should
consider. I'd love to know, do you use RAG
currently? What difficulties do you have
influencing it? What systems?
What services do you use? Please let me know down below
and if you could like and subscribe, it would really mean
a lot. And I want to give a quick shout

(45:07):
out to Tool Hive for supporting the show so I can have
conversations like this and I'llsee you next week.

All Episodes

"This is why you need a RAG system" - Apurva Misra

Episode Transcript

Popular Podcasts

Stuff You Should Know

Paper Ghosts: The Texas Teen Murders

Dateline NBC

.css-15opob5{left:0;position:absolute;top:0.8rem;} All Episodes

.css-14f5ked{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:2;overflow:hidden;}"This is why you need a RAG system" - Apurva Misra