Episode Transcript
Available transcripts are automatically generated. Complete accuracy is not guaranteed.
Tobias Macey (00:11):
Hello, and welcome to the data engineering podcast, the show about modern data management.
If you lead a data team, you know this pain. Every department needs dashboards, reports, custom views, and they all come to you. So you're either the bottleneck slowing everyone down, or you're spending all your time building one off tools instead of doing actual data work.
Retool gives you a way to break that cycle. Their platform lets people build custom apps on your company data while keeping it all secure.
(00:37):
Type a prompt like build me a self-service reporting tool that lets teams query customer metrics from Databricks, and they get a production ready app with the permissions and governance built in.
They can self serve, and you get your time back. It's data democratization
without the chaos.
Check out Retool at dataengineeringpodcast.com
slash Retool today, that's r e t o o l, and see how other data teams are scaling self-service.
(01:01):
Because let's be honest, we all need to retool how we handle data requests.
Your host is Tobias Macey, and today I'm interviewing Vasilije Markovich about agentic memory architectures and applications of how to use it. So Vasilya, can you start by introducing yourself?
Vasilije Markovich (01:16):
Sure. Thanks for having me, Tobias. So I'm Vas, originally from Montenegro. I've been living in Berlin for ten plus years. My background's in business. I became a data analyst,
data engineer, data product manager, mostly building in the modern data stack, building for a couple of German unicorns.
And then after a few consulting gigs here and there, where I was solving the issues for the people who had the data that they needed to move from the place A to B. I ended up back at uni studying cognitive sciences.
(01:47):
I'm here now actually in Belgrade doing my exam in family and family dynamics, which I miserably failed because I didn't prepare well enough for it. But that aside,
based on the studies, I got an inspiration
to start my own company around two and a half years ago to build some type of knowledge engineering memory engine on top of the vector stores. We built a Python SDK,
(02:10):
scaled that, did fundraisers
with the latest one being from an asset based investor, we are now a team of 15 people based in Berlin building in the memory space. So more than happy to chat about that.
Tobias Macey (02:23):
You mentioned that you got into the overall data space by virtue of working in business. I'm wondering if you can just talk through a little bit about how that helped shape the way that you think about the space and directed you down this path of focusing on cognitive sciences and how they interact with our computing infrastructure.
Vasilije Markovich (02:43):
Yeah, sure. So back in the even earlier days, used to do anthropology and in anthropology, we had this concept of like the glasses you can put on, which are these cultural glasses where you kind of fit into different types of communities or social
with their set of social norms or different types of, let's say, nations or social structures.
And I had the same experience when I started working with data. I was in company one, then company two, then company three, and all of them had data, but they looked at that data differently, they modeled it differently and they tried to, you know, with the same almost data, they tried to represent it in a different way with different outcomes. So session could mean different things, the KPI
(03:23):
revenue would be a
different concept. And most of the work often enough entail this data modeling. Can we actually represent what the business needs, how they want to communicate to the investor, what's the most appropriate thing to optimize the revenue for. And over the years, I ended up redoing and data modeling everything from Kimballs book to dimensional modeling to effectively evolving these past the standard
(03:49):
practices towards more, let's say, modern stacks where we saw reinterpretation
of the same constant terms. And as I got involved into that, and I did that many times, I moved into this cognitive science space and it didn't seem connected in the first iteration. But as the LLMs appeared,
they actually started, we started using them more and more, I saw that we no longer have a static set of rules we can apply to a business and have the KPIs defined in a certain way and have this determinism that existed before. But as we have these non deterministic engines that are more similar to how humans encode and map and decide on the rules and the schemas and that we would need a new type of a solution. And that's where I kind of went from, you know, creating some type of fact dimension tables towards how can I actually create a dynamic schema that continuously evolves
(04:37):
and that was more inspired by the cognitive science piece?
Tobias Macey (04:42):
And now digging into this overall idea of memory
in the AI and agent context,
can you just start by unpacking some of the different elements and concepts that are involved in actually
building a solution to address that use case?
Vasilije Markovich (04:59):
Sure. I think memory is a loaded term right now. So it means a lot of things, but it means really nothing. So that's
the problem with it. So it can be a vector store, it can be a layer on top of vector store with some graphs, it can be an ingestion and structured outputs. So how I define it is effectively where I will start with this saying that agents by themselves are stateless, right? So they were not designed in such a way transformers as the architecture are not designed in such a way to preserve any type of a state. So what you wanna do as you are using the agents, you wanna give them some type of a state representation so they know where they'll have the things off, they can effectively continue,
(05:37):
they can exchange the data, they have some type of protocol and ability to move things further than just by starting from scratch all the time. That was the base problem. And the data that they need to access
was in the, let's say earlier days, was some company data or some data that they couldn't ingest through their context window because the context window of the agent was relatively small. So this is where we did the rag, put the data in the vector store, we retrieved it. Now, as we move forward with the implementations and the agentic use cases,
(06:08):
we've seen agents having an extended context window, so the data storage became less of a problem. But what became a problem was that the agents were simply not accurate enough. So, let's say I tried to retrieve all the spells from the Harry Potter book because the agent and the LLM that the agent is using is trained on Harry Potter books, if I add two new spells to this book or customize it, it's never gonna find it, right? So what we ended up with the problem was like, hey, can we actually encode our own data and give it to the agent in a way that the agent can retrieve it and represent it in a right way and not get overwhelmed by its training and the general corpus it's been trained on. And in a sense of, let's say, agentic memory,
(06:50):
one of the first pieces that the agent needs to be able to do is to actually retrieve the right context. And that we can talk about maybe now a bit the architecture of where the agent needs to get that from. That's what we call the permanent memory in our case. So with Cognee, you define this permanent memory as a graph vector layer of different subgraphs
that you can effectively store different type of representations of your business rules, ontologies,
(07:14):
time representations,
context
with appropriate weights that signal the significance of them, insert in this area of the graph vector store. And then when the agent retrieves it, it retrieves the right thing. So that's the first thing we saw. The second piece of the agentic memory infrastructure for us, as we kind of moved and talked to the different stakeholders was that the agents themselves, as they do this reasoning and sequences of steps to kind of try and solve a certain problem, they generate a lot of outputs.
(07:43):
These outputs are relatively useful,
although not often,
but can be, but not only that they do the reasoning steps, they also do a lot of tool calls. So these tool calls also need to be stored somewhere, these reasoning step needs to be solved somewhere because it's not just about you retrieving the data from an external data store that is correct, you also need to be able to retrieve your own memories of your own reasoning from time to time or share them. So what we built is this concept of a session memory. People often call this short term memory in the agentic space, referring wrongly to this Kimball models of short, long term memory in humans. But the problem with that is that our short term memory for humans is like what, six seconds for a N number of items that then get distilled into the permanent. In this context for session memory, what we did is we created
(08:30):
a typical in the analytics web landscape, the typical concept of a session, which starts when a user defines it, ends when user defines it, and there we store all the reasoning traces,
all the tool calls, and then these can be used and distilled into the permanent memories to some type of a trace representation of what happened and what steps did the agent take. So the next time agent comes to the system and tries to do something, we can refer that in the attempt X, he did action Y and we can continue from there or do something else. But also we can connect that if he mentioned Elise
(09:03):
in one of these attempts, we can connect that to the some publicly available domain knowledge that can be connected to it and then we can get more context about Elise, maybe she's a manager at McDonald's or something like that, that we can then give back to the agent. So as you see, like architecture started evolving and getting more complex and we did different types of, let's say, representations
(09:24):
of knowledge and memory as we kind of move towards the customer use cases, because we were asked for for this more from time, entity decomposition,
session,
permanent store, and then going past this episodic semantic
short term, long term concepts, which were where everyone started.
Tobias Macey (09:44):
When we're talking about these different gradations of memory,
do you see that it's often the case where you need to
have all of them present for a given use case, or are there situations where you would only use maybe the session memory
or situations where you don't care about the session memory, you just care about these long term memories that maybe store information about the person that you're having the interaction with to feed into things like user personalization?
Vasilije Markovich (10:13):
That's a good question. So we've seen different types of use cases. One we call data silo use case, where people come with different types of datasets that are disconnected and there is no way to actually merge them because there is no identifiers that they share. But with the LLMs, you can kind of figure out that, you know, apples belongs to apples and oranges to oranges as a naive type of a classifier, right? So with that type of a use case, you don't really need session memory because your agent is not gonna be doing that through a session, you're just gonna create
(10:42):
automatically the connections and the data pipeline of sorts. So it's not like it's data engineering in a sense, but a bit more advanced, right? And then combining what was NLP back in the days. So I would say this use case doesn't require session number. As soon as you plug in agents or an agents that are doing some type of a transformation learning or work, the problem is that you can't really do it the same way because now suddenly the latency becomes a problem. So I need fast responses, I need agent to get the information quickly. What we've seen is that waiting for four seconds to get the data from a permanent store is not the latency anyone wants. So what we did is like we build the session memory, we stored a transformed embedding triplets with the graphs inside of them, inside of the Redis vector store, we can search those quickly, right? And then effectively
(11:31):
give them almost the same quality as searching the permanent memory with the hot memory there that can be constructed in the permanent store synced when needed and effectively give them the right part
of access. The agentic cases are often enough quite, they're expecting low latency, they're expecting to waste tokens and spend a lot of money on these agents, and they just want the result, they don't really care. It's like,
(11:54):
it's surprising how it scales. Like I talked to one company that I won't name now, but they had an issue that they paid for inference for a whole year, and the AI agents used up all the inference in three months, right? Because everything they bought for a year was gone in three months because they exponentially scale. And like, you just don't predict that because like even our like SDK runs, we started with 2,000 in November,
(12:18):
and now we run like 650,000
times a month. Some top users are running us like, I don't know, 20,000 times a month. So on Lambdas and whatnot, these agents are triggering the memory generations on the fly with like small datasets that we didn't expect the patterns to be that way. So what I'm trying to say is that when you need sessions, you really need them. But if you have like this type of, let's say transformation
(12:40):
layer just for like, let's say more permanent storage that can be accessed infrequently,
then you're pretty much okay with the permanent store and that's that
data silo case. And then the third one that is starting to appear and is becoming more relevant now is the edge case, right? So let's say I have the session memory and I have the permanent memory, but I have my mobile phone. There is no way that developers now can use any type of a tooling on the mobile to effectively give their mobile applications a shared memory layer, all due to the iOS Android restrictions, but not only that, because just there is no tooling for it. So we built the first POC now that can allow us to have memory layer on the phone that talks to a memory vector store on the phone, which is a quadrant store, they just released the vector
(13:25):
store on edge. And then eventually you would have these app devices communicating to each other and exchanging,
let's say the data in this way, and then also within the apps. So these are like the three major, let's say, approaches that vary from each other a lot, but within them there is also subgroups.
Tobias Macey (13:43):
Talking through some of those storage layers, you mentioned things like Redis Vector Store, Quadrant,
and
those are definitely very specialized engines for these vectorized embeddings.
Wondering if you can talk to some of the other aspects of the physical manifestation of these memories and some of the storage media that are
(14:05):
in use across some of these different
memory implementations
or even something where maybe I, as a developer, am just going to hack together
some sort of memory store. Maybe I can just use markdown documents and maybe, for instance, things like the Copilot command line or Cloud Code where I have a particular coding session that I'm involved in. I just use that session ID
(14:30):
as the identifier for that memory, and maybe I retrieve that memory the next time I restart that session, or maybe I want to update the agents. Md to include some useful information that I learned in the process of solving a particular problem,
etcetera.
Vasilije Markovich (14:46):
Sure. So let's maybe if we take a step back, right, so 90%
of the users are gonna be using some type of a prompt with some type of a Jinja templating or whatever you put inside of that, so you can actually just put in the variables you need that you're gonna get the answers for. And then you might also tell to the agent to write to an MD file to update the MD file. What happened to me with Claude the other day, I had like, I was trying to make some plans for the SDK improvements, right? I ended up with nine MD files of different versions of the improvements I was trying to make. And then I was trying to reconcile from some of these plans and find where
(15:22):
they ended up at and what happened. And I lost some of them because the cloud session got terminated. So I needed to try and find the session from the cloud, recreate the reasoning process of the agent and recreate the actual part of the plan. So all of that Claude, for example, does out of the box, creates a session,
can restart the session, pull the reasoning, pull it into the MD file, take stuff from the MD file. But the problem is that you're gonna end up with unversion seven files like I did, you're gonna end up with a lot of redundancy
(15:51):
and the updates might be hard to manage or
let's say rollback and know where you left it off after the session is done. So that's kind of the limit, but for a lot of use cases, it's gonna work and you probably don't need to do anything more than that. And you probably don't need to move towards more complex systems.
But in cases where you have to reconcile
(16:14):
the information that is not intuitive. So if we're talking about the cases like, you know, summer fashion in Brazil is winter fashion in Europe, and these types of business rules that the LLM might not intuitively connect the dots on, then you can start, for example, with the vector store. So Vv8 is now releasing a memory product, I think there is Quadrant, there is Mailbus, there is Lance DB. And I think Lance DB is quite easy to use because it effectively allows you to store the
(16:43):
embeddings in a flat file or connect it to S3 buckets, so you can effectively just kind of move fast with that. If you want more complex relationships,
Koozoo, which was recently acquired
but still is available now as Ladybug, there is a new offshoot,
so the graph database that stores data in the files or Neo4j can help on that side. There is other tools such as MEM0,
(17:05):
Graffiti similar to what we do, but more with the focus maybe on chatbots like MEM0, where you need chatbot memory, Graffiti is focused on time representation,
and then others that are focused on certain,
let's say memory domains in certain industries or verticals, some focusing on legal or others. Those can be, let's say a good starting point, start playing around and seeing how your data would start looking and get represented. What we try to do is, we try to effectively add this data engineering layer where we decompose
(17:34):
and store the data, but also deduplicate it, manage it, have a meta store, are able to rerun the jobs and then retrieve the data reliably.
Because as soon as you start with the files, you still need to build this type of an indexing layer.
If you just start with the pure database index and you still need to build the ingestion, So for us, we're kind of somewhere in between, but there is definitely many options and many, many various use cases. And I think like if you really don't need and if you don't have a lot of data and you don't have to reconcile complex
(18:03):
relationships,
you can definitely start with the MD files and then, kind of make your way forward as you as you evolve from there.
Tobias Macey (18:11):
Now digging into some of the
structure and format of the actual
memory information,
what are some of the types of attributes that you would typically
embed or model around the actual storage data so that you're able to filter and retrieve it
and some of the ways that that changes based on the use case or the size of the data or whether it's a agentic use case versus more of a chatbot
(18:44):
style interaction?
Vasilije Markovich (18:46):
Sure. So we're talking now about, let's say, two types of data we can store in two places, right? One is the metadata on the graph side that we can store and use to filter on. And the second one is like, what can we store in the embeddings
and how can we actually optimize
the things on the embedding side? So for us, for example, we are working with Bayer, the big pharma company, and for them effectively
(19:11):
where we mostly operating on is on the embedding layer. So we are taking scientific hypothesis,
we are decomposing them into two elements, the thesis and the prediction, then we are trying to match those predictions
on the embedding layer, predict new hypothesis, have a mechanism that can consolidate and add those. And the data that goes into those is
relatively granular
(19:32):
information
about the scientific papers, about the hypothesis,
the metadata that allows us to source the information back to the scientific paper to the original source.
So all that's gonna be in those collections and then the graph is gonna represent the source object that it came from, but most of the operations are gonna happen there. So that's for this data silo use case and in a lot of cases, it's some type of a transformation that needs to happen both on the graph and vector store, mostly on the vector embedding lookup, and then surfacing up to the graph where you just are operating with the non changing standard set of method. If we talk about the agentic case, what we do there is we create different graphs for different agents.
(20:16):
So even a physically isolated databases where an agent is gonna have its own memory, and within its own memory, it's gonna have different groups of concepts. So maybe it's gonna have, I don't know, yesterday,
today, it's gonna have the representation about personality
of the agent or something else that is relevant for it. And that we do by allowing the by creating this concept of nodes,
(20:39):
super nodes that is controlling the part of that subgraph, and we call that the nodes that's representing the group of nodes around a certain node. And that is a metadata object that effectively is a label of sorts. And there, a lot of things are done with labels, so we can have soft isolation between certain components and entities. And then where the value add is, we have this post processing pipeline that goes and connects the dots. So adds links or edges between different nodes. So let's say in yesterday I mentioned eggs, and today I talked about omelet, I can connect
(21:13):
those and then on the traversals, I would get the information back. In terms of the traversals,
all of this metadata,
so the nodes, so the main concepts, the main, let's say data modeling elements, like these node sets, the timestamps for the time awareness and everything else is structured in, let's say, more standard format, whereas you can custom add your own pieces. And then we have a lot of these retrievers
(21:39):
where they are effectively built in such a way that if you are providing
by default the timestamps,
or if you're providing these node sets in the SDK, which is a part of the SDK functionality,
we will be able to retrieve those or filter on those or filter on that provider. Framework is quite flexible, so you can add those custom yourselves.
So what I did, let's say for the weekend is I built a new visualization that can tell me which part of the pipeline produced which type of data in the graph, and then I can visualize a color dose and I released that to dev this weekend, and it meant I'm just adding new filters inside of this data model,
(22:15):
and then I'm just retrieving those later on, and there was no destructive change. So pretty much you're just changing the data contract. And what we've done is we added a couple of those, like these node sets, we added like the time representation pieces,
but it's pretty flexible and modular and people can add their own. We have the feedback one where you can, you know, feed into back into the system, change the weights of the
(22:37):
node representation
so the search is gonna act differently and and others basically.
So
Tobias Macey (22:44):
That aspect of timestamping the memories and changing the behavior based on certain filters brings up the aspect of
relevancy and the potential for the decay of utility for given memories over time, either because
six months ago, I told you what my address was. You stored that as a memory.
(23:06):
Three months later, I moved to a different location, so that memory is now outdated.
But there's hasn't necessarily been a signal that
that change happened through my conversation with the chatbot or with the agent. So I'm curious too how that temporal aspect factors into some of the ways that you need to think about modeling those memory artifacts
(23:28):
and how that changes based on the content or type of memory.
Vasilije Markovich (23:34):
Yeah. So
in the sense,
if we talk about
we also started thinking about that a while back in terms of the retrieval mechanisms
and having recency frequency and time representation,
then decaying that. And it made intuitive sense at the time to start with that and effectively,
you know, have an auto decay feature where, you know, there is an increment of sorts,
(23:58):
like URINX timestamp that just grows and then you're not prioritizing the highest number and then you're going to the lowest ones and retrieving on with frequent data, then also overwriting certain
weights in certain nodes and adding adding feedback there. Turned out that was quite naive
because
what we've seen is that agents,
they write the information back
(24:19):
and recalculate
the weights
and change the information on the priority of the information they need to retrieve, there is no consistency guarantee that they will always do it properly and there is no good benchmark of what is
absolute truth in that context. So, an agent might say like, Hey, I like this, and then tomorrow it doesn't like it, but it's still like a true statement, right? So, effectively,
(24:43):
where we ended up with that is we decided to do a bit more
consolidated approach that combines
this neuroscience concepts of traces, which represent a certain set of groups of information that are grouped around a certain domain or entity, and then model those traces in such a way that we calculate certain types of, let's say, graph metrics like centrality of the graph and others, and then use those to calculate the scores of the centrality and importance of that information based on the certain data set that we know is deterministic and can be only true or false and effectively give us a starting point to optimize,
(25:22):
start optimizing the returns
and then kind of prime the memory that way. And second, introduce
a bit more of reinforcement learning techniques that can with more traditional algorithms, start changing these graph structures back.
That's what we're currently working on. This should be live relatively soon. Then we can hopefully move a bit away from just pure importance or discounted weighting, which we added and it worked. But I would say it's probably not gonna scale too well over time. So I think like this is kind of the next step and we hope to publish a paper this year of that. And I can give more details on it. I'm not implementing this myself. We have a cool group of researchers doing that. So, you know, please take it through a grain of salt that I might not be explaining the best way, but when the paper is there, hopefully that's that's that's gonna be clear.
Tobias Macey (26:10):
Now as somebody who is
building an agent
or
an LLM powered chatbot
or even a simple command line utility that happens to invoke an LLM as part of it, what are some of the ways that I should be thinking about the triggers or tool calls
that are going to actually store
(26:32):
and retrieve memories and what are some of the ways that those triggers differ between that storage and retrieval aspect?
Vasilije Markovich (26:39):
Yeah. So the only pattern that repeats that we've seen is the tool calling. So having some type of a tool call for storage and some type of a tool call for retrieval. There is 1,000,000 implementations
out there of different agentic frameworks and the ways they do things which are almost all the same if you ask me, but, like, I do respect our framework brothers and sisters, and I hope they continue fighting the right fight. But but my point here is, unfortunately or fortunately, the tool calling has become a standard. So what you will see is
(27:08):
an ability to add the data and ability to search the data. We added options to add the data to session memory and then add the data to permanent memory in the SDK. We added the ability to add feedback to session memory because it belongs to the short term and then it should be synced to the long term. And we added the ability to search and with the search, we have 15 types of search looking for temporal data, looking through different in different ways through the graph and through the vector store, depending on what you need. But all of those, we are trying to abstract out into just search, so no one would need to worry about that. And effectively,
(27:43):
the tool calls in themselves
are still a bit of a rough category because it always assumes you're adding strings, you're not adding any type of structured data or anything like that. So where we are kind of looking at evolving this is, can we kind of have some type of pseudo
SQL tool calling where you are kind of sending a pseudo SQL insert into table with some type of structure that then decomposes on our end into something that has more structure and format that's more rigid versus,
(28:13):
you know, just having
a string that you're sending somewhere or a basic text search that you're sending somewhere. So you're doing maybe a select star from table where condition is there, and we already enable a lot of this filtering, so it's not probably gonna be right. The correct
or the true SQL engine is just gonna look like one, but for us, maybe that's the next step where we see this potentially evolving, at least from how we imagine the SDKs of the future to run, because LLM should be able to generate the SQL statements that are not complex enough.
Tobias Macey (28:46):
Once you do have your agent encoded with the appropriate triggers for
when and how to store and retrieve memories
and you have it deployed,
you've figured out what are the storage media for how I'm actually going to
structure the memory persistence.
One of the other interesting aspects that comes in is the case where maybe you need to
(29:11):
use that same information that's stored in the memory across two different
agentic use cases where maybe I have an architecture where I've got the orchestrator agent that invokes sub agents to perform specialized tasks. They all need to have the same
context of historical
patterns,
or maybe I'm using some form of a two a enabled agent where I have an agent that's communicating
(29:37):
from my organizational context with an agent that exists and is owned by a different business,
and maybe they need to be able to have access to some of those same historical patterns and just some of the complexities that come across as you go across boundaries
of
particular
applications and agents.
Vasilije Markovich (29:58):
Yeah. So back around seven, eight months ago, we talked about that because
we had this issue of the memory isolation.
So one agent, if we put like a lot of data inside of the same graph vector store, there was no good way to isolate
except especially in the graph side without security risks. So what we did is we created a multi tenancy system that allows us to, for each agent have its own
(30:23):
memory graph store and vector store basically that are physically isolated from each other. So on top of what we have as a system is effectively an ability for you to give permissions to another agent to search your memory or take those permissions away or have an admin that has ability to search every worker's memory in a sense. So all of that's configurable and relatively easy to use. And that's
(30:48):
what we saw people really get excited about, because then the agents cannot pollute each other's memory, they can
share memories, you can create what we call public memories. And these public memories
are something that we define as some type of memory concept that can live in a public space that every agent can access and then, share,
(31:10):
but on top of that, have its own personal individual memory. One of the main design elements that we didn't imagine at that time was that effectively
we will have a company memory. So the data from a data store, like a data warehouse,
schema representations,
ontologies,
factory floor drawings, we heard all kinds of use cases here. And then these would live in a public memory, agents would have their own memory with some relevant information,
(31:36):
but we would need to keep all of that isolated from the traditional data stack of the company because you don't want to give these agents unrestricted access to your company's data, because they might wreak havoc into that. So effectively this
isolation layer, this let's say
layering between your agent and your data stack became a very valuable thing for
(31:59):
various types of use cases and users that came to see what we are building. So I would say for us,
repeat the answer was,
let's build this isolated,
let's keep the multi tenancy and the permission layer,
let's enable public knowledge concepts and everything else that user wants to create, and then they're using that a lot.
Tobias Macey (32:19):
The other interesting aspect of this
as a question around persistence
and modeling of data is the question of schema evolution
where maybe the
number of memories that I want to store and retrieve changes and maybe that adds complexity in the retrieval path to make sure that I'm doing some sort of reranking and pruning,
(32:41):
or maybe I need to change the structure of the content of the memory because I've discovered, oh, I actually need to
add this extra attribute in order for it to be useful for the agent to understand what the next tool call should be, or I need to shrink the amount of data that's stored in the memory so that I don't explode the context budget because I'm now retrieving more memories,
(33:03):
or I need to do some other sort of evolution
of the actual structure of that data.
How does that also introduce additional complexity in terms of the ongoing maintenance and development of these systems that rely on the memory layer?
Vasilije Markovich (33:20):
Yeah. So what we try to say is less is more in this context. So I don't need to ingest a whole relational database with all the fields. I might just need to ingest the schema with a sample. And then if I need to go back to the relational store, I can go back to a SQL retrieval and bring that back into the system. So these types of approaches are something that is definitely
(33:42):
how we try to position it. I think the problem of content
compaction
and summarization that you see in Cloud Code and other types of tools is, well, it hasn't been designed properly. You are trying to use one, you have 1,000,000
tokens and then your context is just growing. And then they are like saying, oh, let's summarize this. And you're asking a probabilistic system to summarize a probabilistic system and you're ending up with something that might work or might not. Our approach is if you're putting in the data, we'll keep that data, but we'll have the full lineage of what happened where. And like the databases have been designed to store a lot of data. And we can store a lot of data. We can manage the data, we can clean it as we want to clean it and always have the full trace back and, you know, have the replicas and approach it in a, let's say, more standard adult way, the way you wanna deal with the data, not just like summarize something and forget about the rest and hope for the best, which is how most of these systems tend to work nowadays. So I think for us,
(34:40):
store everything, have full traceability,
have the processes that do the compaction,
do the compaction, but then know what's happening with the rest and then still have it available if you need it to reproduce the steps. And then over time, you know, remove the redundant stuff, but when you remove it, put it somewhere where you can also retrieve it back if you need to. And often enough, we've seen that approach works well enough. But I would say that even on our side, there is still more work to be done there, especially in the data contracting side, which is a notoriously difficult thing to do, right? Because when you say data contract, everyone thinks different things. So we added the ability to, you know, like have versions of the data schemas
(35:19):
you are ingesting,
but we need the mechanisms to manage to evolve from a schema one to schema n. And, you know, it's something that like effectively
I've seen a couple of players do well, like Snowplow has done that with the glue a while back and others, but, like, you know, it's traditional
not traditional, whatever. That means, let's say, it's more deterministic or relational type data that's more easy to evolve or change. So so, yeah, still work to be done, but I I would say we're taking a bit of approach.
Tobias Macey (35:45):
One of the things I'm curious about is how you think about the differentiation
between
memory
as something that is the responsibility
of the LLM and its tool calls or its own decision making to be responsible for as far as whether and when and how to store that information
(36:06):
versus
the I'm gonna call it the rag approach, but basically having some sort of data source that I'm going to curate and structure and populate and make available for retrieval.
Vasilije Markovich (36:18):
Yeah, I mean, since I was doing my exams now pretty unsuccessfully,
it's always the same debate that starts in the eighteenth century of the nature versus nurture. Did we get born with all these genetic predispositions
that just develop into traits and characteristics that we have as humans, or did our parents have some type of an influence on us? And I think this is what's happening now, this debate is happening in the AI land these days. Like, is the agent gonna automatically infer and learn and just read the empty file and get like super smart and like have this innate capability to solve all of my issues? Or am I gonna add some reinforcement learning concepts, some external data and kind of push
(36:57):
it towards the right direction? I think it's both, right? And our approach and where we think the transformer models got to is, you
you have a child and that child is able to do something on a four year old level, but it's still lacking this, you know, what in developmental psychology they call the logical thinking step where like the
(37:19):
de centering of self, you know, where like the LLM is only seeing itself, it doesn't have the context of the world and the world model in a sort. And what we need to give it is this world model, these
structures and rules that, you know, the president is this role that exists in a state and state belongs to a continent and continent belongs to the planet. And these types of representations that the model itself won't have and won't understand and we need to kind of encode. So I think it's a two way thing that needs to kind of end up merging together to give us the right context and the right information and that's why we are building what we call a knowledge engine because the knowledge is the
(37:56):
key piece we need to give to these systems.
Tobias Macey (37:58):
Keying on what you were talking about with your reference to the reinforcement
learning also brings up the question that the role of memory,
particularly in the cases where the memory is something that is extracted from interaction that the LLM is executing,
how that factors into some of this overall question of self improvement,
(38:20):
particularly for agentic use cases where the agent is able to learn and improve its capabilities
as it continues to operate in the context of its deployed system?
And what are some of the ways that things like model fine tuning need to factor into that self improvement loop?
Vasilije Markovich (38:39):
I think as soon as you go into the fine tuning, you're starting to spend money. And I haven't seen that be effective enough for majority of use cases yet. Maybe it's gonna start moving, but I know some small model tuning company, and they give you, I don't know, a thousand dollars of credits, which tells me that, like, that's gonna be very pricey if you wanna do something, you know, at scale. So my point on the fine tuning is that because of the GPU and the prices
(39:05):
these days, you might wanna do it and it might work if you do it one off. But if you constantly let the agents fine tune and they are still kinda trying to get the general model a bit more specific on a context that they are again providing back to the so it's, you know, it's feeding the feedback loop that is not the right feedback loop. So what I've seen works unfortunately well enough is humans,
(39:28):
you know, so you need someone to kinda control
the machine and give it the feedback. And that is actually useful. On top of that, human in the loop,
LLM as a judge can help at times and can do some
level of improvement, but you're gonna hit the wall relatively quickly. So what usually can also help is just the data that's structured well enough or the data sets that are structured well enough to guide the LLM behavior, or in addition to LLMs having deterministic
(39:57):
algorithms that can
actually operate at scale,
can produce the relatively consistent input. So I would say at this point, it would be a combination of a couple of those, but I would say you still cannot run away fully from the human in the loop at this stage, unless it's relatively simple use cases, but I'm sure that's gonna change in the next years.
Tobias Macey (40:19):
When you're working with teams who are
starting to think through this aspect of adding memory
to their system,
what are some of the characteristics
of the system as it exists at the point that they want to add memory and what are some of the ways that the introduction of memory changes the way that they think about the overall
(40:40):
design and purpose of that system?
Vasilije Markovich (40:43):
Yeah, so I usually try to kind of make an analogy in the traditional data world, like, let's say you are a pre seed startup, right? So you work for a small company, five people, you have some investors or you want to report some numbers, you're going to use Excel. You don't really need any type of a complex data pipeline,
you don't need to, you know, optimize for millions of marketing spend and whatnot people do with all the data they have. And then you get to like a seed stage and you have, I don't know, 20 employees and you're spending money here and there and you need to report on the burn rates and you need to, you know, analyze marketing spend and you have a data warehouse.
(41:18):
I would say that most of the people that we see in the agentic space are in this like one to five people category
building something, trying to make something work and run. And in that case,
effectively,
don't really need the memory, you need just to get things working and to 80% accuracy and then you can tune things later or improve things later, unless it's a really specific, you know, deep dive type of a problem where you just can't solve it with the existing tools, which is in case of Bayer predicting scientific hypothesis,
(41:49):
in case of an agricultural company in Ireland, understanding that, you know, if you plow one field that you cannot like plant certain types of crops next to it, then there is all kinds of rules that are not obvious in terms of how agriculture works. And in the case of an oil company in Brazil that, you know, if you have certain types of manuals that are very proprietary to the company based on certain types of incidents on the oil fields, you need to suggest certain types of actions. So all of these that are very specific that, you know, if you have a general enough use case that you don't need, you can move it out. Now, when you get to this case where, okay, now you have your own specificity in the company and you have your own rules and you have your own things you do in a certain way and you have much more in-depth use cases,
(42:30):
then you plug in the memory and usually where it helps is like, okay, suddenly all of this sales calls and transcripts and information,
all of that I can finally access from different places. So this data silo piece gets unlocked because suddenly I can, my agents can see things that belong in different places before looking different databases. That's the beginning. Then the agents are starting to generate their own information.
(42:53):
They're starting to add value to the system, the data layer starting to change based on the feedback they are providing to it, certain things are surfacing, certain things are not, and then the value creation starts securing. So we are seeing like, hey, I'm using this, I'm understanding
better what was relevant in a certain type of call or a certain type of energetic action for a certain group of agents versus other groups, and all of them will have this tuned for themselves. And that's I think where we usually see people like having like an moment like, okay, now I get it. First one is like, I can connect my data, the second one is like, oh, now, you know, it's doing things that I didn't think are possible, like similar to what people did with OpenClaw, right? I could suddenly connect like WhatsApp and Telegram and like give it access to my calendar. It's just gonna go and like connect all of the dots and then suddenly act on all of those and that that that's that wow moment just in date. And as you have been working in this space and building Cognee and evolving its capabilities,
Tobias Macey (43:54):
what are some of the most interesting or innovative or unexpected ways that you've seen this concept of agentic memory applied?
Vasilije Markovich (44:01):
Yeah. I think we had a Catholic priest come in, I think, like, seven months ago, and he was like, I want to connect different theologies
and find common concepts in different faiths. And I never thought of Catholic church as being our potential client, but then I was thinking twice, and maybe it's a good idea, maybe I don't know. Well, like, you know, that could be an interesting one. So people tend to connect the dots across various fields
(44:24):
in different creative ways that we didn't expect, and I think the priest was a cool one. But on top of that, we've seen cases we couldn't really imagine, especially in logistics,
where there is a set of hard constraints
in different types of, let's say, factories, what they can produce, how the merchandise is transported and all that, where they have these what they call control towers that can observe the whole path of certain items and
(44:52):
logistic paths where they saw a big need and they already started implementing this like around two years ago themselves to a degree, so that was a surprising one. We've seen cases
which I still don't fully get with geodata where effectively you have to understand and position and
store the geodata in a certain way to reference certain objects and entities and positions
(45:15):
on the map. And yeah, just got a hunch on how interesting that one is, but we haven't dug too deep into that. Cybersecurity
is another one. Turns out that in China, researchers are publishing latest
exploits and then these exploits can be very relevant in like a month, but forgotten in three because they got bashed and now suddenly you need to maintain for a certain industry,
(45:36):
a certain subfield, a certain set of exploits that are always need to be updated, connected and reference each other because a combination of exploits can actually impact things. And then that became a whole topic in itself. And then, you know, you need to manage that and people are connecting the dots there. What else?
Yeah, many, many. E commerce is surprisingly boring, but there is a surprising necessity there to optimize funnels of different, you know, people jumping on top of certain products and getting the right recommendation sets, and it's still not a solved problem fully. And yeah, on top of that, what's interesting one? Let me think. Yeah, I think
(46:13):
I didn't imagine how much in the finance space there is a need and urgency to connect
the dots between different types of, you know, SEC filings and just the reports that are appearing left and right from different places. One of the major surprising ones was there's a whole industry of people going and searching websites manually and putting data into some spreadsheets, that's then being sold to other people who go and actually try to acquire these companies and these people who found and put this into the spreadsheet get a percentage commission on these sales that is a huge amount of money if it happens. So, you know, there is like a lot of inefficiencies in these types of industries where, you know, you can really easily automate that and store it and just we have this parser now that can pull the whole web page and just structure all that information to see who's the founder or like what's happening and that's something that is
(47:04):
a relatively surprising
Tobias Macey (47:06):
thing that still happens manually. So yeah, just a few. As you have been working in this space and building a business that enables
these memory capabilities,
what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?
Vasilije Markovich (47:22):
Most of my assumptions are wrong. I think I started with a
healthy set of assumptions and
self confidence of how I'm going to solve this.
And then I learned the more I built it that like I cannot be too opinionated on what's going to work because the surprising thing is that someone's going to come on a call and tell me how they interpret things and I might just
(47:43):
change the direction again. So that led to really trying to build in a very modular way. So there is not too many opinions and yet still things
sneak through and you need to be careful on what kind of decisions you make. The second piece is, I think the major learning and the major challenge
is
how to explain to people what you're doing. I think because it's so fast and so evolving, we talked about memory being everything, like from the vector store to ingestion to MD file, it can be whatever you want it to be. We worked really hard to kind of figure out how do we position what we built and how to iterate on that. Even tonight we had a session where we were discussing things like, this is not clear enough, we need to be clear. And I think one of the major things is you can never explain it more simply You know, you just kind of always need to aim for like this, like very basic,
(48:33):
very naive
way of talking about it, which is surprisingly
difficult. And then finally,
think that landscape is continuously changing. But what I've learned is like to ignore the noise. So you go to to x and then you're gonna see people, you know, misrepresenting the truth, talking about things that don't exist or look too nice to be good. And when you work with these systems, you know what they're capable of. And and, you know, it's much, but it's not
(49:01):
AGI.
And as soon as I see an AGI post, you know, it's just like a probabilistic token machine. Like, you know, it's not definitely gonna change my life. But on the other hand, it does change a life of a lot of people. And like there is definitely
emotional dependence to the personality of a chatbot people are interacting with. And I think the current climate is such where people don't get much, you know, warmth and kindness from other people, so they're defaulting often enough to chatbots to get what they need in life. And things are becoming a bit scary also as
(49:32):
you go deeper and you see that, you know, you end up, you know, using the chatbot as your right hand and relying
on questions you should never ask to a probabilistic engine for it to guide you. And there was all these articles and posts. But I think for me one big learning was like I cut off myself from TradGPT this summer for like two months, not because I needed to, it's because I read this article in Washington Post where a guy thought he's gonna be a genius that invents his own,
(49:59):
I don't know, some type of a mathematical language of sorts and
ended up in psychosis. So I decided to cut myself off, ended up two months not using it or using it very minimally.
And I felt actually
a need to use it the first few days and I felt I was a bit addicted after using it for a year and a half. And I think for me, the biggest learning is like to just treat these systems as tools, not as an answer. And I think like that's where I also kind of slipped into relying on them a bit too much. So maybe that's a word of caution from my side, but again, maybe not too much of a data engineering topic, more more a psychology one, but it's it's an interesting one. And
Tobias Macey (50:37):
for people who are
designing or evolving
some LLM powered system, whether agentic or otherwise, what are the cases where investing in a dedicated memory layer might be the wrong choice?
Vasilije Markovich (50:51):
Yeah. Listen, I mean, if you can do it without the memory layer or if you can do it with
a with a prompt, do it with a prompt. You know, if it solves your problem. If you're trying to build something because it's cool, probably you don't need it. You know, it's like blockchain.
It's great on paper. What can I actually use it for? I still don't know except for transferring money from different parts of the world if I really need to, but most of the times I don't. So I would say typical structured
(51:15):
outputs are gonna do great if you need to structure some unstructured data.
If you need to store data, you can always use Postgres. It's a great database. It's been there for a while. You don't probably need anything more complex than that. If you do have agents,
if your agents are running 20 fourseven, if they're querying a lot of stores, if you need to connect to an existing data layer, if you have custom business rules, yeah, then think about memory or agents, but in lot of cases, you can get started with a relatively simple boilerplate code that's just gonna do the job well enough for the beginning, and then then think as your use case gets more complex about memory.
Tobias Macey (51:53):
And as you continue to build and execute on the work that you're doing at Cogni, what are some of the things you have planned for the near to medium term or any particular
projects or problem areas that you're excited to explore?
Vasilije Markovich (52:05):
Yeah, our research team is doing a great job and they are working quite actively on this piece that would allow us to have these decision traces,
these subgraph groups that we can both deterministically
and
evaluate and also on the reinforcement learning side kind of continuously improve and change the weights of. So this piece is something that I'm quite excited about. We are refactoring our session and long term memory and the mechanisms
(52:31):
there that should be live in around ten days. So that's gonna be a version,
let's say, two of what we built because we learn through the iterations,
although the SDK levers won't change the internal structures will. And then on top of that, we will be working on different mechanisms
for memory transformations and storage in different types of memory domains and planes over the next period where
(52:55):
we see a lot of,
let's say, use cases in different verticals that we wanna kind of capture better and represent
them and we'll continue working on that because things are again differentiating,
but we're trying to still have a general solution and kind of try and tackle a couple of things here and there. And yeah, time representation is still, we solved it, but we probably want to improve it over time. Are there any other aspects of the work that you're doing on Cogni, this overall space of agentic memory,
Tobias Macey (53:25):
or the ways that we as engineers should be thinking about the appropriate
types and structures of memory to be relying on that we didn't discuss yet that you'd like to cover before we close out the show? I think at this point,
Vasilije Markovich (53:39):
what we would be talking about over a period of time is more on our side on how do we actually handle and store the embeddings and graphs inside of the embeddings and have like more, let's say, complex structures
with weights in different types of vector store, in different vector store collections that will do 90% of the lifting of just the graphs where the graph structure will be just the guidance
(54:03):
and the
embeddings
will be a data layer of sorts in the data contract. I'll talk about this more in future
as we are kind of iterating on it it and making first prototypes.
But right now, that's where I kind of see the structures going. It's going to be, let's say, complex embeddings and then graphs as kind of the guiding force there, and that's going to be our kind of data schema representation.
Tobias Macey (54:30):
Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you and your team are doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology or human training that's available for data and AI engineering today.
Vasilije Markovich (54:50):
One thing everyone's talking about these days and starting finally to tackle, and there is this guy called Josh from Sundara AIs.
It's like they released recently a product is, if I want to run
remove in
Linux, like RM,
RF,
RM minus RF, you know, that's gonna cause a mayhem. Can I actually have a policy that stops agents from running these types of commands? Can I have some type of a policy logic engine that's gonna prohibit certain types of actions on the terminal or wherever else? And I think these
(55:21):
are quite
new and quite exciting for me because we can also apply them to data
and apply them to data access as we kind of move forward. And I think that that's something that I'm excited about. I think agentic systems,
multi agent systems that communicate between each other and things like OpenClaw and these
(55:44):
bots kind of discussing things. It's a fine precursor of what we're probably going to see more and more of. And the fun thing is like when I have a meeting in my granola and I speak in my native language some of the people I work with, the text I get, I cannot read it. It looks like a mix of languages, but when
it makes a really good summary of what we actually discussed. So I think we are going to end up with some type of
(56:09):
pseudo language that is gonna probably evolve over time. So these
Tobias Macey (56:13):
kind of two with some part types of constraints on that, that might be much more efficient than English or whatever we are using to communicate with it now. So I think those two areas are kind of exciting, although I probably wouldn't go myself into them, but I'm sure some will. All right. Well, thank you very much for taking the time today to join me and help me explore this overall space of memory and the context of LLMs and agents. It's definitely a very interesting problem area, one that has seen a lot of evolution over the past couple of years and one that is hard to get a good handle on what are the pieces of signal that I should be paying attention to and how much is just noise. So I appreciate you helping to unpack that, and I hope you enjoy the rest of your day. Yeah. Thanks for having me. It was a pleasure, and, always always happy to chat.
(57:07):
Thank you for listening, and don't forget to check out our other shows. Podcast.net
covers the Python language, its community, and the innovative ways it is being used. And the AI engineering podcast is your guide to the fast moving world of building AI systems.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email hosts@dataengineeringpodcast.com
Vasilije Markovich (57:31):
with your story. Just to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.