Episode Transcript
Available transcripts are automatically generated. Complete accuracy is not guaranteed.
Tobias Macey (00:11):
Hello, and welcome to the data engineering podcast, the show about modern data management.
If you lead a data team, you know this pain. Every department needs dashboards, reports, custom views, and they all come to you. So you're either the bottleneck slowing everyone down, or you're spending all your time building one off tools instead of doing actual data work.
Retool gives you a way to break that cycle. Their platform lets people build custom apps on your company data while keeping it all secure.
(00:37):
Type a prompt like build me a self-service reporting tool that lets teams query customer metrics from Databricks, and they get a production ready app with the permissions and governance built in.
They can self serve, and you get your time back. It's data democratization
without the chaos.
Check out Retool at dataengineeringpodcast.com
slash Retool today, that's r e t o o l, and see how other data teams are scaling self-service.
(01:01):
Because let's be honest, we all need to retool how we handle data requests.
Composable data infrastructure is great until you spend all of your time gluing it back together.
Bruin is an open source framework driven from the command line that makes integration a breeze. Write Python and SQL to handle the business logic, and let Bruin handle the heavy lifting of data movement, lineage tracking, data quality monitoring, and governance enforcement.
(01:25):
Bruin allows you to build end to end data workflows using AI, has connectors for hundreds of platforms, and helps data teams deliver faster.
Teams that use Bruin need less engineering effort to process data and benefit from a fully integrated data platform.
Go to dataengineeringpodcast.com/bruin
today to get started. And for dbt cloud customers, they'll give you a thousand dollar credit to migrate to Bruin Cloud.
(01:50):
Your host is Tobias Maci, and today I'm interviewing Kostas Pardalis about Fenic, an opinionated PySpark inspired dataframe framework for building AI and agentic applications.
So Kostas, can you start by introducing yourself?
Kostas Pardalis (02:03):
Of course, and thank you so much for having me here and talk about myself and Fenics. So about me, I've been building
data infrastructure
for the past eleven years now, starting with Blendo was
my first company, I was also a founder there, a cloud DLT solution. I started around the same time with five transit data. Then through an acquisition,
(02:28):
Radarstack, joined the leadership team there.
Radarstack, for anyone who doesn't know, again, in the ingestion layer, but focusing more on capturing
events like data and ingesting this data into data infrastructure,
data warehouses and data lakes. And then after that, I worked together with the creators of Trino at Starburst Data, learned a lot about query engines,
(02:54):
and I worked on
expanding Trino and the products that Starburst had into the bots workload. For people who might not be aware of like Trino
traditionally has been used primarily for online analytics,
there was some incredible engineering work that created fault
tolerance
to the engine at some point,
(03:15):
And that allowed the system like to expand into like the batch workloads that Spark traditionally
was,
let's say, like dominating.
And the past like a year and a half,
working on a new project, I'm founder of a company called TypeDev,
where we built what we call agentic data infrastructure. So it's data infrastructure primarily focusing on like data engineering use cases.
(03:41):
And we're trying to tackle like the problems there using
agents as part of the mix of the infrastructure.
Part of that work
also
allowed us like to open source
like a project called FENIC, which we'll talk more about. But yeah, that's
about me.
Tobias Macey (04:00):
And do you remember how you first got started working in the data space and why it has kept your attention for this long?
Kostas Pardalis (04:07):
That's a good question. Somehow
I was always gravitating
through that even like during my,
let's say
academic years. I remember there was a time where I was trying to decide if I wanted to pursue
more of like a PhD
or I wanted to get into the industry. I was like an engineer.
(04:30):
And that time I was
working with some research labs
and
somehow I ended up like, again, like working with
knowledge representation,
data modelling.
Back then it was like a thing called the semantic web. Things that kind of like return today, like we stopped talking about graphs again, and like RDF and like this kind of like technologies.
(04:53):
Somehow, like I started with that. And then the next step when I decided that, okay, like academic world is not my thing.
Let's try the exact opposite, which is start building a company
and building products.
I ended up like building Blendo
and getting into like the data pipelining world and like the ingestion. And I don't know, I always,
(05:16):
there's something about like data that
makes things a little bit more complicated,
but also
has like a very direct mind at least, like connection with value, because data is like things that we can always associate with
something that we try to do, right? That keeps fascinating me and like
(05:37):
keeps me getting back there and working on that. So I guess that's the reason. I don't know. We'll see. There's more to learn about myself, I guess. Like keep asking this type of questions, like new things might come up that surprise me to myself too. Absolutely.
Tobias Macey (05:50):
Yeah. Well, I mean, that's the point of life, right? The journey of self discovery.
Kostas Pardalis (05:54):
A 100%. Through data. Through data. Of course. Quantified self.
Tobias Macey (06:00):
In other words, end. So you mentioned the FENIC project that you open sourced, and I'm wondering if you can just describe some of what it is and the story behind how it came to be and some of the core problems that you're trying to solve with it.
Kostas Pardalis (06:14):
So when we started with TypeDev,
what the problems we're trying to solve was about,
okay, we have like the data infrastructure that we are using like today, which has been kind of defined
pretty much like twelve, thirteen years ago, where the world was like very, very different, right? Like if we think about the all app data warehousing and
(06:36):
moving all these things like on the cloud,
the modern data stack as was like called and like all that stuff. Pretty much we are talking about taking some
ideas that existed for a long time
and making them accessible like to more people
and
making them like scale beyond,
(06:58):
let's say, the scale that we had like before 2010.
Like Spark, Trino, like Snowflake,
Redshift,
the BigQuery, like all these solutions that they are kind of like, let's say the foundation of the infrastructure we are using like today,
they are primarily built with two strong assumptions. One that the dominant use case is like analytics.
(07:19):
So it's like BI.
And the second one is that you have some people to run these
systems that they have pretty good technical skills, right? Anyone who has been like, let's say, in part of like a platform team where they have to babysit EMR Spark, for example, they know exactly what I'm talking about, right? Now this
(07:40):
creates,
let's say, it makes it hard, like for the adoption of like the solution like to become like more broader, like available.
There were like some attempts, I think one interesting approach was
DBD came out, it tried like to create also like this new type of engineer,
the analytical engineer, which is more of, okay, we can get someone who is
(08:04):
not necessarily
person who's like expert in, I don't know, like distributed computing and
consistency
and what it means like to
run like fault tolerance systems across like hundreds or like thousands of nodes and all that stuff and like, kind of like turn them into people who can scale the infrastructure that we need. But still like the problem was there. Like it wasn't, we couldn't see like some escape velocity being like developed in the industry where more and more people could use like this technology.
(08:33):
So that's what like the starting point, how we can make, let's say, data infrastructure more accessible, like to more engineers out there and like kind of abstract
some of the
really hard parts of like systems
that they are not necessarily in place because
of like data, but they are there because of like the distributed nature of like these systems.
(08:55):
So in order to do that, we wanted like to start from scratch in a way instead of like, let's say taking Spark or Trino and trying to make it serverless or like abstracted.
So we're like, okay, let's
see what we can build if we start from the query engine itself and how this query engine should look like, especially if we are focusing on
(09:17):
the workloads that let's say data engineers and like data platform people are like working. And we started like thinking what that would look like, what the API should be, what people like to use as interface,
and what new elements we should put there, right? The world again, it's not twenty ten anymore. There are more out there. Like we have to work with new types of data, and most importantly, because of AI with a new type of compute, which is inference, right? So the core of what we were building was like this new query engine that we wanted to
(09:50):
make it easier, like for people like to use it, being familiar for people who have been working with the pipelining side of things, and also make first class citizens this new type of compute, which is inference, and also build it and design it in a way that will make it efficient now that the workloads, the nature of the workloads like has changed, like we're not talking about like CPU bound workloads anymore, which is like what we were optimizing so far with like columnar systems. Now we are more like, I would say like IO bound, because most of the time we have like to go and wait for like an LLM like to respond and for GPU like to start like streaming tokens back. So let's say how FENIC came into life, like how we started like working on that. So it is like a query engine to make it accessible like to both humans
(10:41):
and machines,
we wanted to use an API that it's already familiar to people. So that's why we decided like to use PySpark. Data frames, my opinion are like important because they give like this really beautiful
API of like mixing imperative with declarative and have like best of both worlds. Because when you work like with data outside of like the purely analytical use cases, the purely declarative is not necessarily
(11:06):
the best way like to do things. And we also made it like we added like inference as like first class citizen, as we say, by enriching the
existing API that like people know. So instead of like building UDFs, where you call an LLM, and you just think of it as like an API that you call, and you send something, you get something back, but then the optimizer can't do much because it's just like a black box. What can we do to add, let's say, these
(11:34):
relational
operators inspired
operators like in the API to make the composition
and
like the API more familiar to use like also inference
together with classic like data processing,
and also expose that to most importantly, to the optimizer, because now the optimizer can go and reason about how to reorder operations
(11:56):
and do it like in a very efficient way by knowing that, oh, right now, like, I'll need at some point to go and do inference. So maybe I should optimise the plan differently. Of course, we wouldn't come up with these ideas without getting inspiration from some really smart people out there who have tried to do some of these things. There
(12:16):
is like this
great team at Berkeley that came out with the Lotus system and the papers where
they were calling in pandas, these semantic operators, let's say, as part of like pandas, we kind of like took that, but we wanted like also to not focus only on that, but more on the engineering side of like, okay, how do we put these like into like, in a more like production
(12:39):
kind of like system, right? Like something that's okay, you don't only prove,
the feasibility of these operators, but also put them into something that people can actually use it regardless of how much data they have. And that's how FENIC was
designed. I'll pause here. So if you have any questions about that, I can get into more technical details, because it's not only that part, there's a lot of work that has been supported by using things like polars, for example, for the processing and some other things that we can talk more.
Tobias Macey (13:11):
Yeah, I'm definitely interested in digging into some of the guts of FENIC and the ecosystem integration.
Before we go too far down that road, I think it's also worth exploring
a little bit more of the utility of having the data frame API being the predominant
design consideration
for FENIC where
tabular data obviously is very predominant with data warehouses with the number of data frame libraries and implementations that exist. But in particular, when we start to incorporate more unstructured data sources, you might have a higher dimensionality than just the x y axis that a data frame will likely constrain you to.
(13:53):
And with the focus of FENIC being
and incorporating some of these LLM and generative AI capabilities
into that data engineering flow and generating some structure from that, I'm curious how you see
some of the trade offs of that tabular representation or that two d constraint
(14:14):
factoring into some of the design and usage considerations for FENIC.
Kostas Pardalis (14:19):
Yeah. That's a great question. And I have like to add something here. There is a very,
I would say, like an important assumption
that has influenced a lot, let's say, like the design and the goals at the end of
FENIC.
So these assumptions about how, let's say, LLMs can be used in a way that it's
(14:42):
more predictable at the end, right? And we start like from, as you said, there is like a lot of unstructured information out there. But this unstructured information has a lot of information, like a lot of structure that is like implicit. Now, instead of saying,
hey, I know that this information is there, let's say I have like conversations
(15:03):
with customer support, right? I know that this information is about, let's say, my product, and
it is about like specific features, obviously, because like people reach out because they have problems or questions about specific features. And asking every time I have a question, the LLM like to come back and give me like the answer of what's, let's say, the feature of this. What we can do is we can make this information
(15:28):
explicit and part of the schema. And that's where, let's say, the tabular nature of like data frames comes in, right? So instead of, again, like querying, we're using LLMs like the raw data, what if we go and say, hey, I know what this data is about.
I know that in order like to process this data, I need to understand, let's say, to extract some entities, why not extract once these entities,
(15:52):
add them as columns in there, right? And then either continue using that information using LLMs or
standard like ways of doing processing, which you do want to do, right? Because when you try like scale, anything that you can substitute the LLM with, let's say, deterministic way, not only adds like more reliability,
but most importantly, like scales much better, like the more data that you have. So, and that's what like the operators,
(16:19):
like the semantic operators are like about, right? Like when you use something like, let's say, like a semantic GROUP BY, which is kind of like the equivalent of, like doing clustering or using like a semantic extract that returns back like a structure that has a schema that you add there, like with data type and like a struct in place. The goal is to go from this implicit structure that exists in your unstructured data into a more structured explicit schema
(16:49):
that you can track, you can update, you can reason about, the model can also reason about, right? And you can process in a continuous manner like us, data engineers have always been doing, by building like pipelines.
So that's like an important, like, let's say, assumption here. The goal should be using the LLMs to make sense
(17:10):
and add structure and order in these data that we have. So
we can reason more efficiently,
both using LLMs
and the relational algebra that we've been using under like today. And that's what you should do like we think if you use it, that's the goal, right? So we
approach the problem and why we use like data frames and why this tabular representation at the end, I think it's really important. We shouldn't let it go away. We should make it more powerful and more expressive by having LLMs in there instead of saying, oh, we don't need
(17:45):
schemas, we don't need types, we don't need any of that stuff because we can just like throw the raw data like in the middle of them and get like a response. No, it doesn't work like that. And so now digging more into
Tobias Macey (17:55):
FENIC itself,
can you give a bit of an overview about some of the implementation
details and some of the core architecture of the project that you settled on to be able to take advantage of some of these data frame primitives while bringing in the LLM
operators?
Kostas Pardalis (18:13):
So one important, let's say, characteristic of the DataFrame concept, let's say, or like API is laziness. It's something that you compose
and
when you are ready, like, you can go like and evaluate it. What that allows
you to do is that it allows to go and apply, let's say, an optimizer
(18:34):
to go and optimize
the plan, like go from the data frames that you'd like, the AST that has been created, like from your code to a logical plan, the logical plan being optimized and then turns out into like a physical plan that then you go like and execute. Now, why this is like really
helpful
is from the user perspective,
(18:54):
obviously it helps with composability,
right? Like, so users can go and write like really composable
code, something that like SQL was like always like struggling with until, let's say something like dbt came into and tried like to add that by using like templating.
But most importantly,
you can now
use your optimizer like to resume and use it like as more like as a compiler, let's say, like to go and figure out like the optimal way of going and like executing things.
(19:21):
And while things have been like solved for like the columnar
data that we've been like processing so far, that's not true when you get into the equation also like
inference,
right? So the way that people have been doing like so far is like building UDFs. So you pretty much consider inference as like an API call that's part of your code. The problem with that, and traditionally it has always been like a problem like with UDFs, like for all systems, including Spark, is that a UDF is like pretty much a black box. So the optimizer itself cannot reason about what this thing is like trying like to do. So
(20:00):
by taking like the LLM
interactions,
creating
some specific operators that they are part of the logical plan,
we achieve like two things. One is that now
the optimizer
understands that, oh, I'm going to process the data using LLMs,
but also for like a specific task. So it's just like a generic thing, again, that it's tokens in, tokens out, but it knows that, oh, this is a semantic predicate, which means that this is going to be
(20:30):
about
deciding true or false based on like some input, and this allows for like specific,
let's say, optimizations that can be done. Or this is like an extraction task, which means that now I can
use different type of optimizations
or a semantic join as we like have another like operator. All these now, like the optimizer itself is like aware of like what we try to achieve using LLMs and by considering also what inference
(20:58):
is as a type of compute, it can incorporate that information to more optimally
create and execute the plan. The reason this is important, especially like for like working like with LLMs is because
expensive. And when I say expensive, it's normally like in dollar amount, it's like expensive in terms of like timing to work with these things, right? So the last thing that you want is something like to fail because you did, I don't know, maybe you went over the quota that you have, or maybe you sent more tokens than, let's say, they are allowed in the context we do that you have, right? All these things that like we as developers are aware of, the optimizer when you are using just like a UDF is not aware of, right? So we make all these informations like explicit and part of like the logical plan. So now you have like a system that when you, let's say you take some data, you do some string matching or whatever at the beginning, or like splitting strings or blah, blah, blah, whatever. And then you're like, okay, now I won't like to go and extract,
(22:01):
let's say some information
using LLMs.
The FENIC will be able like to execute this thing like in the most optimal way, considering also, let's say, like the parameters of like the inference engine that you have, right? So it will make sure that it's like fault tolerant, like it won't, let's say throttle your APIs or like if something gets throttled for whatever reason, it will back off. And it will respect, let's say, like all the limitations
(22:28):
that the LLMs have, with the end goal of that being obviously being like more efficient and more like full turnaround when you're like incorporating these LLM operations in your data processing. From the user perspective,
right, like the developer who's writing the code, it's now much easier like to maintain and reason about the code, because you're not just have like a chaos of UDFs that you also have to go and like figure out and reason about what they are trying to like to do. You have specific operators that you can compose as you did with a regular join or like a regular filter,
(23:03):
right, or GROUP BY. But now you apply these things also like with semantics. And you do it also in a way that you know end to end what the types are going to be and these are like guaranteed by like the engine. So that's the, let's say, the high level approach that PhenK has. The way that we have implemented that, we build our own, so we have implemented, let's say, again, it's like a PySpark inspired
(23:26):
API,
you write code that is very familiar to anyone who has like written like PySpark codes,
with some extensions, obviously, because we have, like semantic operators there that gets parsed,
creates like a logical plan,
this logical plan gets optimised.
And then when we move into like the physical plan, this is actually executed
(23:48):
by POLARS. So we translate that into POLARS and POLARS executes that. And of course, anything that has to do with the LLMs is happening, like it's built like with
FENI. We also support SQL, what happens there is that we have using obviously, like the magic of Arrow there, their operation with DuckDB. So like, you can write SQL,
(24:08):
which is DuckDB SQL, and can be executed by DuckDB.
But all these things are like transparent
on the edge, right? Like you see stuff like FENIC, you write like FENIC and FENIC does all the hard work of making sure that everything is like translated correctly between Ductibi, when it has to, polars and back into FENIC again.
Tobias Macey (24:30):
And as you have been building and iterating on this project, what are some of the ways that the core principles
Kostas Pardalis (24:38):
and scope of the project have evolved from your first experiments with it? There are some interesting things. One of the things that I found like very interesting. So at some point,
we added support for
MCP
into Phoenix. So now like you can pretty much write any code you want over your data, like using like data frames.
(25:03):
And
you can turn this into what we call like a tool. So FENIC has like a catalogue, like an internal catalogue.
And we kind of had like, as you have like views and tables,
right, as like the standard components
your catalogue. We introduced like a new one, which is the tool. So now you can create a catalogue
(25:23):
tool,
which is
associated with some codes that you've written, like as you would, the same way that you would create, let's say, like a view,
you create a tool, and that is like exposed
by Phenic, like through MCP. So you make these like available to any model to go and interact with it. And Phenic takes care of all like the plumbing and like all the boilerplate and everything that is needed there in terms of like to both like buildings serve the
(25:51):
MCP. So for me, there is like a great opportunity there to create
tooling
for
the LLMs to interact with data. If we think about MCP, traditionally, it's
been around connecting
with services primarily, right? But we haven't yet figured out what MCP would look like for data. And our attempt like to address that is by using, let's say, the data frames, introducing
(26:20):
a new type of catalog
object,
which is like the tool.
And as you create views, like create parameter like, as you would create parameterized views, like in database, you create parameterized tools in the catalog, and these tools then can be exposed, like, through MGP
to any to any model.
Tobias Macey (26:39):
You're a developer who wants to innovate.
Instead, you're stuck fixing bottlenecks and fighting legacy code. MongoDB can help. It's a flexible, unified platform that's built for developers by developers.
MongoDB is ACID compliant, enterprise ready with the capabilities you need to ship AI apps fast. That's why so many of the Fortune 500 trust MongoDB with their most critical workloads.
(27:01):
Ready to think outside rows and columns? Start building at mongodb.com/build
today.
And so
one of the
key
positionings that you're doing with FENIC is that it helps to bring reliability
and rigor
in terms of data engineering principles
(27:21):
to LLM powered transformations.
Obviously,
there are
varying elements
of
probabilistic outputs,
unpredictable
results that you might get. Obviously, since you're dealing with eight various API calls, there's the potential for failures.
What are some of the anti patterns that teams that are starting to build with Fenics should be aware of and some of the ways that they might inadvertently shoot themselves in the foot because of the way that they either design
(27:52):
their transformation flows or some of the ways that they structure the prompts or even restrictions in terms of the size and scale of data that they should be thinking about operating on within a certain batch of requests?
Kostas Pardalis (28:05):
Yeah. First of all, I have to say that in my opinion, like data engineers are like kind of the right people to bring, like to solve these non deterministic like problems that LLNs have. Like, it's not that there's a lot of non determinism, like when you work with data anyway, right? Like, it's not a new thing. That's what, let's say, of what data engineers have been always like fighting with, like how we can create like more determinism,
(28:30):
like, out of like the systems that we have. And why do I
say that? First of all, like, even like with traditional ML, right, like there are processing steps that they are by nature non deterministic.
Running pipelines that they are incremental, they are not necessarily like deterministic.
Things can fail at any point, right? And at any time. So that's also like part of adding like not deterministic in the system like over. Like at the end of the day, data engineers are trying to fight against entropy in a way, right? That's like their job. So if there are some people out there that they can deal with these things, think it's like data engineers. The principles that they used in like before LLM, like actually apply them,
(29:12):
like on working with LLMs too. What I would say in general, I think some like best practices or like guidelines,
the most important thing in my opinion is to think like in an incremental manner, like when you work
with LLMs and like data, and instead of going, because like people tend
to go directly into, Oh, like the LLM is going to like to give me the answer.
(29:36):
Instead of doing that, think in terms of like pipeline,
right? Like I have this data that in its raw form is like that. How can I get in the What's like the minimum number of steps to extract information and structure
my information
in a way that like downstream,
we
(29:56):
have guarantees about what we are doing here? So in the same way that, let's say, we start by
getting the raw data and
doing some cleaning,
right, to get our data like in shape. That's where like, I think like LLM like should start being like employed, right? How we can take now the unstructured data that we have and start like creating some structure. And from this structure, how we can apply, let's say, the tools that we have to reason about the quality of this data and decide, should we push this further downstream or not? This is like the
(30:32):
way that I would approach it and like I would suggest anyone like to do it, like just follow like engineering principles that have been working like all these years,
but apply them
using like a new type of compute, right? It's not going to be exactly the same thing, obviously. Like there are like some other like parameters that you have to take into account. But at the end of the day, like if you architect the system, if you think like the more abstract way, you should be applying like the same principle. The other thing like when it comes like to,
(31:03):
let's say,
LLMs is that because of the and that's more of a systems thing, actually. And I don't necessarily think that the data engineers should care about that stuff, but adding async capabilities in the query engines is becoming more and more important, right? Which is something that we had also like to implement like at FENIC, like we added like async
(31:23):
UDFs.
You need that. And you need that because there are many steps that you have like to take and process where
if you don't asynchronously
make the calls as part of your
plan,
you are going to underutilize
your resources
heavily. But again,
this is more for the systems engineers that they build the tooling, and it is for the data engineers themselves. But I think it becomes, like, a much, much more important, let's say, technique that needs to be incorporated into, like, the tooling that we have and use it.
Tobias Macey (31:57):
For people who are adopting FENIC and they're starting to take advantage of some of these semantic operators that are built into the data frame, what are some of the common
tool chains that they might be coming from, and what are some of the common
first steps that they're going to take as they start to adopt some of these more semantic operators beyond just the
(32:22):
traditional
transformations
of maybe doing projections or converting
from one type to another within a particular column, etcetera?
Kostas Pardalis (32:32):
Yeah. I think Phenic is like anyone who has used like pandas
or has used like PySpark
or any data frame, let's say like API before,
like Phenic should feel like very, very familiar. The semantic operators,
there's like a small set of them. What we tried like to do there is to create,
(32:52):
let's say, a semantic equivalent
of like standard operators,
right? So
what it means like to filter something like semantically,
right? Or which is like a semantic filter that you can do like with FENIC, or you extract information,
more like a projection, then you have like joins,
semantic joins, which is just like think of it like as the regular join. The difference is that a regular joins requires like exact matches between like the columns. Now, instead of exact matches,
(33:22):
you have not just like fuzzy matching, but you match actually meaning, right? So you can go and match,
do like a join, but instead of expecting that the values are like exact, values are going, the matching is going to happen like through like, based on like the meaning that the two columns have. So it should feel like familiar
(33:43):
to them like to use it, it is obviously like a little bit different in the sense that you
have to think of oh, I am joining on meaning, not on syntax,
right? But that makes you focus more on the data and understanding, let's say, like what
it means for your data than how to implement that, right? Which is what we are trying to remove from the equation
(34:07):
for people
out there. The other thing is that everything we did,
we added some,
one of the interesting things I think is you see the impact that
new technologies have by observing the data types, in my opinion, that data systems support,
right?
(34:27):
So
in 2010,
we didn't have that many,
probably we didn't have any all app system at all that was supporting something like
JSON, right? Today, take it for granted that you have structs and you have like more unstructured, like variant types, for example, right? That we can use. That wasn't like true with even
(34:50):
Redshift, for example, was like a big problem of like, okay, we get all this data from like the APIs that now we can pull data out, but these are like JSON,
we cannot query these like efficiently, we have to do a lot of ugly raggling of the data, like to make it work.
So there are like new data types, in my opinion, that we also need now that we are working like with LLMs.
And one of the things that someone can see like investigating
(35:13):
like FENIC is that we are experimenting with that too. So we have a new data type that's called markdown.
Markdown has,
it's kind of funny because like markdown was never like intended to be
something to define structure,
right? It was like intended to define
styling,
but we ended up like using it like us humans, like to define structure too. And LLNs rely a lot on that structure,
(35:38):
right? So we introduced like new data types like markdown
that you can use in a similar way that you could use
like JSON and you can turn them into like proper structs and then go and do like the analysis there as you would do like with JSON. So my opinion, like the best way for someone to work with FENIC is
(36:00):
use the MCP, the documentation MCP server we have, which by the way is built using FENIC itself. So FENIC is used like we have like a pipeline that goes through the code base itself whenever we have a new release and creates
like a data set using the semantic operators that we have together with the normal operators to create artifacts and tools that an LLM can use like to understand how to use
(36:25):
FENIC correctly,
connect like some like load code and just ask you to build something using Phenic,
or even ask for a tutorial. So that's the, I would say, like the easiest way to get familiar with Phenic and use it. And hopefully, it's a little bit of a glimpse in the future of how I believe we will be
writing code.
(36:46):
And hopefully it will change also
how hard it is for people like in Dataset to adopt new paradigms
and APIs
and architectures.
Tobias Macey (36:56):
And so now digging more into the ecosystem
integration,
you already mentioned
the built in built in compatibility with polars and DuckDB.
And
I'm wondering if you can just talk to some of the ways that you see Fennec fitting into a typical
data transformation
pipeline.
(37:16):
And in particular, as people are using a lot of those pipelines to feed context into
AI and agentic systems,
the role that Phenic plays in that context engineering workflow?
Kostas Pardalis (37:30):
Yeah, 100%.
So the way that I,
we are also like using Phenic, and I think like the best way like to use it is it's more of like a system that's supposed to be used like locally, right? Like, it's not like a substitute of like a full blown like Spark cluster. That's not what like Phenic is for. It's like something that you can run like in
(37:51):
one instance that you have. Now, what Phenic does is it has like some primitives
that
it makes it ideal to
mess with your context without
having your main agent having like to spend tokens and context
on that. So if we think about like, context engineering, let's say, the end of the day, it's primarily state management, but state management in like very specific constraints,
(38:21):
right? Like, you have like the constraints,
like you have like short term context management that you have to do, long term context management that you have to do. And you also have like to work with tools, which like a new API.
And
also,
you have to
work with semantics a lot, right? So it's not just like a key value store on its own, like it wouldn't be enough
(38:46):
in order to like to manipulate like the context.
And that's where I think like FENIC really signs here. It has the expressivity,
both in terms of
applying any traditional
standard
Clasc or whatever we want to call it, like data processing, right, by using data frames in a declarative way. So focusing on what your context should look like, not how to create it.
(39:12):
Use inference
and offload inference from the main agent to
FENIC to go and create the artifacts that you need, like for your context and retrieve it also through like the toolings that Phenic provides for that. So
the way that I think like people like should think about it is like a library that you can add in any agentic framework that you have to build memory modules or context management modules,
(39:39):
either short term or like longer term, because you can persist the data too. And most importantly,
offload
inference
from the main agentic loop to FENIC to do part of like the context management. So if you want to summarise as part of your compaction, like don't do that like in your main agentic loop, offload it like to FENIC and the semantic operators like to go and do it, and store it in a way that you have like full lineage of that. So you can roll back if you want, right? Like, you pretty much have like complete all up system that can fit as a companion to your agent and use that to manage
(40:19):
and serve the context for your agent. So it doesn't replace the agentic frameworks,
right? It's not harnessed to run agents. It is like it should be thought as something that can enhance your
agenting framework that you use like PyDentic AI or like LangGraph
or anything else and build the state management, the context management using FENIC.
Tobias Macey (40:44):
And then the other interesting aspect
of integrating with the broader ecosystem,
particularly as we talk about agentic use cases,
is the role that it plays
in conjunction with some of these agent frameworks such as Lang Chain, PyDantic AI, and the list is far longer than I am aware of at this point. Yeah, a 100%. I think there is a difference. I mean, the agentic framework is and it should be all about, let's say, like the
Kostas Pardalis (41:11):
the agentic loop, right? Like what gives the primitives to define, like, let's say the broader, like behavior of like your agents, keeping in mind that the agents
and LLMs in general,
right? It's like a stateless thing. It's like you put tokens in there, like you can't offload the state management
to the agent itself. Like it's not going like to work. So if you make this clear in your mind as an engineer that, okay, my agent is, let's say, like the logic that runs there and it's stateless. Like, anyone who's coming like from functional programming, like, I think it would feel like pretty natural, right? Like, it's a function, gets like some tokens in and spits out like some tokens and that's it, not side effects, nothing,
(41:53):
Like there's nothing else there. So the question then is like, okay, because this context affects heavily like the behaviour and the quality and the performance of this thing, like now how we are going like to manage that. And I do think that it should be separated,
like from the framework itself and portable also.
Like it should be something that you can even like use it between like different data settings like frameworks, right? And that's kind of like another value that you get with something like Phenic. Like you can build these like with PyTanic AI.
(42:24):
At the end of the day, like your context is going to like to be stored, you know, either like some parquet files
or like a DuckDB file.
You can use the same library also like in another
agentic framework and still access like the same
content, right? And if you use MCP also among like frameworks that are not only like in Python, it can be like any framework you want.
Tobias Macey (42:48):
And as you have been building FENIC and
working with the community and with some of your customers at Typedev, what are some of the most interesting or innovative or unexpected ways that you're seeing it used?
Kostas Pardalis (43:01):
I would say like most interesting,
I'll say two. One is like kind of like an internal one is like the documentation
use case where
you go directly from the code base to
actionable documentation
for an agent without having to write the documentation itself. That's one. The other one
(43:23):
is using it. We have like some use cases where
it has been used to do some very interesting analytic
call pipelines
that I just can't say like, because sometimes it will come out like from like the people who used it and I don't want like to
say that before they do it themselves because it's their content. But they took like content, like from many years of like posting
(43:47):
on like newsletters and some other platforms.
And they might not like to use it to create like analytics
at the end, but they are very rich on extremely unstructured
data. So we are talking about like pretty much, let's say like blog posts and like audio files and somehow like get all that and like create
(44:08):
literally a star schema at the end that you can put like
a semantic layer on top of it and use even like an agentic analytics
framework to go and get really quantifiable
results.
And
all that coming from completely,
completely like structured information. I was like pretty fascinating like to see, and most importantly, how you needed both LLMs
(44:32):
and also standard OLAP operations
in order like to make it work at the end. And in your work of building this framework
Tobias Macey (44:42):
and building a business with that as one of the core technologies,
what are some of the most interesting or unexpected or challenging lessons that you've learned in the process of bringing these semantic operators into this data frame primitive?
Kostas Pardalis (44:56):
I would say like two things. One,
like a positive and one that is a bit of a challenge.
The positive one is that by having access like to
LLMs,
you can accelerate
a lot the development of like features that in the past wouldn't
be like possible to be done by a startup. And especially like
(45:18):
the data infrastructure world, like, I'll give like an example. Let's say you're trying to support SQL.
SQL has many different dialects. Many different dialects, it's not like a syntactical issue, it's also like a semantic issue, right, that you have there. Like, the semantics are like slightly different also. Now supporting all the different dialects out there, it's like an impossible feat. Like it's super, super hard like to do. LLMs allow you like to
(45:44):
accelerate these a lot. Like you can get code, let's say from,
I don't know, let's say like Teradata,
right? And I'm using Teradata just because it's something that it's not necessarily
accessible easily, like to go and find like code out there and
work on it and create like part or get open source like tooling for that. And still, like, you can use the LEMS if you are careful because you have to, to do a lot of processing
(46:10):
that
deterministically,
like, it would take like a lot of very, very hard work and a lot of time, like, to make it happen. Now, you still won't like to get to the deterministic
version of like the code, but it allows you to go and provide
and prove the value really fast, and then gradually start adding more and more determinism,
like,
(46:30):
as you go. That's one of the, like, big benefits I see there. And I hope that we will see a lot of like the problems in the data world, like, actually being solved because of
having access like to these systems and allowing like data engineers and engineers in general, to align on the different semantics of all the systems like much, much faster and more reliably than they could before.
(46:54):
On the other hand, you have to change the way that you build. Building software and systems is
almost like a shocking experience when you have like to do it with LLMs
in the mix.
Me because I've been exposed a lot like to the product management
world also kind of like reminds me of how you
(47:16):
do product development with AB testing.
You
need to build all these like evals and like
benchmark frameworks,
and
you do that not because like, you're just trying to improve a metric,
it's because you are trying to understand the behavior of the model with the tools that you have against the problem that you have and
(47:39):
adapt the software around it in a way that the agent will be
better in what he's doing, which is kind of like what product development
with AB testing is, right? Like, should we expose to humans this way of like doing this workflow or another way? There's not necessarily like a proof, right, that will take you through like to figure it out. You have to test it. And it's
(48:04):
hard, especially like for
systems engineers that, you know, like they think in terms of invariants and like proofs and like all that stuff to change the way you build software.
But it's fascinating. Like, I don't know, I find it like really, really fascinating.
Tobias Macey (48:19):
And so for people who
want to take advantage of some of these LLM capabilities
in their data workflows, what are the situations where FENIC is the wrong choice?
Kostas Pardalis (48:32):
Don't use FENIC to build a chat with an LLM. You can do it, but it will be painful and not efficient. That's
not the goal. The goal of like using FENIC is to
process data using
LLMs and
do it in a structured way, in a reliable way and like at some scale, right? If you just want to process one PDF,
(48:55):
throw PDF like on ChildGPT, just copy paste it there. Like, you don't need to load it into the DataFrame and like go like and process it. Now, are you continuously going like to working with PDFs as they come in and you want to be able like to identify where issues in your pipeline come up,
when they come up, then yeah, use a structured way like to approach
(49:17):
data processing and use something
like Phenic.
Tobias Macey (49:21):
And you mentioned earlier that Phenic is largely focused on
single machine compute.
I'm wondering what you see as some of the opportunity for integrating with some of the engines such as Ray or Spark or Dask for being able to do scale out capabilities
across multiple machines.
Kostas Pardalis (49:41):
Yeah, I think there's
definitely like opportunity there. I think like the same way that someone would use, let's say, blockchain,
people like primarily using only like through creating UDFs.
I think with something like Phenix, and that's something that we should invest
like some time on it, like to make it more performant and more like easy, like to integrate with that systems.
(50:04):
I think it's a much, much more natural fit to
integrate with like either Ray or like Spark and round these jobs there.
Tobias Macey (50:13):
And as you continue to build and iterate on Phenic, what are some of the things you have planned for the near to medium term or maybe some of the ways that you're thinking about
the decision points where it makes sense to bring in additional semantic operators to the core?
Kostas Pardalis (50:30):
Yeah, so far, I mean, I think the way that the API is defined right now,
you have like different layers of expressivity with the semantic layers,
with semantic operators, sorry. So
at the end, you can just do a map, semantic map, which is almost like
a direct call like to LLM, to generate some output. And you can either
(50:54):
have an output schema defined or not. So it's quite open at the end.
I think if we need to introduce new operators that should come from
seeing
when people reach out to use map and what are some patterns there that they can be abstracted.
Outside of that, I would say that right now,
(51:14):
and I feel like from the literature also, from what's been published out there, I think the coverage of what you can do with the semantic operators is pretty good. I think it's more about focusing on the
performance and optimization
side of things, how we can add new passes on the optimizer,
specifically
for like LLMs to make the whole thing like much more efficient. And that's even more like, say, full tolerance in the mix. Because that's like, I would say, like at scale, at least, like when you're working with bots, the biggest problems
(51:48):
to solve and the most interesting also from a systems engineering perspective problems. There are some very interesting things that can be done there, like for example, depending on when the inference happens,
how you can sort
your columns in a way that caching is much more efficient on the inference engine side.
What I'd love to see
(52:09):
as an engineer
is
to actually bridge the optimizer
with the inference engine too. So
make the optimizer aware of the inference engine and using that information
to optimize like the plans in a way that will optimize the batch inference that happens
Tobias Macey (52:30):
there too, not just like on the CPU side. Are there any other aspects of the work that you're doing on FENIC, the use cases that it supports or the overall engineering effort involved that we didn't discuss yet that you'd like to cover before we close out the show?
Kostas Pardalis (52:44):
No, for me, I would like to think of like FENIC as an experiment in a very interesting
point in time. I'd love to hear from people like how they think and like in what new ways they can either like make it work or break it and
help all of us like the open source community and also obviously like at Typedev, like to learn and shape this new world of like inference and traditional like data processing and like how these will fuse like together. I think it's like great opportunity.
(53:17):
And hopefully there's a lot to be
into this space, like from all the practitioners in the data world, especially data engineers, as I said. And I hope that Phenic can be like a platform to do that.
Tobias Macey (53:32):
Alright. Well, for anybody who wants to follow along with you and get in touch, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap of the tooling or technology that's available for data management today.
Kostas Pardalis (53:47):
I think that that's a big conversation. Maybe we can probably have, like, a full episode, like, talking about that. I think if we want, like, to be accurate here, we have, like, to make, like, a distinction between, like, the enterprise world and, like, the rest of the world. The reason I'm saying that is because the problems manifest like a little bit like differently between the two. The main
(54:07):
problem in my opinion is like fragmentation
in terms of like the tooling. Just that, let's say like the enterprise
experiences that not just like fragmentation,
but also fragmentation
with a lot of like legacy systems too. So there is a question there, like systems that they've been around for like twenty five years, Like literally, like you talk with people like in this big enterprise and they say people retire
(54:31):
and we
lose institutional knowledge about like these systems.
And you cannot just like rip them off. And like, you know, people like joke about COBOL, for example, like in the programming language space, but like in the data space, the problem is even harder.
But this fragmentation,
like if you go in, let's say like companies that they are more like digital native or they grew inside like the, let's say, the modern data stack, whatever we would like to call it, exists there too. There is like a lot of like fragmentation,
(55:03):
and this fragmentation is like, it does create problems.
So the question is how we can bridge these tools like together and how we can make like the data practitioners actually like increase their bandwidth instead of like actually droning into like more work. Because now we don't only have like humans
doing BI, we will have a thousand agents also, like making queries that make no sense. And these people will be responsible for like maintaining the infrastructure in a way that can keep like driving business decisions, right? Maybe LLMs can help. I don't know. It's something that it's definitely like problem and a big opportunity in my opinion in this space.
Tobias Macey (55:40):
Absolutely.
Well, thank you very much for taking the time today to join me and share the work that you're doing on FENIC and introducing
these semantic primitives into the data frame abstraction.
It's definitely a very interesting project. It's great to see the work that you've put into that and the ability to meld these two different worlds and functionalities into a single interface. So I appreciate all the time that you've put into that. I hope you enjoy the rest of your day.
Kostas Pardalis (56:06):
Thank you so much, and thank you to your audience for, like, listening and,
hopefully, giving an opportunity to Fennec and give some feedback to us. You're a developer who wants to innovate.
Tobias Macey (56:16):
Instead, you're stuck fixing bottlenecks and fighting legacy code. MongoDB can help. It's a flexible, unified platform that's built for developers by developers.
MongoDB is asset compliant, enterprise ready with the capabilities you need to ship AI apps fast. That's why so many of the Fortune 500 trust MongoDB with their most critical workloads.
(56:36):
Ready to think outside rows and columns? Start building at mongodb.com/build
today.