Episode Transcript
Available transcripts are automatically generated. Complete accuracy is not guaranteed.
Tobias Macey (00:11):
Hello, and welcome to the Data Engineering Podcast, the show about modern data management.
This episode is sponsored by Data Driven dot I o, the free data engineering interview prep platform built by data engineers for data engineers.
Have you ever walked into a data engineering interview and gotten a question that has nothing to do with real data engineering work? Interviewing is its own skill separate from the job.
(00:33):
Watch your code execute live, inspect Spark internals, and whiteboard your data models and pipelines and defend your decisions.
Unlike SQL only or Python only practice, datagerman.io
covers the full interview loop. Star schemas, slowly changing dimensions, grain and fact table design,
item potency, watermarks, dead letter cues, change data capture, and back pressure.
(00:56):
Every question comes from real data engineer interview loops at Google, Amazon, Meta, Stripe, Databricks, Netflix, and Airbnb.
Go to data engineering podcast dot com slash data driven today to start practicing.
Your host is Tobias Macey, and today, I'm interviewing Weimo Liu about the engineering behind Puppy Graph's zero copy ETL for querying your lakehouse as a graph. So Weimo, can you start by introducing yourself?
Weimo Liu (01:20):
Hello, everyone. This is Weimo, co founder of PuppyGraph.
The name sounds like self driving. And
before that, I worked at a graph database staff called Tiger Graph and also Google F1 team. F1 is a unified SQL query engine inside Google. It can query all the data across Google without ETL,
and it's serving billions query per day. Yeah. So that's me.
Tobias Macey (01:44):
And do you remember how you first got started working in the data space?
Weimo Liu (01:48):
Oh, it's a long story. When I was in college, I'm working on some
research projects on some open source spatial database.
And after that, I went to GW, the George Washington University, for my PhD degree
about database sampling technique.
After that, I joined TIGR graph with a closing series A, because the CTO and the co founder is a good friend of my PhD advisor,
(02:13):
because he was a professor in database area as well. And laterally, I joined Google working on the SQL query engine. And
finally, I
convinced my friend to find
PuppyGraph.
And so digging now into PuppyGraph,
Tobias Macey (02:30):
can you give a bit of an overview about what it is that you're building and some of the story behind how it got started and why you decided that this is where you want to spend your time and energy.
Weimo Liu (02:40):
Yeah. So PuppyGraph is a federated graph engine on your tables or even Mongo and other data source. You don't need to load your data to somewhere else, but just connect the PuppyGraph
and run graph query, graph pattern, and the graph algorithm on top of it. The story is that
since 2022,
(03:01):
ChatBet was
becoming
popular, and some of my friends is
some founders of
BigLaran model project.
And they share with me that
no one will write a SQL or any other query in the future, and the agent will do everything. And we feel that, oh, this is a big opportunity.
(03:21):
So we're trying to build an engine for agent.
And then we think about
what's the agent need. And we try to follow the first principle,
and then
we start to build the Puppy graph. Yeah. Since recently, there's a lot of buzzword
like agent harness, but at that time we don't know it yet. But we're trying to, we believe agent need something like that, so we're trying to follow the principle.
(03:47):
And currently,
we collect a lot of customer feedback as well. And we believe there are three
hard requirements to have a successful
data agent.
One is the
process unlimited data. The second is the sub second real
time performance.
The third is that
so called agent harness.
(04:08):
And ourselves, we pick a graph query language to meet this,
to
enable the
ontology enforced,
which is how we believe we are pretty unique in this area. As you mentioned,
Tobias Macey (04:22):
the introduction
of these agentic capabilities
and the graph based knowledge retrieval that a lot of them benefit from has caused a pretty substantial resurgence in the overall interest in graph databases and graph query engines overall.
And I know that fundamentally
(04:44):
graph data, particularly if you're using a graph native storage layer,
is very challenging to scale horizontally because of the constraints of graph topologies, difficulties of figuring out where and how to shard, the impact of super nodes within the overall graph structure.
And I'm wondering if you can talk to some of the ways that the architecture of PuppyGraph
(05:07):
helps to address some of that or just some of the overall challenges in dealing with graph data, particularly as you move to larger volumes and more complex queries.
Weimo Liu (05:19):
Yes, yes. So this is a long time challenge in the graph space. I think the first project that solved the graph scalability issue is Prego from Google,
and Google published a paper about that. It's called
large scale graph processing.
And
TagGraph is kind of a base that paper to develop it. And after that But internally,
(05:43):
we have something involved because it's kind of an iteration run by run, and we feel it's too static and too consuming like MapReduce. And then we make it more flexible.
And after I joined Google, I realized, since I was at Telegraph and people like it, but they spent, like, big banks spent eighteen months to load data into it. And our
(06:07):
customer complained
about
this a lot. And after I joined Google, I figured out what's wrong. Since when I was at Google, we query everything.
And like if you have logs,
because logs
follow the certain format, so we just define that table. We don't need to load it, rewrite it, and then query it as a table. And then I think about whether Graph can do it as well. And
(06:31):
in this case, if we can just query, for example, your data lake like iceberg
and the scalability for storage is no longer a problem, then the problem will be how we can scale the computation.
And
we do something different since most graph database is sharded by the
nodes, and then they have a partition of different nodes. But then it's highly dependent
(06:56):
on what's the distribution of a graph. But we are more like
shard the data by edges.
In this case, even you have a super node like Justin Bieber.
He may have too many followers.
But if we can just shard it on our edges, for example, if we have 3,000,000
followers, we can make it a three partition.
(07:18):
And for some others like me, I don't have a lot of followers.
But
we can,
for example, all my teammates,
we can be in
all the edge between
our teammates and the followers can be in the same partition.
In this case, the partition side can be similar to each other, and then we can just shred by the edges to solve
(07:42):
the hub node situation. And also, we make the graph traverse much easier by
doing this. Before that, for example, if you do 10 hop neighbor traverse,
it really one more hops and the
complexity is increased
especially.
But now it's kind of a linear increase.
And we can solve some
(08:05):
10 hop network queries in
one or two seconds with cluster because
we can shut the edge and also shard the computation.
During the hops,
during the graph traverse, we can do the shuffling between different nodes. So this is a kind of way to solve the problem.
Tobias Macey (08:24):
You mentioned too that one of the features or one of the capabilities
that you're aiming for with PuppyGraph is to have low latency,
and lakehouse architectures in particular struggle with that broadly just because of the architectural
fundamentals of it, and there's a lot of work going on at all the different layers to help mitigate that.
(08:46):
But I'm wondering,
particularly when you're in a data exploration
phase or if you are using
agent generated
graph traversal queries, which might be very complex or require multiple hops, how you
mitigate some of those challenges of being able to cut down on latency
as well still being able to allow for exploratory
(09:10):
and discoverable
connections?
Weimo Liu (09:14):
Yeah. First, there are certain overhead if we read data from a lake house. Like, we don't optimize for ten milliseconds or twenty milliseconds query at all. So
in that case, there are a lot of all in memory solution,
and we just
kind of give up it.
What
(09:34):
we are good at is like a sub second or a single digit second one. In this case, we still have the overhead, but the overhead usually
may be fifteen milliseconds or one hundred milliseconds to fetch from
S3, for example.
But at the same time, we optimize for the computation.
And we also have a vectorized evaluation
(09:58):
and also MPP. In this case, we can handle more nodes and edge at the same time.
And also,
we can scale out more machine, better performance. In this case, we can try to optimize for sub cache second query. And at the same time, because Iceberg has metadata, so we can do active cache or
adaptive cache. And then we can just
(10:20):
gather data and
store the cache in memory and also
local disk. And the next time after we
read the
metadata
and we see that, oh, this parquet file was not updated
in the last several minutes or
is a kind of a cache hit, and then we can just load from local disk or even memory, and then the
Tobias Macey (10:44):
performance will be much better. There have been other attempts at being able to add a graph traversal layer on top of other storage. I think the graphics package from Apache Spark is probably the most notable one. Some of the recent
attempts at that are the KoozieDB project, I think, was aiming for that, and that I think has been taken over by the LadybugDB
(11:09):
fork. And then maybe the most notable recent entry is the LanceGraph package for being able to do graph traversals on top of the underlying Lance table format.
And I'm wondering what are some of the areas of inspiration or comparison that you would like to highlight between what you're doing with PuppyGraph and some of those other technologies?
Weimo Liu (11:31):
Yeah. I think our projects are kind of inspired by the GraphX and the GraphFrame.
But the tricky part is that Spark is not optimized for graph at all.
And also, we saw some friends
build on top of other SQL query engine like Treno. And in this case, you highly depend on the compute framework itself. And usually, it's not designed for Graph. For example, if Spark optimized
(11:57):
for
Spark jobs and also Spark SQL and also like TreeNote optimized for SQL. In this case, the engine to optimize for something
else first,
and then on top of it, it optimized for Graph. The
direction is not the
There are no alignment, so the performance is a big bottleneck.
(12:20):
And we also see that Lens, Graph, and Kudu. I think they are a very great product,
but it's more like
small data. Maybe the storage can be scalable because they can just try to raise the iceberg.
But I think I will propose this first. And laterally, both of them support
a read from object store.
(12:42):
But the issue is that if the data is big, and also even the data is not very big, like 200
gigabytes, but because computation
of graph is very heavy, the data is highly connected.
And in this case, shuffle is a necessary feature for this kind of workload. And we are very good at this kind of stuff. And since we saw a lot of all in memory solution like MemGraph,
(13:06):
and it's very good at small data because
all data load to memory first and then do the computation.
And I think CoolGraph and the LessGraph
are also kind of a single machine one. And this is a long term problem
in the graph world. Since the graph data is highly connected, so no one wants to do the shuffling.
(13:27):
And
then the bottleneck is that if it's a small data, everyone is publishing benchmark and is very fast.
But when it really scale in the industry
data size, it will be a potential problem. And we also see most of our customers coming to us for this because their data is too big, because they're already leveraging Databricks,
(13:52):
Iceberg, or train or hire things. Data is already there, and it's super big. And when they're trying to have a graph solution,
and I
think we are very unique in this position. That's why we are a small company, but we have a lot of big logos
as our customers.
Tobias Macey (14:12):
Now digging into some of the data modeling question,
your focus is on that zero ETL aspect of you don't have to move your data into a different layer,
but a lot of data maybe doesn't necessarily
have that natural graph topology
or you need to do some explicit modeling of it. And, also, there are a few different flavors of graph
(14:35):
definitions,
whether it's a labeled property graph or the RDF triples.
And I'm wondering if you can talk to some of the ways that you approach some of that data discovery and data modeling aspect of being able to take the existing data
as it is in its natural, probably tabular structure,
and be able to represent that as a graph and manage the evolution, particularly as the underlying schemas evolve?
Weimo Liu (15:01):
Yeah. Yeah. You definitely have
a deep expertise
in this area. Yeah. So this is a typical question. And first, let me talk about the ideal case. The ideal case is that all the table are normalized, and then it's nature to be a graph. Like you have a customer table,
you have a product table, you have order history.
(15:25):
Actually, history is an edge between customer and
product, means customer A, product B.
Those are perfect. And at the same time, since in database 101
and there are some principle,
like everybody
do the data modeling. Please create the data tables in a normalized way, and it saves a lot of cost for storage. And
(15:48):
also,
we provide
a very easy and straightforward data modeling on this situation. But of course, this is the ideal case, and some customer are okay if they already normalize
some tables,
but now they feel, But denormalize is actually for
predrawing
something and to have better performance. But after they reach out to us, they realize, oh, maybe we don't need to do predrawing, we just run a graph pattern. And it's also very faster, either very fast and in real time. And
(16:21):
so this is the ideal case.
And some customer either have normalized table or they are okay with normalizing their current table, because before that, the denormalized table is for the single table, wide table performance.
And if they can normalize it and still have a
very fast performance on graph pattern, which means they can tile drawings but without the slow performance of drawings. So they are pretty happy. Another is that because table is already in production for some other use case, they don't want to change it at all. There is some tricky part for us. And one way that we have a logical view and then define graph on a logical view. Another possibility
(17:06):
that we have a very flexible mapping. For example, if you already draw in the customer profile and
product profile
as the wide table for order history.
In this case, we can define one column like a ZIP code as a node. So it's not a node table, it's just an attribute in a wide column. But we will dedupe it for you logically,
(17:29):
and then you can have a flexible mapping from the graph schema to your tables.
So this is another way we do it. And of course, because for the same 100 table, for example, we can create a lot of
different graph schema.
And different graph schema have a benefit and have an advantage and a disadvantage.
(17:53):
And usually, they highly depend on the use case and the query they are running. And then we can
suggest the best GRASS schema. But usually,
our customer,
because they want to build some
customer facing agent system,
so they are pretty familiar with their tables. In this case, we will discuss together and
(18:15):
build some graph schema best for their use case. Yeah, but of course, sometimes it's not that good, but
we just,
and if it works, it's fine, but definitely there are always space to optimize it.
Tobias Macey (18:34):
Beyond the relational structures,
there are also potentially
document models you mentioned that you're able to execute across MongoDB
as a storage layer.
And also, increasingly, we're looking to unstructured
data sources and doing some transformation of that into
semantically
enriched deep data or extracting structured data from free text. And I'm wondering how you're addressing some of those as well and being able to map that into a graph structure and maybe some of the interesting use cases that you unlock because of the fact that you're able to work across these different storage layers.
Weimo Liu (19:13):
Yeah. So for
Mongo,
so it really depends on how unstructured the data will be. And for Mongo, it's actually pretty
structured already since most of the collection follow the same pattern and
something like a JSON file as well. And even though they are not flattened at the table, but logically,
(19:37):
you can just flatten,
you can have a flattened table. Like,
when I was at Google, we also do a lot of this work, like if it's a nested field, like aws.c,
we can just use SQL to select adobe. C from like a collection A, something like that. And
in this case, we can just connect MongoDB
(19:57):
with a JDBC interface, or maybe it's a surprise to some of our audience, like MongoDB can access by JDBC.
And in this case, we can just run the query similar to the table ones. And also for Mongo collection, there are still something like a foreign key. And then we can use the key to link to each other to form a graph. And for even more structured data, like, for example, PDFs,
(20:25):
we have
two partners. One is from Treno team, one is my old friend at Google. They are building the
index of
Google Search. So what they are doing is that they already have a bunch of documents like PDF,
and they do the entity expression
and store it in, for example, expert table. This is an interesting part because for a lot of abstract data like PDF,
(20:50):
assuming the Bank of America bank statement, even they are unstructured.
But the unstructured data itself has
some potential
structure inside. Like you have all the PDFs follow the same pattern, and they have a bank
account number,
they have the home address,
(21:10):
they have a account holder name, and they also have transaction tables there. So there are certain rules, and our partner will help us to do the extraction
from the PDFs,
and then to markdown, and then to tables.
And
since both of our partners work
in this area
(21:31):
for many years, like the Google Search team and also the Trindle team. So we just
partner with them closely.
And what we are doing is just you already have tables,
and then you can have a
graph query, and also you can have
agent system to query it.
And
(21:52):
we already have a lot of drawing customer, and
it's pretty smooth. And it's even better than,
for example, just to make chunks and then do the embedding.
Since
when do the chunks and embedding, actually you didn't leverage your
potential structure
inside of your documents. For example, if in the case that every PDF have a
(22:17):
different pattern,
then maybe the
embedding is better. But if you have like 1,000,000 PDFs and all of them follow the
exact pattern,
it's actually a structured data,
a structure
representation.
So in our experience, if we can flatten those tables, it'll have a lot of benefits.
Tobias Macey (22:41):
Digging into
PuppyGraph itself, I'm wondering if you can give a bit more detail on the architecture,
some of the technology choices that you're investing in to enable this use case and some of the core ecosystem primitives that you're leaning on to be able to manage the complexity of the space that you're working in?
Weimo Liu (23:05):
Yeah. So since the beginning, because we want to build a system for agents, so we consider a different ecosystem
and which
part of the system and which component to pick.
Like the first is that I think three years ago, we believe that the even before that, after
the expert team left Netflix,
(23:26):
we believe they will be the standard for OLAP in the very near future. So we pinged the founders and
showed them how we can run a three hops graph query on Iceberg without any change. And it's much faster than most graph database on the market, and they also feel surprised because Iceberg is not optimized for graphs. And
(23:49):
we work this closely.
And at the beginning, we only support Iceberg, but my teammates question, what if Iceberg
won't be popular or is not become
popular fast enough? And then we will just bankrupt. And then we support other different data source, but our favorite is Iceberg. And another thing we are waiting for is that we're waiting for the agent is capable enough to generate all the query automatically without human in the loop. And in this case, we're to pick up an interface.
(24:24):
And we consider SQL at the beginning, but we feel that
because when human writes the SQL, there are lots of context in their mind. And when they write something wrong and they know it because they know the building logic, like a student won't join with a teacher's salary table, otherwise the student will have 100
income last year or something like that, but without a throw out an error. And
(24:49):
then since I work at TechRaf, I think the graph is the best because the graph itself not just contains data
but also contains the ontology. And if people are doing ontology,
the ontology results in the graph, why we just query the graph rather than use the graph to generate better SQL? It's against the first principle.
(25:11):
And then we support the Cypher and the Gremlin at the same time because they are popular and there are enough public data on GitHub, like the large model and stand, and generate the query.
And
also,
we only provide a Docker to our customer because then they can just use Kubernetes to deploy it easier.
(25:33):
And so this is basically like the interface layer, the graph query, like Cypher and Gramming. And for the computation part is what we are doing by ourselves. And for the storage layer, we just leverage the table format. And the best one, of course, for us is
the iceberg. Then we design this system, and also it will be easy to leverage by a different community.
(25:57):
Like we connect the community of graph work and the community of common
data engineer work. In
the
last ten years, I think the graph community
is separate
with data engineer community. Since the data engineer community is involved a lot, but the graph community has still not changed a lot since the last ten years. And I think if we can bring the capabilities together,
(26:22):
and, it's not just benefits the agent as we design, but also benefits the human user as well.
Tobias Macey (26:30):
One of the other use cases for graphs,
particularly in the data engineering community that I've come across a few times is for master data management and being able to do things like named entity reconciliation
to be able to say these two documents that are talking about slightly differently worded
(26:50):
entities are actually the same thing and being able to do some of that resolution
there. And I'm wondering what you're seeing as far as applications of PuppyGraph in that more I'm gonna use air quotes and say traditional
data warehousing use cases in addition to these more agentic workloads and maybe some of the cases where those two coincide of being able to use agents to do some of that master data management and entity linking.
Weimo Liu (27:18):
Yeah. So we saw some
user are using Graph database to do entity resolution,
and also there are some other ways in SQL.
And
we're trying to see, since now
you know, our
theory
and
every data warehouse and the data lake now support graph. So we're trying to apply the data
(27:40):
resolution
solution
of a graph database to the
SQL tables ecosystems.
And then the user can just apply the existing solution
to the SQL one. And also, what we're doing is that we can write back the result to tables. In this case, and for example, Iceberg will be the bus of the
(28:03):
data pipeline. And like a bus,
write the result back, and we read the result from Iceberg and other engine like Treno and the Spark SQL, read tables and write tables. In this case, we don't need to talk with each other and do the data loading. Everybody just read from Iceberg and writes to Iceberg. And then your output can be our input, our output can be other input. And we're trying to make the data pipeline easier.
Tobias Macey (28:30):
Now circling back around to what you were commenting with some of these other
more point solution graph engines that are very efficient on smaller scales and volumes of data.
What are some of the ways that you're seeing people maybe
use both in concert where they've got a dedicated graph engine for
(28:51):
their data that needs to be low latency and in the hot path of a certain workload,
but then using PuppyGraph for more of that scale out across larger volumes of data and maybe being able to transfer data to and from the hot path and into the more warm path at the iceberg layer.
Weimo Liu (29:11):
Yes. Exactly. So this is what we're expecting. Like, in SQL world, it's pretty common. Like, you have PostgreSQL,
and also you have Snowflake Databricks Trino.
And then you have Postgres to handle the transactional CRUD with ACID.
And for the large streaming data or batch
data, and then you use Trino or Snowflake or Databricks.
(29:36):
But before that, in graph word, it seems all the stuff similar to PostgreSQL
position.
And no one cares about the OLAP
one. And we're trying to be the OLAP one. And at the same time, for graph database, because people won't store all the data in graph database. Usually, they handle the hot data or transactional data.
(29:57):
And in this case, in SQL world, it's just made like a CDC or some data loading things, like you wear the AirBiz jacket, right? So you download the data from Postgres SQL to Iceberg, for example, and then to do the OLAP. I think, hopefully, in graph world, this can be the common practice as well. Like, you don't want to store all the historical
(30:20):
data in graph database. That's too expensive. And also, it affects your
transactional QPS
when you run
some heavy analytical query.
And also it's very slow since we see a lot of time out and auto memory
to run
that kind of query. But if you can still use a graph database as transactional updates and then
(30:45):
load the data
to the iceberg and then use, for example, PuppyGraph to run the OLAP query.
And then it will have a lot of benefits, which proved very well in the SQL world. So hopefully this can be the common practice. And also we help some customer keep in the graph world. Before that, they're trying to migrate away from
(31:06):
Neo4j to
PostgreSQL
because of ecosystem problem. And we said that you don't have to do it now. Graph have OLAP as well. And then they just do the data loading on the CDC from Neo4j to Iceberg, and we
run a query on top of Iceberg.
Tobias Macey (31:25):
I'm interested in digging a little bit more into some of that translation layer and the data modeling and representation
of graph structures
in the Iceberg ecosystem
because Iceberg was designed
primarily
with tabular structures in mind. And I know that, for instance, Kuzu DB actually uses a columnar representation under the hood, but I'm just curious what are some of the points of impedance mismatch between a graph native representation
(31:54):
on disk, for instance, from something like a MemGraph or a Neo four j
and how to actually do that translation
into Iceberg and mapping to and from the structural
semantics?
Weimo Liu (32:07):
Yeah. So
in
our design, we decoupled the computation
and the storage at all. In this case, the computation is still on
graph mode. But the storage,
just since what we're doing then, we define the node operator and the edge operator. And the operator's
input and output are collection of nodes and edges. And in this case, we're assuming
(32:32):
all the graph query, graph pattern, or graph algorithm can be a combination of node operator and edge operator. Then we can do the cost based optimization. And for a single operator, because the input and output are collection, so we can do the MPP and also vectorize the evaluation. And of course, in this case, because it's a collection, so the column based storage is really, really important. And then final stage, we still need to fetch the data. In this case, we just run the Parquet file reader. And then read the Parquet file and translate it into collection,
(33:07):
and then to the computation.
And I think this is an interesting part. Before that, all the graph database, they're trying to speed up to support
a complex query, but it's still close to row based. So because it's hard to support the high QPS transactional updates. But in SQL world, everybody know that the OLAP and the OLTP
(33:28):
need to have a different storage, and OLTP
need a row based one and the OLAP need a column based. And
with the
column based one, it's much more memory efficient.
And
in this case, we can handle much larger data.
And at the same time, the query complexity is no longer a problem since, for example,
(33:49):
one CPU instruction can handle a vector of nodes and edge. And at the same time, because we only access the necessary attribute, like even one node or add have 100 attributes. But maybe for single query, only three or four is related. If column based, we can just leave
all the other
97 or 96
(34:10):
attributes on disk. In this case, it's much more memory efficient.
Tobias Macey (34:14):
One of the other challenges
for a product like PuppyGraph
is that graph engines,
as we mentioned before, have been somewhat niche for a while. They're not as broadly adopted
as a Postgres or a Snowflake.
And I'm wondering what are some of the areas of education that you've had to invest in to help people understand
(34:36):
the power and benefits of having that native graph structure and graph traversal capability
available to the underlying data that they're already investing in?
Weimo Liu (34:47):
Well, I think this is a kind of a chicken egg problem since before that, the investment before you run the first graph queries are too heavy.
So even some perfect graph use case, people still want to, for example,
write a SQL or write a Spark job to do it. Because even it's slow and complicated,
but you don't need to do a lot of have another copy of data and have another pipeline.
(35:13):
And in this case, I think it is hard for the user to adopt the native
graph engine.
And at the same time, we feel that actually there are certain requirements,
and a lot of users just give up the use case after they try different ways. Like
when they have very complex things,
they try the complex SQL, but
(35:35):
it's very soon
become
no longer, it's no longer human readable.
And there are hundreds of lines SQL
or
either too slow
or like if a lot of customers have 1,000 tables, but in daily work, there are only 20 or 30 tables are used. All the others are just left there and no one accesses it at all. But now
(35:59):
because we show the possibility to a lot of our customer
and they see that, for example, they can write very complex query
short way, like ten
ten lines of query. The expression capability is the, more than 100 lines SQL.
And then they feel that and the the interesting part is that when they feel that, oh, this works and can just return the result,
(36:24):
they will keep
trying our
capability
and write more complex query.
And this is more
common when they are using an agent since agent don't care how complexity the query will be because
people just ask us some question and assign a task to agent, and the agent will decouple into subtasks.
(36:48):
And each subtask can be more
complexity,
more and more complex.
And in this case,
some sometimes they send the logs to us to help let us debug.
We feel that even the graph query, the one hundredth line, this no longer can be readable. But the agent can just write the correct one, which is a surprise for us. And we feel that if we show the stronger capability
(37:14):
and, like, a more complex query can be handled in a short time
and the response is in real time, and then
people don't care, like, they want to issue more complex query
and
assign the agent more complex tasks.
So we believe that the usage will be larger and larger. Before that, maybe it's limited because the limitation
(37:39):
of the tools. So people have to give up some wonderful idea. But now they can just try, and
also the
agent that can help them to try. So
is the cost is pretty low now. So they can try some fancy idea without heavy invest. Another element of the overall graph ecosystem
Tobias Macey (38:00):
that has
varying levels of support depending on the underlying engine or the language that you're working within is the
core graph query and traversal,
and then a lot of engines will add another layer of
out of the box graph algorithms or graph machine learning or data science capabilities
(38:22):
such as between the scores and centrality scores, etcetera.
And I'm wondering what the
capabilities are around PuppyGraph for being able to do some of those more native graph feature extraction and discovery.
Weimo Liu (38:37):
Yeah. I think this is something different from PuppyGraph to other graph solutions. Since most of the graph solutions, their query engine and the graph algorithm are implemented
separately. Like,
have a query engine, and also they have an independent implementation of a graph algorithm one by one.
But for us, we try to all leverage our engine, as I mentioned, the node operator and the add operator stuff. And in this case, when we implement a new graph algorithm,
(39:06):
we don't need to start from zero,
like how to parallel process the data or how to share the data. We just need to
implement an algorithm like decode
pretext.
And after that, we can deliver a new algorithm within one week, something like that. Some of the customers even
implement
their own graph algorithm by the query language.
(39:29):
Since, you know,
Grammarly is Turing complete.
So
before that, people don't do it just because if we implement through
query engine, it will be too slow. And like
GNS Graph, it's a single thread engine.
So if you implement on this, the algorithm will only run on single thread. So it's a potential issue for the performance. But for us, some of our customers just customize their algorithm
(39:55):
based on the query language, and that is another option. So we feel that. And also,
we won't charge additional for that. The people just charge our usage for the engine, whether it's algorithm or Cypher query or grammar query, we charge the same. We don't have additional,
like, enterprise feature charging for that.
Tobias Macey (40:14):
And so for somebody who is interested
in
using
the
capabilities
of graph engines for doing some of that graph traversal discovery,
semantic capture,
and ontological
representation.
What are some of the guiding questions that you would ask them to help them determine whether PuppyGraph
(40:38):
or another solution is the appropriate
solution
or maybe even just say just use NetworkX and Python because it's a one off type of use case?
Weimo Liu (40:49):
Usually,
if our customer already have data in data warehouse,
data lake, or even database,
we recommend you use a polygraph since it makes the pipeline much shorter and the system complexity
is much
lower. So in this case, they can just have,
for example, all the XBERG tables,
(41:11):
define graph on top of it, and then run the graph query and the graph algorithm.
Make the and also the results can write back to XBERG
and then leverage by Spark
or some other tools, and also even PyTorch,
this kind of stuff.
But some of them, for example, some of the data scientists, they don't care the data warehousing.
(41:34):
All the stuff already
They use Python all the way and all the stuff already in CSV file.
And if
We also support it, but it seems it is easier to use embedded one, like they can just use Python to read the CSV file like DuckDB
and then just run some network X on top of it. And
(41:56):
then we feel that the people are using Data Lake and Data Warehouse
like us because
the reason they use Data Lake and
Data Warehouse is because their data size is big, and they don't want to handle the
distribution
things. And so
we're in nature to be a good fit. But if people just use DuckDB
(42:18):
and have small data set, they can use DuckDB to handle the things, and it's embedded on Python, and all the stuff can be laptop at all. So we're trying to recommend our product to some of these users, but I
don't think we can convince them because,
frankly speaking,
just the Python ecosystem with DuckDB and also NetworkX
(42:41):
is better and more convenient than PuppyGraph.
Tobias Macey (42:45):
One of the other interesting
aspects of where we are right now is the proliferation
of
vector embeddings for
some of these unstructured
sources.
And one of the patterns that I'm seeing is using a vector query to determine the starting point into a graph and then doing traversal from there. And I'm wondering how you're seeing people deal with that, particularly if they're using Iceberg as the underlying storage given that Iceberg doesn't really have native vector indexing capabilities.
Weimo Liu (43:20):
Yeah. For ourself, we can query the Iceberg array as a vector. And also,
I hear some news from the Iceberg community.
They will support it very soon. And I think,
like, lessDB guys are also actively working with them. And hopefully, Iceberg can
have a vector type very soon.
(43:43):
But currently,
we just query the array type in Iceberg
as a vector.
And I think it works fine because
it's more like an index on top of a read type, and then we can run vector search on top of it.
Tobias Macey (44:00):
And as you have been building PuppyGraph
and helping your customers get up to speed with it and understand its applications? What are some of the most interesting or innovative or unexpected ways that you've seen it used?
Weimo Liu (44:13):
One for customer is Palo Alto Network.
Have several teams using our products. Some teams are using the
as a posture management, is a customer facing project. But the
interesting part is that the security research team, while they're doing that, they just
have all the logs in Iceberg
and use polygraph content to all their logs and look back, just to visualize all the data.
(44:37):
And then
they found some
botnet work. And
the author was arrested already.
And
people believe that the
malware was gone
and no one do the detection anymore. But
after they use public graph to look back at the logs,
there are still a lot of bot bot
(44:59):
network is attack
all the stuff. And so the bot network is still active,
even the
attacker was arrested.
So they feel surprised, but they also feel that this is very helpful.
And we also
didn't expect it.
Some the usage is more like a Splunk and is a more complex Splunk.
(45:22):
They can just use the public graph to to be the log reader
and then see what happened in the past.
Tobias Macey (45:30):
And in your experience of building this product and platform and investing in this zero ETL capability for graph traversals and graph exploration,
what are some of the most interesting
unexpected or challenging lessons that you've learned in the process?
Weimo Liu (45:47):
So we have some, like, one case is that at the beginning, we only support the Iceberg and later
Delta Lake and Hudi and Hive. But then some customers want us to support the database. We're trying to use the similar way, but
it's not a packet file reader, but projection with filters, like select attribute one from table A with some filter. But literally,
(46:11):
feedback from customer is that if we read too much from the transactional database, the QPS will be affected.
And then what we're doing is that we just do a cache layer and cache all the data from the database. But the lucky thing is that the database,
usually the data in database is not very big. So we just have a snapshot of the database data and then do the CDC, which is very different from our initial design. But I think with design partners and early customers,
(46:41):
their feedback is very, very important since what they care is not like if it's pure
in
this stage, it's just how we can fit into their production and how they can leverage the public graph technology.
So I think it's more like not just like we designed
and then everybody follow our pattern, but also after our early
(47:05):
adopter trial product, they provide feedback and
we're trying to
follow their request
and then have a different design for different use case. And we feel this is very valuable. And also like
the cybersecurity guys,
one and a half year ago, we are totally outsider of cybersecurity.
(47:26):
But after the different
leading cybersecurity company
reach out to us and they teach us how cybersecurity
industry can leverage
polygraph,
and they also let us to find like some other
company may use our product as well, and they gave us the names and let us to reach out to them. And their insight and the terminology, they're very helpful for us. Since just the engine is not that useful, but we can talk with the user a lot. We know what's their pinpoint and, how we can address their pinpoint.
Tobias Macey (48:02):
And what are the cases where PuppyGraph
is the wrong choice and either the problem is just not a good fit for graph data generally or you'd be better served with a different graph engine.
Weimo Liu (48:17):
Yeah. So really one
typical case is that, for example, you want to have some
personal AI memory storage.
And in this case, polygraph is not a good one since, really,
for personal AI memory, it's not big. And
embedded
solution
like maybe Kuso or some others is better. Like, you can just run on top of your run-in your laptop. And at the same time, you can support the transactional updates.
(48:46):
And
single data stack is good enough. And all staff can embed it in
Python program, for example. And in this case, I think it's a better solution. And also, we have a lot of similar case.
And
I think one good, when we do the judgment, I think whether the data size is very important. If the data size is small, I think the graph database is much better. Because you are writing data into it and you are reading data from it, And you don't need a data pipeline at all. In this case, I think, especially for the embedded graph database like Kudu, it's very good. And then you can have a you don't need to have a service right now. You just have,
(49:27):
for example, Python program and embedded Kudu in it, and then you can have all the functionality you need. So this is a one typical case.
Tobias Macey (49:36):
And as you continue to build and iterate on PuppyGraph and its capabilities,
what are some of the areas of improvement
or new features or projects or problem areas that you're looking to dig into in the near to medium term?
Weimo Liu (49:51):
Why is that? Definitely the enterprise features. Like, because
most of our customers are very big, and either our customers are big or our customers' customers are big. So
we are supporting the enterprise features, single sign on, rule based access, and all is in preview already.
And another thing is that
we want to have a better support of data warehouse. Same for data lake, because it's an open format and we can have access
(50:19):
to all the metadata
and table
stats
and all the related information, and then do the cost based optimization or some others. But for the warehouse, because some of them are pretty closed,
and so we need to collect all the information by ourselves and then have a better push down on the cost based optimization.
(50:40):
But another good information for us is that currently
the founders of
Parquet files and Apache Arrow are working on a
project called Columnar,
and
their project is ADPC.
And a lot of the warehouses are supporting ADPC now. And then we can read data through Apache Arrow.
(51:01):
You can see that is
much faster than just pure JDBC.
So I think because
ecosystem
evolve a lot and sometimes
we do need to implement what we need, we just wait.
The
feature we need is coming.
And
also,
culinary is our good partner.
And after
(51:22):
they share the project with us, we feel, oh, it's it's amazing. It's yeah. Chinese word is something like, when you're trying to sleep, you have a pillow.
Tobias Macey (51:33):
Are there any other aspects of the work that you're doing on PuppyGraph
or this overall zero copy ETL
graph traversal capability that we didn't discuss yet that you'd like to cover before we close out the show?
Weimo Liu (51:46):
I think these
covered all. You are super expert and you ask a lot of problem,
even we don't know before. After we engage with our customer users, they propose that. But definitely you have super
long users in this area. So I think it will cover all questions.
Tobias Macey (52:07):
Thank you. And so for anybody who wants to get in touch with you and follow along with the work that you and your team are doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data and AI management today.
Weimo Liu (52:26):
Yeah. I think we want to have a better agentical framework,
and currently, it's working, and we have a confidence to put it in production with a lot of customers already.
And
I think the improvement is more
easier to use. And also,
we want to
make it easier to connect with the fine tuning and reinforcement
(52:50):
learning things and the tools. And then we can have a better framework to embrace the ecosystem.
Tobias Macey (52:57):
All right. Well, thank you very much for taking the time today to join me and share all the work that you're doing on PuppyGraph and the different use cases that it enables and some of the technological and architectural challenges of being able to act as that zero copy representation
on top of customers' underlying data. It's definitely a very interesting project and problem space, and I appreciate all of the work that you're doing to make graphs more available and accessible to a broader variety of use cases. So thank you again for that, and I hope you enjoy the rest of your day. Yeah. Thank you so much for the opportunity. Yeah. Have a good one.
(53:43):
Podcast.net
covers the Python language, its community, and the innovative ways it is being used. And the AI Engineering Podcast is your guide to the fast moving world of building AI systems.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email hosts@dataengineeringpodcast.com
(54:04):
with your story. Just to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.