Architecting the Future: Alok Pareek on Databases, Logs, and Real-Time AI

Episode Transcript
Available transcripts are automatically generated. Complete accuracy is not guaranteed.
John Kutay (00:05):
Welcome back to what's New in Data.
I'm your host, John Kutay.
In this episode, I'm thrilledto be joined by Alok Pareek,
co-founder and EVP of productsat Striim, the first platform to
truly unify change data captureand distributed Striim
processing into a singlepowerful solution.
Alok is a longtime databaseindustry veteran, starting his

(00:27):
career at Oracle, where heworked on core database
internals for disaster recovery.
He then became CTO atGoldenGate, pioneering CDC
before it was even Striim.
Now, at Striim, he's redefininghow data moves, with strong
consistency, sub-second latencyand cloud-native scale.
We get into the future ofreal-time data, innovations in

(00:49):
Striim architecture and cleverapproaches to handling sensitive
data and AI workflows.
Let's dive right in.

Alok Pareek (00:58):
Doing terrific.
Thank you for having me, john,it's a pleasure.
I've been meaning to come toone of these famous podcasts
that you do so.
Excited to be here today.

Speaker 3 (01:07):
Yeah, absolutely, and I'm equally excited.
You know, I've always wanted tocatch up with you and in this
forum, which is just incredibleand super interesting all the
collective knowledge you'vegained throughout the decades

(01:29):
and you're really one of theleading minds on this topic.
But I want to open it up to youto just tell the listeners
about yourself.

Sure, so I can get started and of course that shows
my age too, that I've beenaround the database industry a
while.
So I started my career in theOracle Database team, which was
a fascinating experience.
So this was right.
After my grad school I did somework at Stanford, in fact,

(01:59):
hector, god bless him I workedwith him as his TA and so he was
sort of a heavyweight intransaction processing, taught
me a lot, told me to keep thingssimple, always told me about
some of the hard problems inindustry which I'm hoping to
share with you.
Yeah, so I started my career atOracle, which was a fascinating

(02:21):
journey for 10 years and it wasa really good program.
So they, they sort of had thisamazing program where I got to
be part of different teams.
Um, believe it or not, I'veactually got up at five in the
morning, um, and listened to, uh, oracle support calls with some
of the some really you knowamazing oracle people, um, like

(02:44):
sherryry Yamaguchi and theseguys who are responsible for
some of the core practicalultra-large database
implementations in the world onthe maximum availability
architecture.
Then a little bit of consultingas part of still being sort of

(03:04):
in the database side.
A little bit of consulting,where there were always these
firefights between you know,sometimes support would have an
issue and then development wouldtake a time, some time to kind
of navigate that.
And so there was this kind ofrapid team called React I don't
know if I remember what I thinkwas called rdbms, uh, escalation

(03:25):
and activation team orsomething like that um.
And so there were, you know,these people who actually also
were programmers and couldunderstand support issues and
respond to customers um quickly.
So I got to experience that umand I learned that you know most
at that time kind of thehardest area that a lot of

(03:46):
people used to face challengesin was, you know database would
crash or they would lose a fileor they'd have some sort of a
discorruption.
So that kind of attracted me torecovery and self-healing
systems and things like that.
And that's when I sort of, youknow, went into the recovery
team, spent close to 10 yearsthere.

(04:07):
So this is kind of the corearea of you know crash recovery
and media recovery in databases.
So that's kind of like how Igot my career started.
And then I went on to you knowother things like GoldenGate and
eventually up to Striim.

Speaker 3 (04:21):
Nowadays, oracle is a household name in data and AI.
You can't go to CNBCcom withoutseeing an article about Oracle
stock or Larry Ellison and eventhe new stuff they're doing with
AI and collaborating withOpenAI through their latest
projects.
But in the generation youstarted there, it was still very
much a competitive and openmarket, so the work you were

(04:44):
doing really led to their marketdominance, which is pretty
incredible to hear about.
How you worked back then, and Iwanted to get into one aspect
that you mentioned to me, whichis, you know, now we have this
definition of big data, which isterabytes and petabytes of data
.
You were working on one of theearly big data projects with,
let's say, roughly a terabyte ofdata in a transaction

(05:08):
processing system.
We'd love to hear about that.

Sure, sure yeah that's a fun story.
So you know, when I joined Ithink you know it was probably
early versions of 6, I thinklate 5, early 6.
And you know, and then you knowobviously the way the database
development team was working.
We were working on seven alsobecause that's usually a few

(05:31):
years ahead and at that timethere was an initiative to try
and sort of have like a verylarge, one terabyte database
between HP and EMC and Oracle,and so it was interesting.
So there's a bunch of engineersfrom HP and Oracle and we all
actually flew to EMC, which wasin Massachusetts, and I mean it

(05:55):
was.
I still remember it veryvividly.
It was a huge kind of awarehouse type of a setup and
there was just literally diskdrives like sprawled all over
over.
At that time disks were big andthere were cables all around
and so a number of folks thatwent from Oracle and also from
HP, we literally just went inthere and had to hook up servers

(06:19):
and disks and the cables andprepared the Oracle database.
Now what was interesting aboutthat was at that time loading
one terabyte was kind of a crazyconcept, right.
So we had to figure out thingslike you know, how are we going
to load these things, you know,in a reasonable amount of time,
and so that led to a bunch ofwork, including, you know,

(06:41):
direct path loading, where youcould bypass some of the logging
concepts and etc.
Efficient indexing around that.
We also had to lay out the dataon literally different disks.
There's no contention, thisstuff is classic now.
There's no contention betweenyour actual data blocks and your
index blocks, literally justphysically at a file level, at a

(07:05):
disk level.
And yeah, and then we publishedthat.
I don't know this waspre-internet, I think.
So I know that that white paperis available somewhere, but I
don't know if it's available inthe public read or not.
But yeah, it was very excitingto kind of declare victory on
kind of like a one terabytedatabase collaboration between

(07:26):
EMCHP and Oracle.

Speaker 3 (07:29):
Yeah, that's super cool and you know you mentioned
some of the contention betweenthe indexing versus the storage
level and the read level.
Can you get into some of thetechnical trade-offs you had to
make there?

Yeah, so you know, I mean typically, like you know,
if you, I mean ultimately, youknow, at that time, you know now
this technology is alsoadvanced, though still continues
to still bottleneck to acertain degree.
But at that time you know, whenwe talk about just hard disk
drives, I mean ultimatelythere's an arm right, there's
got to move, and you know, soyour seek time sort of.

(08:03):
You know you have to pay thatpenalty.
So if, let's say, you're tryingto go to the optimizer and you
try to look up something, so yougotta, like you know, go
through your B3 or whatever thevariant might be, maybe a hash
index, so you have to go fetchthe kind of the root blocks
there and the index block andthen the data block.
So if they're all on the samedrive then you can't quite.

(08:25):
You know, to a certain degreethere's a queuing that gets
involved there.
So to try to actually just keepthem on different disks was kind
of a classic technique that wasused.
Right, we would say, okay,let's separate.
Even our, like you know, logfiles would be on a separate
disk than the data files, forexample, and you know, and then
the between the data files.

(08:45):
You know, if you had you knowdifferent applications and
trying to actually also keepthose separate from a you know
broader sort of maybe acontainer level, like a table
space, for example.
So those were some of thetechniques that we used to kind
of really be gentle to theoptimizer, so to speak.

Speaker 3 (09:02):
Yeah, absolutely, and this was very much early days
for those types of workloads, soingesting a terabyte of data at
that time was considered on thekind of the extreme end of
performance requirements.
So have you seen any of that,any adoption of those techniques
in like more recent storageengines?

Yeah, I mean nowadays you know again, like
you know.
So there's several ways, see,ultimately the problem of just
sort of doing fast loading comesdown to can you partition the
data set?
Number one, right?
So if you can partition thedata set, then it's just a
number of like how many you knowthreads, so to speak, how many
cores you want to actually gothrough at that problem and you

(09:45):
can actually go load that superfast.
Now what's interesting there is, you know, when you do these
massive loads, you know whathappens to also kind of the
indexing, right.
So it's kind of inefficient tobuild the index as you're going
along loading the data as well,your tree gets skewed and
whatnot, right, and I thinkthat's pretty much common now,

(10:08):
right, where you could actuallydo things like direct loads and
then build an index after theload itself, so that now you
kind of have you can take a passat all the data, so your keys
are sort of more well-organizedand you're not sort of doing
random seeks all over the place.
So yeah, that are the earlytechniques that we had, and I

(10:29):
can just specifically tell you,in Oracle for sure, we had this
client utility called SQL Loaderand in SQL Loader and we're
still running into some of theseissues, by the way, even today,
where we evolved it, I mean, anumber of developers were
involved.
But to try and say, hey, I'mgoing to actually do sort of an

(10:51):
insert directly in place.
So let's say, if you have atable which has a bunch of
blocks, the blocks are on somefree list.
So you grab a free list and say, okay, I'm going to insert it
in there, so that's an in-placeinsert.
But a faster way was well,let's just go and see what is
the high watermark of this tableat a segment level and then
just go ahead and just startshunting all the bytes there.

(11:11):
The advantage there, john, wasthen if you died in between, you
could just do one simple undoof the whole thing.
Right.
But imagine if you actually dida direct place insert.
Imagine that I have 10,000blocks to write write
hypothetically and if I'm goinginto different uh areas on disk,
then if I have to undo it, Igotta go to each one of those
blocks all over again and undothat right.
So all of a sudden then there'simplications in terms of the

(11:34):
the, not just the forward wingload performance, but also
because if it's a direct pathload.
You could write it contiguously, but if you have to go in in
place, then you have to go inspecifically to locations and
that's inefficient and then notto mention that if something
goes wrong, to recover from itwould be super fast, right?
So those are some of thetechniques which I think now
they're Striim, although I wouldstill say I still get surprised

(11:58):
that not all systems have it.

Speaker 3 (12:01):
Yeah, and you know that's some of the fun work we
do here at Striim is we'reconstantly evaluating the new
popular up and comingarchitectures.
You know whether it's thehyperscaler, cloud storage
engines like S have their ownapproach and you know, for
instance, in Iceberg you have topretty much manage the write
process yourself.
I mean it's great for reads andmanaging.

(12:32):
You know tables at scale acrossobject storage.
But you know it's some of thesurprising stuff that we come
across where you sort of have tobuild this stuff again right
and take a pass at it as if it'sa, you know, nascent area to do
development.
But you know, one of the thingsthat you touched on in your
description was the recoveryprocess and the logging process.

(12:54):
So you've always had a lot ofexperience in that area.
I'd love to hear kind of yourperspectives on, you know, your
experience with recovery and howit's evolved over the years.

Yeah, yeah.
So recovery is definitely closeto my heart.
I mean, I spent a lot of yearsin database recovery.
So in Oracle earlier on I usedto own these there's one
component called LogWriter andat the time I think I joined
Oracle there was only I want tosay there was like only five
kind of these main backgroundprocesses.

(13:24):
It's just, you know, this is aC code base and one of them was
LogWriter, along with, likeDatabaseWriter and CheckPointer
and, you know, systemmonitor andProcessMonitor, et cetera.
So when I took that over, so theinteresting thing there was,
you know, logging was this onething which made things really

(13:45):
fast, right?
Like, I mean, if you justimagine, if you didn't have
logging, then obviously you haveto go back and, do you know,
for every commit you'd have togo in and really update the
in-place table records, right?
Just imagine, if I, like youknow, 10 records, then I have to
go in and make sure that, youknow, as part of my commit I,

(14:06):
you know, flush all those blocksto disk and so logging is sort
of like this and some of this islike you know you know has been
around for a while since theAres system.
But you know you're doingsequential, right right ahead
logging to kind of justeffectively be super fast and
efficient in your commit path,right.
So I go touch a bunch of stuff.
It allows me to go, and youknow, dirty a bunch of blocks

(14:31):
and then in one operation in thelog I can go in and commit or
undo my work.
It also allows you to just, youknow, steal buffer, dirty
buffers before commit, so thatalleviates the memory pressure.
So this area was reallyinteresting and partly, you know
, I was actually more interestedin recovery because customers
would really struggle with thisthing, right.

(14:52):
So just to give you an idea,john, like there was, imagine
that you have like I don't know200 files that make up your
database, right, and they'resitting on like you know 10
disks.
So you have all these files,then you have your logs and then
you have like the metadatawhich is your catalog in your
dictionary, and you'd get intothese funky scenarios where
someone would be like, oh, mydatabase is not coming up right

(15:16):
and you know you're getting someerrors.
So there is an amazing dancebetween the metadata, between
these files and the log and kindof the metadata which is the
like in what was kind of thecontrol file, where we would,
you know, keep the appropriatemetadata to know what is the
state of this thing.
Right, so, very earlier on, Iused to just love that, that,

(15:38):
you know I, you could, you couldmanipulate, like you know, a
few bytes here and there in justlike one file in its header,
and it could sometimes you justget corrupted because of, you
know, just magnets, right, Imean, disks are ultimately
magnetic and things will gowrong.
So what would happen is, youknow, the DBAs would try
something they would like try torecover this thing by saying,
okay, I think that file iscorrupted, so let me go to my

(16:00):
backup that I made two weeks ago, slap on that file and backup
that I made two weeks ago, slapon that file, and then, you know
, it would still not worksometimes.
So then it almost became thisamazing thing where these
recovery developers were likebrain surgeons trying to figure
out what is going on with thisentire recovery, and you would
find out that when, during thebackup, the backup was taken
perhaps inappropriately, and soit's like not a proper backup,

(16:22):
where you have to go throughsome proper steps and so forth,
and so it's like not a properbackup where you have to go
through some proper steps and soforth.
And so I learned that you knowforward going in recovery, you
have to throw enough stuff sothat when things really go bad,
you're able to recreate apicture of what's going on.
So looking at a state and thensaying, how did I get there?

(16:43):
And so I've always in my headalways maintained that even
during, you know, my career inrecovery at Oracle.
And then moving on toreplication and Striim and then
AI and so forth Right, thesethings are sort of all.
The log is sort of aninteresting thing.
That kind of is a unifyingframework to me across a bunch
of this work that I've done tillnow teamwork to me across a

(17:04):
bunch of this work that I'vedone till now.

Speaker 3 (17:08):
Yeah, absolutely, and you know we can touch on all
the reasons why it's technicallycrucial to have logging in your
database for recovery purposes.
You know I'm just going to jumpto the other end of the
spectrum what happens if youjust don't run your database
with logging?
You know where are the risks,how will it operate.
You know what does it actuallymean for the business users of
the database.

Yeah, so I mean, in fact, you know I can, depending
on the definition of thedatabase, there are many
databases that in fact don't runwith logging, right.
I mean, if you take a look ateven especially analytic systems
, right, logging might be a bigoverhead just because you're
doing massive parallel loads andthere's no point in maybe
logging all of that stuff.

(17:46):
So what you do there is, likeyou know, you run this thing and
then you just take anotherbackup, right.
So now you're moving atbasically your.
So in fact, what the log doesis it spreads out like your
state evolution, right.
So each change vector can takeyou from one state to another.
And then if you sort of you knowbypass logging, now you're
going through sort of you knowbypass logging, now you're going
through sort of you know, fromone jump of a state to another,

(18:07):
jump of a state to another, jumpof a state.
So you just be methodical aboutthat and be aware that your
specific, the state to which youcan recover, is not going to be
the most recent state.
You might have work that's lostsince the last time you took
that backup, which may or maynot be okay for your specific
workload and also, like Imentioned.

(18:30):
If it's an operational database,then I think doing if you
bypass logging, you will findthat your actual IO is going to
just be not, it's going to startinterfering, right.
You won't be able to get asmuch throughput, like I
mentioned earlier, if I go inand for every single transaction
let's say my transaction hasfour actions and it's touching

(18:51):
different tables that means thatI need to go now right to four
different specific locations ondisk and and now, depending on
again, the disk type and whatnot, I may have to like pay that,
uh, you know, my my time forrandom access to that specific
block on disk, which kind oflogging obviously is good.
Right, I have a sequentialpoint and I just have to, kind

(19:12):
of, I already have my specific,I've already sought to the
specific sector and now I'mbasically just, you know, seek
time and rotation latency arehardly, you know, so it really
does have impacts on theoperational side of the database
, not just recovery.
Absolutely, absolutely.
Yeah, I mean I think if you, Imean just imagine, right, Like
you're, if you're doing work andI want to commit, and let's say

(19:33):
I just touch, like justhypothetically speaking, I just
updated every single record inmy table.
And let's say I'm a customer,like I find a large financial
application, let's say PayPal orSquare, right, I might have
like 100 million users.
That means I need to go in andupdate 100 million locations on
disk, Right?
So what I want to do is I wantto be able to actually go in and
, in my memory, go dirty orchange all these buffers.

(19:54):
But then if at the time ofhitting commit, instead of me as
a client just waiting aroundfor flushing all those blocks, I
just have like one write to thelog, right and I'm done, and
lazily.
Now you know in the database,you know the flushing can happen
lazily, you know, based on yourLRU schemes and whatnot, right?
So that's kind of theoptimization there.

Speaker 3 (20:16):
Yeah, and your work obviously evolved from working
on the log writer of thedatabase to GoldenGate.
And you know, just for a bit ofcontext for the listeners, you
know GoldenGate is a I'll letyou describe GoldenGate but
logging in general I mean, youknow, for instance, it's
received a lot of attention fromthe leaders in the industry and
those who actually adopt andrun these massive big data

(20:39):
processing and distributedsystems products.
You know, for instance, youknow Jay Kreps, the co-founder
of Confluent, was famouslyinfatuated with your work at
GoldenGate.
He wrote about it even beforehe started Confluent.
But that's just one example.
I mean there's many examples ofhow logging has been deployed
as this kind of like you said,it's a unifying concept for kind

(21:02):
of journaling, the state of anapplication.
So I touched on GoldenGate.
Would love to hear about yourexperience there.
Yeah, so you know John.

So you know Jim Gray famously used to say the
log knows everything.
In fact I did meet Jim one timeat a VLDB, but that's a
separate story.
So let me kind of tell you howI got interested in GoldenGate.
So if you take a look at theuse of a log from backup
recovery very quickly forcertain replication techniques,

(21:36):
you know, and there's been a lotof research in this area over
the years, over the decades, butvery famously, you know, pat
Helland and Jim Gray talkedabout some of the evils of
replication very earlier on andthey used to talk about, like
you know, one of the classictechniques obviously is lazy
replication.
And in lazy replication you canuse the log to make it

(22:00):
available to sort of, you know,not just the node that's
generating the transaction butalso to other nodes and you can
catch them up using the log.
So replication was so, ratherthan do eager replication, where
you're replaying the sameaction at multiple sites, you
just say, hey, I'm going to justdo it, harden the commit on one

(22:21):
and then I'm going toasynchronously propagate it
lazily.
So that was a classic techniquein replication and you know, at
that time when I was at Oracle,this was also used for data
protection, for things likestandby databases, and led to
eventually products likeDataGuard and whatnot.
But you know, the evolution atGoldenGate that was sort of very

(22:48):
interesting was theheterogeneous application of
logging onto just a completelyseparate system.
And so the idea there was thatyou know, I may be originating
my transaction on maybe aspecific database vendor MySQL,
postgres, oracle but I want toreplay that transaction

(23:10):
logically onto just a completelyseparate vendor stack, and so
that's the problem thatGoldenGate very successfully
solved and it became very, youknow, obviously sort of
state-of-the-art and so theapplication of it.
Then, speaking of replication,so both to number one, just
availability.

(23:30):
So for ATM networks, you knowthere may be, you know, banking
applications that might berunning against an ATM and so
you want to do high availabilityfor the ATM.
So replication based onGoldenGate became like a really
interesting technique to solvethat problem.
And the second one wasperformance to scale out a
workload, and you know, with thedawn of the internet era there

(23:52):
was a lot of work around.
Hey, how do we scale thesesystems?
Because the number ofsubscribers or users that are
coming onto the system wereorders of magnitude higher,
right, I mean earlier.
For example, you might havemaybe hundreds of travel agents
doing something.
Now, with you and me logging onand trying to search for fares,
you know that number justbecame, you know, million or

(24:15):
higher.
So the choices there for anairline reservation system are
okay.
I got to scale this thing.
So if you scale it verticallythat's a huge cost function.
So you know, at that time youknow technologies like
GoldenGate sort of solved itrather elegantly by saying, well
, let's go ahead and sort ofcreate one master and you know,
and end slaves in sort of areplication architecture from a

(24:38):
logical perspective, to sort ofseparate out our writes from our
reads, and so they cleverly usethat distribution for scale and
performance.
So that was the idea behindGoldenGate, and you know.
Obviously then, you know, as wemade that successful, there was
a number of interestingproblems that emerged during
GoldenGate and that's what ledus to Striim and try to address

(25:00):
some of the problems that wewere facing back in the 2000 to
2010 timeframe.

Speaker 3 (25:06):
Yeah, it's like I said, it's really foundational
software that really inspiredthe big data processing products
that have come out since then.
And if you look into GoldenGatenow it's under Oracle of course
, and you you resume that Oracleafter uh, goldengate was
acquired there.
Now you know GoldenGate.

(25:27):
If you want to research that,you know you can go look at the,
the magic quadrant and uh fordata integration products and
you know it's in it's, you knowit's at the top there and it's
widely recognized as sort of youknow foundational software,
like I said, for many of thereplication and database
processing workloads.
And now you know which kind ofis a good segue into Striim.

(25:48):
What inspired you to startStriim?
Is it just another golden gateor is it something else?

Yeah, no, that's a great question and I get that
question quite a bit, john.
Obviously it's not the samething, right, I think.
So, if I go back right, therewere two distinct things that we
were trying to address thatGoldenGate did not address and
that sort of converged with kindof the advent of some of the
work coming out from the Hadoopand Spark, the Berkeley guys and

(26:17):
Blab guys at that time, withthis whole approach to big data
through, you know, parallelparadigms like MapReduce et
cetera, right.
So one was the focus on notjust structured data but also on
semi-structured data or maybeunstructured data.
So addressing sort of thebreadth of these sources was

(26:39):
kind of interesting andimportant because people were
beginning to ask for that.
I'll give you an example.
So GoldenGate was focused moreon database-to-database stuff.
But when people would bringthose problems up, like hey, we
also want to go apply some ofthese changes onto a
non-database, so yeah, so someof our teams went in and we kind

(27:00):
of did something kludgy to makeit work, right.
But there was a pattern thereand I don't think that that
pattern was solved veryelegantly there.
Right, it was more of aband-aid add-on, right, and some
customers used it, but that waskind of like one idea that you
know you could just generallyapply this problem to truly
account for the heterogeneity,the structural mismatches, the

(27:23):
syntactical mismatches betweendifferent systems, right,
databases, nosql systems,databases, messaging systems,
messaging systems, databases,messaging systems, storage
systems, and you can see thegamut of these things, right,
and you know it's interesting.
Right, I mean, one of the coredrivers behind some of the

(27:44):
things we are doing today, youknow, with the advent of the AI,
Striim AI or generative AI era,if you will right is the fact
that even to enable my AI agents, I need to go in and retrieve
information from somewhere andthen process it and maybe allow
someone to take an automatedaction on it.
So it was important to addressthe various sources.

(28:07):
Okay, so that was one of theproblems.
The second was there was a lotof databases that were becoming
very, very large, so the scalepart of it was coming in.
And when you have the scalecome in, you know sometimes
between you know the differentpoints of, like, a distributed
system.

(28:27):
Let's call that, you knowpeople, you can call it like you
know, maybe a mesh or maybejust a fabric or however you
want, but fundamentally there'sa.
It's a distributed system withmultiple nodes and if you do, if
you push one of these nodessuper hard in terms of how much
traffic is coming through,you're going to build up latency
as the Zeta is going to one ofthe additional points in your

(28:49):
distributed ecosystem.
So there was an open challengethere, john, which was you know,
there's this lag building up,so what is going on here?
And that actually meantsomething ultimately to the
businesses.
So they would say you know,typically, we see that you know
the propagation latency, themessage propagation latency or

(29:11):
replication latency, however youwant to call it, or lag between
you know, two of my systems ismaybe a few hundred milliseconds
, but it's like 35 minutes rightnow.
So that means that there's aspike going on.
And there was this generalproblem of how do you go in and
literally peek into the spikeand observe that and analyze

(29:32):
that to figure out what doesthis really mean to the business
?
And I think that was wherethese two so the part about
accounting for kind of the newersources of data and then the
scale part of it as it startedhitting it and these latencies
started going up the ability toobserve that and try to explain
that necessarily, you know,warranted some sort of an

(29:55):
interesting engine.
And that's where I think soStriim is, you know, sort of a
Golden Gate++ in that sense.
Right, that it's not just aboutthe data movement piece of it,
it's also data movement appliedgenerically.
But then on top of that theability to go literally having
transparency and observabilityon the moving data, to try and

(30:16):
make sense out of that and totry to express declarative
queries on that, to try toexpress, you know, any AI
automation on that.
So we've kind of evolved thatsystem now to this point where
you know you can actually do alot of smart intelligence as you
are dealing with thisdistributed system, and that's
kind of one of the very powerfulcapabilities here.

Speaker 3 (30:38):
And that's kind of one of the very powerful
capabilities here.
So GoldenGate was reallychanging to capture more to
solve the 1990s, early 2000sgeneration of problems, which
was replication forheterogeneous database setups,
for high availability If you had, you know, multiple database
nodes that you wanted to keep insync, which is a great, great

(30:59):
use case and still applied in alot of areas.
Now, when we look at Striimchange data capture, high-speed,
low-impact change data capture,which is going against the logs
, is one of the core features.
But it's also natively tiedinto a Striim processor and it's
a horizontally scalable Striimprocessing engine.
So what made you look at?

(31:21):
You know, obviously GoldenGatedidn't do Striim processing, it
was just actually.
I'll ask you for clarificationWas GoldenGate an in-memory
Striim processor?
How did it actually replicatechanges?

Yeah, so GoldenGate first of all didn't have any
Striim engine within sort of itsarchitecture, right.
So the idea there was more.
It was a, you know, processed,component-based, published and
distribute and apply type ofarchitecture, right?
So that was like a CDC processthat would, you know, take the

(31:55):
changes and push it onto a queue, and there was another guy who
would read the queue and then,you know, serve as a client to a
database and deliver thechanges and push it onto a queue
, and there was another guy whowould read the queue and then
serve as a client to a databaseand deliver the changes In
Striim.
The architecture is very, verydifferent.
So CDC is also a part of it,but not only CDC from databases,
but also getting changes from,for example, nosql systems like

(32:19):
MongoDB and getting it from theOpsLog or from Snowflake from
their chain Striim or, you know,using our delta identifying
techniques on things likeBigQuery and Redshift, and so
these are the things that.
So we are able to universallyidentify what's changing and
move that right.
So that's the extension part ofit.
But to your point now, withinthe Striim engine itself that

(32:44):
Striim has, we have very highspeed in-memory constructs that
allow data to go over sockets.
You don't have to just do that.
You could also leverage Kafkaas a persistent layer underneath
the covers to try and free upknow, free up the publisher and
the subscriber so that they canmove at their own speed.
So both of those are availablewithin the platform, but I think

(33:07):
the big novelty there was thepresence of a actual engine,
which is the continuous queryprocessor.
Right, and the continuous queryprocessor is you know the part,
john, I was talking about thewhere you can observe the data
and you can try to analyze thedata.
So oftentimes, you know,earlier on, you know, this may
be characterized as Striimanalytics, right, where people

(33:30):
like you know, when Twitter andGoogle and these guys came in,
or when the you know, let's say,the concept of like meta was
introduced, they would count upthings super fast.
So how do you do that?
So that means that you've gotto constantly go in and take
your metrics and aggregate themon the fly.
You don't have the luxury ofpushing them on disks first and

(33:51):
then doing it.
So that piece of it is wherethis capability comes in into
Striim, and this was, by the way, I mean, our ideas were
patented in early 2014 and we,you know, got that granted.
Flink was invented after that,by the way, right, and at that
time, other than you know,twitter was using Apache Storm,

(34:13):
which was like just ridiculouslyslow.
It was like an order ofmagnitude less than what Striim
could do.
So what we've done now is madesure that you could express
these declarative queries thatcan be used on the moving data
and you could apply windowingfunctions there.
You could apply, you know,interesting agents which allow

(34:36):
you to do AI or machine learningon the data that's moving.
So, in other words, we'recoming on to this era of hyper
automation, right, where earlierI need to do things manually.
I want to.
Okay, let me grab data fromhere, let me push data here.
Now what we are doing is we'reopening the world up to this
class of applications where thedata is moving, opening the

(34:58):
world up to this class ofapplications where the data is
moving and now the system isactually reacting in an
automated way to say, hmm, Ithink that there's a pattern
here.
Someone needs to take an actionon it.
I've taken the human out ofthat.
Now, these are very interestingideas and advanced concepts
that we are enabling, but theyare going to be super
interesting with some of thesenewer patterns we are seeing

(35:20):
with agentic AI and hyperautomation.
That's where you do need trulyto observe the data that's
moving on the fly in real time,because otherwise you're as
stale as the last time you trainon some data and that is kind
of game changing in my view.

Speaker 3 (35:35):
Absolutely, and you know, in terms of applications
of this, you know I look atABC's Good Morning America did
this great segment on how UPS isusing AI to battle porch
pirates people who basicallysteal packages from your porch
and if you you know it has hugebusiness impact because this is
a very obvious issue toconsumers who don't want their

(35:58):
packages stolen, a very obviousissue to consumers who don't
want their packages stolen andthey're looking for ways to
mitigate the risk there.
And UPS has deployed thisreally amazing next-generation
cloud-based AI stack which doesinclude using Striim to get that
operational data into the AIengine.
So that's one real-worlddeployment of Striim for AI and

(36:20):
it can also be broadly used foranalytics use cases.
It can even be used forreplication.
I'll get into how you know weuse Striim for our own internal
workloads, but you know I wantedto talk to you about you know,
what are the larger use casesfor tying Striim with AI.

Yeah, so great question, right.
So I think we just began tosort of get into that area.
So, if you just take a look atAI in general, if you sort of
look at its own evolution allthe way to now generative AI and
the applications in kind ofnatural language processing and,

(36:59):
you know, text summarizationand so forth we introduced AI in
Striim from a machine learningperspective very earlier on, and
this was and I think wepublished some of this work at
one of the VLDBs I think 2019, Ithink or 2018.
I forget the year, but where wetalked about sort of you know,
an online machine learning modelthat we were keeping fresh and

(37:24):
up to date for just predictingwhat you know network traffic
patterns will be based on thehistorical and the real time
Striim that we are seeing, right.
So you know where I see sort ofyou know Striim and AI converge
is number one.
It's AI is only good.
Everybody understands thisthing, and you've probably heard

(37:45):
this a hundred times that AI isonly as good as the data.
Right Now, it's remarkable thatat Striim, we tend to get a lot
of this data from highlyaccurate and high fidelity
systems like databases, where wehave bothered to curate the
data and push that into aknowledge base.
So, to a large degree, despitesome minor cleansing issues and

(38:07):
whatnot, to a large degree it'sserving as the operational tier
and it's high quality data, soas that data is getting updated
in real time, making that dataavailable into an AI model in
real time, right, that is onearea where this is a very nice
convergence and, I would say, aninteresting use case, because

(38:31):
it still requires you to dochange identification, right, if
I'm a large retailer or if I'ma large logistics carrier when
I'm trying to track my shipmentsand so forth.
So, as newer things arehappening, how do you make sure
that an AI model is made awarethat things are actually
happening around the model,right?
So intelligence is not aboutspecifically being stale and

(38:55):
dumb, right.
Intelligence is about actuallybeing super aware and learning
on the fly.
An intelligent person is notsomebody who can just answer
your questions.
An intelligent person is alsosomebody who's taking real-time
feedback and dynamically goingahead and constantly doing
learning, and so that's where Isee these worlds kind of

(39:16):
converge on right that so far, abulk alliance share of the
attention has been on hey, Itrain the model and I do
inference from that model.
Now, slowly, we are beginning tosee through the, you know,
maybe a retrieval, augmentedgeneration pattern that, hey,
the model can actually beenriched, perhaps with better
context, which you knowtypically translates in the form

(39:38):
of the form of vectorembeddings and whatnot.
So how do you create thesevector embeddings so that your
fabulous new Gen AI applicationcan take advantage of it?
Well, it needs to get real-timedata from somewhere to create
that vector.
That's where I see theconvergence of the Striim
capabilities and the AIcapabilities, and, in fact,

(39:58):
we've introduced our own agentsinto the Striim platform where
we are doing some fascinatingthings like which traditionally
have been super hard to do, likeidentifying, you know, any
leaks in, you know, personallyidentifiable information, for
example.
Or is there any sensitive datathat is going through my
pipeline and not just databases?

(40:20):
It could be, you know, it couldbe a communication tool like
Slack, right, and if I cut andpaste something in a Slack
channel, you know who'smonitoring whether or not I'm
allowed to share this withwhoever's on that channel.
So you want to scrub thesethings and I think, as we go
through in the, as AI unfoldsitself, some of these things
about number one, the ability todo real-time ingestion into the

(40:43):
model, the ability to actuallymake real-time data available to
the model for the purposes ofRAG, for the purposes of
fidelity, for the purposes oftrust.
I think these are the kinds ofthings where I see Striim and AI
kind of converge.

Speaker 3 (40:55):
Yeah, and especially for these inference time
workloads, you know.
I'll just ask you a bluntquestion.
Which is okay, I have all theseAI agents.
Why don't they just go talk tomy database or go into my Slack
and hit the API there or gofetch the data from the source
at the time it's needed, versusreplicating the data into the

(41:15):
model?

Yeah, that's a great question, right, and I
think that in maybe someworkloads that is absolutely the
right thing to do.
However, I think what we seereally in enterprises and in our
customer base is they're notleaning on one system, right?
Typically, these systems evolveand that's kind of what I
earlier I was saying.
Typically, the pattern is thatI have this interesting network
of distributed nodes and thereare things that are happening in

(41:41):
this distributed network atdifferent.
You know, it's like atime-space thing, almost right.
So if something happens, theinteresting thing is how do you
get notified?
In your example, the databaseof records, how does it get
notified that something elseexternally has changed, but the
model needs to be aware of it?
So that's where changeidentification is a problem.

(42:03):
And the second part is youcan't keep the agent, can't?
It's operating on some trainingmodel that was trained as of a
certain time.
So how do you then, you know,make your inference smarter?
How do you give it advantagesto say that, hey, if something
has changed and if you'rerunning maybe you know a vector
search on this agent is runningsome sort of vector search here,

(42:23):
right?
Whatever it is, how do you getthe most recent representation
of that semantic information tothe actual vector search so that
you can actually furtherqualify your specific response.
And now you're fine-tuned onthe fly.
So that, john, I think, is whereI would say I would draw the

(42:44):
line right that obviously wecomplement some of this
fascinating work that otherslike OpenAI or Gemini and Cohere
all these guys are doing right.
So their strength comes inwhere they're saying hey, look,
we have a zoo or a garden ofthese models.
So, depending on whichmultimodal aspect you're talking

(43:07):
about image or audio or videothey're focused on saying select
.
This model saying is that, hey,as events are moving, as
real-time transactions areoccurring in the real world,
there are changes there thatneed to interact with the model
itself.

(43:27):
And this model, if it's onlyoperating in isolation, is not
going to have the most recentchange information available,
either in its raw form or in itsvector form.
That's what I think Striim isadding to that mix.

Speaker 3 (43:42):
Absolutely, and so many of the fundamentals and
best practices of datamanagement apply in AI as well,
and it's unpredictable how manyAI agents and workloads can be
running at the same time and areyou really going to let them go
pull your productionoperational systems?
It's the same question of likewhy do we have OLAP?
You know, earlier in the seasonwe had Andy Pavlo from CMU,

(44:06):
from Carnegie Mellon, and Iasked him the same question and
you know, basically, I mean wewouldn't have engines like
Snowflake and Databricks, whichlooked at the analytics
platforms and really centralizeddata for BI and data science
purposes right there, and theoperational teams are running
the database and your analyticsteams are generally running your

(44:27):
data warehouse.
Then you'll have AI teams,whether they fall under
engineering or analytics.
I think that's, you know, everycompany's own adventure that
they need to figure out, butthey'll they'll still need their
their own data store.
Uh, that has fresh, curated uh.
Uh govern data specifically forai workloads and, just like the

(44:48):
, the analytics team, that mighthave hundreds to thousands of
analysts and data scientists whocan be running workloads at any
time.
They don't want to hit theoperational database, they want
to work on their data lake.
Yeah, ai agents can.
Agents will very practicallyrun into the same issues where
you'll have any number of AIagents and they'll need their
own sort of layer of indirection, their own kind of buffer zone

(45:08):
to use as the context window fortheir data.
So, absolutely, replication andStriim data into the models
absolutely makes sense there.
And I want to bring this backto some of the work we did
around data governance for AI,specifically in the latest

(45:28):
release of Striim.
You know what we call the fifthgeneration of Striim.

Yeah, so, and you know, and I'm really excited
about some of the work that wehave done because it is very
novel To my knowledge we're inFebruary, now 2025, right, so to
my knowledge, I don't thinkanybody else is doing this yet.
So the idea was that, hey, whenI'm setting up my data

(45:54):
pipelines, could I make surethat I'm using some of these
foundational models to identifythings that are private and
sensitive?
And so we've actually appliedthe model.
So we introduced two new agentsinto Striim 5.0, which is our
fifth generation, as you said.
One of them is called aSherlock agent and another is a

(46:15):
Sentinel agent, and we thoughtabout it, you know, very
thoroughly and from a holisticperspective.
So there's a design time partof this thing where, before I'm
setting my pipelines, I want tojust know am I going to
encounter any sensitive data.
So we designed one specificagent for that purpose to say,

(46:36):
go investigate, do a bunch ofsnooping around on my systems
where my events are going tocome from and tell me whether
I'm going to encounter anysensitive data.
And that actually, you know,again, we've optimized it, we've
kind of, you know, I made surethat we have the right frequency
and sampling there so we don'tdisrupt the external world.
So that's kind of one powerfulcapability there, before you
actually go in and start movingthe data.

(46:58):
Now the second part of it isSentinel.
And so Sentinel is like thisincredible agent and it's kind
of, you know, literally sits inand, you know, reminds me of
some of these like sci-fi movieswhere you know you can peek
into like a bus and look at thepeople and say okay, that's John
right there.
And so Sentinel actually isrunning.

(47:19):
You know a lot of the sensitivedata detection on the various
attributes.
And then you know, just taking alook at that and for governance
reasons, identifying thatsomething might be a social
security number, or it might bea Singaporean national identity
number, or it could be, you know, just an address somewhere in
Tunisia, so, and it's veryaccurate and it's very powerful,

(47:41):
but we didn't stop there.
So once you identify it, thequestion is, you know, for
governance, what do you do nextwith it?
And that's where there's agamut of policies and actions
that we have added into thisthing.
So, as a policy, I can Striimgo in and say deal with this
sensitive informationcategorically by either masking

(48:02):
the entire thing or masking aportion of it or encrypting it
and or tagging it, and I thinktagging is a very powerful
technique.
I think that in a few yearsfrom now, this is going to
become Striim, because tagginghas a lot to do with this.
Trusting in AI, right, the factthat I'm able to go in and tag a

(48:26):
piece of data using my model togive you more metadata around
it.
You know, so that you know,right from you know,
understanding maybe, what is thelineage of this data?
Can I trust it?
Did my observers, like allthese agents that are observing
it, what did they think aboutthis at the time that it was

(48:48):
moving from one location toanother or one node to another?
So you can see that you know,as this is evolving, we are
putting in these breadcrumbswhich kind of brings me all the
way down to the top of recoveryright, that there's a pattern
here from a logging perspective,that you know, if I want to
evolve something, I need to keeptrack of stuff in an efficient

(49:11):
way.
And I think the same conceptsthat we learned all the way back
from saying, hey, there arethese changes or there's these
breadcrumbs that I need to tieinto events so that I can,
either for backup reasons,replication reasons,
heterogeneous logicalreplication reasons, Striim
processing reasons and now AIreasons.
Go ahead and leverage some ofthese artifacts so really

(49:33):
excited about these capabilitiesin Striim 5.0.

Speaker 3 (49:37):
Yeah, absolutely, and this has been discussed even
with other guests here on what'sNew in Data.
But the quality of the datainfluences the quality of the AI
.
So having inaccurate data, datathat doesn't make sense.
This will all lead to negativeoutcomes with AI and
hallucination and just thingsthat just generally don't work

(49:58):
and make people say, okay, thisis another form of AI slop, no
one's going to invest in it.
Then the other side of it ismaking sure that it's governed
right, making sure you're notsending customers social
security numbers and things likethat into a workload that you
know becomes a customer-facingchat experience.
And hey me, john, I can look upalok's social security number

(50:18):
by asking I mean, these are,these are the type of data
leakage that can realisticallyhappen.

Right with with uh, and it happens all the time,
john, like, we know, we talk tocustomers all the time and
they're like oh yeah, so, sosuch and such uh happen, and
it's always post-facto and Ialways tell people that you know
it's.
It's surprising to me the costfactor that people pay in a
reactive manner rather than justproactively thinking about this

(50:44):
as a preventable problem.
Right, you may have like 100security products out there, but
you need to have a horizontalview across these things to see
that the corners where thesethings are glued to each other,
right, is there a correlationhappening?
There Is my surface area ofattack being tracked in real
time, and those are some of theinteresting things that I do
think that now, with thesesecurity agents and so forth on

(51:06):
these moving data, I think we'regoing to venture into that area
soon as well.

Speaker 3 (51:11):
Yeah, absolutely.
And there's tons of use casesfor Striim.
And, for instance, we have ourserverless Striim developer
platform where anyone can go inand try these features.
For you know, retrieval,augmented generation, vector
embeddings, replication, wherewe're basically offering as a
multi-node Striim cluster right,and you know, that's just in

(51:33):
and of itself a use case ofStriim and offering Striim
cluster as a service, dataStriim as a service, and then
even in the backend, the way weensure high availability in the
case of regional or availabilityzone outages, we're actually
using Striim to replicate to abackup database and then
restarting the cluster with thesame metadata.

(51:53):
Then, on top of that, we say,okay, well, what do people
actually get out of these Striimpipelines?
What are the number of weeklyactive users?
How far do they get into thepipelines?
Again, you know that's that'sanother use case of Striim,
because we copy that data intoour analytical warehouse where
we use that for reporting.
So Striim is really.

(52:14):
A lot of people have thismisconception that Striim super
complicated.
But then I actually look at allthe Striim guides and
implementations like, oh well,you got to set up your open
source CDC platform, you hookthat up to your Kafka cluster.
You got to set up yourzookeeper.
Now it's craft, and you got torun this Kafka cluster.

(52:35):
You have to think about topicsand brokers and you know the
replication factor and the costimplications of that.
And then, finally, once you getinto interfacing with
analytical systems and AIsystems, okay, now you have to
think about, okay, what doeseach system actually do with an
event and what is the payload ofthat event and how is it
actionable?
Right, so that in itself is anengineering problem.

(52:58):
Actionable?
Right, so that that in itselfis an engineering problem.
But I think, especially with the, the, the next generation of
work, especially getting thebusiness value, the time to
value, as fast as possible.
You know, just having a, aproduct that abstracts, that for
you end to end, uh, makes it alot, a lot simpler as a, as a,
as a managed service orsomething you can deploy in your
environment.
So you know I that that's superexciting, super exciting for

(53:21):
the work you're doing here, youknow.
But I also want to back up tojust, you know we were having a
really fun conversation aboutdatabases.
What's a general fun fact aboutdatabases that most people
don't know?

Good question.
Well, I mean, I can talk aboutone specific database that I
worked on and I think it's stilltrue, but maybe a lot of people
may not know that you couldconfigure, like, maybe an odd
size block size in a database.
So usually you think like, okay, well, maybe 2K or 4K or 8K or

(53:55):
16K or 64K, 32k.
But I think it's true that youcould probably have it as a
multiple of 512 bytes, so youcould probably get away with a
three and a half K block size.
I don't know why you would dothat, but I think it's a fun
fact.
So I think that's used to betrue, at least by the time that
I was dealing with it, but I'mnot sure if that's plugged in

(54:17):
now or not.

Speaker 3 (54:18):
Yeah, and you know there's a lot of interesting,
especially when you get into thebare metal aspect of databases
and how databases kind ofcompete with the operating
system for resources and eventhe way the database assumes
that certain resources might beavailable and, for instance, the
operating system.

(54:39):
If you take an operatingsystems class, you'll know that
operating systems have their ownbuffer cache and then databases
on top of that have their ownbuffer cache implementation,
which is a bit of a differentapproach.

I don't know if that's something that you could
comment on the buffer cachedirectly in the database layer
and bypass that completely.
So you can actually even ininterfaces let's say, when I'm

(55:11):
just doing IO to a disk, right,I can just bypass the operating
system.
I can indicate to the operatingsystem I don't want this cached
, right, so then it doesn'tstore.
And now you don't have to paythe penalty of kind of a
multi-layer caching scheme there, but the database itself will
directly just say I'll managethis thing.
So at least in the advanceddatabases in the world that

(55:32):
should be just de facto in myview.
But it's not, I don't thinkeverybody does that?

Speaker 3 (55:38):
Are there any open source databases you can?

kind of.
I'm honestly not.
I haven't caught up, john.
I've been busy with Striim so Ihaven't caught up with what
maybe some of the open sourceguys have been doing lately or
not.
But you know, I mean it's 2025.
I think it should be something.
Now, whether or not everybodyleverages it or not is a
separate question, but I thinkin code I think it's doable.

(56:02):
I think in most of the popular,like Postgres for sure, I think
you can do that.
Yeah, absolutely, and you alsohad a fun project with running
Oracle on the Mac.
Oh yeah, that's probably one ofmy biggest regrets, by the way.

(56:23):
Well, let me clarify that.
So back in the day, we ended updoing this project.
So, you know, back in the day,you know, we ended up doing this

(56:47):
project which kind of allowed,you know, some objects plug it
into a Windows or a Mac-basedOkay, let's not talk about Mac,
just the Windows-based database.
Right Prior to that, you had to, john, go through and do this
massive thing where you wouldunload all the data and reload
it to the SQL engine.
So the problem there was that itwas kind of dependent on the

(57:07):
size of the data.
So if you had 100 gigs versus aterabyte, right, all of a
sudden it's 10 times the penaltythat you're paying.
But if you're at disk copyspeeds, you could do this much
faster, right?
Okay, so we wanted to do that.
This is when I was still inOracle.
I think it was in Oracle eightor nine, I forget exactly the
version right now.
So it was called cross-platformtransportable tablespaces and

(57:30):
as part of that, we needed tofigure out, okay, what are the
various platforms where youcould go take this to.
And Oracle had a lot of ports atthat time.
Right, it used to run on likeSiemens and HPUX and Mac and you
know all kinds of flavors oflike you know AT&T, svr5.
There's a bunch of operatingsystems that DAC, vms.

(57:51):
So we had to manage that, makesure that we kind of restricted
the number of platforms that wewanted to port the Oracle
database on.
And as part of that we did someanalysis and I guess maybe one
of the regrets is that Oracleused to have a port on Mac OS

(58:12):
and we nuked it at that time.
Mac OS X didn't make it to thatlist because this was kind of
like Apple was kind of goingdown at that time and 90s, right
before Steve Jobs rejoined.
Yeah, so it was never consideredlike, like we were like who
would run an Oracle database ona Mac.
But now let me tell you, likeyou know, I love the Mac and I'm

(58:35):
a developer on the Mac and Iwish I could just run the Oracle
database on my own Mac.
So that's one of the regretsthat I have.
So that was a fun fact aboutkind of the porting aspect of
macOS XY.
So now there's no Oracle porton macOS.

Speaker 3 (58:50):
Oh wow, your team didn't want to think different
at the time.

I thought we were thinking differently.
I thought we were like you knowwho's going to run the database
, like it's not a seriousoperating system where you'd
want to run an Oracle databaseon.
But I mean, ultimately I thinkwe've kind of converged towards
kind of the Linux flavors anyway.

Speaker 3 (59:12):
So I think in retrospect, yeah, dockerize
Oracle on a Mac is totallyacceptable now, exactly yeah.
But Oracle famously has over 25million lines of code and
porting to every operatingsystem.
At the level that Oracle had tosupport it, it must have been a
massive undertaking.
The level that Oracle had tosupport it must have been a
massive undertaking.
So of course it made sense thatyou had to prioritize the
platforms that you know peoplewere realistically deploying on

(59:34):
at scale.

Yeah, and then you got to test these combinations
and there were genuinely someports that were, let me Well, I
don't want to get into anythingthat's not documented publicly,
but let me just say that youknow some of the platforms kind
of.
You know they have like thissort of header which is special

(59:55):
and not everybody had thatheader.
So imagine that you know, like,if you're the database manager,
right as in like code, andsomeone says, go start up the
database, you got to go examineall these files and the headers
to say, you know, do thesebelong to me?
Is this kosher?
This is kind of like the dancebetween the headers that are
metadata that I was talkingabout earlier.
Right, is this safe for me toopen?

(01:00:15):
And if you didn't have thatheader, it was tough to know.
Is this corrupted or is itactually, you know, something
that is coming in from someother platform that someone has
shipped to me?
In other words, if you're doingthis trick to plug this kind of
this data from one databaseinto another, I have to ship it,
right.
So this database manager nowwhen it opens that file, it

(01:00:39):
needs to recognize what am Idealing with.
And because of thisnon-standard, you know, kind of
we used to call this the kind ofblock zero or the OS block
header.
Not everybody had it, so thatmade it complicated.
So that's why we had to kind oftrim down and make sure that
the platform list was kind ofmanageable and we didn't have to

(01:01:00):
have this you know sort of youknow end by end problem right.
We kind of restricted it tojust a small set of ports.
I think we brought it down tomaybe 18 or 20.
Restricted it to just a smallset of ports I think we brought
it down to maybe 18 or 20.
I don't even know if there's 18ports anymore.

Speaker 3 (01:01:13):
It's probably a smaller number.
Yeah, it's definitely.
You know, database developmentis a fascinating area.
It's becoming more popular thanever.
You know, like I said, we hadAndy Pavlo earlier in the season
and Mark Rooker from AWS, andyou know it's just, it's
incredible to see that it has a,you know, almost a cult
following and so an audience.
You know thousands of peoplefollow.

(01:01:34):
You know the trends in databasedevelopment and just go in and
I think ultimately a lot ofpeople ask well, you know, now,
I know how a database is built,why don't I just build my own?
But then you look at real worldimplementations of databases
and see, okay, oracle's 25million lines of code and you
know it's not slowing down.
I mean, you know Larry Ellisonwas just up on stage with Sam
Altman and our president youknow talking about.

(01:01:55):
You know the $500 billionthat's going to be invested
there, right?
So, and then you look at someof the popular open source
databases like MySQL andPostgres, and then even the
growing NoSQL object stores likeMongoDB and things along those
lines.
So you know there's a lot ofwork out there.

(01:02:17):
There's a lot of opportunitiesto stand on the shoulders of
giants.
So, especially for, you know,data engineers who want to
really do impactful work thatrelates to an operational
database, or whether they'rescaling it or trying to get data
out of the database foranalytics and AI use cases.
It's really important to beaware of these types of
fundamentals.

Alok Pareek (01:02:37):
Absolutely, and you're right, john.
Like I think you know, a lot ofthe people have taken a stab at
kind of writing a database fromscratch.
It can be done, but I think theevolution of it and, you know,
addressing sort of a genericclass of workloads, that's not
an easy problem.
Like I mean, you mentionedOracle's lines of code right.
I mean, if you take a look atsome of the more recent stuff

(01:03:00):
right from, you know modelextensions to vector
representations, trying to do,like you know, open neural
network extensibility through PLSQL functions, and you know
there's a lot of work in thatright.
So when you go in and startembarking on some of these
things, you know in a narrowdomain you could always build

(01:03:20):
like maybe something that'ssuper specialized, and I think
you know we've seen that right.
I mean a bunch of folks havedone that.
But to try to sort of sustainit across hybrid workloads but
still keep up with sort of thecutting-edge technology, it
requires a lot of R&D budget,for that requires like a proper
investment in that area, and Ithink that's where you'll fall

(01:03:41):
behind very rapidly, because youdon't know where to focus on at
that point.
And I think that's where youknow.
I do think that it's not aneasy problem.
But if you want to restrict itto a specialized domain, yeah,
that'd be fun.
To build another database, Iguess, okay that's our next
project.

Speaker 3 (01:03:55):
Let's do it.
Okay, John?

That was good.

Speaker 3 (01:03:59):
I mean a lot of people ask us is Striim a
database, Because we have a SQLengine?
You can technically store datayeah, but we've been always
pretty firm.
You could technically storedata yeah, but we've been always
pretty firm in our stance thatit's not a database.

Yeah, I mean and that's deliberately so right,
because I don't think that thewhole idea here was that, you
know, there's data in motion andthere's data at rest, and
combining these two things.
There's been a lot ofarchitectures proposed, but I
don't think anyone's cleanlysolved that problem, in my view
at least, and I think so.

(01:04:33):
We're definitely not a database.
However, you know, there is astorage tier in Striim and, in
fact, by default, we back it upwith a distributed Elasticsearch
cluster.

Speaker 3 (01:04:47):
Yeah, my first project here at Striim, by the
way, so just fun.

Yeah, I remember that and so you know.
So you could actually, you know, build these fast aggregates,
right, let's say, over movingdata.
So let's say that you know Ihave store sales being reported
from, you know, every single Idon't know Lululemon store or
something.
I don't know if I'm allowed totalk about specific entities,
but that's an example you couldactually see.

(01:05:12):
Like, you know, hey, what arethe 15-minute sales per store?
You can partition by the storeID and you can actually go ahead
and, you know, use the Striimstore to go and push this into
Elasticsearch and now it getsindexed for you automatically
and you can actually build anactual application directly
right on Striim.
Now, right, so that'sabsolutely possible.

(01:05:33):
In fact, you know, becauseElasticsearch also has
capabilities to, you know, dothings like, you know, vector
search.
There is a possibility from anAI workload aspect for us, but I
don't want to speak aboutfutures in this thing right now.
But that's an active area ofR&D for Striim, trying to get us
into that realm where you couldjust take a brand new agent and

(01:05:59):
you could look at it as anautomation problem between
different data pipelines andthere are pieces external to
Striim where agents can go inand grab the data and then
process it.
But then you know there'sanother agent that's actually
leveraging from an applicationperspective directly on the
Striim platform itself and doingthings like you know, maybe

(01:06:21):
having the rack pattern withinthat application directly, and
so that's an area that we wantto definitely get into in the
future.

Speaker 3 (01:06:28):
Yeah, absolutely, and you know my my take on that is
you know, a lot of companies aredoing CDC today, but the the
compromises they made in theprocess suggests to me that it's
not a solved problem for foranalytics or AI, cause you have
these teams that say, yeah, wehave, you know this, the CDC
thing running, but we don't careabout latency.
You know, we, we just we shipreports out and we're okay with

(01:06:51):
24 hours, or you know even twohours, and that's fine.
You know it's not our role totell people, everybody, that
real time is always required.
It's, honestly, not evenrequired If you're a Striim
customer.
We have customers that runreports at an hourly basis.
Um, and then you look at othercompromises made like okay, I
mean, how do you know that yourreports are accurate with the

(01:07:12):
transactionally consistent data?
That's a dicey question becausemost analytics teams don't even
want to touch that area,because once you even raise the
question, it kind of opens up acan of worms that they're not
ready to surface with their ownmanagement.
So usually we hear about theother way around, from a data
executive or an AI executive ora CIO who definitely wants to be

(01:07:33):
able to go to the CEO withconfidence and a CEO.
First thing they will say iswell, is this report real-time?
Can I trust it?
Can we actually make insightsfrom it, right?
So my message on just yourcomment about the storage tier
and using AI to collectautomatic insights, is that this

(01:07:54):
is the next leap that teamswill have to make with change,
data capture and replication andStriim in general to make sure
that they have fast time tovalue, fast time to business
insights.
That's enabled, you know, by aiand, like we said, we have
really awesome large-scalecustomers like ups, who have
done great work here, and youknow we're excited to to work

(01:08:15):
with, you know, hundreds ofother customers that are doing
this to the to the thousands andin the future it's it's really,
uh, an exciting time yeah,absolutely, you know, I, I and I
think those are great points,right.

I mean, I think, uh , I think you know, I hear the
term CDC kind of casually beingthrown around a lot and there's
an unusual hardship right whenyou are trying to support change
data capture.
And one of the very naiveapproaches that I see people
take is assuming that somehowI'm going to have the entire

(01:08:49):
record available in my changeStriim.
If I have a wide record with318 attributes, for example, and
I go update only one of thoseattributes, a lot of the
frameworks make the assumptionthat sure, in my change Striim
I'm going to have all 318attributes available.
Now, okay, if you and that'sthe part that you're saying that

(01:09:11):
people make kind of compromisesor assumptions, if you can make
that assumption, okay, you canlive with it.
But real world applicationsdon't behave like that, right,
that's a lot of overhead, it'sinefficient, right, and it
involves logging on the systemthat is generating that change

(01:09:33):
and makes assumptions about thatsystem.
That's not true in the realworld.
So how does Striim solve that?
Well, Striim supportscompressed updates.
So in many of the sources,right, we can actually just take
an update with just the key andthe before-after image and
basically a bitmap that tells mewhat columns have changed, and
we carry that information alongwith the metadata and we react

(01:09:54):
to that.
And that's where the problembecomes interesting.
Because, if you like to yourpoint, john, now if you take a
look at a, at sort of a dataengineer right in the, in the in
the first case, right wherethey had the entire record, okay
, as a client I can go push thisentire thing into my data lake,
but if I have partial imagescoming in from a very

(01:10:14):
mission-critical application,I'm lost.
I have no idea what to do withthis thing now.
You're not conforming to this.
And even with open formats nowthat are coming in to represent
logical change records, it's notan easy problem.
You've got to manage the schema, you have to do conversions
Within the decision-makingpipeline.

(01:10:38):
You want to say that, like wetalked about, I want to maybe
mask something, and these arethe kinds of things, as we are
moving towards event-driven,especially real-world
event-driven applications, thatsystems have to pay attention to
.
So I think you know, I alwayssort of think that people who
think this is just a CDC problemare looking at the world in a

(01:11:00):
very, very narrow domain.
I think that problem was solvedprobably 35 years ago, in my
view.
Right, you could write atrigger and do chain data
capture, and now I haveeverything.
But it begs the question likeokay, is the?
Are the operators of thatsystem going to allow me to
write triggers?
And if they say no, all of asudden all bets are off.
So I think those are the kindsof things that move sort of this

(01:11:22):
argument from a developeroriented mindset to an
enterprise oriented mindset, andI think that the choice is
really, you know up to you,right.
Do you want to spend your timewith the stuff that you really
know about and excel at, or doyou want to get into the nitty
gritty of saying that, hey, Iwant to get into some of the
core logging concepts,transactional concepts, for

(01:11:44):
consistency reasons, etc.
Or the accurate aspects thatyou talked about.
Who's going to deal with this?
Who's going to deal with eventprocesses and guarantees?
That's where the fun actuallystarts and I don't think that a
lot of teams are very equipped,especially if they're not
database guys, especially theirapplication level guys.
It's a tough problem to crackright.
Very few get it right in myview, absolutely.

Speaker 3 (01:12:07):
And there's so much work to be done right now in
really deploying next-genanalytics and AI use cases.
I think they go hand in handand, as we're seeing more teams
adopt AI in real-world use cases, it does seem like a function
of a data team, right, becauseit's not quite software
engineering, it's not quite areporting or business

(01:12:30):
intelligence role, but it is theteam that's sort of bringing in
both operational and businesssystems, marketing systems, crms
, erps that you know theinternal teams use, and then the
customer facing systems andfiguring out how do we really
make production applications outof this.
So really allowing you to, to,to innovate there and build IP

(01:12:51):
for your company uh, that thatbrings the value and shows the
ROI.
You know we, we, like I said wewe work with some incredible
teams there.
There's, you know, no debatinglike the.
You know we, we talk about UPScause it's public and you know
it was so successful.
They went on good morningAmerica.
It was a cool segment.
So there's just so much funwork to do.

(01:13:12):
I think it's an awesome time tobe in data and it's an awesome
time to be working with theseproducts, and I can't remember
this level of excitement sincethe first iOS app store blew up
and we saw the early days ofthings like Uber and you know

(01:13:33):
all the various mobileapplications that blew up and
you know Instagram and whatnot.
Now we have this kind of newwave you know, generational
platform shipped into AI.

So absolutely, absolutely feels like it's just
a brand new world and, uh, Iabsolutely share your excitement
and enthusiasm there and on tosort of.
You know, uh, I think each yearis going to show something,
something very, very different,uh, and I'm pretty excited to
see what's uh, what's there, andwe have some great things to
help, uh, you know, really umthe world with some of the cool,

(01:14:09):
cool, novel things that we'reintroducing in the Striim
platform.

Speaker 3 (01:14:14):
Yeah, absolutely, and you know we're super excited
about that.
Alok Parikh, co-founder ofStriim.
You used to run dataintegration products at Oracle.
You were CTO of GoldenGate andthen you worked on Oracle
blogging system before.
What's your advice to folksgetting into the industry?

Oh, wow.
Well, you know, my sense is thatA you know, make sure that you
understand the value ofabstraction.
You know whoever's coming in, Ithink we're going to observe a

(01:14:55):
shift in terms of how we arewriting code, how we are putting
together different types oftasks, how reasoning is morphing
, and so I think you know, Iwould say, pay attention to sort
of what is the abstractionaround this, and what I mean by
that is there's a way in whichthese new you know, ai-based,

(01:15:22):
you know workflows, ai-basedautomation, ai-based agents
they're going to radicallychange how systems and humans
and processes evolve.
And having a good understandingof that and being at ease with
that, so do as much as possibleto kind of get comfortable with

(01:15:44):
it and adopt it in your dailylife.
I think that's really and Imyself don't have the magical
answer, right, but on aday-to-day basis, that's what I
do, right?
Just if somebody comes in andsays, hey, here's a way in which
AI is going to change this, I'mnot afraid of that, right, it's
another powerful way toincorporate that towards what I

(01:16:09):
truly believe is the highestform of intelligence, and that's
human intelligence.

Speaker 3 (01:16:14):
Absolutely, and I wanted to ask you that because
you know you're an expert inyour area, but you're still I
mean, I see it every day youknow you're still adopting the
latest and greatest you know Iwant to call it personal life
hacks and ways to be moreproductive and efficient and
really moving things forward.
So thank you for answering thatquestion and thanks for doing

(01:16:35):
this episode of what's new indata.
It was super fun.
Love talking about databases.
It's always a good timeabsolutely john.

It was a pleasure and great questions and I love
uh, I love being on.
I've been wanting to come herefor a while, so this was was
fantastic.
And say hi to Andy too, ifyou're doing.
I know Andy well and hopefullywe get to do one of these things
again in the future.

Speaker 3 (01:16:55):
Yeah, absolutely.
We'll have to do it again soonand thank you for all the
listeners for tuning in you.

All Episodes

Episode Transcript

Popular Podcasts

United States of Kennedy

Dateline NBC

Stuff You Should Know

.css-15opob5{left:0;position:absolute;top:0.8rem;} All Episodes

.css-14f5ked{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:2;overflow:hidden;}Architecting the Future: Alok Pareek on Databases, Logs, and Real-Time AI

Episode Transcript

Popular Podcasts

.css-r6mb8g{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:1;overflow:hidden;}United States of Kennedy

Dateline NBC

Stuff You Should Know

All Episodes

Architecting the Future: Alok Pareek on Databases, Logs, and Real-Time AI

United States of Kennedy