Balancing Off-the-Shelf and Custom Solutions in Data Engineering

Episode Transcript
Available transcripts are automatically generated. Complete accuracy is not guaranteed.
(00:11):
Hello, and welcome to the Data Engineering Podcast, the show about modern data management.
Data migrations are brutal. They drag on for months, sometimes years, burning through resources and crushing team morale.
DataFold's AI powered migration agent changes all that.
Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches.

(00:35):
And they're so confident in their solution, they'll actually guarantee your timeline in writing.
Ready to turn your year long migration into weeks? Visit dataengineeringpodcast.com/datafolds
today for the details.
Your host is Tobias Macey. And today, I'm talking to Tulika Bhatt about her experiences working on large scale data processing at Netflix and her insights on the future trajectory of the supporting technologies. So, Tulika, can you start by introducing yourself?

(01:01):
Sure. Hey, everyone. I'm Tulika,
and I'm currently a senior software engineer at Netflix.
And I work in the impression space,
basically creating, like, you know, data services and
datasets that power the impressions
impressions
work at Netflix. Before that, I used to work at BlackRock, and I worked on a lot of, like, you know, mission critical financial applications.

(01:25):
And
towards the end, I was working on data science platform.
And I also spent some time in Verizon
building some sort of applications.
And do you remember how you first got started working in data and what it is about the space that keeps you interested?
I would say that I kind of, like, accidentally

(01:46):
ventured into data. It wasn't very intentional start,
but then, like, you know, every role sort of brought me,
closer and closer to data engineering.
So for example, in BlackRock, I worked in a variety of different teams and roles.
And, towards the end, I started working,
on a data,

(02:06):
data science platform tools. So it was just like, you know, creating
customized Jupyter notebooks
and, like, cron jobs for data scientists and, like, you know, creating, like, those libraries
that they could access data,
like, easily without having to figure out the behind the scenes infrastructure issue
with that. And, you know, the project when we look at, like, BlackRock and we decided to sort of, like, sell it outside,

(02:31):
then there was this,
idea of creating, like, a usage based billing system,
which should, like, you know, create events and, you know, you would process and crunch those events to generate bills. So I felt like that was, like, sort of, like, more moving towards, like, you know, data. It was eventually, like, you know, creating those modeling those events, creating events, and then sort of, like, crunching them and generating builds.

(02:55):
And I enjoyed my work, but I wanted to do something with, like, really large scale systems. And I got the opportunity to work at Netflix, and I jumped at it. And, yeah, I'm here now sort of working in the impression space.
With the focus on impressions, obviously, that's a very specific category of data. Netflix,

(03:16):
as an organization,
is very large, has a very disparate set of data processing systems,
various requirements
around those different systems.
Wondering if you can give some framing about the characteristics of the platforms that you're building and some of the requirements as far as latency, uptime, etcetera, and how that frames the ways that you think about the work to be done.

(03:40):
Sure. So I'll first of all define impression so that we are on the same page.
So when you, like, log in to Netflix, you, like, see home pages and it creates, you see a bunch of, like, images on the home pages.
So we call those, you know, like, you know, images impressions.
And, those are, like, you know, sort of gateway of you discovering the product and interacting with it and eventually leading to plays.

(04:07):
So as you see, like, you know,
impressions are, like, you know, really fundamental for discovery.
So
it it's a really important
dataset at Netflix. So we use this piece of data in variety of forms. We use it for personalization.
So to see, like, which content are you,

(04:27):
or impression are you interacting with and its leading to place that gives us a signal, okay, these are the contents that you like.
Then we also use it for, like, you know, actually constructing the home page. So it's sort of, like, using,
variety of purposes of, like, you know, business use case of how home page is created.
So as you can, like, you know, hear about these use cases,

(04:48):
it's kind of fairly clear that you needed both in the batch world as well as, like, in the online services. So if you're creating home pages, obviously, the latency has to be, like, as low as possible.
Like, it has to be real time
for,
we also,
like, create, like, you know, aggregated datasets on impressions for, like, a variety

(05:11):
of personalization
and, like, you know, model training use cases.
Working
on large scale systems,
there are numerous technologies that have been built specifically for
high speed, high volume.
I'm thinking in particular about the Kafkas and Flinks of the world.

(05:32):
I'm
wondering
with the requirements around
the
type of data that you're working with, the speed at which it's coming through, and the
latencies at which you're trying to deliver
actionable insights to the downstream consumers.
How much of the technology that you're working with are you able to pull off the shelf from,

(05:56):
whether open source projects or commercial projects versus having to think about it from a,
greenfield,
you know, architecture design principles
of this is what I need to do. These are the primitives that I need to be able to build from. And because I have a very bespoke need, I need to build a custom system and, you know, what the

(06:17):
gradations are along those axes of just pulling off the shelf to build something from whole cloth.
Yeah. So
I think,
like, you know, whenever we are evaluating, like, technology, obviously, it's like the first, decision is, like, you know, whether you're gonna get it from, like, something from open source or do you need to

(06:37):
build it as much as possible. I think the first step we do is we evaluate, like, you know, if there is already a solution available. So, like, you know, for most
purposes, like, you know, Spark or, like, you know, Flink or Kafka is kind of, like, very good at what they're doing. So we don't just go about and, like, reinvent, like, those things again. We just use them,

(07:00):
like, for their purposes. But there might be
certain, like, you know, use cases where we are hitting the boundaries, and then we need to some think of, like, you know, customized solution
in order to, like, you know, achieve that.
For example, there was this one use case where, like, you know, I was crunching something,

(07:20):
using Spark and Iceberg, and then I wanted to populate it to
online data store like Cassandra.
And, you know, that was something,
which
is, like, you know, there was just no tooling available for us to just do it directly and, you know, just you
you could write a script going through, like, you know, all the rows and, like, you know, and, like, you know, just batching it and writing it to the Cassandra found row directly, but it was just, like, you know,

(07:49):
impossible to kind of, like, tune scale. It took, took a lot of time. So and, it would overwhelm Cassandra a lot and wasn't working. So they
kind of work with the platform team, and they build this custom of the shelf solution of actually, like, you know, taking the data in iceberg and just converting it to, you know, assist the files and just directly sort of, like, you know,

(08:12):
copying
that in actually, we didn't eventually send that in Cassandra. We ended up in, like, in copying that in ProxDB
and serving the online use case. But that was something which was, like, you know, there was no
sort of, like, direct tooling available, so we had to be a little bit creative about what needs to be done and,
solve this particular problem.

(08:34):
One of the benefits that you get by leaning on existing tools such as Spark, Flink, etcetera, is that they have had a lot of investment from the broader community, particularly around things like deployment,
observability,
being able to pull together different components that are designed for various use cases.

(08:55):
Whereas the benefit that you get from building custom is that you're able to design it specifically to suit your need. You don't have to worry about bringing in a whole bunch of extra functionality that's just dead weight and extra complexity for the problem you're trying to solve.
And,
obviously, when you're at a company like Netflix, you have enough resources to be able to invest in custom development, but you also don't have infinite time. And I'm wondering,

(09:19):
at least within your team, what the general heuristic is as far as the build versus buy and how far are you willing to push one of these off the shelf systems before you decide that you actually need to split out from that and build something on your own or build some additional component that,
solves the need that you want within that framework?

(09:40):
So I think this is a really interesting question.
There's always,
this, talk about build versus buy. So, like I said, like, you know, for most of the use cases, we first evaluate, like, how much we can push the original, like, you know, solution which is out there, the open source one.
We do have a flavor of, like, spark inside,
but it has, like, you know, some sort of, like, you know it's not exactly the open source spark. It has, like, you know there's a platform team which is kind of, like, wrapping our requirements,

(10:10):
our needs over it, so which is, like, solving all all our authentication and other, like, you know, problems
so you don't have to do everything from scratch.
Then,
I think I'm extremely lucky
working on Netflix because it always kind of like you know, it's a place where you you can always,
sort of, like, innovate. There's no there's always a like, you know,

(10:34):
it's
very open,
innovation culture. So if you feel like, you know, there is a better solution
out there and, you know, you believe in it, so
at Netflix, I think we are very, like, you know, eager to actually experiment something and, you know, kind of, like, build something on our own if if we feel that it suits our need.

(10:55):
So and this, I feel, is different is because, like, you know, in a lot of other,
like, you know, companies, then there's not that much freedom to actually sort of, like, you know,
in a way, you kinda have to be,
within,
like, you know, company sanctioned, like, technologies or something. So I don't know if I answered your question.

(11:16):
No. That's definitely helpful. And
another aspect of that problem is that in order to be able to even
have an informed decision about when and whether to break away from the existing frameworks is that you have to have enough domain knowledge about the problem space, about the technologies, and about the

(11:36):
core primitives and software principles about how these systems work, which obviously requires a lot of
experience and on the job learning as well as whatever education you have coming into it. And while it's definitely
very
possible to become an expert in Spark or in Flink because of the resources that are available,

(11:58):
it's a much more
it's much more complicated to get that breadth of knowledge, and it's not something that you necessarily have a road map to to go down to say, okay. These are all of the things I need to know, and these are how I obtain all of them. It it's usually a very ad hoc experience. And from your career path and your experiences of working at BlackRock and now at Netflix and Verizon prior to BlackRock,

(12:20):
what do you see as the opportunities for that learning that have been most useful and some of the ways that you've had to
stretch your
knowledge and gain new skills in order to be able to tackle these problems as they arise and just some some of the challenge that that poses as an engineer to be able to have that breadth and depth of understanding.

(12:41):
Yeah. Definitely.
I I feel like, you know, working in the software world is just, like, you know, immense. It's just like ocean
of knowledge. You you cannot just, be
like, claim to be an expert of anything because there's just so much out there that you don't know.
So, like, for example,
while I was working at, BlackRock and we were sort of like, you know, that I'm experimenting with Kubernetes,

(13:05):
I feel like, you know,
having a good breadth of foundational knowledge is good. I feel like the, like, the
so, for example, if you're evaluating which cloud data technology to use or which database to use, I think having the foundational knowledge of, like, you know, how this database works, what are the what is, like, you know, positives, negatives,

(13:26):
what is your particular use case, is it, like, write heavy,
read heavy, what is the latency requirement, how are you gonna partition it, like, all of that is sort of like, you know, the fundamentals are really important.
And from there, you kind of, like, you know, go about
looking for, like, you know, different options available
online and, like, you know, comparing them.
And I don't think you can choose to ex

(13:49):
be an expert in certain,
like, you know, tools. So for example, you can choose to be expert in, like, say, Flink or Spark or or, you know, a
certain database.
But, yeah, I wouldn't imagine, like, anybody to be, like, you know, have the whole breadth of knowledge as well as, you know, expertise and, like, you know, everything. And that's kind of, like, I guess, sort of impossible.

(14:12):
In terms of the team that you're working with, what are some of the ways that you as a group lean on each other to
accommodate the knowledge gaps and understanding that each of you have and ways that you're able to work together to be able to understand the entire breadth of the problem space and help level each other up, particularly if somebody's coming in where they say, I got hired on because I'm an expert in Spark, but now we're dealing with iceberg tables or now we're dealing with Flink stream processing, and I'm way out of my depth here.

(14:43):
We definitely lean on each other, for that sort of, like, you know, help. We also are, like, you know, like, they're, like, you have dedicated, like, you know, platform teams who are, like, you know, experts on certain things, and their literally job is to have a, like, you know, eye on the market and keep evaluating
what new technologies
are out there. And then they also kind of, like, you know,

(15:07):
bring up reports where they evaluate something which is, like, out there and say, oh, this is, like, you know think this is something, good, and we can, think about, like, you know, adopting this technology
sort of, like, you know,
internally.
Otherwise or
whether they think that this might not suit our company so more. So it's actually, like, you know, I'm very lucky to have that resource. So they do the first line of groundwork for us when we are sort of evaluating if we need to go after something else

(15:36):
and lean on them, get their expertise.
If we do not agree with them or if I we feel like our use case is very niche, then we can also, like, kind of, like, you know,
put forward a counterargument that, hey. I think this this this, like, you
know, works for us and this suits us. So,
as long as you have enough good arguments and data,

(15:58):
points, like, you know, backing your use case,
you can just
go and use anything that you want over here.
And then particularly along the axis of reliability,
observability,
data quality management,
when you are building
custom components or extensions to some of those off the shelf frameworks?

(16:21):
What are some of the principles or
core libraries that you've developed to be able to
manage the
visibility into those,
custom
implementations
so that you don't lose context or lose visibility into the overall flow and quality of information that you're processing?

(16:41):
This is interesting. So I think observability
and reliability
is still, like, you know, sort of, I would say, a little bit sore from them. But I think, like, data quality is something which is sort of, like, you know, there's not, like, straightforward, like, in a solved answer. We have a lot of, like, you know, custom tooling.

(17:02):
But to be honest, I don't think so we have something comprehensive, which is, like, you know, taking care of, everything. We have, like, bunch of different tools which are, like, you know, discrete, and they're doing, like, they they are pretty good at one particular stuff, and we have to kind of use, like, combination of everything
to make sure, okay, the data quality is, like, you know, fine.

(17:24):
So that's an interesting and, like, you know, unsolved space for us right now.
And, for example, like, in data quality, I feel like we have solved, like, you know, schema issues. You can have, like, you know, schema or whatever checkers and, like, you know, and all of that. That all of that is working
fine. We have, like, you know, volume based auditing and everything. That also works pretty well for us.

(17:48):
You can, gauge it,
if something is,
you know, not going as
as it is supposed to be. But
one thing that missing in our tooling is, like, you know, semantic,
I guess, semantic,
auditing.
So, for example,
if some event happens and you log something,

(18:10):
your
log output might
be correct according to schema.
The volume might be correct according to what the trends you are seeing, but maybe you have logged the wrong thing.
And there's not a really good way on how to, you know, actually
catch those sort of errors.
Another element of
data observability

(18:31):
is that it will also typically be at least somewhat correlated with
observability of the operational infrastructure where
the data didn't get delivered because one of the nodes crashed, and we have a split brain situation or we lost quorum, and so we're not able to actually move forward.
And I'm wondering,

(18:52):
what are some of the techniques that you have found helpful to be able to
thread together that operational visibility,
the operational characteristics of the underlying platforms with the actual data observability
and data quality management
to be able to
more quickly understand what is the actual root cause, particularly when it's something operational and not a logical bug?

(19:16):
Operational bugs, at least at metrics, are,
like, you know,
easier to detect.
So for example, like, you know, if we have, like, you know, robust
system, like, you know, workflow,
orchestrator, which is called, like, Maestro, I think it's an open source product too.
So if, like, you know, if something goes wrong, we automatically like, you know, it kind of, like, you know, is integrated with our,

(19:42):
alerting system, PagerDuty.
So you would automatically sort of get alerted that, hey. There is, you know, this workflow, there's something wrong. We
you know, it it threw, like, spark error or something.
Now if you're not a direct, like, you know, owner of the workflow, if you're, like, some some sort of consumer, we have, like, different sort of alerts. Like, you know, we have, like, you know, a time to complete alert.

(20:05):
So we normally like, we can set alerts. So this workflow took takes, like, you know, three hours to complete. But if it for some day, it took, like, you know, four or five hours,
so it just automatically, like, fires it. And it's like, you know, we can easily go on upstream and sort of, like, you
know, check of whether, like, you know,
what happened

(20:26):
in the entire, like, you know, processing pipeline. Was there, like, you know, some failure or something?
Then we have alerts like, you know, freshness alerts.
So, like,
we call it, valid to timestamp, DTS. So something like, you know, if
whenever a data is written, audited, and, you know, sort of, like, you know, processed,

(20:46):
we release, like, you know, a flag saying, like, you know, this data is corrected, like, you know, fresh and corrected, like, this particular time stamp.
So you can have those alerts too. So if we feel like, oh, this data has not been, like, you know, freshened up for a while, it's it's, like, you know, it's been, like, five hours, and, we didn't get any fresh data from that particular source. So it automatically kind of, like, triggers another alert, and then

(21:10):
it is routed to, like, you know, you can set it up like Slack, PagerDuty, or whatever and just get automatically notified
that there's something wrong over there, and you can go and investigate.
So I would say, like, operationally,
like, it's like I think it's
it's in a okay situation,
right now. We're also sort of trying,

(21:32):
some
a little bit of, like, you know, self resolving alerts
where if we feel like, you know, a data like, you know, we were supposed to receive suppose 10,000 records and we only received, like, 8,000 and there might be a late arriving,
data issue, then there are, like, you know, self healing pipelines where, like, an auto
backfill sort of, like, kicks off and automatically kind of backfills the data.

(21:56):
So, like, sort of, like, slowly,
incorporating that. It's not, like, a % there, but, yeah, that's that's also one way of how we
sort out operational issues.
And one of the aspects of operating
large, high uptime systems is that they require

(22:17):
automation to be able to actually
sustain them.
Whereas if you're running a smaller scale, you're running lower volumes of data, you can maybe get away with manual processes for error detection, error correction.
And I'm curious what you see as
lessons from working at these high uptime, high scale

(22:40):
environment
that are
translatable to some of those smaller scales where you can say, oh, it's easy to automate this thing, so it just should just be part of the standard operating procedure no matter what your scale and what are the aspects of
large scale
automation principles
that don't translate as well to medium to small scale systems.

(23:04):
I know, like, in, like, you know, smaller scale system, you can definitely get away with having a lot of, like, manual processes, like, you know, alerting,
backfilling, or whatever.
But I do think that it's important to
keep these things in mind while you are designing this particular
these kind of, like, you know, systems because you never know when like, you know, in a blink of eye from your small scale systems, you just end up in a, like, in a in a in a place where you're suddenly getting

(23:32):
much more events
than you originally planned for. So, for example, I feel like even though we have a lot of things that are automated at, Netflix,
having a good alerting strategy is still
painful.
Like, we
either under alert
or over

(23:52):
alert. So which both of them leads
to their, like, you know, own
set of, like, problems.
So I think, like, at least from in my experience,
whenever I'm thinking of, like, designing a new data pipeline,
I always kind of, like, you know, think about, okay, what what are we measuring?

(24:14):
What are the, like, you know, the who am I consumer? What's the impact?
What should the alerting strategy be?
I think that's a that's a
good, I would say, thing to do even if you're starting with, like, you know, a smaller pipeline. Having, like, you know, having a good runbook
even just for, like, you know, practice or something goes wrong. This is my runbook. These are all the details over there.

(24:37):
Do I need anything out of my like, you know, outside of my runbook if I need to solve this
particular problem.
And and the answer should be no. Your runbook should be complete. But that's a good exercise to
see. Sounds easy, but I'll tell you it's more often than not not easy. So but just getting in the habit, like, you know, right from the very beginning, I think it would be, like, you know, useful thing to do.

(25:02):
And, obviously,
the
topic that has taken everybody's attention for the past couple of years is AI and all of its various applications.
As data engineers, there are
and software engineers and engineers working in technical systems, there are definitely ways that we can use

(25:22):
generative AI to automate and accelerate our work.
But there is also a challenge of being able to feed it the right context, feed it the right understanding
about the problem that you're trying to solve.
And I'm wondering how you're seeing that
factored into the work that you're doing at Netflix, and in particular,

(25:43):
given the
size and complexity of the systems that you're operating,
being able to feed enough of that
architectural
knowledge into the
AI systems to be able to get them to give you useful outputs without just having to fight with it and go through, an untold number of rounds of prompting.

(26:06):
To be honest, I would say that we have had, like, mixed results
at Netflix.
So for example,
like, we do have,
like, an internal version of, like, you know, a search which is kind of trained on our, like, you know, on the internal documentation
because that,
bot has, like, sort of, like, you know, overall

(26:28):
view of documentation, so it's definitely doing
better.
We do have, like you know, we can,
like, sort of, as fun projects, create our own Slack bot team on our, like, you know, support questions or train it on our run book and help them to answer, like, you know, questions. I think those have had mixed results, like, not super great.
Sometimes it's just, like, feels like there's more effort in, like, you know, training the bot and making sure it's answering the right thing than just going and searching the answer by myself and answering it.

(26:57):
So but we definitely, like, you know, use Gen AI tools, those bots for, like, you know, developer productivity. We use it, like, you know, we have, use it for, like, auto completion or in notebooks
just to, like, you know, prompt,
I guess, fancy auto complete.
And then,
we definitely
do

(27:18):
integrate it with our pull request to give, like, you know, some sort of, like, suggestions on, like, you know, code quality and all of that.
I would say six out of 10 times, it
gives pretty
okay con comments. Otherwise, it's,
kind of like you know, sometimes it's Mhmm. Just doesn't work.
Other time,

(27:38):
I think one other use case that's, pretty good is, like, you know, we kind of integrate it with, like, our build boards or, like, you know, our workflow blog board.
So if there is any error that happens, so it automatically kind of reads the log and bubbles it up.
So you don't have to and most of the time, it does create. So saving clicks, I would say, instead of you going and, like, now going through Spark and, like, you know, or going through Jenkins and getting the logs and see, okay, what happened. And it it, like, bubbles it up so that makes things faster.

(28:10):
What else?
Yeah. I think that's the, I guess, breadth of use cases we have right now for the for LMS right now.
And
as you have been
building systems,
working at Netflix,
tackling the various data challenges that are faced by the
scale at which you're operating?

(28:31):
What are some of the ways that you
foresee or would like to see some of the off the shelf technologies
or some of the
adopted best practices
evolved
to incorporate some more of this AI automation and
intelligence
into the actual processing layer to allow

(28:52):
human operators to move even faster and not have to worry as much about the low level minutiae of the bits and bytes and work at a higher level of problem solving?
I think,
like,
maybe,
with things
things get better with LLM, I I think
it could be, like, you know, more helpful in, like, you know, in the actual,

(29:16):
sort of, like, prompting for, obviously, like, you know, code completion, writing code, or something. Actually, let me take a step back. So when I'm designing this, like, you know, any large scale system over here, I think the first step is actually designing the data model, coming up with it,
the criteria of when that event is gonna fire, how that event looks on, like, different clients, client devices, and all of that. I think for that, I don't know if there's a good way to

(29:45):
sort of, like, get it solved by,
LLMs.
But probably, like, you know,
it's just, like, needs so much context, and it's, like, going to be very difficult to provide all of that context to what and expect, like, great answers. It can be a good, like, you know, starting
step, but it's just there's so much work. So I don't envision any, like, you know, it being useful in that aspect. But once we have,

(30:11):
like, say, a data contract or something, probably, we could, like, you know, be integrating it with, like, you know, these,
bots and sort of, like, you know, generating code from that. Also, like, you know, planning, it can little bit help on, like, you know, planning on the architecture
side. So for example,
if I need a fling job and I could just feed it, okay, these are the like, you know, I'm expecting this much, like, you know, volume, this much, like, you know, each event is gonna be this size. Like, can you propose,

(30:41):
like, you know, what's the starting sort of, you know, cluster side
cluster size? I can, like, start off with to accommodate all of this. So probably if it if it could, like, you know, getting all of this input, like, create, like, you know, the first initial cluster for me, I think that would be great if I can just get to
that point.
And another aspect of the ecosystem

(31:04):
is that a lot of the tools that have become
the widely adopted standards
are getting to be old in technological
terms where they've been around for a decade
plus. Spark in particular
was built in response
to Hadoop and the challenges that people were facing with Hadoop. There have been various generations of successive technologies that are taking aim at Spark and Flink. I'm wondering

(31:31):
what you see as the
forward looking viability
of Spark and Flink as the primary contenders in the ecosystem
now that there has been enough
understanding of their
characteristics,
the benefits that they provide, and the shortcomings in terms of their technical architectures
and some of the ways that newer systems are designed to address some of those challenges.

(31:56):
I don't think, like, I see something which is going to be a replacement for, like, you know, Spark or Flink as of now. I do think, like, there's a lot of things that needs to be
better with,
like, you know, Spark and Fling, for example. I think we just started auto scaling on Fling. I mean, I don't know,
about, like, the, like, what version it it was released on, but I think we just started, internally with auto scaling of Fling before. If any, like, you know, thing this is one of the bigger problems that we have in the data engineering world where, like, you know, if you we didn't have, like, you know, systems that would automatically scale a light, you know, in software engineering, which is, like, you know,

(32:36):
it's kind of, like, stateless. And if something,
if there's more traffic, automatically, you can set up, like, you know, traffic guards or something and then, like, auto scalers and they scale it. For Flink Job, we always had to manually scale it whenever we would get, oh, there's a lot a lot of events coming in, consumer lag increasing. Even for stateless job, we had to do it. So I think they started with that. I don't think so we have a solution for stateful

(32:59):
jobs as of yet. So that's going to be an interesting problem to solve. As things become more and more real time, you obviously don't want a human in there to be actually, like, you know, operating your,
operating your process size and,
these things can be taken care of. It'd be great. And, also, for example,
for Spark,
even though we have, like, optimizations

(33:21):
on Spark, it doesn't really actually work all the time.
Also, like, you know, if there is, something, like, you know, hot partition or, like, you know, skewness or something, it requires a lot of, like, you know, manual, I would say, intervention.
The regular,
like, you know, parameters are not working. It just, like, fails, and then you have to sort of retune it and redone it. So I think that is also another, like, you know, bigger

(33:48):
problem which has not been solved in Spark. So even though these technologies have been here for a long time, there's still a lot of, like, you know, improvements
that needs to, you know, happen so that we can keep up with the,
with future where, like, you know, more and more real time data needs will arise.
And in your experience

(34:08):
of
working in this space of high volume,
high speed data, and high uptime requirements at Netflix, what are some of the most interesting or innovative or unexpected ways that you have seen either your team or adjacent teams
address
the design and implementation
of solutions to those large scale data management problems?

(34:31):
This is interesting question. I think the first one that comes to my mind is, it's in the impressions project, so it just immediately streamed to my mind. So there was this use case of providing, like, you know, a year worth of, impressions for certain models. And, you know, providing a year worth of a person's impression is, in real time is just like, you know, an impossible ask because I think one person would see around, like, you know, maybe hundred thousands or even more

(34:59):
impressions in a year is just, like, a really big ask. So we were thinking about the ways on how we could, like, you know, achieve that, then it was, like, you know, the the raw data, that was an impossible answer, like, you know, getting that in real time and, like, you know, reasonable latency. So we were like, okay. What if we can aggregate it? What if we can reduce it somehow? So that's how we came to, like, you know, using something called as, like, you know, EMEs.

(35:23):
Just like, you know, taking,
impressions and converting into numbers using, like, formula. So now from object, we went to numbers, but still, like, you know, one year is it's a lot. And, also, like, you know, you gotta
have some job which process one year of impression and, like, you know, converts it into numbers. So we came up with, like, you know, having a Spark job that would do the scrunching and store it into Iceberg table. Now all is good and fine. We have data in Iceberg, but we cannot expose Iceberg

(35:51):
to GRPC service. And so from there, we need, like, you know, some something else, like an online data store
from there.
So I this is the project I told you about where we actually sort of, like, you know, we had to come up with an a new technique of how we can, like, you know, take this entire gigantic universe of data and kind of, like, you know, sort of upload it into an online data store.

(36:16):
So
we sort of devised a clever technique where what we do is, like, you know, we upload this data sort of, like, you know, weekly.
And then we have another
sort of,
real time process which kind of, like, you know, service which takes
these, impressions,
does some, like, you know, online crunching for a week and takes week old data from this computed week old data and just, like, combines it together and sort of, like, you know, provides real time one year worth of impressions. So I felt like that was an interesting use case because, normally, we have used Spark only for, like, you know, analytics serving analytics

(36:53):
purposes,
and it's, like, after the fact. But this time, we use Spark and Iceberg in the actual, like, you know, actual operational flow, and we kind of, like, you know, we are powering, like, you know, a use case in the, like, you know, in in real time system, like, GRP service.
And we have, like, this,
Spark and, you know, Iceberg and equation. So I felt like this was an interesting,

(37:14):
project, like, where you used both software engineering and data engineering and kind of combined together and built, like, one product. I guess there are other examples too. Just like whatever really fulfills our requirement, we are always open to open for innovation and going beyond, like, you know, normal ways and getting that. And in your work of

(37:34):
operating
in such a high demand environment
and learning more about the various primitives involved in building these data systems and maintaining reliability, what are some of the most interesting
challenging lessons that you learned personally?
I think, to be honest, I feel like at least in, this world, technical problems are, like, you know, easier to solve. It's more just like organizational

(38:00):
problems that are harder to navigate, and that's been my lesson. So, for example, like, you know, when you are dealing with,
impressions, now impressions are going to be created on, like, you know, client devices. And by client devices, I mean, like, you know, the
it it's going to be web or TV or your phones or something. So, like, you know, being generated over there. Now all of these client devices, they'll have their own limitations

(38:25):
on how they can generate that event, whatever, like, you know, how much, like, you know, that particular client, like, you know, has, like, you know, logging
facilities or whatever on the device and how much it can capture and whatever, like, you know, how much date, like, capabilities the manufacturer is providing them. So all of these, like, limitation kind of, becomes really important to know when you are sort of, designing an event

(38:50):
because now you have all of these constraints that you need to keep in mind, which,
like, you know, which maybe earlier you didn't need to know. You just needed to know, like, you know, this is my API or whatever. Now I need to know, okay, how what are the constraints of their systems?
Then, what is their release cycle? What is their, like, you know, how are they testing their logging artifact? How are they doing canaries? How are they doing AB testing. So all of that is, like, you know, there's just so much context

(39:17):
required to do
work in this space. So this is what which has been, like, an interesting observation
for me personally.
And as you continue to work in this space,
invest in the reliability and capabilities of the data systems that you're responsible for, what are some of the lessons that you're looking to grow in, some of the resources that you rely on staying up to date and adding to your understanding of the space, and just general advice that you have for folks who want to be able to move into a similar position?

(39:52):
For me, personally,
I am really invested in learning, like, you know, what's up with, like, you know, data quality. Can we finally get, like, you know, one solution that fits all and can solve most of my problems? So that is what I'm, like, you know, personally invested in. Like I said, like, currently, my pre my my project is to go with the producers, sit and understand their use cases, like, how they you know, what constraints they're operating in in order to produce an event, how does the whole,

(40:21):
dev life life cycle goes through. So I think this is
a a good exercise, and this is something I would encourage others also to do. Like, you know,
I don't know if it is just me, but I feel like, you know, it's easier to,
understand
how data is being used by consumer,
whereas the producer aspect of it is something becomes, like, you know, more black boxy. So, like, you know, just digging in over there and understanding

(40:45):
what's happening in that world, which it and that could help with you, like, you know, having better strategy for your, you know, data quality. You don't have to you can actually stop bad data from going in if you are more, like, you know, plugged in with how your consumers are doing their testing and their whole network life cycle. So that would be one advice, be more plugged in with the data production process.

(41:09):
And,
regarding how I keep up, I think I keep up with things normally as, like, any other folk in the field is keeping up, like, you know, just, like, reading the type of blogs of other, like, you know, other companies,
other newsletters,
podcasts, and, like, you know, even, like, conferences just, like, listening. What's happening? I know there is, like, you know, some work going on on data contract space. There was an open source project, so I'm curious, like, you know, what that will lead to. Yeah. And there there's, like, you know, some work which has been I think LinkedIn has a open source, sort of, like, data quality platforms.

(41:43):
I'm following up over there too, and I'm seeing if that can be something that we can actually adopt and use for our use cases.
Are there any other aspects of the work that you're doing, the lessons that you've learned working on large scale data processing systems,
or the predictions
or wishes that you have for the future of data systems that we didn't discuss yet that you'd like to cover before we close out the show? I think working in, you know, small scale world as well as moving to, like, in a large scale world, I think one thing has become clear with me is you need to definitely have good fundamentals.

(42:19):
So, I mean, in small scale world, probably, you can, you know, get away if you are not, like, you know, thinking about the architecture
that much. You're not thinking about, like, you know, reliability
that much or, you know, alerting or whatever. If there are, like, you know, inefficiencies,
you can just cover up them with, like, you know, manually fixing things and all of that. But all of that really

(42:41):
explodes when you go to a large scale system. Like, if you don't think about the design carefully,
even, like, you know, minutes of, like, you know, downtime can mean, like, millions of events are now gone and can have have an impact. So, like, you know, it becomes very imperative when you are, like, you know, designing any system to think about all the challenges.

(43:02):
Also, like, you know, not only, like, you know, about reliability, but also from cost standpoint,
like, you know, also think about that. Like, okay. I'm going with this technology.
Like, how much is it's gonna cost? Also, like, really negotiating about
the amount of data
that you need to have, that you need to store is, like, you know, do you really need that data? Because all of this is, again, is gonna cost you to, you know, process, store. So

(43:29):
all of these,
decisions
just become really important
as your scale kind of increases.
That would be my advice. Like, don't forget the fundamentals,
lead designing data
applications.
Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.

(44:00):
Yeah. Definitely. I think, I already talked about something. We need a better solution for data quality or for a semantical, like, you know, answers for data quality. That's one.
We need, like, you know, definitely,
you know, more reactive tooling for, real time purposes.
So, like, you know, I don't know, more, autoscalers

(44:20):
for,
for,
Fling,
stateful jobs, then definitely we need more sort of, like, you
know, kind of investment in, performance optimization
for Spur and, like, you know, less sort of manual
tuning and manual feedback. And over there, I think that is, like, another unsolved space that we have for now.

(44:42):
Alright. Well, thank you very much for taking the time today to join me and share your thoughts and experiences
of working at Netflix and being responsible for high speed, high volume,
and high uptime data systems. It's definitely a very interesting problem space to be working in. Has a lot of valuable lessons to be learned from it. So thank you again for, taking the time to share that with us, and I hope you enjoy the rest of your day. Yeah. Thank you so much. It was really nice talking with you. And, yeah, we had a lot of, like, you know, great conversation,

(45:11):
a lot of, questions that I would also think about later in my day. Thank you so much for taking this.
Thank you for listening, and don't forget to check out our other shows. Podcast.net
covers the Python language, its community, and the innovative ways it is being used, and the AI Engineering Podcast is your guide to the fast moving world of building AI systems.

(45:38):
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email host@dataengineeringpodcast.com
with your story.
Just to help other people find the show, please leave a review on Apple Podcasts
and tell your friends and coworkers.

All Episodes

Episode Transcript

Popular Podcasts

CrimeLess: Hillbilly Heist

Crime Junkie

Stuff You Should Know

.css-15opob5{left:0;position:absolute;top:0.8rem;} All Episodes

.css-14f5ked{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:2;overflow:hidden;}Balancing Off-the-Shelf and Custom Solutions in Data Engineering

Episode Transcript

Popular Podcasts

.css-r6mb8g{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:1;overflow:hidden;}CrimeLess: Hillbilly Heist

Crime Junkie

Stuff You Should Know

All Episodes

Balancing Off-the-Shelf and Custom Solutions in Data Engineering

CrimeLess: Hillbilly Heist