Episode Transcript
Available transcripts are automatically generated. Complete accuracy is not guaranteed.
Unknown (00:13):
Hello, and welcome to podcast dot in it, the podcast about Python and the people who make it great.
When you're ready to launch your next app or want to try a project you hear about on the show, you'll need somewhere to deploy it. So take a look at our friends over at Linode.
With the launch of their managed Kubernetes platform, it's easy to get started with the next generation of deployment and scaling powered by the battle tested Linode platform,
(00:34):
including simple pricing, node balancers,
40 gigabit networking,
dedicated CPU and GPU instances, and worldwide data centers.
Go to python podcast.com/linode,
that's l I n o d e, today and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.
(00:56):
Your host as usual is Tobias Macy. And today, I'm interviewing Smit Shah and Sayan Chakraborty about Luminair, a machine learning based package for anomaly detection on time series data. So, Smit, can you start by introducing yourself?
Hi. My name is Smit Shah, and I'm working as a senior data engineer at Zillow for almost 4 years.
And I'm mostly involved working building data related products at Zillow.
(01:20):
And, Sayan, how about you?
Hi. My name is Sayan Chakrabodi.
I work as senior applied scientist in the Zillow AI team.
I mostly work on, like, anomaly detection and help building the anomaly detection and, in turn, the machine learning methods
inside Luminare.
And going back to you, Smit, do you remember how you first got introduced to Python?
(01:41):
Actually, when I joined Zillow in 2016,
that was the very first time
I got into Python as most of the data teams were using Python. So
I started, like, learning more about Python and using it. And it's really
easy and convenient language to use, and I really love it.
Just before Python, I was more into Java and Objective C programming.
(02:05):
And, Sayem, do you remember how you first got introduced to Python?
Yeah. So I'm from a stats background. So I used to code in R. And I remember, like, during my 3rd year of my
PhD program, I decided to learn a new language, and Python was picking up, and I thought it would be a good candidate.
I'm full on Python. I think I
(02:26):
didn't wrote any code in r in last 2 years.
Yeah. That's definitely 1 of the interesting sort of ongoing holy wars in tech is Python versus r when it comes to stats.
I'll leave that at the sidelines for now, but I think it's funny that you started in R and have since come over to Python.
And so I'm wondering if you can just start by giving a bit of an overview about what the Luminair project is and some of the origin story of how it got started and why you decided that you needed to build out this library from scratch?
(02:54):
So Luminar is a tool for detecting anomalies,
specifically for time series data,
and we do it for both batch and streaming use cases.
The in the place where Luminaire
stands out is, like, automating the whole modeling process and somehow,
like, democratizing them. Anomaly detection across the community because, like, anomaly detection is a hard problem, and and not everyone comes with a domain a domain expertise or, like, the ML expertise needed for doing an omnidirection. That's where Lumina comes in to automate the process.
(03:26):
And what's the story behind the name?
So the dictionary definition of Luminaire is complete electric unit of light.
And 1 of our Zillow's core value is turn on the lights. So as Sayan mentioned, we wanted to, like, democratize
anomaly detection,
and we also wanted to bring visibility to the teams
about various anomalies or data health issues
(03:50):
in their data.
And that's how we came up with the word Illuminaire.
And so there are a number of other projects that are intended for working with time series data even just within the Python ecosystem,
1 of the most notable ones recently being Prophet. But I'm wondering if you can just give a bit of an overview about
where Luminare fits within that ecosystem of time series frameworks and libraries, and what are the use cases that it's well suited for versus when you might want to use something else?
(04:19):
Luminair is built for anomaly detection, so it's focused on
classifying anomalies.
So when you are talking about, like, other forecasting packages, those kind of depend on how much signal that you're getting from the data. But
anomalies are anomalies. So they can show up either you have signal in the data or you don't have any good signal in the data.
(04:42):
So that's where Luminaire come in. We have built the library in such a way like it's focused on detecting anomalies for time series data and brings automation
for when the user wants automation.
And whoever wants, like, to play with the model, we also
open up the configuration
for, like, doing all the changes during the modeling. So it's kind of also
(05:07):
works well well in terms of, like, bring
building an explainable model for
anomaly detection. And specifically, like, for forecasting,
we actually have seen, like, whenever you have good signal, Luminar can be used well as a forecasting tool, and it performs pretty well whenever you have good signal in the time series data. We actually recently submitted a paper in I triple
(05:31):
e big data conference this year, which got accepted. And we
shown, like, some few
benchmarks where we have shown, like, Luminaire
is outperforming many other
competing
forecasting or anomaly detection methods in different scenarios.
That ranges from, like, Luminol, ADTK,
(05:51):
even Auto Remind, also Prophet as well.
Because of the fact that there are so many other libraries out there for time series,
what was the motivation for actually starting an entirely new project versus just adding new capabilities onto
an existing library or just wrapping
(06:16):
worlds. Like, either we wanted to have more worlds. Like, either we wanted to have
more control over the
model building process, or we wanted to build an anomaly detection tool
for those users who wants more automation.
And Lumina comes good in both world, and we have
seen, like in many others
(06:37):
so when we were building this anomaly detection tool,
we looked for
several existing solution, and we found, like, there is no such tool that is
solving this problem in a robust and reliable way because I mentioned before, like, most of the tools which are more sophisticated for dealing with time series data are focused on forecasting, not for anomaly detection. And those tools which are focused on anomaly detection does very basic
(07:05):
modeling,
has very basic modeling capabilities. So that is where Luminaire comes in, which is kind of
combining these 2 capabilities
into, like, a more powerful, more sophisticated anomaly detection platform.
Yeah. Let me explain, like, how this whole Luminar project actually got started from inception.
So at Zillow,
(07:25):
since we are a data company and the company identified
there was a need for
having a centralized data quality team. And that's where our core team got formed. And what we observed also was there was no standard or formal processes
that everyone were following in general to
detect their general data health issues for their metrics.
(07:47):
And that's where we started
creating
standards
and processes to help these teams out. So that's where the very first utility function in Python
was created internally,
which was helping teams to
generate data health metrics like volume, availability, completeness, or even comparison.
(08:08):
And later on, teams were interested to how are not getting alerts on top of this matrix. So we started building deterministic
anomaly
alerts on top of this. But later on, we also found that we had a lot of time series use cases within
a lot of our teams. And that's where the need for doing a time series anomaly detection came in.
(08:30):
And that's was the first place where we build this, where we added this utility function
within our core package to detect tensors anomalies. And we started with the off the shelf ARIMA model.
But from that point and onwards, we then
saw, like, okay, these are, like, 2 different use cases. And that's where we splitted the project
into, like, Luminaire deterministic
(08:53):
checks. And this Luminaire, what we have open source right now, which is the core time series anomaly detection.
And from there and onwards, we started adding
more models, more sophisticated models to support just time series anomaly detection
and, overall, trying to democratize
self serving,
(09:13):
anomaly detection at scale.
And over this period, like, our core team, which I would like to also mention just not just 2 of us, but it was also, like, Anna Swigert, she is our their manager, and then Rui Yang,
Kumar Sultani, and Kyle Buckingham
who were there from the initial days of building these packages.
(09:36):
So I'd like to also thank them. And in terms of the ways that it's being used now, you mentioned the data quality aspects of maintaining your data pipeline. And, Sayan, you also mentioned actually using it for some of the forecasting capabilities. I'm wondering if you can just
discuss some of the different ways that it has grown to be used throughout Zillow within the data team, but also maybe some use cases outside of just the specific data pipeline and data analysis life cycle?
(10:03):
Yeah. Sure. So within Zillow, like, we currently don't use it as a forecasting tool, but use it as an anomaly detection platform.
So
in general, like, we have lots of, like, internal and external services, and we process
enormous amount of data every day. So there are data producer who generate, like, batch or streaming data that gets consumed by the different downstream services. And the producers
(10:31):
wants to make sure the data they are generating are good quality data as well, like the downstream team who are consuming the this data
and creating maybe business metrics or
generating feature for their ML system. Like, for example, like Zestimates, Zillow Offers, the
recommender system we have in Zillow. All of their team, they want to make sure, like, the data they are ingesting is good enough. So that's where Lumina comes in. So Lumina intervenes different part of this process pipeline and make sure
(11:01):
the data flow
and the data that is going from 1 place to another is good quality and does not have anomalies.
So we do different checks, like checks for volumes, like availability,
nulls, completeness,
and so on and so forth, and to make sure the data within Zillow and that's being ingested to the services are are healthy enough.
(11:24):
Just wanted to say, like, sometimes also
so that's where we were saying, like, we were trying to bring this standard or processes
at Zillow and trying to guide them,
like, what data health means, what quality means.
So that's where we started
teams to
encourage more on not just working on building the data pipelines. It's not their end goal.
(11:47):
But it is also making sure
within your pipeline process,
you are
outputting,
healthy data.
And that's where, as Sayan mentioned, like, we act as an intermediate intervene process where team leverages
this tool.
Particularly for things like data quality or if in you're in the operational use case and you're doing anomaly detection on maybe system metrics,
(12:14):
it can be very easy to accidentally get to the point where you're
generating too much noise because there's a, you know, certain
variance in the signal. And so it can be hard to determine if something is meaningful or not. And I'm wondering if you can maybe dig into some of the complexities that are inherent to anomaly detection that are not obvious at first glance and that are difficult to overcome or that are important for avoiding the case of creating too much noise that people will start to ignore the types of anomalies that are being detected?
(12:46):
Anomaly detection is a challenging problem indeed. Specifically, like, if you are building an anomaly
detection model for a given problem, like a given dataset,
you can keep on optimizing that because you can ingest more data, more information about the data, and you can keep on optimizing your model so that it works best for that dataset.
But from a tool or from a service perspective, when you're making, like, an anomaly detection service,
(13:12):
so that is a challenging problem because
that is something anomaly
detection is, like, an unsupervised problem.
So
anomaly detection is, like, an unsupervised problem,
so it does not really come with labeled data. So that is, like, 1 tricky problem to understand, like, the performance of the anomaly detector, whether how good it is performing versus, like, how bad it is performing.
(13:41):
And, also, like, since Luminaire is a time series anomaly detection tool, it struggles a problem of handling nonstationarity, which is like a never ending problem for time series data. And, also,
for batch and streaming time series anomaly detection, those are 2 very different problem. We have observed, like, from our past experience, like,
when you start aggregating the data or, like, you start seeing the time series data over different
(14:07):
frequencies,
the behavior of the data changes a lot. So these are
the different issues that we have to keep in mind when you're building
time series anomaly detection or anomaly detection
in general as a service.
And from the actionability
point of view, that is also, like, a very important
(14:28):
aspect because, like,
anomaly detection comes with an error rate because it's a probabilistic solution.
So you have to understand, like, if your model misses
or fails sending an alert or there is some issue in the model or in the pipeline
that alert does not receive to the end user, what is the cost of that? So that is a very important problem
(14:51):
to handle. And, also, like, the time sensitivity, because in many cases, mostly in the streaming use cases,
detecting anomalies in time is a very important
problem. And so in terms of the actual design of Luminaire, how are you addressing some of those problems of being able to solve for the general case while also being able to provide some
(15:13):
escape hatches or tuning capabilities for being able to identify some of these special cases or
make it fit a particular use case and just managing the flexibility and the breadth of the overall problem space?
Yeah. So Luminary is is an anomaly detection tool, which supposed to work for wide ranges of problem. So we take different measure.
(15:36):
And since it's a machine learning
internally, it uses machine learning. We take standard
techniques followed in the machine learning literature that kind of processes the data, models the data, and
then use it for training. So from the beginning, like, we start with data cleaning. We check for non stationarity and do all the adjustment that you need to do for
(15:58):
modeling or dealing with the time series data.
And then we get signals from the data itself. And as I mentioned before,
since we have less control over the externalities,
we use the history of the data in order to incorporate all the information we need in order to model the
model for anomaly detection. We check, for for example, temporal correlations,
(16:19):
like periodic patterns,
or sometimes we have seen in use cases where data has, like, local nonlinearity.
Those are the things that are incorporated during the model building process.
And finally, like, all the steps
are like, require some actions
and some decisions that the users need to make.
(16:41):
That is where Luminaire stands out from the other systems, and Luminaire
has a built in
configuration optimization
feature as well where the user
just comes in and says, okay. Like, I would like to monitor my time series. And for a given problem, we optimize the configuration
(17:02):
based on the dataset,
and that actually
brings a lots of automation during the process.
This is, like, 1 side of the thing. And another side is, like, where how to deal with
streaming solutions. Because in streaming anomaly detection or specifically for speech streaming data that I mentioned, like, it behaves differently, we have different solution. Like, instead of, like, doing, like, a predictive or uncertainty based modeling, we do some sort of, like, baseline matching or density matching,
(17:32):
where we do, like, data checks over sequential windows in order to see, like, whether
has Telvie or not.
So, yeah, these are the typical steps or measures we take to build Luminess as a successful anomaly detection tool.
And on the subject of the windowing,
that can be another challenging optimization problem is determining what are appropriate sizes for those windows for ensuring that you're actually,
(17:59):
you know, determining what is the proper bucket for being able to determine whether something is anomalous and, you know, how much information do I need for being able to compute that. And then also cases such as seasonality
where
something might be anomalous
within the previous time window on the order of days or weeks, but on an annual basis, it's actually entirely normal. And I'm curious how you handle some of those types of problems.
(18:23):
For the windowing
aspect, this is a very important question because
sometimes, like,
anomaly detection,
specifically, like, this kind of problem are, like, context based. So sometimes, let's say, at the middle of the night, you have less traffic or some data showing very low
volume, high variability?
So we make sure we we
(18:44):
consider the seasonalities
of this pattern. Like, if it is a repeating pattern, then we consider that into incorporating that model into the model.
And, also,
in terms of determining the width size of the window,
we
take measures of setting, like, the either the user can pick the width size of the window for if if the user knows what would be the optimal size for their problem,
(19:07):
or we are actually working
on, like, bringing the automation of
measuring the same problem over different window size. That means you're basically seeing the problem over different contexts.
We have seen in many use cases like that
seems to work pretty well
and does not really implement it inside Lumina right now, like changing the window size for a given solution, but we are planning to build in future.
(19:35):
And, also, like, that is definitely a tough problem. And
right now, usually, what we have seen is, like, usually, like, if your data in terms of streaming, if they are, like, in
like, you're receiving data, let's say, every second or every minute,
Right now, we kind of expect
the
users,
(19:56):
as of what we support provide right now, to give us that kind of information.
And, yeah, that definitely is challenging. But at least for the batch style of processing,
that becomes very much easier. Like, if your data is
like, you are getting your frequency of your data is, like, every day or every week or every month or even, like, every hour, that is where it becomes a little bit easier to automate that process.
(20:21):
And in terms of the actual design and implementation
of Luminaire, can you dig into some of the internals of the project and some of the ways that the design and goals of the project have evolved since you first began working on it? Yeah. So it initially started from, like, just implementing with a basic type of model.
And we just had the training data, doing some play basic cleaning, and we're processing that, doing anomaly detection. And we've seen, like, there are many caveats and in dealing with,
(20:52):
like, anomaly detection in in time series data.
Specifically, you have to deal with if you have missing data or you have some change points in the data, which is a very
serious problem in time series modeling.
So we take different measures of of detecting those and making the data ready before it goes to the model building process.
Lumina has 3 main components,
(21:15):
which can be used independently or sequentially
in order to
perform a complete end to end anomaly detection. So that start with, like, the data preparation and profiling. So in the preparation and profiling part, you can
prepare the data for being ready for modeling.
And you can also do some
(21:36):
exploration where you can see what the historical patterns and what has changed in the data.
So processing in the sense, like, that Illumina detects change points and also, like, Illumina detects trend turning, which are very useful and, like, sometimes interesting to see in many use cases.
Data imputation, if there is missing data or, like, doing any other adjustment, if there is any change point observed.
(21:59):
And in terms of,
the modeling,
we
in internally in Luminaire, like, we have different type of models.
Some models focusing on the forecasting capabilities compared to some models
focusing on, like, the variational and uncertainty pattern where the data has very less signal.
And on top of all of this, we have an optimizer
(22:22):
that can optimize
the choices
given a problem. So if you have a data,
Luminaire optimizer
can run different scenarios and check which
specific configuration
fits best for a given problem,
and it can suggest
to suggest
the best optimized
(22:43):
configuration to you. And on top of that, like, what we do and also, like, whoever uses Luminaire,
they can run a scheduling engine,
like, for training, like, in terms of scheduling the training and the scoring process because time for specifically for time series modeling,
you need a very periodic structure of straining and score straining specifically because
(23:05):
you don't want to use a very old model
in order to
score
newer time points. So you have to make sure you are always generating new
and most recent
models
at a specific frequency.
And for streaming use cases,
it is like a trade off between
efficiency versus speed because in streaming use cases,
(23:28):
you want to process the data fast and you want to result
you want to
send the result to the user in a timely fashion.
So
as I mentioned before, we do, like, a baseline comparison,
like a volume comparison or distribution comparison approach
where we
compare different time windows and we take a baseline time window and we do the processing. And,
(23:50):
similarly, we have a training scoring schedule for that. And the scoring process is very lightweight,
where the most recent model can be used to pull
the relevant baseline and can be used to score that specific window to see any problem is there or not.
And I'm wondering what the motivation was for releasing the project as open source and any extra effort that was necessary
(24:14):
to maybe remove any
assumptions about the way that Luminair was being deployed or used based on how it was being employed at Zillow
and how to make it more general purpose and accessible for people outside of the company?
So we actually looked for several solution out there, like, first, when we are trying to solve this problem.
(24:36):
And there was no tool we found like that solves the problem the way we want.
Because as you mentioned before, actually, the time series model comes with several
challenges dealing with the seasonalities
or dealing with streaming versus batch data or, like, solving the problem of.
(24:57):
So that encouraged us to build our own solution.
And we wanted to open source Luminaire because, like, whatever we have built,
we wanted to contribute back to the community because, like, since
we have already invented the wheel, we didn't want the wheel should be reinvented
for a because we are solving a very common problem because anomaly detection is is is not a problem within Zillow. But even outside,
(25:22):
many people are trying to solve the same problem.
And, basically, like, this is an industry standard as well. Whenever you open source something,
instead of different people working on the project independently rather than
building something on top of each other, some on top of your solution
so that we get incremental
(25:43):
increase and, overall, the whole industry benefits from it.
And, finally, like, the open sourcing helps incorporating lot of brains. And if we want to have, like, a high quality solution, this is, like, a good way to go.
Also, 1 of the problems in the initial time that we found was
when we started providing this utility functions to the teams,
(26:05):
initially, they had to select what models to run when from the suits that we were providing.
And within that same model, they also had to specify
what parameters they also have to set. So this was 1 of the bigger challenges that we found
for not just, like, even for the ML teams, but even for the data teams. Because
(26:26):
we wanted even, like, any generalized data teams to benefit from
all the sophisticated
ML systems.
So that's where we started
making our models, like, bringing more models to the suit and making it more sophisticated.
And on top of that, also added the layer of AutoML. So where
(26:46):
users now don't even have to figure out what models they have to select,
that kind of becomes 1 of the parameter to Luminaire.
So I would say, like, that is, like, 1 of the key things
that has helped lot of the teams
at Zillow right now
and solving lot of the time series
problems.
(27:06):
Because
teams are also not just onboarding just 1 time series that they are interested in. They are onboarding, like, 100 or thousands of metrics that they care about.
And imagine the time
a team has to spend figuring out for this 1 metric, what model should I select and what parameters should I
set in and scaling that to thousands of metrics. So that's where we are, like, trying to
(27:31):
solve that bigger problem.
And that was the main motivation
also to make it open source because as also Sajid mentioned, like, nothing we saw was available to solve that use case.
Python has become the default language for working with data, whether as a data scientist,
data engineer, data analyst, or machine learning engineer.
(27:55):
SpringBoard has launched their school of data to help you get a career in the field through a comprehensive set of programs that are 100%
online and tailored to fit your busy schedule.
With a network of expert mentors who are available to coach you during weekly 1 on 1 video calls, a tuition back guarantee that means you don't pay until you get a job,
resume preparation, and interview assistance. There's no reason to wait.
(28:18):
Springboard is offering up to 20 scholarships of $500
towards the tuition cost exclusively to listeners of this show.
Go to python podcast.com/springboard
today to learn more and give your career a boost to the next level.
For individuals or teams who are adopting Luminair, can you talk through the overall workflow of introducing it to an ecosystem or a particular application
(28:44):
and what's involved in actually getting it set up and training the model and getting it deployed? So,
ideally, what this open source package is providing you is the brains, which is the models.
Now for teams, if they wanna leverage it, so their
main goal is, first of all, the matrix that they care about. And that matrix basically involves, like, let's say, 2 main columns. 1 is your time column and 1 is your actual metric column. So that kind of becomes the initial inputs to Luminaire.
(29:16):
And
then you train it using that historical data. So make sure you provide enough history so you get a proper model,
trained model. And now you have this trained model
which is getting outputted. Now you wanna store that model somewhere,
and that can be any of your desired storage format. You can either go with any kind of file storage or even you can store the object in a database if you like that. So that was the main reason of, like, decoupling
(29:46):
the model with the storage and everything around it. Once you have this model,
now the time becomes a scoring. So let's say if you have the metric which is generated every day or every hour,
so you can have your scheduling system
that can
run at that interval
and pulls that train model to
to score that specific
(30:07):
metric and determine
what is the score of that metric. And
that's not where the process ends for the user because
scoring will output few few metrics about the scored results.
From there, you also have to figure out,
like like, for your stakeholders,
what is the sensitivity
(30:28):
that they are interested in. So if it is, like, highly sensitive,
like, if it is, like, highly
anomalous,
then only you wanna get alerted. Or if it is, like, say, like, 95%
anomalous or 99.9%,
it is anomalous.
So, like, anomalous is different for different users. So that's where users just have to set up what threshold
(30:50):
makes more sense to them.
And that's how you leverage the output to figure out if this point
that we just scored right now is an anomaly or not. So that's how you design
your
basic training,
deploying of your model, and the scoring process.
And are there any particular
common patterns that you've seen either internal to Zillow or from users of the open source project as to ways that it's being integrated and some of the sort of main use cases that people have been able to benefit from?
(31:23):
So this system
this model that we provided and the example that I shared
was just for 1 metric. Okay? And in a larger system, and even within Zillow as well, we have a lot of teams who wants to leverage this. And they just don't wanna leverage it just for 1 metric asset. They wanna leverage it for multiple of their metrics.
(31:44):
So 1 thing that we have developed, like, internally and that what we recommend to the listeners as well
is creating a process where
your input is not just 1 metric, but it can be multiple metrics. Let's say, for example,
you have some, like, page views by devices
of your website. Okay? And you can have multiple devices over there. So you wanna create a process where you view
(32:09):
that query or that dataset as an input, and you wanna, like, divide each of this metric into each single mini process
and train and score them train and store the model object and then do the same process even for scoring.
So that's where we recommend users to create some kind of mapping mechanism.
(32:31):
So for example,
1 of the metric might be a device's
like, iOS,
and 1 of your metric is page views. So you can create a mapping for that
specific matrix and assign some kind of a unique key,
and that's how you can scale your system.
So that's 1 of the aspects of
scaling. And another thing that we have also seen is
(32:54):
the code that is written is pure Python.
So if you have a lot of this matrix that you wanna run parallelly,
so you can leverage some kind of distributed
processes.
So, internally, we are using this core pack like, this package with Spark.
So that helps us to train
all these individual metrics in your dataset much more faster. Anyone scored them
(33:19):
faster. So that that is highly recommended to leverage those kind of distributed processing.
And another thing that we have seen is
if you have a metric or a dataset
that you are in like, you have, like, you have generated,
but there might not be just 1 team that is interested in that metric. We have seen a lot of the other product users or business users are interested in
(33:46):
that same metrics. Let's say, again, take the example of, let's say, page views by some devices. Okay.
So there might be a lot of business stakeholders who care about that metric, and they wanna
be notified if there is some kind of anomalies.
So instead of,
like, each team
doing that same training and scoring process for the same metric, you are kind of duplicating
(34:08):
that whole resources.
So what we have done is
we have just on like, created a process where user can come in and say, okay. This is that job which is running, and this is that metric that I'm also interested in. And I would like to get alerted
if the sensitivity
reaches certain level.
(34:29):
So that way, what we are doing is
we just have 1 process which does the training and scoring, but we have another process
which maps the result of the scored value
to the subscribers
and notify them accordingly. And we have seen
that has helped a lot of our teams and stakeholders.
(34:49):
So definitely something we would recommend.
And 1 last thing is when we initially started a Luminaire,
and in order to onboard jobs,
it was more like a config driven. Like, people had to
specify they're creating some kind of config form,
which they push it in our repo. And a lot of the nontechnical
(35:10):
users were finding it difficult. So what we started is we started creating a self-service
UI,
and we integrated that with our data catalog system.
So where
like, user can easily come and onboard any kind of job process they'd wanna do
and also easily specify
the alerting thresholds.
And we automatically just take care of all the downstream
(35:34):
processes, like all the scheduling processes,
all the orchestration around it, airflow,
running the jobs in PySpark.
So decoupling all this process and
making it as a self serve is definitely
has benefited
Zillow and a lot of teams
at Zillow and something we would recommend
(35:54):
all our listeners to also
try to do that.
1 of the other interesting aspects of things like anomaly detection and working with time series data is situations where you have some protracted
anomalies, such as the current situation with the pandemic where
everything is thrown out of whack and things that might have worked well for your steady state are now
(36:17):
difficult to predict because of the constantly changing environment.
And I'm wondering
how that affects the work of people who are using Luminaire or who are trying to identify
sources of the outliers in the anomalies that are detected and just sort of the overall impact of things like the current pandemic on people who are trying to use time series for
(36:39):
building meaningful signals that instructs the way that they want to drive their business or manage their data pipelines or things like that?
As I'll repeat what I said before. So
anomaly detection is, like, a very contextual problem.
So in the context of the current situation, like in the pandemic, like,
we are kind of living in a anomalous
(37:00):
state right now.
But the nice thing about, like, the time series models who are using
Illuminaire or any other tools is, like,
you can tweak them
to bring
more contextual information
into the models. So for example, you can work on narrower time windows, which are, like, cases for many matrix. Like, for example,
(37:22):
like, many operational matrix, they don't
depend on, like, long term term history, and you can treat them and you can detect any problems.
In that or, like, on a local history, you can take in and detect any problems into that. But
in general, like, the business matrix or other matrix which has
a longer
(37:43):
context of longer history, that kind of
get more impacted.
So in general, like, time series models are very fast to adapt. And
when you want to have
anomaly in the context of a local outlier, like you're trying to find, like, something happened today, which is independent of the overall
context of these these pandemic.
(38:05):
And time series models are pretty pretty good at that. But in general, like, when you are dealing with
such different,
like, data of such a different pattern,
its ideal is to observe it from different perspective and involving, like, varying time windows and understanding the relationship between different time series
(38:25):
or somehow
correlating the time series or correlating the outlier sometimes help. And, also, like, doing some multivariate processing
of the time series where you can relate 1
problem with another 1, that helps a lot.
So even though you have such
big impactful
externalities such as COVID
(38:45):
or like this pandemic,
I would say, like, yeah, I mean, taking this measure helps a lot.
And in terms of the
specific projects that you've seen built with Luminaire,
either internally at Zillow or out in the community, what are some of the most interesting or innovative or unexpected ways that you're seeing it used?
So in general, any anomaly detection tools are
(39:08):
designed to work as, like, detecting anomalies in the data.
Like, even internally, like, we have seen use cases where teams are using Luminaire to detect,
like, the different quality matrix relate related to modeling. That kind of
takes Luminair
just being like a data
quality tool rather than
(39:28):
from a data quality tool, like a tool that
monitors, like, a whole ML data on our products.
So, for example, like, we have teams who are using
this for tracking model drips
or any
changes in the code that might have introduced the bugs which have changed the out outcome of the model radically.
(39:50):
So
that's where Luminaire comes in. And not only Luminaire, that means tracking the input side of the thing, but also
tracking the output side of the thing. And, also, like, there are situation where Luminaire has been used as
tracking slow drift, where you see, like, models are going down model performance is going down for
(40:10):
some specific reason, some temporal change slow slow temporal change in the data, which cause, like, slow drift, not a steady jump or drop. So these are, like, some many interesting cases where we haven't seen anomaly detection tool being used, but
Lumina is being used in this case.
In terms of the external
(40:30):
users, it's been pretty recent that we open sourced the project.
So we have not, like, got much, like,
feedback from the outside world, like, how they have been benefited yet. That's where we are, like, trying to spread the word and
see if someone has used it and if they found anything. Yeah. That is something where we are happy to learn more from them as well.
(40:53):
And in terms of your own experience of working on the Luminaire project
and using it internally for your own projects, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?
There are,
several aspect of answering this question. Like, first, from a system point of view, like,
as I discussed before, like, 1 of the challenges of dealing with
(41:17):
not only for an omnidirection problem, specifically for time series problems, it has, like,
temporal dependency. Like, time is itself a factor in it. So it is very important to understand this time factor not only from a statistical perspective, but also from a system perspective
because
that's where we
(41:37):
make sure, like, we are processing the right anomalies, and we are sending the optimal
number of alerts to the user.
So from the system perspective, it is very important to find the sweet spot of
scheduling the training process because
you want to train your model frequently,
but not too frequently that will increase,
(41:59):
cost of resources and but, also, like, you don't want to train it
very infrequently so that it creates like, start degrading the model.
And, also, we have a context
of model detail, like time to live, where
we publish a model
that comes with an expiry.
That means, like, if you have published a model that would expire in some point,
(42:22):
that is deliberately done
because we don't want to use a very old model to
to score a recent data point.
So these are the different things from the system side of the things you need to understand when you you're dealing with time series data.
Now from the alerting side of the things, like, from the user's point of view,
we have seen challenges in, like,
(42:43):
setting up the post processing, like, how do we process the output from Luminaire?
So in the context from a user,
sometimes we send an alert,
which user find kind of obvious because the context
that model is training
is different from the context the user is looking into the model. For example, like, you are seeing, like,
(43:06):
continuously
a metric to be at the 100
for, like, last few months, and suddenly
you have seen that to be 101.
And that time, like, you will get an alert.
But from the user's perspective, user might think, okay. I don't really care about 100 being 101. I want to know, like, go in a 100, 000. But so these are the different
(43:26):
challenges we have seen, like, for from someone coming from a non ML background to to see this kind of problem in a different angle. And from a probabilistic
perspective, like, anomaly detection is a probabilistic
classification,
which comes with an error rate. So it's very important to understand as well, like,
what
metrics do you want to monitor? As Smith mentioned before, like, there are user, like, who comes up with a task with lots of metrics, and it is very important to understand which are the important metrics to monitor. Because
(43:57):
if you
onboard a
very noisy metrics or a very
unimportant metrics that
usually generate noise, it's tend to get more alerts from those metrics
rather than from the stable ones, which you think to be more
important to monitor.
So for this reason, we, like,
(44:17):
internally, like, have smart alerting processes
where we do some alert throttling, muting, grouping so that we make sure, like, we send the right amount of alert
so that it does not really create an alert fatigue so that the user start ignoring the alert, but also, like, it does not suppress as an important alert.
(44:38):
I would just like to add 1 thing over here that we learned from the initial stages was,
initially, we were just sending out alerts for that 1 point itself.
So what was happening is, like, users were always not having
the context of, like, why is this point treated as an anomaly. So they always wanted to know
(45:00):
the previous trend along with it. So that was, like, 1 of our 1 of the interesting and challenging thing. So that's what we what we started to do is we
started showing
not just the recent data point that we scored, but we also started showing them,
like, last couple
histories
(45:20):
of what was observed.
So showing that context and showing your anomaly
along to the user helped them a lot to immediately make the decision. Oh, okay. This is why it is an anomaly.
Like, sometimes,
the significant
dip or drop might not be obvious, but then it becomes very obvious when you show them
(45:41):
the current point and some history with it.
For people who are evaluating
Luminaire or trying to understand if anomaly detection is the right tool for their particular problem. What are the cases where Luminaire is the wrong choice?
When I introduced Luminaire, we mentioned, like, Luminaire is an anomaly detection tool, which works for wide ranges of time series data, and it also brings automation. That means we
(46:05):
take historical information patterns or, like, different structures in the time series to understand what is an anomaly versus what is not. But if a user has
more
information about the data, like, more information about the externals of the data or, like, more context on the data, then
building a more feature based model will make more sense.
(46:27):
So in that context, like, for example,
like, if the user knows, like, there will be, like, major release, which might increase the
traffic by March, then incorporating this information would reduce
an alert, which is a good thing because the user already know if the alert is coming, why the alert is coming. This is, like, a very important aspect. Like, if you have more information
(46:50):
on the data, like, then building a feature based model.
And, also, it's important to understand whether you have the resources to build a feature based model. If you have the resources and if you have more context of the data, I would say Luminaire
better option to go beyond Luminaire.
This also kind of ties with the challenging
part of it
(47:11):
as well. Like, sometimes
people, if they are alerted, they want to get more context on
why something is an anomaly,
And that's where, like, their teams have to later on dig into it. And that is more, like, outside of the scope of the package right now, but that is something teams should
think more about it. Like, there's something as an anomaly
(47:32):
trying to give them
more context if possible why we think this is an anomaly. Like, not just specific to the models
or to the characteristics, but is there anything external reasons that might have caused it? And that's 1 of the very challenging problem.
As you look to the future of the project, what are some of the plans that you have for the near to medium term or any areas of contribution that you're looking for help from the community?
(47:58):
Mostly, we are now, like, working on the streaming anomaly detection model, and that is at this moment, it's a very initial stage. We have open source, the first version of it, but we would like to build and bring more automation and bring more sophisticated feature into it to bring more end to end processing,
like, where the user will let less would need less configuration.
(48:20):
Also, like, from the context of where Lumina is a wrong choice, as Smith mentioned, giving a context of
an alert would bring more
insight on why someone is getting an alert. So from that perspective,
diving and doing some root cause detection, like doing some data driven
(48:41):
context extraction,
is a very important part. And we actually published a paper last year about root cause detection and how we are planning to do that inside Lumina.
So that is 1 key part. And in terms of
improving the existing
anomaly detection model, we are planning to
incorporate more sophisticated model into the system. And, currently, we do
(49:05):
optimization, but we are planning to do some voting or, like, some sort of bagging approach
in order to identify if something is an anomaly versus nonanomalous because they're like having an ensemble approach of dealing with
the classification model would bring more reliability.
So that is something we are planning for future.
(49:25):
Are there any other aspects of the Luminaire project or the problem space of anomaly detection and dealing with time series data that we didn't discuss yet that you'd like to cover before we close out the show?
So right now, what what we internally kind of do is, like, having a fixed
frequency
when the training runs. Okay?
But that is also not a scalable solution. You actually wanna trigger tuning
(49:49):
when,
actually, your model is degrading.
So you wanna tune your
model based on your past
scoring results
and taking that into effect. And if that degrades, then only would you retune it. Yeah.
So, basically, this is like an internal tool that is not in the open source project right now. So what we do is
(50:13):
so we talked about the automation part, but this is more of a
self awareness
part where it reduces the maintenance cost as as well. Because when you onboard anomaly detection model and when you want to monitor something, you have to continuously
check, like, whether the model degrades and when you want to retune. Because you don't want to retune at every stage as well because that kind of
(50:37):
makes things less reliable and also, like, increases the competition cost because, like, the tuning part is pretty expensive. So what we do is, like, whenever we score,
we store
some model performance metrics. And every time when the the retraining schedule is
there, we check whether the model performance started starts degrading
(50:58):
with different voting methods. Like, we take different measures of
measuring a model. And if we see, like, okay. This model should be retuned,
we trigger a retuning. So that means, like, it's kind of a complete loop of full automated
ML system.
That's very cool.
So for anybody who wants to follow along with either of you and get in touch, I'll have you each add your preferred contact information to the show notes. And so with that, I'm going to move us into the picks. And this week, I'm going to choose a tool called flake hell, which is a wrapper around the flake 8 linting utility, which gives you the ability
(51:33):
to maintain your configuration in the pyproject dottoml,
as well
as better maintenance of plugins and determining
which errors you want to have reported and in which contexts are for particular paths. So just a great convenience utility on top of flake 8. So definitely recommend checking that out for maintaining code quality. And so with that, I'll pass it to you, Smit. Do you have any picks this week?
(51:57):
Like, over, like, last couple months,
I am, like, looking into this tool called Apache Ranger, which is an open source tool.
And it kind of provides
you a way to
manage or control
data authorization.
And the reason I find it very interesting is because
as
the companies are becoming more data driven companies,
(52:20):
Lot of data are getting generated
every day, like, every minute, actually.
So
and now a lot of teams also needs access to this data.
So how do
you control that? And that's 1 of the interesting project I found, and that will be the pick of my yeah. That will be my pick. And, Sayon, do you have any picks this week?
(52:40):
So I would like to pick a book recently I read that's prediction machines, the simple economics of artificial intelligence.
So I found this books in book interesting because, like, this talks about from a broader aspects of
machine learning and AI and predictive modeling. So specifically those who are
(53:00):
working on very technical side of machine learning, like the machine learning practitioners or even the data engineers.
This book should be very interesting for them because this gives a broader picture, like, from a business and strategy point of view. So I highly recommend to everyone to read this book.
Well, thank you both very much for taking the time today to join me and discuss the work that you've done on the Luminaire project. It's definitely a very interesting tool and 1 that I plan to take a look at myself. So I appreciate all the time and energy you've put into that, and I hope you enjoy the rest of your day. Thank you, Tobias, for having us on your show.
(53:35):
Thanks, Tobias, for having us in the show and for all your time. Thank you.
Thank you for listening. Don't forget to check out our other show, the Data Engineering Podcast at data engineering podcast dot com for the latest on modern data management.
And visit the site at pythonpodcast.com
to subscribe to the show, sign up for the mailing list, and read the show notes.
(53:57):
And if you've learned something or tried out a project from the show, then tell us about it. Email host@podcastinit.com
with your story.
To help other people find the show, please leave a review on Itunes and tell your friends and coworkers.