How to build in Observability at Petabyte Scale

Episode Transcript
Available transcripts are automatically generated. Complete accuracy is not guaranteed.
Speaker 1 (00:08):
Welcome back everyone to another episode of Adventures in DevOps.
And because I'm here all by myself, I've brought as
a small consolation another fact.

Speaker 2 (00:17):
So this one security related.

Speaker 1 (00:19):
Google itself is being used to spoof login attacks and
this is because they made some mistakes with their custom domains.
If you're running a website, you can actually run it
on a Google domain. And so I'm here to say, like,
if you're not already offering past keys with your solution,
it's now more critical than ever realistically, and if that's
something that anyone needs to help with, I guess you

(00:40):
know where to find me. So today I've brought it
on from Observe Inc. And he's the head of engineering
and was previously a founding engineer for the company. We're
hoping to really get into something interesting which is not
just all about it zerability, but really how they built
a platform to scale.

Speaker 3 (00:58):
So welcome on, thanks for having me. More.

Speaker 1 (01:00):
We've actually had a bunch of experts and observability in
the past to focus on different things. So there was
a previous episode with Adriana Velella and all about hotel
and building self healing systems with Sylvan Khalaichi So if
Eddyinge was interested in deep dives and those topics, you know,
we've sort of investigated them. But I'm really interested in

(01:21):
the observability at scale perspective because one of the things
that's always really caught me off guard is just how
much data comes in from the customer systems into a
third party tool. So if you're doing logging or collecting metrics,
it's something that there's just a lot of data being
passed over the wire.

Speaker 3 (01:39):
Right, and you know, machines are very good at generating
a ton of data. You have seen like news like
Mainstance Open the Eye is gating away like more than
one petavite of data per day, and that's you know,
not that rare. You know, we have seen, you know,
among our customers. You know, like multiple cases where people
are sending you know, hundreds of terabytes of data per day,

(02:01):
even like a petabyte. So you know, how do you
kind of store even like store those gigantic amount of
data you know, at scale cheaply? And also how do
you process them?

Speaker 1 (02:13):
And how do you so you know, how to buy megabyte?
You just take it an S three right problem solve?

Speaker 3 (02:19):
Yes, So cloud definitely like the modern cloud is very
magical in terms of that. Even if you consider like
just jumping everything into S three, right, there is still
the challenge of how to organize their data, right, like
what is do you want to do like Culumner store,
do you want to do road store? Like? And if
you put stuff into S three, like how do you

(02:41):
index them? Or do you even index them?

Speaker 4 (02:44):
Right?

Speaker 3 (02:44):
So many of them more than data warehouses, like you know,
like some fake or data breaks, they don't use a
different scheme, right, They don't index their data. They rather
they partition their data and then they have clustered their data.
As a olvisibility vendor like ourselves, we would definitely you know,
do a lot of these kind of custom plustering and
you know, custom data modeling to make sure that you know,

(03:05):
people can access these data S scale.

Speaker 1 (03:08):
So like maybe maybe still some secrets here, like what's
data dog and data bricks and observe bank, Like how
are you how are you storing the data? Are you
using S three for some part of this or is
it you have your own like virtual machines where you're
running your own sort of custom database or some some
database provider that you're that you're managing there too at

(03:30):
the engine, like on your own infrastructure to actually store Yeah.

Speaker 3 (03:33):
Yeah, very good question. So you know when we when
we got started, that was like around two sousand and eighteen.
The basic idea that we had was, hey, you know
around that time, you know, data Warehouse was kind of booming, right,
you have like Snowflake and data breaks, and you look
at them and you thought the kind of the generic
the relational execution engine they built on top of you

(03:56):
know S three essentially Easy two is actually pretty engineering
separation of computation and storage, right, so you can store
your data, you know S three, but then you can
basically essentially spin up Easy two clusters to conscious data
on demand if you just want to store the data,
if you don't want to run an inquiry. You don't
have to pay for all the easy two machines to

(04:16):
be online all the time. Soerability of data vendors, you know,
like data Dog, they usually have their own kind of
you know, preparatory execution engine. Being a startup engineering you
feel like, hey, we probably don't want to spend the
first four or five years of the startup building a
database from from scratch. You know, We're more wanted to

(04:38):
provide values to our customer, like real use cases.

Speaker 1 (04:43):
I mean, it's interesting bringing that up because I actually
think that this is something that Charity uh speaks a
lot about in the past. Actually when she built Honeycomb,
was they did actually roll their own database.

Speaker 2 (04:54):
The thing that kept on coming back is like, no,
that was it was definitely a mistake.

Speaker 3 (04:57):
Like, yeah, it went against all the instincts of me
as an engineer, right, Like I would love to build
a database from scratch. If you really want to build
a solid database from from the ground that bend support
all kinds of functions. Yeah, it's going to take years.

Speaker 1 (05:14):
It's going to be a challenge to do it better
than the ones that are out there today already. Like
you pretty much would have to first employ database architects
who have done this many times before, and then you
spend all of your time and effort building up something
that isn't even going to be really the core competency
or competitive advantage on on what you're what you're selling

(05:35):
as a product.

Speaker 2 (05:36):
So you what did you go with off the shelf?

Speaker 3 (05:38):
Then, yeah, so we went with the snowflake, you know,
so snow fake. There are a bunch of aduct kind
of non technical reasons of going with Snowflake by you know,
using a lot of the the the mechanisms they're providings,
and Snowfake allows doing auto clusteroming of your data. You know,

(05:58):
Snowfake obviously has excellent semi structured data support, so that's
also a thing that we leverage a lot in our solution.
So Snowflake obviously, you know, is you're on top of
S three and easy too, so you know you can
say we kind of also consider that as you know,
building on top of three and easy too, and we

(06:19):
can just data into like Snowflake pretty much. You know,
as three in general, you're probably looking at something like
thirty seconds or like. So that is the fact, right
because you're dealing with like, you know, terabyte or pedabyte
of data, right, But you know, for some of the
use cases you probably want better, right, Like if you
want to do like, you know, real time monitor or

(06:42):
you want to do something like very real time troubleshooting, right,
you will want something that's seconds or even like milliseconds
of latency.

Speaker 1 (06:49):
Okay, so are you did you build like a proxy
layer that sits like on top on top of your
own Snowflake account that has the data from customers being
filtered and then being routed to the appropriate tenant inside there,
and is it just passing the data offer? Is there
something happening in your in your architecture that before the

(07:12):
data is actually entering.

Speaker 3 (07:13):
You know, we obviously use you know, the coppa and
stuff to cute the data before they get into a snowflake.
We are doing a little bit of processing, uh in
that in that part of the system, not a whole
lot right now. I think they're mostly about filtering the
data and doing some kind of data shaping, but we
also do authentication and there are different endpoints that it was.

Speaker 1 (07:36):
So I've never found a good reason to use Kafka
in my experience, and the one the one, the one
bad time that someone wanted to use it in my proximity,
I had vetoed it because at the time it's still
required running zoo keeper, you know, a third party another
tool in order just to manage it. I think that's
finally gone away, But I'm actually curious what the need

(07:56):
would be at the edge in order to process.

Speaker 3 (07:59):
I think for cuf fis mostly just for queuing, because
you know, snowflake, it's a it is fundamentally a batch
based engine, right. So you well, I think right now
it's getting better. They have some kind of native support
for like stream ingesting data, but I think back then
they only support ingesting data in batches. So basically you

(08:20):
need to upload the data to three and then you
have to call Snowflake and then the Snowflake would you know,
find those data and upload those batch of data into
their own three essentially, But then you know, we have
to handle streaming data from the customer, right, so we
need something in the middle to convert the data from
streaming to batch, so and to absorb any kind of

(08:43):
you know, bursting this from the customer, because you know,
with Snowflake uploading stuff, it's a pretty much constant throughput, right,
But when the customers, you know, throwing data, it does
you know, it highly depends on what the customer is doing.
Right now, Like we have these instances like vid you uh,
like intelligence customer who's doing kind of video analytics. Yeah,

(09:05):
plun story they were doing like a video analytics stuff,
and then we were back then it was the World
Cup season and whenever there's a game showing, you can
see the customers traffic is spiked still at that point,
we were not super good at handle some of those spikes,
and we've always had an incident or some thought that

(09:25):
we need to deal with On Sunday. I'm like, ah, yeah,
it's definitely like one of those World Cup incidents that
this customer is is jamming up with. So yeah, so
I kind of need the cup cut to absorb those
those large bursts of data and to smooth in and out.
If you asked me like, do I do I love
cuff cup and announcer is obviously no.

Speaker 1 (09:49):
Yeah, So it sounds like, you know, you're pretty much
having that run on the virtual machines at AWS that
you you control for.

Speaker 2 (09:55):
Handing them beload.

Speaker 1 (09:56):
Now I'm curious then, like, was there ever a consideration
use something like kinesis.

Speaker 3 (10:01):
Yes, so we definitely have thought about it, But I
think eventually we decided to not get super tied to
the ALS ecosystem just because this isn't pretty practical concern, right,
because you know, people will ask us to provide a
GCP or like Azure deployment at some point in the future.

(10:25):
I think right now we already have kind of a
GCP deployment and kind of pouring that between different cloud
is going to be a challenge interesting.

Speaker 1 (10:34):
I know what we do in those circumstances is our
main workloads are still running on a of us, but
when we need to have connectors into other clouds, we
will use like GCP's event arc and I forget the
thing in Azure to connect in and stream our analytics
that we see inside our product to back to our
customers implementations.

Speaker 2 (10:54):
And I can see the opposite happening. I don't know the.

Speaker 1 (10:56):
Corresponding things are to kinesis. I mean that that's been
a much better sell internally for us because we could
more utilize the advancements that the cloud providers are making,
rather than trying to manage all the non serveralist aspects
of spinning up really like easy two machines and then
deploying complicated technology on top that no one internally is

(11:17):
an expert on just to deal with the load. I mean,
I always find can interesting, Like I've always hated it
as a product. You're not in aws, you know, anyone
that's listening. There's two forms of kine asists. There's like
fire hose and then like streams, and I can never
remember the difference, but between what these what these two
things are. One of them does scharding and is for

(11:38):
like pulling data, and the other ones like for pushing
data and.

Speaker 2 (11:40):
Piping it somewhere else.

Speaker 1 (11:41):
But it's always really interesting to me because it seems
like even at small volumes of data, it may work fine.
And then I get the questions like, well, how much
data can you actually funnel at it? And I'm sort
of curious, like do you have you actually know, like
how much like the total customer data that's coming in
through your platform is.

Speaker 3 (11:56):
I don't have a concrete number at top of my head,
but I can say probably multi petabyte per day of
data coming probably more than ten petabytes of.

Speaker 1 (12:08):
Data per day, you know, I honestly have no idea
if that's a lot.

Speaker 3 (12:12):
Yeah, I mean I like it kind of volves on
mind that you know, how much you know, crap people generate,
you guys shouldn't say that as oviserability vendor, but the
sheer amount of data that that comes in and you
get these like billions of rows of logs you know,
every you know, a few seconds, do people actually need

(12:33):
so much soup much data coming coming from their system?
But the truth is, I think if you kind of
field them down. And if you have kind of a
pretty flexible pipeline, right, you allow people to kind of
branch off the data and clean it up and to
transform them into a shape that they can querry more easily.
That actually brings a lot of values for these customers, right.

(12:55):
And he also has kind of a little bit of
benefit of Hey, you know, I don't have to drop data,
so I don't have to pre select, pre choose what
data might be valuable to us, you know, down the
road or not.

Speaker 1 (13:07):
And I say recently, and I really mean, I guess
the last six years or so, I did see a
bunch of startups that popped up whose only goal was
to basically figure out and drop data before it entered
data dog, uh, to avoid getting the cost hit. And
I'm wondering it, like, how you've managed to overcome that.
Is it a matter of you are not charging so

(13:30):
much like you're you're basically charging me like a lower
flat rate that's closer to the infrastructure cost for storing
the data. And then you're it's some sort of usage
based pricing off of the queries or something else, Like
how do you how do you make that trade off
be financially effective.

Speaker 3 (13:46):
Yeah, that's a that's a great question. So there are
there there are multiple aspects of this. So like for
for one, you're you're definitely right that you know, we
don't try to make a lot of money off like
storage cost and I can say is true for you know,
Snowflake basically what just passed down the infrastructure costs of three.
The other part of this, we don't use index and

(14:09):
all that for a lot of our data, so ingestion
you know, comes pretty cheap for us. And we are
also very good at kind of dynamically kind of scaling
up and scaling down the commutational resource. The usage based
the pricing right, so that by itself is a very
interesting topic. You would think, you know, usage based pricing
is a good idea. I think it worked very well

(14:30):
for the business intelligence use cases, like you know, Snowflake
business does very well for their usage based pricing, but
I actually didn't work very well for our customers because
you know, for observability, you really you know, you work
with the budget and you hate surprises. You hate that, hey,
why this month my usage bill is you know through

(14:53):
the roof you want a vendor to say that, hey,
you know, I'm only going to charge you so on.
So the nice thing is, you know, as a vendor,
we also have some legal role of saying that, hey,
if you pay us so much, we can start to
kind of throttle you if you way over, you know,
in terms of your usage. It's kind of like the
mobile phone data the mobile kind of data plan model, right,

(15:14):
So if you're over your limit, you can still use
your phone, like, you can still query data, but it
won't be like super fast, right, so you can still
get your job down. So that's that offers a much
better kind of off ramp.

Speaker 1 (15:28):
That's really interesting because we actually saw the complete opposite
thing when it came to our domain. So we offer
like log in and access control for our customers. We
generate JWTs for the applications that our customers, right, and
the ones that have come to us and said, we
don't want usage based pricing. We want to pay like
per number of users or whatever. And I'm like, what

(15:51):
do you want to happen when you go over that number?
Option number one user can't log in option number two.
You may say you want consistent charges, but that means
you are waiting until the end of the year basically
or end of the month depending on their subscription plan
to get the charges anyway. So is it better to

(16:11):
wait till, you know, six months or seven months to
the end, whatever the end of the year is to
actually feel that hit, or better to get it, you know.

Speaker 2 (16:18):
Per month during the subscription.

Speaker 1 (16:21):
And realistically they're like, oh, yeah, no, actually that's that's bad,
like you know, yeah, obviously, yeah, we don't want to
wait to pay, and we can't have you degrading the
experience for our users. And I'm like, okay, well, now
you know why our infrastructure thing charges in this way.
I mean, I can imagine you're not necessarily in the
critical path of the end user's applications, although I can also,

(16:45):
on the flip side, see the value that you're providing.
So you mentioned like really driving some business specific use cases,
so not necessarily just having the logs available or the
metrics on like whether or not there's uptime.

Speaker 2 (16:57):
Et cetera. But I'm really curious what.

Speaker 1 (16:59):
You've seen there from a like competitive advantage for your
customers who now have access to basically all of the
data from their platform that if they had used something else,
they would not have retained.

Speaker 3 (17:12):
There are a lot of interesting stuff that we we
we we have seeing our customers doing. Right, So for instance,
you know many customers would you know kind of do
you know, using their logs and you know, chasing data,
they can build kind of higher level kind of customer journey,
you know, by doing some kind of aggregation. Right. So
you have you know, these different events belong to the

(17:34):
same user, right, you want we want to aggregate them
and you know, form a session that you know, similar
to the things that you would do with like a
web analytics tool like you know, like Google Analytics and
so on and support, but you can kind of customize
and you can decide what are the events that belong
to the same session, you can aggreate them, aggregate them, right,
So that's that's very common, Like people do that all

(17:56):
the time. And you know many cases people can also
join in with their business data, right like hey, I
have this log message, right it references you know customer one, two,
three for five. That's just an opaque idea. How do
I know you know who customer one to two for
five is? What kind of what customers are affected by
this particularly particular incidence, and you know by how much

(18:18):
the margin of the customer of this month, right, we
can actually calculate that all the way just from log data. Yeah,
we also kind of ingested a lot of the financial
data into you know, our system as well, so that
we can you know, have like the sales data and
all that in one place. What happened can we exactly
pinpoint like, for instance, the one incident that caused us

(18:40):
to you know, have to burn some mature computation for
that customer that I would explain this ten percent you
know margin loss.

Speaker 1 (18:46):
That's interesting, I mean, if I understand you correctly, there's
actually scenarios where they're potentially using observe to store like
almost as their customer or CRM and importing importing the
data from wherever there are other systems are hopefully not
Excel spreadsheets, but could be rather than funneling the data

(19:08):
out of their the customer specific data like you know,
metrics or usage patterns for the whole tenant or customer
account into say a whatever the sales or customer success
organization is doing. So this could be like unfortunately, like
a salesforce or something like that. So it is interesting
to hear that some people are are actually utilizing the

(19:32):
ability inside your tool to make that happen, because from
my experience, these things like happen in the tableaus of
the world, and anyone who is technical in any way
always hates these tools.

Speaker 2 (19:45):
Like it was never the.

Speaker 1 (19:46):
Time where I'm like where I saw an engineer jumping
into Tableau or a looker and was like, Wow, this
is like the best experience ever that never happened.

Speaker 3 (19:54):
Yeah. Yeah, adding for us, you know, we we like
in our internal use cases definitely kind of biased, uh,
you know, for obvious reasons.

Speaker 2 (20:05):
No, No, I don't think.

Speaker 3 (20:08):
Yeah, but I think one thing I would point out is,
you know, because when you build obserability a platform, right,
so essentially you are trying to build uh you know,
from a technical perspective, you're trying to build a pretty
good streaming engine, right, because the data is always kind
of streaming. You know. Compare that with the traditional ETL
kind of system. Right, the ETL you have to specify, Hey,

(20:31):
you know how often I want to run this pipeline,
you know, is it by day or by month? You
have to think about that, right, So it's like forces
you should think this in a batch kind of mode,
And you also have to think about oh, what if
this job failed, I'll have to reach try and like
stitch the data together and so on and so forth. Right,
and but if you build a kind of a streaming

(20:53):
first system, right, the system kind of take care of
that for you, right, so you don't have to think about, Hey,
what how old one I want to run this job?
This system kind of decides based on the incoming volume.
Do I chop it up into like one minute fragment
to process, or do I chop it up into like
ten minute segment to handle. So, if you have this

(21:15):
kind of capable streaming system, you naturally wanted to kind
of do more use cases on top of it because
it's sometimes is easier to express what you want to do.

Speaker 1 (21:26):
So if there was someone out there that thought they
had a similar need to handle such large amounts of
incoming data over the API, I mean you're obviously using
Kafka here to handle some of the load before the
systems which aren't designed to be either item voted or
have the amount of scale that you need. I'm just interested,
like what the whole interaction is here here?

Speaker 2 (21:46):
Like are you.

Speaker 1 (21:47):
Providing custom agents for your customers to run on their
virtual machines? Or is it an SDK or is it
just a matter of exporting hotel compliant data out And
on your side, are you using something to handle the
streaming or is this like custom technology that you had
to build because I don't know, X y Z just

(22:08):
didn't cut it as far as both having the functionality
and also the reliability that you would need.

Speaker 3 (22:13):
In this demain milind sight, we basically do pretty standard stuff.
We do have our own agent that you can install
and you will collect the data. And we also have
you know, standard endpoints like you know, like like like
I said earlier, like you know Hotel and all that.
You know, you can if you have Hotel instrumentation already
done and you can just point your you know, Hotel

(22:36):
library to us and then we'll we'll we'll we'll take
it very easily from there. On our back end, we
do build a lot of custom things for stream processing,
so you know, because Snowflake itself is not a streaming engine, right,
so Snowflake only offers storage and compute, right, and Snowflake
that does offer some form of materialized the view supports,

(22:59):
but we use those mostly because that the pipeline that
we're building is kind of interesting. You know, we want
to have a lot of flexibility for the user to
do streaming data but for observability use case. You can
imagine a lot of cases where I have already collected
a lot of historical data, right, I would want to
you know, reprocess those historical data and using my new pipeline. Right.

(23:24):
So that's one of the main reasons where we build
our own kind of streaming systems sitting on top of Snowflake, right.
And you know, just to add kind of another little
bit aspect of this is, you know, we also didn't
use the existing materialize the view solution provided by Snowflake
big get that allows you to backfield because you know,

(23:45):
you can just start a new view and you let
it materialize, but then it's going to materialize the whole thing, right,
And imagine the cases where you have already collected historical
data for like a year, right, and you know just
reprocessing that whole year worth of data is going to
be very expensive and very consuming.

Speaker 1 (24:04):
So you know, this is actually made me think because
this may be the first time I've actually heard someone
utilizing Snowflake as there like as a core component to
the architecture. Is this like an expected use case that
Snowflake supports or is it I mean, like, is there
like a concern here that Snowflake will be like, no,
I'm sorry, you can't like embed this as a as

(24:26):
a you know, part of fundamental product if you're building
something that sort of competes with it in any way,
Is that a concern at all? Or is it just
like no, this is like sort of supported. There are
actually lots of companies doing that.

Speaker 3 (24:37):
The answer to this is kind of there are different
different layers of this. So I think from the business standpoint,
I think Snowflake wants to become kind of a platform
player people building an application on top of it. I
think to that, and you know, I think the businesses
are pretty aligned, so they would like people to like,
like us to kind of having a deeply integrated application

(24:58):
running on top of Snowflake. With that said, you know,
we are also pretty unique in the ways of how
we use ston Flake, because you know, your usual data
application on top of Snowflake would be, hey, I offer
some customer UI to you know, querry the data that
have already been stored in Snowflake, and totally because we

(25:18):
run a data pipeline like in a streaming faction inside Snowflake,
we are comptable for like over two percent of Snowflake
query at this point.

Speaker 2 (25:28):
Wow, that's quite the impact there.

Speaker 3 (25:30):
Yeah, So we helped discover a lot of bugs obviously
in their in their system. So we have a kind
of a like a love hate relationship with their engineers.
So on one hand, they like us because you know,
we helped kind of stress test a lot of aspects
of their infrastructure. On the other hand, I can only
guess that we cause some you know, slipless nights for

(25:53):
their Hong Kore engineers.

Speaker 1 (25:55):
You're lucky in this way if it's good to evaluate
the platform that platforms that you're utilizing before making a
decision on how you're relying on them, because if the
fundamental direction that it's going, the particular product you're using
or dB engine or whatever third party is different from
their expected usage, you could run into I mean, not
just downtime, but really just fundamental limitations at some point

(26:16):
where they're like, I'm sorry, we can't help you.

Speaker 2 (26:18):
So I guess in that way, you're lucky.

Speaker 1 (26:20):
But I can also understand the flip side of it
being so critical not just for a single customer, but
basically your whole business. Now when there's an issue with Snowflake,
and I don't mean like an incident, but just realistically
finding an unknown edge case that isn't supported, Like, oh, yeah,
you know, we're hitting the limits of the throughput that
is available here because I don't know Snowflake never imagine

(26:42):
that there'd be so many coming in through like a
single account or Another problem that we see often is
like the customer of a customer of a customer issue
where it's like it's easy for you as a customer
of a product to add multi tenancy, and the product
you're using may support multi tenancy, but if your customer
are also multi tenant solutions and they want to put

(27:02):
tenants in your solution, then you're passing you know, tenants
of tenants into your provider.

Speaker 2 (27:07):
Its like you know, turtles all the.

Speaker 1 (27:08):
Way down, and I can imagine not all those things
may have been built out effectively. So do you see
a future that is just like well, you know, as
we grow in either scale volume, amount of queries or
monetary value, where Snowflake no longer becomes the critical backbone
for how you're storing and managing that.

Speaker 3 (27:31):
We definitely took kind of a leap of faith at
the beginning of the company. Could you know, be like
a case where you know, or different alternative reality where
you know, just Snowflake couldn't handle whatever that we're throwing
at it. The way I'm thinking about the future is
really that you know, now you know, we have like
you know, Iceberg and all those kind of open format

(27:54):
for storage, right, so you kind of can can see
the trend where the storage is getting essentially commoditized. Basically
there's not a whole lot of money to be made
on storage, and also there's not a lot of useful
kind of preparatory stuff in terms of storage format, right,
so everybody is kind of converging into these open format

(28:16):
So that's you know, definitely good news for us, right
because if we can leverage those open format then that means,
you know, we have a choice of different execution engines
we can run on top, right, like Snowflake, maybe one
of them, like Snowflake is very good at doing like
large scale you know, joints and that kind of stuff, right,
But then if I just want to do something simple, right,

(28:38):
something like the scan terabyte of data and doing like
quick aggregation, Right, maybe another different execution engine would be
cheaper to run and would be more efficient, right, So
I think that's kind of like the way of kind
of leverage and kind of hand our bats a little
bit across different execution platforms.

Speaker 1 (29:00):
That makes a lot of sense, I mean, and I'm
now you've reached the sort of the edge of my
experience here. So there's parkt format, which is like a
columbar format that actually stores optimized data so you don't
have to repeat the property values and it's easy to
filter append.

Speaker 2 (29:14):
Only logs, for instance.

Speaker 1 (29:15):
And this is like one of the most common ways
in which even AWS is storing the data within S three.
And then you mentioned Apache Iceberg. I know there isn't
like full support in AWS yet, but I am sort
of curious, like what's the fundamental difference between these And
was the decision to use that like one particular format
versus another something like a one way door where you
sort of made the decision early on. It's like that's
the way you're going at least not coupled to a

(29:37):
proprietary format, or is it something that you think is
like pretty easy to switch out down the road, Because
it's like an internal detail.

Speaker 3 (29:44):
Or let's just say park is one of the format
that Iceberg tables can be stored at park. If you
think about it, just a bunch of S three files,
there's no concept of a table, right. So from a
database perspective that you fuy on the table, I have
to know, you know, what is a schema of the table, right,
what columns it had? And I need to need some

(30:07):
metadator to tell me, hey, in order to querry this,
you know this table from row like one hundred to
one thousand, which PARKT file I should scan? Right, That's
essentially what Iceberg provides. So Iceberg provide this metadata to
connect the concept of table with the raw parquet files
that you store on at three if if if you simplify,

(30:29):
really a patch of Iceberg is nothing more than just
a bunch of metadata files that you store alongside of
your parquet files, right. And I think that the good
thing about Iceberg is, you know it it offers a
capability of doing what's called puning on your on your data. Right.
So for instance, if this parkat file the value is

(30:52):
between one hundred and you know two hundred, right, and
if I have a quarry saying Hey, I only want
to I want to filter down the data to anything
with a value greater than two hundred and do you
know that this park file will never contain any from
any roads that would satisfy this future predicate, right, so
you can choose to not scan that parquet file at all. Right.

(31:13):
So Iceberg basically offered this way these metadata for the
career engines to proactively prone out these useless parky files
and to make the career efficient. So that's also a
pretty kind of interesting kind of enterprise use cases because
many of the enterprise customers, they are already thinking about

(31:36):
kind of building a data lake, right using these open
format like Iceberg, because they also don't want to be
locked locked in by all those vendors. Right, so when
we come in, you know, we are also kind of
yet another data vendor, right, and they will be like, hey,
could you also expose you know, all those opposerability data

(31:56):
I have ingested into you through Iceberg, so you know
we can they can actually own those data and they
can you know, do more things with those data. And
that's also kind of like an opportunity for us as well,
because you know, if we build our stack on top
of Iceberg and exposing them to the customer also becomes
a lot easier.

Speaker 1 (32:15):
So wait, like ingress into S three is like free, right,
But then if they're actually exporting the data for your platform,
then that's got to especially like a large sum, that's
got to build some costs, right, and.

Speaker 3 (32:25):
Yeah, so the point is they don't They don't have to, right,
So they can basically choose to give us their th
three bucket, right, so we basically directly ingesting data into
their three bucket in the Iceberg in the Iceberg format.
So in that way, they truly own the data, right.
So the data, all the data is in their bucket.

(32:46):
They have access to it, but we are basically, you know,
one of the applications that sit on top and operate
on those data.

Speaker 1 (32:53):
How do you go about securing access to a whole
bunch of customers S three bucket?

Speaker 3 (32:58):
Honestly, we haven't completely thought up that story yet. I
think everybody is kind of trying to figure out how
to do like essentially storage integration. Yeah, so I think
right now we just use standard like ables.

Speaker 2 (33:10):
Policies and you mean like I am or like just.

Speaker 3 (33:12):
Bucket, Yeah, I am. And then there's a specific way
of doing this kind of cross.

Speaker 1 (33:17):
Yeah, the trust the trust policies between the trust policy. Yeah,
I mean it's always an interesting question for me because like,
if you're just doing this thing sort of one off
or I think the canonicals, like you're in some sort
of ABS management account and you're logging into a whole
bunch of organizational accounts, it's like straightforward. But as soon
as you're in this weird position where you need to
start accessing customer accounts, the concept of how do I

(33:40):
actually secure this process starts getting more and more complicated,
and like how do we actually store the data for
doing the access not even the actual data and making
sure like customers can set that up correctly because there
is not like a straight like I still want this,
And if anyone from ABS is listening to this, like
where is AWS oh off access to a WS accounts,

(34:02):
because you know that's that's really what I want here, Uh,
that would be like solve a lot of problems with
getting the trust policy right getting the customers to actually
enter it correctly. It's interesting because we actually do have
a sort of uh for for authors, we have this
oth hack that we have an application in the serveralist
application repository in AWS, so are a WUTH. Instead of

(34:26):
them logging in, is they go and deploy this application
of their account and it like automatically configures everything and
the first time is sort of this weird edge case
where we don't know which account it is. They haven't
maybe not logged in, and so it doesn't necessarily work
out of the box, which would be really nice if
they streamline. But yeah, the cross account access with the
external trust policies, like, it's not the most customer friendly thing.

Speaker 3 (34:47):
Yeah, my being incentive is to solve this problem because
you know, they are they are the ultimate platform, right
that want.

Speaker 2 (34:57):
Like, I have a lot of thoughts here, you know.

Speaker 1 (35:00):
The one that I keep coming back to actually is
I wonder if they're actually decent incentivized to do this,
because it opens a new target for attack, for malicious
attackers to come in and fabricate applications, like you know,
take your company for instance. You know what if someone
threw up a fake observe page and said, oh yeah,
you need to just send someone an email, say hey,

(35:20):
your data whatever is going to expire, your connection is
going to expire. You need to go click on this
link and click approve in AWS and then they go
they get redirected to AWS. There's a screen there that says, hey,
Observe wants to access your data.

Speaker 2 (35:33):
Do you click approve or not?

Speaker 1 (35:34):
And you click approve, you don't look at the permissions there,
and the permissions say like full admin access to the
entire ABS account and it's just good to go. And
because people aren't reviewing what shows up in those screens.
And that's actually the fact that I threw out at
the beginning of the episode is like literally the same problem.
It's someone's making a fishing page and people can be
fishing this way. So I don't know there's a solution there, honestly,

(35:57):
and that's why I can sort of understand not to
do this. However, you know, my retort is going to
be there's already CloudFormation templates, and there's already the application
of the service application repository, which means there's already ways
to fish people into giving giving out full access to
their account by clicking a bunch of buttons without out
actually needing to enter their user or a password or

(36:19):
I mean, god forbid, they're actually having a password for
AWS in the first place, using their passkey to log
in and you know, authentic getting that way and then clicking,
you know, deploy on the CloudFormation template. It's still a problem,
so I'm not buying it. Yeah, I agree, a WS
platform to rule them all. You would think a lot
of support would be something that it has, and it
just doesn't.

Speaker 3 (36:39):
Yeah, I think. I think they also can count astand it.
But I think they are being like with like Iceberg
and stuff, right, I think they're going to become the
de factol data platform for everyone at some point, right,
Like you if if everybody is putting their data on
F three directly through the Iceberg format, then you will
have all those use cases of like sharing it around,

(37:02):
like sharing it with their application and all the kind
of things. Right, So hopefully they can get their stuff
together and actually do something.

Speaker 2 (37:11):
They just came out with.

Speaker 1 (37:12):
I think it was the last year or so, they
came up with S three tables, which I it's them making.
It's nice to see that there's still some improvements being
made S three over time, because it's not the best
user experience. It's not even the best developer experience, so
you know, there are things that things left to be
desired there. I mean, my my personal pick if I
had to say anyone is just like all those bad

(37:34):
security features like just default remove them from the console,
Like no one should know about being able to make
a bucket public, Like there's never a reason in today's
story to ever have a private a public bucket and
you know, on that same token, like I want, I
want to be able to name my bucket whatever I
want and not like have to worry about conflicts with
other conflict things.

Speaker 3 (37:56):
So BUSI and I'm just.

Speaker 1 (37:57):
Like, I don't know what it's going to take here,
but no AWS, just come up with like another replacement service,
call it you know as far as or and just
and just you know, honestly blob private, blob storage.

Speaker 2 (38:10):
That's really all I need.

Speaker 1 (38:11):
And make it hook up a ball too, like literally
the exact same API. The only thing that has to
be different is just cut out all those things that
are these tons of foot guns, right like especially bucket
names quating is a huge one. Anyway, you can see
that I have a whole rant on this, so maybe
I'll stop there.

Speaker 3 (38:28):
You know, we we obviously found out Snowflake is pretty
good at some things, but also pretty pretty bad at
the other things.

Speaker 2 (38:35):
What's that like number one bad thing?

Speaker 3 (38:38):
So you would have So we have this like constant
battle with with Snowflake in terms of how like how
how they they're optimizer optimizes the query versus our kind
of optimizer optimizes the querry. Like many cases, you know,
if you do it like a joint, right, like you know,

(38:58):
the ordering of the joint matters a lot, you know,
which started to put at the built side, which started
put it as the probe side. And in many cases
it's like, oh, like Snowflake, could you just like let
us decide what what is the joint orders because we
know more about the data then, you know, right, But
then you thought the snow FAI would join it other way,
but then in some fair would actually swap it and

(39:20):
you know, ruin your your your entire efficiency. So we
would always wish like Snowflake would just provide us with
an API for us to just you know, send them
the curve the curve plan, just don't change it, just
run it as is. But obviously that's not something that
they're willing to do.

Speaker 1 (39:38):
That's really surprising actually because I feel like even as
far back as I remember, which isn't that that far
with MS sequel. There was always like you could pass
it hints to the engine.

Speaker 3 (39:47):
Yeah, so now they're getting better at that. So recently,
very recently they released a new feature. I think I
would like to think that we we push for it.
But they recently released the future cut erected to joint,
which is kind of like similar to you know, joint
hints that you can you can you can just say, hey,
use this as a left to use either the right,

(40:08):
don't change it. But we have waited for so long
for for for such a future.

Speaker 2 (40:14):
Well, I guess you're you're you're two percent usage.

Speaker 1 (40:16):
Only count for something. Yeah, well then I guess I
think this opportunity then to uh to move over to picks.

Speaker 2 (40:22):
So what did you bring for us today?

Speaker 3 (40:24):
Yeah? So I bought this gadget for my kind of
summer traveling.

Speaker 4 (40:30):
It's a it's just like three D. I can actually
show it show you it's yeah, so it's good. It's
just kind of like a R kind of glasses, you know,
not the fancy ones like the Division pro or whatnot. Basically,
what he does is it has like.

Speaker 3 (40:48):
S GM I you know, connector to whatever devices you have,
and it has this like tiny kind of old projector
at the top of the glass and just projecting it's
into a lens so you can kind of start. It
worked pretty surprisingly well, so I can. It does like
hat tracking. I use it for work during the traveling

(41:08):
because I, you know, obviously kind of bring a large
monitor with me and plugging that I can actually get
like a you know, thirty inch monitors like floating in
front of my eyes.

Speaker 2 (41:18):
So you have to plug it in though it have
to be physically connected.

Speaker 3 (41:22):
Yes, that's one job back, but I would say that
because it doesn't have any battery inside. It's powered purely
through the STMI. Yeah, so I don't have to charge it.
So actually the experience is better than like having something
I have to bring another charger in.

Speaker 2 (41:40):
Do But do laptops support h power over Hdmi?

Speaker 3 (41:44):
Yes? Surprising interest I was, I will even work with
my phone, like it drains the phone better really, yes,
but at least it works.

Speaker 2 (41:56):
Wow, that's amazing. So like what is the company?

Speaker 3 (41:59):
What is that called a x rel x real I
think the one that god is called x real pro
one pro or something like that.

Speaker 2 (42:07):
Okay, so I should just like get rid of my monitors.

Speaker 3 (42:10):
How much is it like a five hundred and six
hundred bucks after that. Yeah, way cheaper than like a
vision pro.

Speaker 1 (42:15):
Yeah, I say, so I should just get rid of
my monitors, switch it over over to this, and like
walk around with like an HDMI. K.

Speaker 3 (42:24):
It's not the most kind of like you know, socially
acceptable attire out there, but you know.

Speaker 1 (42:29):
The interesting thing is, like I'm surprised that they haven't
they didn't have like a battery pack honestly connected with
like rechargeable batteries that you could like have in your
pocket and then have a remote HDMI because I feel
like that would allow you to walk around with it,
but having to take the glasses off in order to
perform other actions, I don't know.

Speaker 3 (42:49):
Yeah, I think with this, I kind of see, you know,
there's another way of doing kind of AR that's not
the app Apple way, and yeah it also works. Yeah.

Speaker 1 (43:00):
So this is I'm going I'm going to speak at
a conference. Instead of bringing my laptop, I'm bringing my
phone and my my AR my AR goggles. Okay, So
for me, what I brought is, uh, there's a movie
that I just really like and every once in a
while I'll rewatch it, and so recently I had to
rewatch of The Shadow. It's the nineteen ninety four with

(43:21):
Alec Baldwin and it was originally made after the radio
show from the from the thirties, and I don't know,
there's just like so many great quotable things in it.
And actually it's if you haven't seen it, it's have
you seen it. It's the original inspiration for Batman. So
it's like Batman wasn't didn't come from nowhere. It was

(43:41):
actually The Shadow with I mean, not with Alec Baldwin,
but that was much later. But it's it's great. It's like,
just imagine Batman with guns. And also in the movie
he has he learns like telepathy and telekinesis, so I mean,
it's just honestly, it's like a much better Batman. It's
like everything Batman wishes he could be.

Speaker 3 (44:02):
Yeah, like I never understand why Batman's not allowed it
just guns.

Speaker 2 (44:05):
But yes, I mean, you know, it's interesting.

Speaker 1 (44:10):
There was actually a really long documentary on the making
of Batman, the animated series that came out in like
the nineties and how it was like so dark and
at the time it was like the only dark animated
television show. Everything else that's even remotely animated is like
sunshine and flowers, like getting everyone to sign off on

(44:31):
like how like almost deeply depressing some of these episodes are,
or like the dark drama that is now just everywhere
on television is like quite unexpected. So I can see,
like even the upgraded guns is like it's just something else.
I mean, there's a lot of the there's a lot
of history about like the gun usage or even just
in cannon.

Speaker 2 (44:51):
I don't know what the non cannon reasons are.

Speaker 1 (44:53):
But I suppose if you wanted guns, you know, just
go watch The Shadows.

Speaker 2 (44:57):
It's fantastic.

Speaker 1 (44:58):
I mean it's like, ay bad movie, but I mean
I actually really like it.

Speaker 3 (45:03):
I will, I will. Yeah, thanks for the thanks for
the situation. Yeah, okay, well.

Speaker 2 (45:07):
Then that's the end of our episode.

Speaker 1 (45:09):
Thank you AG for for coming and talking about Deserve
and how to build an observability platform has gotten quite
far and using such a high amount of compute and
Snowflake as a database.

Speaker 2 (45:19):
But that's been really interesting and I

Speaker 1 (45:21):
Hope we'll be back on next week with another episode.

All Episodes

Episode Transcript

Popular Podcasts

My Favorite Murder with Karen Kilgariff and Georgia Hardstark

24/7 News: The Latest

Dateline NBC

.css-15opob5{left:0;position:absolute;top:0.8rem;} All Episodes

.css-14f5ked{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:2;overflow:hidden;}How to build in Observability at Petabyte Scale

Episode Transcript

Popular Podcasts

.css-r6mb8g{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:1;overflow:hidden;}My Favorite Murder with Karen Kilgariff and Georgia Hardstark

24/7 News: The Latest

Dateline NBC

All Episodes

How to build in Observability at Petabyte Scale

My Favorite Murder with Karen Kilgariff and Georgia Hardstark