Episode Transcript
Available transcripts are automatically generated. Complete accuracy is not guaranteed.
Speaker 1 (00:00):
Welcome to the deep dive. Today. We're diving into well,
a really fundamental challenge in computing these days, handling that
absolute flood of data, you know, the high speed stuff.
Speaker 2 (00:10):
Yeah, it's relentless. I think social media, IoT sensors, everywhere,
web clicks, it just keeps coming.
Speaker 1 (00:17):
And it's not just the sheer volume, right, it's how
bursty it is. You get these sudden peaks exactly.
Speaker 2 (00:22):
That borestiness is the real killer. Our sources. They show
that building systems that don't just fall over when you
get a sudden rush, like say everyone trying to check
out a bike in a smart city right at five pm,
that needs special thinking. Traditional approaches just won't.
Speaker 1 (00:37):
Cut it right. So our mission today is basically to
give you a solid shortcut to understanding these kinds of architectures.
Will use the Amazon Kinesis family, you know, data streams,
fire hose, data analytics, video streams, kind of as our
case study.
Speaker 2 (00:50):
Yeah. The goal is by the end you'll really get
the difference between these tools, and more importantly, the core
idea is behind making these systems scalable and crucially fast.
Speaker 1 (01:00):
Okay, let's kick off with the big why why is
speed so important in data analysis. Now why the big rush?
Speaker 2 (01:05):
Well, it's become a strategic must have. Really, we've moved
way beyond just looking backwards. Yeah, like those old batch
jobs running overnight giving you a report.
Speaker 1 (01:14):
On yesterday, right, a snapshot of the past exactly.
Speaker 2 (01:17):
Now, it's all about immediate insights, the freshest possible data.
Let's you make the best decisions right now, perishable insights.
Speaker 1 (01:25):
Basically, I've heard about this thing, the O day loop observe, orient, decide, act.
It comes from military strategy. I think, how does real
time data fit in there?
Speaker 2 (01:36):
It fits perfectly. Real time analytics fundamentally strengths that observe window.
Something happens, you know about.
Speaker 1 (01:42):
It instantly, which means you can orient, decide, and act
much faster.
Speaker 2 (01:46):
Precisely gives you a huge edge. Think about frog detection
or optimizing ad placements on the fly, or in that
smart city example, rerouting maintenance crews instantly based on real
time bike usage patterns.
Speaker 1 (02:00):
Okay, so that's the why. Now the how? A single
server obviously can't cope, so we use distributed systems. Lots
of servers network together. But does not just explode the
complexity failure modes everywhere.
Speaker 2 (02:10):
Oh absolutely, It gets complex fast, but engineers figured it
out ways to manage it. Two main patterns really help
tame that initial complexity.
Speaker 1 (02:19):
Beast Okay, what were they.
Speaker 2 (02:21):
First standardized interfaces? Things like APIs This really paved the
way for micro service architectures.
Speaker 1 (02:28):
Ah, so breaking the big system down into smaller independent
services exactly.
Speaker 2 (02:33):
It lets different teams own their piece and you know,
iterate faster without stepping on each other's toes.
Speaker 1 (02:39):
It sounds a bit like Conway's law in action, the
system design mirroring the ORG structure.
Speaker 2 (02:45):
That's a great way to put it. Yeah, small independent
teams tend to build small independent services. Those standardized interfaces
are what make it work smoothly.
Speaker 1 (02:53):
Okay, interfaces help services talk. But what about failures? If
one micro service crashes while it's processing data, how do
you stop it from dragging down everything else that depends
on it?
Speaker 2 (03:03):
Right, that's the castgating failure problem. And that's where the
second pattern comes in. Decoupling. You use an asynchronous message broker.
Speaker 1 (03:10):
A message broker like a middleman for data sort of.
Speaker 2 (03:14):
Yeah. It acts as the stable buffer, an invariant in
the system. If a downstream service, a consumer fails, the
broker just holds onto the incoming messages. It builds up
a backlog.
Speaker 1 (03:25):
So the producer can keep sending data unaware of the
downstream problem.
Speaker 2 (03:30):
Exactly, the failure is contained. Once the consumer service comes
back online, it just starts processing the backlog from the broker.
No data lost, no castkating crash.
Speaker 1 (03:39):
That makes sense. The broker isolates the fault. But if
that backlog keeps growing, that sounds like trouble brewing. What
metrics do we watch?
Speaker 2 (03:47):
The key capacity metric is transactions per second tps. That's
usually limited by either the number of records per second
or the total data size like megabytes per second.
Speaker 1 (03:56):
Okay, tps, but what's the danger signal the real data.
Speaker 2 (04:00):
Your signal is back pressure. That's when your producers are
sending data faster than your consumers can process it. Input
tps is consistently higher than output tps.
Speaker 1 (04:08):
So back to the bike sharing rush hour hits every
bike starts pinging its location like crazy. How do you
handle that surge, that back pressure without the system just drowning.
Speaker 2 (04:18):
You've got a few strategies ranging from let's say gentle
to pretty aggressive. Okay, you can throttle the producers tell
the bikes, hey, maybe only report your location every minute
instead of every second during peak times.
Speaker 1 (04:30):
Reduce the input rate makes.
Speaker 2 (04:32):
Sense, Or you can scale the consumers. If you're in
the cloud, you might automatically spin up more instances of
your processing application to handle the load.
Speaker 1 (04:40):
Add more workers, right.
Speaker 2 (04:42):
You can also use bigger buffers in the message broker
itself to just absorb temporary spikes. But the most drastic
option is to just drop messages drop data.
Speaker 1 (04:52):
That sounds risky.
Speaker 2 (04:53):
It is. You'd only ever do that for non critical data.
Like maybe dropping some routine sensor readings is okay if
the systems overloaded, but you'd never drop something like a
customer order completion message. Never.
Speaker 1 (05:04):
Okay, got it? Critical versus non critical. Let's dig into
the stream itself. What are the absolute basic parts of
any streaming system for main components?
Speaker 2 (05:14):
Yeah, you can break it down pretty simply. First you
have the producers. These are the apps sending the data,
our bike sensors, the user's mobile app checking out a bike,
that kind of thing.
Speaker 1 (05:23):
Okay, producers send data.
Speaker 2 (05:24):
Then you have the messages or records. That's the actual
data payload, usually small, maybe I a few kill bytes
often capped around a megabyte. Each record has the data
itself and a header, usually with a unique ID the
broker assigns.
Speaker 1 (05:38):
Got it, producers messages.
Speaker 2 (05:40):
Third is the stream or the broker. That's the component
doing the buffering, like canisis itself.
Speaker 3 (05:45):
Or Kafka the fault isolator we talked about, right, And
finally you have the consumers, the applications that pull data
from the stream and actually do something with it, analysis, storage, triggering, alerts.
Speaker 1 (05:55):
Whatever, producers, messages, stream consumers. Okay, now latency you mentioned
speed is critical the source of stress. We need to
be really precise here. It's not just latency. They're two
specific measures.
Speaker 2 (06:09):
Yes, absolutely critical distinction. There's propagation delay that's simply the
time it takes from the moment of producer writes a
message to the moment a consumer reads it raw transmission speed.
Speaker 1 (06:21):
Basically okay, right to read time. What's the other one?
Speaker 2 (06:24):
The other and often more telling metric is the age
of the message. This measures how long a message has
actually been sitting in the stream before a consumer picked
it up.
Speaker 1 (06:35):
Ah, So propagation delay could be low, but the message
might still be old. If the consumers are lagging exactly.
Speaker 2 (06:41):
If that average message age starts creeping up, that's your
big warning sign. It means your consumers can't keep up.
The backlog is growing, and performance problems are right around
the corner, even if the network itself is fast.
Speaker 1 (06:52):
That's a really important distinction. Now, failure modes distributed systems
retries network glitches. It means data streams all and have
this at least once delivery guarantee right, a message might
show up more than once.
Speaker 2 (07:04):
That's the standard reality. Yes, because ensuring exactly once delivery
across a distributed system is incredibly hard. Most brokers guarantee
at least once. The producer might retry sending if it
doesn't get confirmation the consumer by process and in crash
before confirming. Duplicates happen, and that.
Speaker 1 (07:23):
Sounds potentially disastrous. If you're, say, processing payments or updating
inventory accounts, double counting.
Speaker 2 (07:30):
It would be disastrous, which is why the responsibility for
handling duplicates shifts downstream to the consumer or the final
destination system. Those systems must be designed to be idempatant idempatent,
meaning meaning processing the same message multiple times has the
exact same effect as processing it just once.
Speaker 1 (07:48):
How do you achieve that two main ways.
Speaker 2 (07:50):
Either the operation itself is naturally idempatant like setting a
value as adempatent adding to a count, or is not,
or you use that unique ideed. It's typically in the
message payload to explicitly check if you've already processed this
specific message before de duplication.
Speaker 1 (08:04):
Basically okay, So the consuming application needs to be smart
about duplicates. What about errors within a message? If a
consumer gets a record it just can't process.
Speaker 2 (08:15):
That leads to a particularly nasty problem called the poison pill.
Speaker 1 (08:19):
Poisman pill sounds bad, It is.
Speaker 2 (08:22):
Because message streams usually guarantee the order of messages, at
least within a certain context. Like all messages for a
specific bike, you need to process the unlock event before
the return event takes sense.
Speaker 1 (08:34):
Order matters.
Speaker 2 (08:35):
But if one message in that sequence causes an error
in the consumer, maybe it's malformed bad data, and the
consumer keeps retrying and failing on that specific message, it
gets stuck. It gets completely stuck. That single poison pill
record blocks the processing of all the perfectly valid records
sitting behind it in the stream for that same bike
or whatever the ordering context is. The whole sequence grinds
(08:58):
to a halt because of one bad message.
Speaker 1 (09:00):
Wow, so a single bite of bad data could effectively
block a whole partition of the stream, causing a latency
for everything behind it to skyrocket exactly.
Speaker 2 (09:09):
It really highlights why robust error handling and potentially dead
letter cues for problematic messages are so crucial in consumer design.
Speaker 1 (09:18):
Okay, that sets the stage really well. Let's move on
to the specific tools Amazon offers with Kinesis to handle
all this at scale. Starting with the foundation canesis data
streams or KDS. When do you reach for this?
Speaker 2 (09:30):
KTS is your go to when you need maximum control,
high durability, and really low latency. We're talking subsecond processing
for your own custom applications that read directly from the stream.
Speaker 1 (09:42):
And the core unit of KDS is the shard, right,
what does that actually do?
Speaker 2 (09:46):
The shard is the fundamental unit of capacity and parallelism
in KDS. Each chard has fixed limits on how much
data it can ingest typically one megabyte per second or
one thousand records per second, whichever comes first, and also
how much data can be read out.
Speaker 1 (10:00):
If I need more capacity, I just add more shards
pretty much.
Speaker 2 (10:03):
Yeah, you scale the stream by scaling the number of shards.
Speaker 1 (10:05):
Okay, and if we're tracking millions of bike journeys, how
does KDS decide which shard a specific bike's data goes
to and keep it in order.
Speaker 2 (10:13):
That's determined by the partition key. When your producer application
sends a record to KDS, it includes a partition key.
This could be the bike ID for instance.
Speaker 1 (10:22):
And KTS uses that key to route.
Speaker 2 (10:25):
The data exactly. KTS hashes the partition key and uses
the result to assign the record to a specific shard.
All records with the same partition key always go to
the same.
Speaker 1 (10:35):
Chard ah, and that's how it guarantees order. Within that key,
all data for bike one twenty three lands on, say,
shard five, in the correct sequence precisely.
Speaker 2 (10:45):
But this also highlights a potential pitfall, the hot partition
key or hot shard. If one partition ke, like a
super popular bike or a single central sensor, sends way
more data than others, its shark can get overwhelmed while
others are idle.
Speaker 1 (10:58):
So choosing a good part two key that distributes data
evenly is critical for performance, maybe not just the bike idea.
If usage is very uneven.
Speaker 2 (11:06):
Right, sometimes using a more random key or combining fields
is necessary to spread the load effectively.
Speaker 1 (11:12):
Now reading the data, Let's say we have multiple apps
wanting that bike data, one for maintenance alarts, one for
route analysis. How do they read efficiently? You mentioned Enhanced
fan out EFO.
Speaker 2 (11:21):
Yes, so, standard KDS consumers work on a pull model.
They pull the shard for data, but each chard has
a total read capacity limit about two megabytes per second
that all standard consumers share.
Speaker 1 (11:32):
So if you have lots of consumers pulling the same shard,
they start slowing each other down exactly.
Speaker 2 (11:36):
They compete for that shared bandwidth. Enhanced fan out EFO
solves this with EFO. Each registered consumer gets its own
dedicated throughput limit per shard, delivered via push model from Kinesis.
Speaker 1 (11:49):
So EFO consumers don't interfere with each other.
Speaker 2 (11:52):
Correct. It allows multiple real time applications to consume from
the same stream at high speed independently. If you need
scale on the consumer side, EFO is often essential.
Speaker 1 (12:02):
Okay, KTS for flexible, low latency custom apps. What about
Kinesis Data fire Hose KDF. People often call it the
easy loader.
Speaker 2 (12:10):
Why because it is KDF is fully managed, totally serverless
and Its whole purpose is to make it super simple
to capture streaming data and load it into specific destinations.
Think S three data laks, redshift data warehouses, elastic search
for search, that kind of thing.
Speaker 1 (12:25):
How easy is easy?
Speaker 2 (12:26):
Often zero code is needed for the delivery part. You
can figure fire Hose in the console, point it at
your stream source which could even be KDS, tell it
where to send the data, and it handles the scaling, buffering, delivery,
and retries automatically.
Speaker 1 (12:40):
Does it do anything else like transform the data?
Speaker 2 (12:43):
Yes, you can. It has built in capabilities for data
format conversion like Jayson to park, and it can also
invoke an AWS lambda function to perform custom inline transformations
basic epl before the data lands and the destination.
Speaker 1 (12:56):
Okay, KTS gives us speed and flexibility. KDF gives us
ease of use for loading data, sinks. What's the catch
with KDF? What's the tradeoff?
Speaker 2 (13:05):
The main trade off is latency because fire Hose buffers
data internally, often for several seconds or even minutes to
batch it efficiently for delivery to the destination.
Speaker 1 (13:14):
Ah, It's not truly real time like KDS.
Speaker 2 (13:16):
Can be exactly. It optimizes for delivery throughput and cost
effectiveness by batching, not for subsecond end to end latency,
So if you need that immediate subsecond processing, KDS is
the choice. If your goal is reliable, easy loading into
F three or redshift and near real time is good enough,
KDF is often way simpler.
Speaker 1 (13:34):
Makes sense speed versus simplicity. Now, analysis kinesis data analytics kDa.
How does this fit in analyzing the stream as it flows?
Speaker 2 (13:44):
Precisely? kDa is also serverless, and it lets you run
continuous queries or applications against your streaming data, either from
KDS or KDF. It offers two different.
Speaker 1 (13:53):
Engines for this, Okay, what are the engine choices?
Speaker 2 (13:55):
First is a sqel engine. If you're familiar with standard SQL,
you can use ans icql to query the stream. kDa
presents the stream data within windows, like tumbling windows of the
last five minutes or sliding windows, so.
Speaker 1 (14:08):
You could write a query like select count from bike
stream where status in US, group by location district over
a rolling time.
Speaker 2 (14:16):
Window exactly that kind of thing. It treats the incoming
stream like a continuously updating table. You can query. Very
powerful for many common analytics tasks and accessible if you
know SQL.
Speaker 1 (14:27):
And the second engine you mentioned too.
Speaker 2 (14:29):
The second is based on a patche flink. This gives
you much more power and flexibility. You typically write your
applications in Javar Scala.
Speaker 1 (14:35):
What can flink do that the SQL engine can't.
Speaker 2 (14:38):
Well, flink apps can handle much more complex logic, maintain
sophisticated state across events, and importantly can connect the data
sources and sinks outside of just kinesis and aws like
maybe an on premises KOFKA cluster or a custom database.
Speaker 1 (14:53):
And I remember reading flink is key for achieving exactly
once processing semantics. How does it manage that?
Speaker 2 (14:59):
Yes, that's a major advantage. Flank achieves exactly once by
carefully managing the application's internal state and periodically saving checkpoints
of that state to durable storage typically S three or
something like rock sysdb. If there's a failure, it can
restore from the last successful checkpoint and resume processing without
missing data or processing duplicates.
Speaker 1 (15:19):
So SQL for simpler windowed queries, think for complex stateful
potentially exactly one's processing and connecting anywhere.
Speaker 2 (15:26):
That's a good summary.
Speaker 1 (15:26):
Yeah, okay, one more service kinesis video streams KBS. This
seems more specialized. Video and audio.
Speaker 2 (15:32):
Yeah, KBS is specifically designed for handling time encoded data streams,
primarily video and audio, but potentially other time series data too,
like radar or liar feeds.
Speaker 1 (15:42):
How does play out in our smart city bike scenario?
Speaker 2 (15:44):
KBS actually addresses two pretty distinct use cases. There first
is real time low latency interaction. Imagine a user wanting
to see a live video feed of a bike stand
on their phone to check if bikes.
Speaker 1 (15:58):
Are actually available okay, live streaming.
Speaker 2 (16:00):
KVS provides capabilities, often using Web RTC standards for that
kind of low latency peer to peer or small group
video streaming. It manages the signaling channels and things like
stunt and turn servers needed to establish the connections. So
KVS WebRTC.
Speaker 1 (16:15):
For live use got it? And the second use case.
Speaker 2 (16:18):
The second is more about storage, playback and analysis. This
is KVS storage. You stream video from say security cameras
that the bike stands into KVS for durable storage.
Speaker 1 (16:27):
And then what just store it?
Speaker 2 (16:29):
You can store it for compliance or later review, but
the real power comes from integrating it with AI and
mL services. For instance, you could feed that stored video
stream into Amazon Recognition.
Speaker 1 (16:38):
AH for analysis.
Speaker 2 (16:39):
What like running facial recognition to detect known vandals near
the bike stands in real time, or maybe object detection
to count bikes automatically or spot obstructions. Recognition analyzes the
frames ingested via KVS and can trigger alerts or other actions.
Speaker 1 (16:54):
So KVS handles both the live interaction piece and the
jest for later analysis piece for video.
Speaker 2 (16:59):
Data correct two sides of the video coin.
Speaker 1 (17:02):
That's a really helpful tour of the whole Kinesis family.
Let's try to quickly recap the core job of each
one for the listener.
Speaker 2 (17:07):
Okay, KTS, that's your foundational low latency, high throughput stream
for custom apps needing maximum flexibility, thinks subsecond speed right
KTF KDF the easy serverlust loader, great for getting streaming
data into S three redshift elastic search without much code,
but sacrifices that subsecond latency for batching efficiency. kDa the
(17:29):
real time Analytics engine, use it SQL interface for simpler
windowed queries, or the Apache frink engine for complex stateful
processing exactly one semantics and broader connectivity.
Speaker 1 (17:39):
And finally, KVS KDS.
Speaker 2 (17:41):
The specialist for video and time encoded data, handles both
low latency web RTC streaming for live interaction and gerbil
ingestion for storage playback, and AML analysis like recognition.
Speaker 1 (17:53):
Perfect summary. Now, before we wrap up, there's a crucial
security point that underpins all of this distributed complexity. How
to secure all these producers and consumers talking to kinesis, Yeah, this.
Speaker 2 (18:04):
Is super important. The absolute standard best practice is use
IMA roles. Do not embed long term access keys or
secrets directly into your producer or consumer applications. Why roles specifically,
because IM roles provide temporary, short lived security credentials that
are automatically generated and rotated by AWS. Your application running
(18:24):
on EC two or Lambda or wherever assumes a role
gets temporary permissions, does its work, and those credentials expire.
Speaker 1 (18:31):
So even if an application instance gets compromised somehow, the
attacker doesn't get permanent keys.
Speaker 2 (18:36):
Exactly, the potential blast radius is dramatically reduced compared to static,
long lived keys being compromised. It's all about implementing least
privileged access using temporary role based credentials. It's fundamental to
secure cloud architecture.
Speaker 1 (18:53):
That's a critical takeaway. Okay, let's finish with that provocative
thought we talked about at least once delivery and the
need for dempotency. How should listeners think about that?
Speaker 2 (19:02):
Right, we established that GETT duplicates is just a fact
of life in most high performance, resilient streaming systems like
those built on Kinesis data streams. You have to design
for it.
Speaker 1 (19:11):
So the provocative thought.
Speaker 2 (19:12):
Is think about how profoundly that changes how you have
to design your applications. Yeah, you can never assume an
input an event, A message will arrive only once, might
arrive twice or even more time. So that's rare. So
how does that constraint The absolute certainty that you might
get duplicates ripple through your entire design? How does it
affect your database schemas, your transaction logic, your state management.
(19:36):
Realizing that you must build identity consumers isn't just a detail,
it's a fundamental shift in mindset needed to truly master
streaming architectures.
Speaker 1 (19:44):
A great point to ponder designing for repeats, not just requests.
Thanks for breaking all that down.
Speaker 2 (19:49):
My pleasure. It's complex, but hopefully a bit clearer now.