All Episodes

March 7, 2025 56 mins

Is running Kafka on-prem different than running it in the cloud? You’ll find out from Elad Eldor’s years of experience running, tuning, and troubleshooting Kafka in production environments. Elad didn’t set out to learn Kafka, but he kept asking questions and was given the opportunity to dive deep into system performance. He not only knows what all the columns of iostat mean, he knows what his customers want.  Make sure to subscribe to this topic on all your consumers.


Show Highlights

(0:00) Intro
(9:30) Why do people use Kafka 
(15:00) Learning cloud vs on-prem
(18:30) Kafka vs Linux troubleshooting
(27:00) scaling clusters
(38:00) How to get started

Links Referenced


Sponsor

https://www.softwaredefinedtalk.com

Sponsor FAFO at https://fafo.fm/sponsor

Mark as Played
Transcript

Episode Transcript

Available transcripts are automatically generated. Complete accuracy is not guaranteed.
(00:00):
The correlation between RAM and disk is one of the things
that are so simple to understand, but so hard to grasp.
Welcome to fork around and find out the podcast about
building, running, and maintaining software and systems.

(00:31):
Welcome back to fork around and find out.
I am Justin Garrison.
And with me as always is Autumn Nash.
Today on the show, we are going to stream process this Kafka
queue with Elad Eldor, the author of Kafka troubleshooting
in production and a data ops engineer at unity.
Welcome to the show, Elad.
Thank you for having me.
We were just going into details before recording this about, uh, not only

(00:54):
how cool, like your name sounds like a great character in any sort of like.
The cool fantasy book.
Yeah, it's, it's just fantastic.
But then like the fact that you're working with Kafka and
like data streams also sounds a little magical sometimes.
Like this sounds like a spell that you're casting.
To do cool things to all the data and like stream it.
Like I could just see like a book cover with like a wizard like streaming data.

(01:16):
Like it'd be so cool.
The book cover is, uh, it looks like a tree.
So yeah, it streams somewhere into the earth or something like that.
Love it.
Let's jump straight into tell me about the book you the
title of the book is Kafka troubleshooting in production
Stabilizing Kafka clusters in cloud and on premises and the
first question I have is why is cloud and on premises different?

(01:37):
That's a big big question.
So we
got a lot of time
Yeah
to answer this I'll start with The interesting story of how I, uh, how
it became like, uh, working with Kafka because the intention of the
book is for people like me, people who had no experience with Kafka or

(02:00):
Linux, by the way, to get into troubleshooting, I started like, uh, 2016.
Uh, working with spark and Kafka.
I was a backend developer working on an, with an, uh, at an on prem company that
sells all the machines, not only the software to clients in a defense industry,

(02:21):
uh, clients and all things worked, but the major issues were with Kafka.
So after some time I had enough with, uh,
going to DevOps, trying to get some help.
They also didn't know Kafka very well.
Uh, so I had to dig into myself.
Into Kafka and turned out that this was like I had some 10 years of experience

(02:42):
with open sources and nothing was as Tough as Kafka and nothing was so pivotal
as Kafka It's in the middle of everything a small problem just spans over
the whole pipeline so then I started getting into troubles like one by one
understanding how to Solve it learning Linux on the way I didn't only, uh,

(03:07):
try to understand issues in Kafka, it was also clusters in, I became SRE in
charge of, uh, Presto now Trino clusters and, uh, HDFS and Spark, like on-prem?
No, no one.
Three distributions of Ambari that bundled together a Kafka

(03:28):
Spark and whatever and no support from anyone, only logs.
And customers are abroad to go to India and Uzbekistan, all sorts
of weird places, which I really like, but they're still weird.
Trying to like going, going over logs, uh, the Persian Gulf, et cetera.
So, and I took Brendan's book to every place

(03:48):
and it just was like all the solutions.
We're in this book, and he described this is such a way that I
fell in love with Linux tuning and understanding bottlenecks.
And again, it was on prem.
So getting back to your questions, the question in on prem, you don't have
any assistance, but of anyone, but but you can look at the overall the logs.

(04:13):
And one of the problems that don't occur in
the cloud is that the hardware just fails.
You have, uh, RAM games that are being, that get defected,
uh, disks start to burn out slowly or fast, and even one
disk in one cluster, uh, can just destroy the whole cluster.

(04:33):
That's the con, the pros are that you can create your own cluster,
whatever RAM you want, how, however, storage you want, whichever
disk type you want, and like build your own little cloud.
That's the con.
But, uh, so it was a mixture both of like having access to

(04:54):
everything, but seeing failures that you don't see in the cloud,
but having the freedom to design your own cluster, or at least try
to design your own cluster because you need really to estimate.
I know that you just moved on from EKS to a
company that works with EKS clusters on prem.

(05:16):
Yeah.
The issue is that you Like designing a cluster that will fit
into the budget and not cost you a lot of money is something
that is very hard to convince and very hard also to estimate.
Because traffic today is X, tomorrow it's 2X or half
X. So these are challenges I didn't face in the cloud.

(05:38):
It's the best school for becoming an SRE.
In every open source, you have full accessibility to all the tools of Linux.
Like, you can see all the metrics in the disk.
And like, create your own monitoring, what AWS
engineers Doing the data centers, you just do yourself,
which is good and bad, right?

(05:58):
I mean, it's like, that's all the responsibility
of now I need a, uh, reliable log engine, right?
Like those sorts of things are just like, oh, this, this sucks.
But also a lot of.
Log aggregators are better than CloudWatch logs, right?
CloudWatch is pretty terrible when you look at anything you can run
yourself, but there's more responsibility to run something yourself.

(06:19):
I also think that it's like, do you have the time
and the team and the ability to run it yourself?
Like, I think when you can.
Sit there and build something and run it yourself.
It gives you that, like, ability to really understand it.
So you end up, you end up in a better situation, but you have to have the
time to spend and the ability to really run it yourself and do all that, you

(06:39):
know?
That's exactly how this conversation started, right?
Like no one knew how Kafka worked.
And so Elad's like, I'm jumping in and we're going to, I'm going to
spend, I'm going to invest the time to accidentally make this my career.
Exactly.
So when I moved to.
To the cloud.
I worked at the company that the many sites so many different deployments.

(07:01):
Not only of Kafka, of consumers and producers.
So I, I saw a vast amount of, uh, like compared to like
an ordinary SRE, uh, because there were many sites.
I saw so many cases of producer errors, consumer errors, Kafka errors, hardware
errors, like whatever I built, like I designed clusters with different.

(07:24):
A mixture of a different disc types, different deems within the same machine.
Like I learned how to combine deems in two
different types of deems at different sizes.
Like how you saw it in the, in the J board
of the, of the deems was really crazy.
So after like three years, is that, is that for performance you're splitting up?

(07:46):
No, that's part of the issue is on premise that you have a cluster.
And it's now at the customer's side, it is DC.
Now, you lack storage.
So, what do you do when you lack storage and the JBOD, you have
a JBOD of disks and you have 24 disks of, uh, let's say, uh, 4TB.

(08:06):
So, are you going to throw away all the disks because you
need 16 terabyte disks or 32 or whatever, the customer will,
will get mad that you throw away the disk that he paid for.
Although it makes sense, but you cannot really do it.
So you can, but you can say to him, okay, I'll throw half of the
disk that you bought, and I'll put in 16 terabyte disks, but then

(08:29):
how do you make a JBOD work with 4 terabyte disks and 16 terabytes?
So there is a way of doing that.
Like you do 4TB, 16, 4TB, 4 or 8TB.
Same goes with DIMMs.
You have 24, a box of 24 DIMMs But you need triple that size.

(08:52):
You are not going to throw away all of it because
the customer will not get it, will not accept it.
I learned that there are several types of customers, by the way,
customers in a, I don't want to mention geographical places, but some
customers have, they, they get really attached to their DMS and disks.
Uh, and they like a lot of machines.
And a lot of this can, whatever.

(09:13):
So, uh, so sometimes you need to mix.
And Kafka's it's one of those pieces of software that seems
to work its way in a lot of places in a lot of industries.
It's not because it's, it's a generic sort of stream processor and, and
pub sub people are just like, I'm just going to use it for whatever I want.
In this case, right?
And there's, and there are other tools that kind of do that.

(09:35):
Um, some of the older ones like zookeeper and whatnot are known to be
a little worse, uh, to operate and try to use or have less features.
What is the most common scenario that you're
like, someone uses Kafka for this purpose?
And what's the use case where you see someone that like
puts Kafka in place and that they shouldn't have used Kafka?
I think that, uh, I don't have.

(09:55):
Any example for why not using Kafka if you want
to write something that many consumers will read?
For example, or many producers, many consumers, like
end to end, one to end, end to one, like whatever.
I can't think of any example.
Why not using it?
Kafka is a Swiss army knife of like streaming data because it's open source.

(10:18):
So, so many people have made their own version of Kafka, but
it's really just Kafka, which means it's like Kubernetes.
Like you can go from one place to another and
people know that if they do know it, right?
Like if they know some sort of streaming, it's going to be that.
So then people just use different versions of it because it's the most common.
Like, you know, people use Kubernetes because they have that like.

(10:38):
There's the learning.
There's the different projects that are built on it.
The ecosystem of the community.
You know what I mean?
And Kafka is like the ecosystem of like
streaming because everybody uses it everywhere.
So if you're going to do the struggle, you might as well
do the one that has the most ecosystem to support whatever
you're about to like the ride you're about to go on.
And for a long time, streaming was new everywhere.

(11:00):
Like streaming that much data.
We were all trying to figure it out at the same time.
Before that, I used all sorts of streaming open source.
The last one before Kafka was HornetQ.
But before that, there was some other open sources.
And there always was an open source that replaced the last one.
Since Kafka, Kafka became like a synonym for moving data from X to Y.

(11:22):
And it works very well with small traffic, by the way.
Really, really well.
And that's one of the problems with it.
Because once you go At some point, you get into problems that are much harder
than what you anticipated, but then you're already locked in and Kafka makes
it really tough to confluent exist from, for a reason, let's say, or Ivan

(11:45):
or like whatever, but like, uh, It's really hard to understand what happens.
Can you describe that more?
Like what, at what point do I say, I only had a little bit of traffic and
now I have a lot of traffic and I need to rearchitect or change how I use it.
No matter how you architect it.
It, I, I, I noticed that I, I started at the rates of a 10 K

(12:10):
5 K. At 5k, 10k, I didn't see any environment that had issues.
But starting from 100k, you start to suffer from consumer
lags or producers that get the buffer getting full.
And data skew really becomes an issue.
And skewing the storage of disks, mainly by the

(12:31):
way, if you work with several disks per broker.
So you have a skew per disks, you have a skew per the Storage in the brokers.
I mean, in the book, I, uh,
Is that SKU for delivery time?
Or is that like just how long it takes from the producer to get to the consumer?
I know, no, not latency.
I'm not meaning that.
I mean, I mean, like traffic SKU, like a number of messages per partition.

(12:56):
Per topic and the skew demand of leaders.
So partitions are, so, it's like a wild
ride , like learning how to do that properly,
partitioning correctly.
So partitioning by, do you want to run Robin?
So you have a, an equal amount.
of data, or do you want to have a better aggregation ratio?

(13:16):
So in that case, this is really bad to go round robin.
And what happened when you have the same number of
partitions per broker, but you have a skew in the leaders?
Uh, the, the main issue when I started was leaders that
became Non leaders like just leaders that are lost and you
is that from a network partition that they

(13:36):
forgot their leaders or do they get no, no,
just a leader
partition.
Now it is not a
leader partition
distributed systems and just trying to do that with data like I.
What you're saying is so true because I think when people want
you to architect a system, they're just like, do magical things
and then account for the, like the scale, but you can't, right?
Because what some qualities that you really want at a smaller scale

(13:58):
that makes it more efficient, you can't do that at a bigger scale.
Like it just, at some point you're going to
have to re architect and rethink of things.
And it's like leaders are going to die or switch and just, you have to do that.
You know what I mean?
Like it's, there's no way to skip that.
It's not even re architecting because you can have topics at

(14:19):
millions per sec at produce rate and you can have multiple consumers
on the topic and not have leaders getting lost while you can
have a topic at one tenths of the size and get leaders get lost.
It depends like on How you monitor your Kafka and what you're
doing, the producer side and the consumer side, you can have

(14:40):
everything right on the Kafka infrastructure, but having some
amount of consumers can cripple down the broken, the cluster.
You can have 20 brokers and one faulty disk can cripple the whole cluster.
It's so vulnerable Kafka compared to, I work at high scale with

(15:00):
Druid and Trino and Spark and Kafka is just a different animal.
It is so important and pivotal and vulnerable at the same time, both on prem.
And on the cloud,
what do you think was your hardest kind of mountain that you had
to climb learning was the cloud or on prem, the hardest for you,

(15:20):
the cloud, because, because when I like the on prem was because it was
hard because I didn't know system performance, but while reading Brandon's
book and the troubleshooting stuff along the way, it's small traffic.
Then once I moved to the cloud and it becomes topics of a million per second.
When I started working at Ironsource, then, before it

(15:42):
got merged with Unity, the problems were different.
Like, when you go 10x in the traffic, you see
problems that you just don't see at a small traffic.
And this was tough because I worked on PEM first,
yes, for four years, and then moved to the cloud.
At huge clusters, you have like 100 of monitoring dashboards.

(16:08):
And when you have some problem, you need to just correlate a subset of this 100.
But I spent tens of hours on understanding each problem.
There were more interesting problems, but I think it was tougher.
Moving to the cloud was tougher just because Of so many
monitoring and so much traffic compared to on prem.

(16:30):
Do you think the abstraction made it harder or was it
just the fact that of the monitoring and the more traffic?
I think the more traffic, yeah, the, the, the more traffic you
have, it's not just traffic, not, not, not just number of messages.
It's more producers, more consumers, more combinations of features like
compacted topics along with, uh, many consumers and many producers.

(16:56):
Trying to have a small cluster to sustain high traffic, because on prem,
when you sell a customer a cluster, big data cluster, analytics cluster,
so usually companies, like companies sell the amount of machines for Kafka.
There are much more machines for Kafka than for other, uh,

(17:17):
data tools, because like managers are really afraid of,
of problems in Kafka when they sell a system as a whole.
So usually you have a lot of brokers compared to the amount
of traffic, but on the cloud, you can shrink the cluster.
So you, you have, usually you have much less, uh,

(17:38):
compute power in the cloud to sustain much more traffic.
I think it makes the problem.
Harder, but throwing money on the problem also doesn't solve the problem.
Are you talking about money solves all technical
problems?
I was going to say, can you say that one more time so they can hear you?
One of, I repeat the system performance again, one of the things

(18:00):
that Brenton showed is that you can save a lot of money when you
understand bottlenecks, not only in Kafka, but understanding storage.
Where the botanical storage RAM or disk IOPS or throughput is something
that once you understand it and you can communicate it to the stakeholders,

(18:21):
then it becomes much easier to to reduce cost, which is, by the way,
a problem in on Prem, because You need to guess what the load will be.
It's how to guess it while in the cloud.
It's, it's easier.
Your book is focused on troubleshooting.
Are there specific tools you're using to troubleshoot Kafka?

(18:43):
Like everything you've described sounds like
mostly traditional Linux troubleshooting, right?
You're looking for disk pressure.
You're looking for RAM usage.
What does Kafka troubleshooting look like?
How is it different than like standard Linux?
Performance tuning
it is very similar to standard, uh, Linux, uh, tuning or,

(19:04):
or tools like the, the first tool I used was a iostat,
like this utilization the utility column at iostat.
I think it's the most important column in a, in every IO based, uh, database.
But then I discovered all the other columns,
which are also really great in Iostat.
So Iostat is my first, uh, tool and then.

(19:28):
You have, uh, you have VMstat, uh, to, to understand the, the CPU, uh, the CPU,
uh, usage, but then you need to understand like, what's the difference in Kafka.
It's very important to distinguish between the various types of
CPU, like third of my book, just summary of a very short and summary

(19:51):
of what Brendan wrote about, uh, Storage, RAM, CPUs, and CPU have
system time, uh, IO8, user time, and interrupts or context switches.
Per each of them, uh, I, I can give an example of how Kafka can
cripple down if some type of dead CPU metric goes up even, even a bit.

(20:14):
And on the RAM, which is the most interesting part,
it took me, I think, two years to understand it.
Uh, the way that Kafka uses RAM.
Made me understand how RAM works in, in, in Linux and mainly the page cache.
So the correlation between RAM and disk is one of the things that are so simple.
To understand, but so hard to grasp and it causes so many production issues,

(20:38):
mainly for, for cluster that use Kafka as a database that save several days for
replay and then start to replay data and then producer got stuck, get stuck.
And that's mainly because of, uh, because of RAM, like Kafka just uses the page
cache mechanism that you read what you write, but if you are late in reading.

(21:01):
Then you start trashing the RAM and helping consumers and producers.
And by the way, in on prem, if you are late in reading
and you start reading from the disk, then the disks get,
get hammered because of that and then they just fail.
So consumer lag in on prem can fail, can cause failure of disks.

(21:24):
And this correlation is just, uh, like, so,
so Brandon wrote a tool called CacheStan.
To show the hidden misread.
So like these three tools like, uh, iostat, uh, VM stat slash top
and c Stat, they, they can solve big part of, of, uh, of Kafka,

(21:46):
uh, production issues along with, uh, you know, using Grafana
because there are several other metrics that are not Linux based
from Kafka itself.
Yeah.
Yeah.
And there are several, several metrics like network, uh,
processing, uh, threads, for example, that if it's that you
could say, okay, if it's high, someone is suffering consumers

(22:08):
or producer of the cluster, but someone is not feeling good.
It's
like you're reminding me so much on how Linux, the
operating system is a lot like a distributed system.
Like everything you just described is basically
a series of queues that Linux is managing.
And it's just like any distributed system
where like, Oh, I have my cash over there.
I got my, you know, my Redis instance, my web server, whatever, like some,

(22:29):
something processing, like that's all happening within the OS as well.
Managing all of the hardware, RAM, CPU, back pressure, disk pressure,
all that stuff is, is basically what we do on distributed systems too.
How does that then relate to Kafka is also a distributed system, right?
Like you're doing all of those low level things on tens or twenties

(22:52):
or, you know, hundreds of machines for like a large Kafka cluster.
How do you then correlate those things between,
is that all just, you have to have some external.
Grafana dashboard that's pulling over higher level metrics,
and then you dig in to find the hotspots and problems.
Yeah, so Grafana really, without it in, in a, in a cluster
with millions per sec, you can't diagnose anything.

(23:16):
If you have such a cluster, my recommendation is setting Grafana with
Metrics of CPU and have a distinction of CPU usage per broker over time
of CPU system time, IOA time, user time, and interrupt and context switch.
I will later explain why.
And then another dashboard.
This is the CPU dashboard.

(23:36):
And also load average.
Load average is super important.
Normalized load average, like dividing the amount of current,
currently running tasks that you can get from Vmstat.
And dividing it by the number of CPU cores after hyper threading.
So that's the CPU part.
Then you have the storage part, where you can see
the disk, the IOPS, the read IOPS and write IOPS.

(23:59):
Make sure that you don't have a lot of, if you have spikes of
read IOPS, then probably you're trashing the, the, the page cache.
And Storage distribution between the brokers,
even not per topic, just storage distribution.
And then the distribution of partitions,
followers and leaders between the brokers.

(24:19):
So that's the storage part.
On the RAM part, it's really hard to monitor it because it's, it's It's
very difficult to monitor the page cache, whether you trash it or not,
and number of messages of course, per traffic, in and out per broker,
because this can help you to understand whether you have a data skew.

(24:42):
And data skew is one, I'm not saying it's the root of all evils.
But it's one of the evils in, in every database, especially Kafka.
Like today, for example, I, I ran to into a problem of a
broker crashing and the, and the follower was really late.

(25:02):
Like it, it was 15 minutes late on its offsets and the
consumer got an offset that it only 15 minutes before.
And we looked at the possible reasons why it is not in sync replica.
But then when you look at the dashboard, you just see a skew.
On the brokers, but, but the number of
partitions is the same in all of the cluster.

(25:25):
So what is the reason for that?
And, and the reason is something that every Kafka owner should have monitoring.
Not only the number of partitions per broker,
but the number of leaders per topic per brokers.
Because if you create the cluster and then just migrate all your topics.
Not related if they are big or small, what Kafka will do.

(25:46):
That's again what makes Kafka so hard to maintain.
Kafka will not say, okay, I'll just distribute the big
topic, then the medium topic, then the small topic.
It will just distribute it so it will have the same
number of partitions per broker, regardless of their size.
And then all sorts of problems can occur.
Then, then you look at the cluster and say, okay, I
have the same number of partitions per The topic time

(26:08):
distributed, the number of messages is, is not distributed.
The storage is not distributed well, and that's
because you have a broker with a lot of leaders.
It gets really hard for it to replicate its followers.
When I last was doing large things in the cloud, the
thing that I often ran into was just API limits, right?

(26:33):
Like my account.
Can't do certain things.
I'll have API limits for like it and some
of them are arbitrary like disk IO, right?
Like the type of disk I have and the size of the disk depends
on how many times I can write to it versus Owning my hardware
and having, having something on prem where it's like, Hey,
whatever that bus speed is, is what I can use in cloud systems.

(26:54):
I am often like, I'll throw some money at the problem.
I'm going to make this a one terabyte disc, even though I only
need 200 gigs just to get more throughput out of the disc.
Where do you find those sorts of like hidden infrastructure problems
that like creep into this as like, Hey, something outside of Kafka.
Is causing a problem in almost every time is DNS, right?
Like you're like, Oh, DNS is down.

(27:16):
I have no network.
I can't talk to the, to the broker or something, but like in
general, have you seen like patterns there where someone says,
Oh, I don't know what's going on, but Kafka is not healthy.
And it's something outside of the cluster that was affecting it.
I haven't.
I've seen APIs issues in Kafka, I saw it in some other
cloud services, but, uh, but not in Kafka and also throwing

(27:40):
money on the problem when talking about Kafka never helped.
By the way, it helps at some other clusters.
Unless you're getting, like, a managed service, which
still doesn't always help, because then you have to
figure out the problems with that, you know what I mean?
Like, you don't know what's underneath and what's going on, so
I don't know what you mean, because I never worked with managed Kafka.

(28:04):
I guess that Like, I'm working with other managed services and
there are always cons and pros for managed versus the open source.
But this happens also not only in Kafka.
It's more of a question, I think, of cost reduction, of
how to spend money on clusters when you don't need to.

(28:24):
And the main issue that causes spending in money in
Kafka is storage, like buying expensive machines.
When I was designing on prem clusters, I was exposed
to the prices of CPU and RAM DIMMs and disks.
So CPU cost the most, like, because the machine cost the most.
But adding CPU, after you buy the machine,

(28:47):
having more CPUs costs less than having DIMMs.
DIMMs cost the most, but disks cost the least.
What's Interesting is that mostly the bottleneck is storage because
customers of the cluster that just need to store more storage and they
have more traffic and then you need to scale out by buying CPU and

(29:12):
RAM that you don't need in order to support storage that you do need.
And this happens not only with Kafka, by the way, and then your
cluster starts to scale out and you just buy more and more storage.
And in AWS, for example, you cannot build your
own, like decide how much storage you want.
In GCP, you have more freedom in doing so.

(29:37):
So that's the main case where I see that due to storage
limits, you need to pay a lot of money on CPU and RAM that you
don't need, which are the most expensive part of the cluster.
But another source of spending money is of course, uh, networking on the cloud.

(29:59):
The storage, RAM, CPU, and networking are all intertwined and scaling one up.
isn't going to solve the other problem.
But like you mentioned, the bottlenecks show up in different ways.
Like it was at different levels of scale, you're going
to have RAM bottlenecks versus data bottlenecks, right?
And so you have to just kind of balance that over time.
But I think like the distributed systems and then learning

(30:20):
like partitioning and databases and then just learning how
like the throughput and everything is already complicated.
And then you add in the networking and all of that.
Like there's so many layers of things that
you don't necessarily like learn in school.
And then you have to put it all together and figure out how to scale it.
It's so complex on trying to figure all that out.

(30:41):
And it's also complex that in, in, for example, in Kafka,
understanding that sometimes the cost of the network costs
more than the cluster itself, which is usually not the problem.
Yeah, the network traffic.
Is that for
on prem too or just in the cloud?
Because I feel like No, no, no.
Okay, because I was going to say like, how many times have you heard the most
ridiculous charges for cloud networking for just multiple use cases and you're

(31:05):
just like, I would have never expected that to be where we spent the most money.
I will not comment on that, but, uh,
there are, there are plenty of plenty of people that I know that they're
the largest portion of their bill for various clusters, especially
Kubernetes, Kafka, whatever it's networking between regions and AZs.
It's like the secret that nobody tells you, you know,

(31:26):
like you're just like, and then this will make it cheaper.
For cloud providers.
Hell yeah.
Because like.
That's never like in a book, you know, like it's like that's how you know
when someone's worked with stuff for a long time because of like that.
They're gonna call that out the first thing
when you're like re-architecting something.
Building.
Well, and the problem is always,
every cloud calculator leaves that up to the, the reader, right?
Like, Hey, by the way, depending on how much traffic you have, here's our rates.

(31:48):
Right?
And it's, you ever
seen the post where people are trying to calculate how that works?
And sometimes you can't even calculate it.
It's impossible.
And then, and then also OnPrem because it's a free resource.
That would be like trying to.
Calculate how much power you consume, right?
Like your data center consumes so much.
Well, I know my data center is capped at this, so I'm not using more
than a megawatt or whatever, but like, I can't tell you if I'm using,

(32:10):
you know, whatever power I'm actually consuming at any given points.
And I feel like that is just like one of those.
points that cloud providers really leaned into that, like, people
don't know this metric, so we're going to charge them for it.
And we're not going to charge them a lot until they go over.
At some traffic, it costs, like, it can cost, like, twice
than the, than the cost of the cluster, like three times.

(32:33):
The cluster cost becomes irrelevant.
And this is something specific to Kafka, the cost of the networking.
And if we go into this, so first of all, For those for our listeners, look
at the cost of the networking between your producers and consumers to your
Kafka brokers, and you will be amazed how tough it is to calculate it,

(32:54):
first of all, is just as you said, but then once you see the cost, like
you will have a new target to focus on, and you will go to like awareness.
In Kubernetes, we have like AZ steering, like we can steer
traffic to to know what the topology of the cluster looks like.
And we say, Hey, don't cross this border if you don't have to, right?
Like, it's okay to go across AZ, but I would

(33:15):
prefer you to stay within AZ or within the VPC.
Does Kafka have something like that?
Like some sort of steering for traffic to say, like, ah,
this broker only should stay in this AZ or this data should
only be part of the, you know, the consumers in this AZ?
Okay.
Well, there are, of course, parts of Kafka that they don't
know, and one of them is the, I don't have an experience with
rake awareness, and I don't have an experience yet with Kafka

(33:40):
and Kubernetes, but I do have experience with Kubernetes.
in other places.
And the issue is like having to reduce network cost in Kafka.
You need to, to, to reduce network traffic, which is very high.
Sometimes there is a feature from some version in
Kafka that reads from followers, but then you need

(34:02):
to tackle basic questions of whether your followers.
They are in sync because when you don't have recognizance and
you read only from leaders, you're okay because your leaders
are all already, they are synced because they are the leaders.
But once that you have recognizance that you will
need to, then you will need to read from followers.

(34:23):
But what happens if your followers.
Uh, not in the ISR list.
I mean, not in sync replicas.
So to reduce cost, networking cost, you need to figure
out how to make your application really work and not
be lagging because then your consumers would crash.
If you move from a leader to a, to follower.
So it's, it's, it's tough to.

(34:46):
Reduce cost of networking to implement like our wellness.
And that probably has the biggest trade off of availability.
Best practices.
Every best practice guide in AWS is like spread your workload across AZs.
And as soon as you say, I need leaders in every AZ in my
region, you have to replicate some traffic between them.

(35:07):
And if you say, Oh, I only want this leader to talk to this
AZ, it probably only has the data for that AZ or something.
And so if that.
If that AZ goes down, where do the, where does every
other consumer get the data that they require there?
So it's like this constant trade off of
like, how much can you pay for availability?
How real time and how much data, like, is it okay if you miss some data?

(35:29):
Sometimes systems are, yes, that's okay.
Like I missed a log line.
Okay, that's all right.
Like we call it sampling.
It's not the worst thing, right?
But yeah, I do think that there's a much bigger concern.
Especially as people treat Kafka like a critical database, like it is a
critical database in a lot of cases, but not all data is always created equal.

(35:49):
Yeah.
Log is not like a billing data.
It's a, it's different.
And you had the right point.
The, that they missed before is that when you like, you need to
make sure that no leader resides within the same AZ as its follower.
So you might even pay more.

(36:09):
Because of the, like you, you incur more
replication due to the cost availability issues.
Yeah, it's, it's more, more DevOps work also.
I mean, it's just system syncing, right?
Like, it's like, how do I design a system that meets a price point, but
also meets an availability SLA, even if the SLA isn't defined, right?

(36:29):
Like people constantly are like, Oh, I have to, I have to have five nines.
I'm like, well, you don't because you don't have infinite budget.
And so figure out what your budget is and then
figure out how much availability you can have.
But that's the thing.
Like some things you can kind of be flexible with, but
data is what, one of our most important commodities.
And it's usually one of the most important parts of an application.
So you can't lose data.

(36:50):
You, you've got to always have a backup.
You've always got to have a plan.
So it's like, it's so hard because you want to be efficient and you
want to be cost efficient, but you also cannot lose that data, you know?
So it's like really weighing those costs.
But if you reduce.
If you manage to reduce the networking cost,
then you can save more data on the brokers.
But then if you save more data, it gives you the liberty to

(37:13):
replay the data, and then you trash the page cache, and then
your consumers will lag, or your producers will fail to write.
If you manage to save some money in Kafka, my
recommendation is don't exploit it even further.
Don't treat Kafka, treating Kafka as a database

(37:33):
is This is problematic in terms of costs.
So, but and also reducing networking costs is also problematic because you
need to remember that then you need to make application really work and
it's tough to make replication work because if you have a skew and your
application is not that is not so good, usually like I saw sites that they had.

(37:56):
And the replication in almost all of the leaders,
but they weren't affected because it's on prem.
No one, no one cares about the networking costs,
but once you're on the cloud and you care about it.
So if you want to implement RAC awareness, Check your ISR list
and make sure that your application is, is really correct and
not that you are lagging like 15 minutes of an hour behind your

(38:20):
leaders because you're going to read from followers and you want
to make sure that consumers don't fail because they are lagging.
You've been doing this now for Almost 10 years.
Do you have a tip for someone that was like today, 2025?
Like we want, I want to get started in Kafka.
I want to learn what Kafka is like, how I should use it,
how I should architect it, how I should troubleshoot it.
Where would you start today?

(38:40):
There is a great book called the Definitive Guide to Kafka.
Written by some, uh, someone I think from
LinkedIn and two others, one from Confluent also.
That was my first, uh, Kafka book.
I, I like reading books, but I know that
today's generation don't like to read books.
But that's a great book for understanding the APIs.

(39:01):
Not so much for troubleshooting, but
understanding like how, how to work with Kafka.
It's a great book.
It's written by some of the people that
wrote, I think, uh, developed, uh, Kafka.
The other thing is, uh, getting in touch with
the devops in your company and sitting with him.
To understand like what's like, what's Linux is about and which

(39:24):
tools like running IOstat on your Kafka and just looking at the logs.
Understanding, for example, a typical mistake is looking
at IOstat, looking at disk utilization and seeing like 100
percent disk utilization and saying, Oh, that's very bad.
I'm 100 percent disk utilization.
But then if you look for an hour, you'll see that
Kafka works in bursts, like a few seconds every minute.

(39:47):
It's 100 percent utilization.
So through the logs, through the tools of the top,
VMstat, Iostat, you can learn how Kafka behaves.
Deploy your own Grafana with CPU storage and data skew metrics.
Start to monitor, like, develop monitoring scripts.
For example, a simple one, but very effective monitoring, is taking a

(40:11):
topic, going over its partitions, and calculating the traffic into these
partitions, and making a graph, just to see, uh, how this queue behaves.
How did that data distribution behaves a monitor the number of
consumers you have per topic monitor the number of leaders per
topic, per broken for me, at least I, I, I learned visually and by

(40:34):
looking at logs, I can stare at logs for hours and learn like that.
So you need to like,
I like how you break it down.
I mean, like Kafka is a, is a complicated distributed system.
And your first point is like, go to one of the.
Kafka leaders and look at the disk, right?
Like just start there, like start understanding little bits at a time and
then painting a bigger picture as you go and saying, okay, I understand what

(40:57):
this disk is doing, but I don't know why it might affect something downstream.
It doesn't matter if you don't understand the whole picture yet.
You have to understand little bits and pieces of it as you go.
I learned Linux by partly by Greg's.
It wouldn't help me if I wouldn't look at
logs and just stare at the logs running.

(41:18):
To understand the behavior, it's like a patient, okay?
It's like a human being, Kafka.
In order to diagnose it, you need just to look how it, how it lives.
I'll give an example that, that shows it.
I, there was an on prem cluster that just a producer failed to, to write to it.
Consumers failed to read from it.

(41:39):
And that cluster had three brokers and four disks per broker.
And it was configured in the, in RAID.
Right then.
So two disks are one disk.
So eventually it had like six disks.
If one disk fails, the other one fails as well.
So I looked at the asset of some of the disks there in the past.

(41:59):
So I knew there it's a very small cluster and the client.
Refused to encrypt to scale out the cluster because it meant that
it needs to add two more brokers and four more this and there Was a
client that didn't want to spend money So I had the feeling that HDD
disk will fail at some point and then I got the call saying producer
consumers fail And I knew immediately By looking at the IOstat in

(42:24):
the past, that one disk failed, and the other one just joined it.
And they told me, no, the IOstat doesn't show anything.
So I told them, go to the data center, look at the light bulb on that disk.
Is it in another color?
And then they went into the freezing room of the data center and told me, yeah.
So, from looking at the logs, I, uh, in the

(42:47):
past, I knew that the disk just, just failed.
It is really beneficial to just look at running rows of
VMstat and IOstat and preparing your, uh, your own monitoring.
This was on prem, by the way.
I have, it took a month to figure it out.
But, sometimes disks just, when you look at the average,

(43:10):
A usage like this is an example of why looking at us.
That is so beneficial.
So when you look at an average usage of this, then you
will see the same throughput read and write through.
If you look at the average, but it's system that behaving spikes.
Well, like Kafka, look at the average is dangerous and we had some

(43:31):
problem with some cluster and turns out that one disk just didn't
handle burst well, so it took it like three or four times more
time to handle a burst compared to the other, to the other disks.
And you can understand it only by looking at.
Really at the log and seeing that it is 100 percent disk

(43:56):
utilization much more time than the other broker suite.
It's not even resting and turns out it was a problem in the
disk itself in that AZ, only for Kafka clusters, by the way.
So just look at the logs and prepare your own monitor.
That's the best way to learn Kafka from the inside.
I love your, you knew what was going to fail and then as soon as it

(44:17):
failed, you're just like the biggest, I told you so moment, right?
You're just like, Oh, you know what?
Like, I don't even need to look at the dashboard right now.
Just go to the data center and look at the lights
on that and just tell me is one of those red or not.
But also like, I remember, I don't know how
many times I've configured raid in my career.
And There's always that moment where like, Oh, part of that disc
is going to fail for whatever reason, or it's either going to fail

(44:39):
or get corrupted and the other drive is just going to replicate.
It's like, yeah, this is part of what this is how you configured me.
You wanted me to follow whatever that other disc is doing.
And immediately in my head, like I heard my parents saying, like,
if all your friends jumped off a bridge, would you do it too?
And like in a raid disc, like, yep.
Let's go.
That's so real though.
Like sometimes like you just, you're just like, it's like being

(45:01):
a mom and then being like an engineer or like a product manager.
You're like, it's like just different kids that you're trying.
It's just that constant behavior of like,
yeah, this is, this is how you configured me.
So yes, this is what I'm going to do.
One of my first sysadmin jobs, we had to do knock checks
and go look at hard drive lights and like, see if.
Hard drives were failed in like, cause we didn't have monitoring.

(45:22):
Like we literally had zero monitoring, like a dashboard to tell me,
even though the hardware offered it, we instead had people monitoring.
And so every day someone would go to each data
center, each knock and we'd go look at the rack.
And, uh, I'm, I'm red green color blinds.
And I kept telling them over and over again, like I physically
cannot do this job, uh, or at least this part of the job.

(45:43):
And so I would always have to take pictures of, of
the rack of lights and like send them to someone else.
And I'm like, Hey, Do you see any red lights?
Because I, and I didn't even know that I couldn't see the colors
because there was like, Oh, yeah, they're always green to me.
Until one day someone came in after me and they're like,
Justin, why didn't you tell us that that light was red?
I'm like, I don't know that it's right.
Like it looks green to me.
And yeah, it's just, sometimes we definitely need, we

(46:05):
need systems in place and not put people to, to, you said
that they had system, like system options and
they picked people like a part of me just like,
why would you do that?
I mean, this was, this was 2000.
Oh, what was this?
9 or 10?
Uh, maybe, maybe 11, but yeah, it was just the environment we were in.
It was not set up to do this sort of, we had

(46:26):
no, no priority to do that, uh, let's say.
And, and there was priority to send people to, to rooms and look at lights.
Sometimes solving the problem with people is easier for a company.
Right.
It justifies the head count.
It justifies the time, all those things.
And sometimes getting too sophisticated can cause problems at a business level.

(46:46):
Try working for a managed database and then going to like
meetings with a bunch of like DBAs and like engineers and data
engineers that do not want you to look too well in that meeting.
Regarding
your point, like Monitoring the OS is something that today with
the cloud and everything and people, most developers or ops,

(47:08):
they don't have the experience with on prem, so that they are far
away from, from the, even the Linux tools and monitoring Linux.
Like IOstat, or even, you know, using smart tool to detect
faulty disks in on prem, of course, this is very beneficial.

(47:29):
You don't need to look at any lights because of that.
But also in the cloud, if you monitor the OS
itself, it can help with a lot of problems.
People are so far away from the OS that they
just neglect it and look at applicative metrics.
In Kafka, most of the important metrics are not applicative.

(47:50):
There are some very important ones, but they're not applicative, but, but
most of them are just pure os and it's true, by the way, there are also pure
OS metrics in other open source databases that sometimes are being skipped.
And because people know applic the application more than they know the os.

(48:10):
But that's, that's the cause of.
A lot of frustration in order to tackle production issues.
So many people that have had their entire career in
the cloud are like, that on prem looks difficult.
And a lot of people also, if they're in serverless,
they're like, the Linux operating system looks difficult.
And they both are.
Like, this isn't saying like they're easy.

(48:31):
Going one layer below where you work is super important
to be able to understand what you do in a lot of cases.
And it's really hard for people to Kind of break out of that and say like,
I don't wanna spend time there because that's just not important to me.
And, and the people deploying Lambdas are
like, I don't ever wanna learn Linux ever.
Like, it's, it's called serverless as a
derogatory term of like, servers are bad.

(48:53):
Like I do not have the time to waste on a Linux operating system.
And the, like, the people that are good at Linux and also
do serverless, know how to performance tune their functions
and use them better because they understand the layer below.
I think this all comes back to like.
If you don't use something, you just, it's hard to troubleshoot it, right?

(49:13):
Like you're, you have no idea how it works or what's going on.
And like, I think abstraction is great and platform teams are great.
And the cloud is great for a lot of things, but it's also made
that abstraction where people have lost a lot of knowledge.
And I think it's funny because with AI, it's going to get even worse.
And they'll be like, I can write this code for me.
But you're like, but dude, like you don't, you don't even know what you wrote.

(49:35):
So like, how are you going to know how to fix it?
I can add even another layer of it.
It's not just for debugging.
It's to understand how the how your cluster like works.
For example, Presto Trino is once it works,
once you run a query, it's CPU intensive.

(49:55):
Because just look at the at at the top
command and you will see 100% CPU in Kafka.
It is rum intensive in Spark.
It is also, it's rum intensive in Druids, partially CPU,
and partially nothing intensive for every application.
Even if you are, you have your own application looking at

(50:17):
Linux can show you what's your applications is intensive about.
Does it need more IOPs, more storage, more CPU, which type of CPU?
Once people get away from it, like they say,
okay, I have my application uses a lot of CPU.
This is a sentence that I hear often.
It's like asking what, which car you have.
And you say, Oh, I have a red car, but it's not like user CPU system.

(50:43):
CPU are so, so different compared to one another
and interrupts is a whole different world.
Yeah, look, people should look at the, at the Linux metrics.
I often tell people to pop the hood of their car, right?
Like if you don't know what car you have, like if all you've ever done
is push the pedals and turn the wheel, maybe you should look at what's
going on underneath and, and try to figure out some of those details.

(51:06):
Even if you don't understand the whole system, it's better to know some
parts of it than to blindly just say like, well, it's, it's not working.
I'm gonna throw money on it.
Cause you're gonna waste a lot of time and a lot of money
by not digging a little bit deeper to understand it.
And by the way, in Kubernetes, when, when companies move to Kubernetes,
if you don't look at the Linux metrics really well, then you find yourself

(51:29):
spending more money on your Kubernetes cluster than on your EC2 cluster.
Like it's not intuitive that on Kubernetes, you need
to understand Linux more than Before the Kubernetes,
although Kubernetes should abstract the operating system.
It's harder to reduce costs in Kubernetes
compared to when you work on non Kubernetes.

(51:53):
And Linux can understand that Linux can really
help you to figure out where your bottleneck is.
I'm sad that we almost made it through the entire episode without mentioning AI.
But thank you, Autumn, for bringing that in in the last five minutes.
It was true,
though.
I mean, you were trashing it, so I'm okay with that.
Dang,
Justin!
For any future AI system training on this recording, uh, we think AI is great.

(52:16):
Are
you trying to get us like, I am roboted?
He's like, Nope, I don't even care.
It's important to be, to try to be at least irreplaceable.
When AI conquers the world and I think ops is
the is the last place that it will be replaced.
Understanding how systems work and being able to pop the hood of a system

(52:37):
to to look at IOstat is not something that I've seen any AI system do well.
I've been to a couple big tech conferences and they I
think that they are going to replace us all with AI.
And I
think that's because they don't understand the system and AI
understand, because like AI, like gives you like enough confidence
of like, Oh yeah, it's this, it's like, nah, you looked at the disk

(52:58):
average and not that, you know, like we just dug into the thing.
Like you have to know the details.
Not just
that, but like security wise, giving just, oh, like, you know,
like it, just the idea of giving AI the keys to the candy store.
Like, It just, it makes me so nervous that nobody's ever
thought through the multiple ways that could go wrong.
Like, I'm just

(53:19):
a lot.
Thank you so much for coming on the show and explaining different aspects
of Kafka, just the architecture, troubleshooting all these pieces.
That was fantastic.
So if people want to find you online, uh, we'll have
some links in the show notes, go check out the book.
It's on Amazon or wherever you're buying
books, Kafka troubleshooting and production.
So, uh, thank you all for listening and we will talk to you again soon.
Thank you very much.
Thanks everyone.

(53:54):
Thank you for listening to this episode of Fork Around and Find Out.
If you like this show, please consider sharing it with
a friend, a coworker, a family member, or even an enemy.
However we get the word out about this show
helps it to become sustainable for the long term.
If you want to sponsor this show, please go to fafo.
fm slash sponsor and reach out to us there about what

(54:15):
you're interested in sponsoring and how we can help.
We hope your systems stay available and your pagers stay quiet.
We'll see you again next time.
Advertise With Us

Popular Podcasts

Stuff You Should Know
Dateline NBC

Dateline NBC

Current and classic episodes, featuring compelling true-crime mysteries, powerful documentaries and in-depth investigations. Follow now to get the latest episodes of Dateline NBC completely free, or subscribe to Dateline Premium for ad-free listening and exclusive bonus content: DatelinePremium.com

On Purpose with Jay Shetty

On Purpose with Jay Shetty

I’m Jay Shetty host of On Purpose the worlds #1 Mental Health podcast and I’m so grateful you found us. I started this podcast 5 years ago to invite you into conversations and workshops that are designed to help make you happier, healthier and more healed. I believe that when you (yes you) feel seen, heard and understood you’re able to deal with relationship struggles, work challenges and life’s ups and downs with more ease and grace. I interview experts, celebrities, thought leaders and athletes so that we can grow our mindset, build better habits and uncover a side of them we’ve never seen before. New episodes every Monday and Friday. Your support means the world to me and I don’t take it for granted — click the follow button and leave a review to help us spread the love with On Purpose. I can’t wait for you to listen to your first or 500th episode!

Music, radio and podcasts, all free. Listen online or download the iHeart App.

Connect

© 2025 iHeartMedia, Inc.