The AWS Outage and Why It Matters

Episode Transcript
Available transcripts are automatically generated. Complete accuracy is not guaranteed.
SPEAKER_02 (00:01):
Welcome to In the Loop.
What is up everybody?
My name is Michael Burpo.
Thanks again for listening to Inthe Loop.
This week I'm joined by backenddeveloper at Punchmark and my
friend Andy Zoki.
And we're talking about, in caseyou didn't hear about it, the
AWS outage that took place abouttwo weeks ago.

(00:22):
This was extremely disruptive tothe entire internet.
And you're probably thinking,oh, how can one company's outage
affect the entire internet?
Well, when you realize how muchof the internet is connected and
also relies on these certaintools that are provided by a
very small number of companies,it took down everything from

(00:43):
Netflix to Airbnb to BMW, lotsof healthcare services.
Even Punchmark had some of ourinternal tools taken out because
we were relying on Atlaskian,which I did an entire episode
about, and they were relying onAWS.
And it's like, oh my gosh, thisis way wider than you'd expect.

(01:03):
And it took them a prettysubstantial amount of time to
correct it.
So I wanted to talk about it sothat maybe we can all understand
it more.
And I think it's interesting tosort of start to understand how
large the internet is, but alsohow small.
So please enjoy.

SPEAKER_00 (01:25):
This episode is brought to you by Punchmark, the
jewelry industry's favoritewebsite platform and digital
growth agency.
Our mission reaches way beyondtechnology.
With decades of experience andlong-lasting industry
relationships, Punchmark enablesjewelry businesses to flourish
in any marketplace.
We consider our clients ourfriends, as many of them have

(01:46):
been friends way before becomingclients.
Punchmark's own success comesfrom the fact that we have a
much deeper need and obligationto help our friends succeed.
Whether you're looking forbetter e-commerce performance,
business growth, or campaignsthat drive traffic and sales,
PunchMark's website andmarketing services were made
just for you.
It's never too late to transformyour business and stitch

(02:08):
together your digital andphysical worlds in a way that
achieves tremendous growth andresults.
Schedule a guided demo today atpunchmark.com slash go.
And now back to the show.

SPEAKER_02 (02:26):
What is up everybody?
I'm joined by my good buddy AndyZocke, backend developer at
Punchmark.
How are you doing today, Ando?

SPEAKER_01 (02:34):
Uh not too bad.
There's uh no servers crashinganywhere, so it's uh good day to
beat a developer.

SPEAKER_02 (02:39):
You ever see that that joke?
It's like, or every uh back-enddeveloper waking up on the day
of the AWS outage and they'rejust like, oh God, and they pour
like a double shot of espresso,and they're like, ah.

SPEAKER_01 (02:51):
Yeah, there's uh some people that definitely
ruined their day for sure.
Um, I know that it was uh we gotkind of lucky, I guess we can
get into it, but uh yeah, somepeople they're at outages
stretching into the afternoon,and it was the internet was
chaos for that day.

SPEAKER_02 (03:04):
Yeah.
So I set it up in the intro, butfor all of you all listening,
the AWS outage took place onOctober 20th.
I mean, it goes out verysporadically, but uh to kind of
just set up um what this was,AWS is Amazon Web Service, and
this is like a cloud computinguh service that AWS offers.

(03:29):
But you're probably like, oh,you know, Amazon, like the
company that does like theshopping stuff, it's like, yeah,
the same ones that does WholeFoods and also uh, but AWS is
such a massive part of theircompany.
I think at one point it was likeif Amazon spun AWS off into
another company, it would be atop 10 most valuable company.

(03:51):
It's it's massive.
But then when you start digginginto it a little bit more, you
realize that cloud service ispretty much only split into
three companies.
There's AWS, Microsoft, Azure,which Microsoft, yes, the
company that does, you know,Xbox, and then Google Cloud,
like the company that does allthe other Google stuff.

(04:12):
So it's three companies thatpretty much dominate the entire
thing.
There's a couple of other oneslike Oracle and IBM and Alibaba,
but um, for all intents andpurposes, they they refer to
them as the big three.
So, Andy, for the people at homeand me, can you explain what
like cloud computing or cloudservice is?

SPEAKER_01 (04:30):
Sure, yeah.
So cloud computing is it soundsreally fancy.
Uh it's basically just literallysomeone else's computers, right?
So if we want to uh say storeimages to AWS, then we literally
just store it on one of theirservers somewhere, and it's just
presented to us in a way that wecan interact with easily from

(04:51):
wherever we are.
And uh the the benefit to thatis AWS has servers all across
the world.
They have different regions, uh,they have different, it's called
availability zones.
And the idea is we're asPunchmark, you know, we're one
company on the east coast of theUnited States, which is great,
but uh we have clients that areall over the place.

(05:12):
So we need a service that canput computers physically close
to where people live and thencopy all of the assets that
we're storing for these clientsand push them out to all these
different locations so that nomatter where you're coming to
Punchmark from, you're using anAWS server that's relatively
close to where you are.
And then on top of that,everything that we can do on our

(05:35):
servers, like uh creating codeor uh running logic or anything
like that, we can store on AWS'sservers, and they just have a
bunch of different servicescalled you know Amazon Web
Services uh that basically justmesh all those together and
allow for basically seamlessoperation of a production
environment uh distributednationwide.

(05:56):
So when it works great, it'sawesome.

SPEAKER_02 (05:58):
Yeah.
But what's really interestingabout it is so I had this
conversation with my mom.
I was like, yeah, AWS went out,it was a real pain in the neck
because you know we got clientsmad because their websites are
like kind of broken in certaininstances, or uh lots of
services around the internetwere out, like lots of services
around the internet.

(06:18):
Everything I just was jokingabout this, dude, some people's
smart beds were broken.
Isn't that incredible?
Like, what a dumb timeline welive in where one company goes
down and suddenly everythingfrom your car to your bed won't
even work, and it just uh youdon't realize it that there are
these hubs, like you weresaying, there's a hub on the

(06:39):
east coast, in the central, onthe west coast, you know, like I
mean Europe has one, and theproximity to those can also
reduce your lag.
So that's a big part of, forexample, gaming is you want as
little lag as possible.
But for what I understand, oneof the main uh hubs went down,
and that's what caused thisoutage.
Is that right?

SPEAKER_01 (07:00):
Yeah, so what basically happened with the
outage is kind of a cascade ofissues that all were just on
AWS's side, on Amazon's side.
So they had a service that wastrying to write uh something
called a DNS record, which isbasically how your uh computer,
when you type in a website, itknows where to send the website
to.
That all goes through somethingcalled DNS routing.

(07:22):
It tried to write one of thoseentries for its database
service.
Uh there's an internal AWSservice called Dynamo that uh
some companies use extensivelyfor database operations.
We uh have our own systems formost of the part, but it wrote
that entry uh that was empty,and that got picked up and uh

(07:43):
propagated across all thedifferent cache points that I
mentioned all across the world.
And so AWS found that issue andtried to fix it relatively
quickly, but you have thiscascade all of a sudden where uh
the DNS calls to the databasearen't going through, which
means that no site that uses AWSfor data storage can do
anything, uh, including AWS.

(08:05):
Obviously, they use their ownservices for uh all their own
internal tools.
So basically any tool on theirside that tried to access data,
which is a lot of them, uh, gotaffected by this, and it spread
out to services called Lambda,which handles internal
functionality.
Uh, it's spread out uh basicallyall over the place.

SPEAKER_02 (08:26):
Yeah, and the way I'm trying to like picture it,
and I'm a very visual guy, uh,it's kind of like in a game of
like telephone, but instead ofit just being one game of
telephone, there's many, many,many billions of games of
telephone.
And what happens if you takelike the first person or maybe
even the second person and youyou know uh remove them from the

(08:49):
situation and you uh you knockthem out?
That suddenly many people downthe line are not receiving the
game of telephone, the only someare.
But what if some of the peoplefurther down the line they are
responsible for fact-checkingthe information from the other
people?
Well, if they didn't get it,suddenly the other people aren't

(09:10):
able to fact-check with theother people, and they might
shut down as well.
And it has this crazy cascadingum knock-on effect.
They use that term a lot,knock-on effect.
But it's been a really uhilluminating aspect to learn
about is um what ends uphappening.
Can you explain a little bitabout Lambda, especially?

(09:31):
Because that's the one I waslearning was most important for
Punchmark.
Can you talk about that?

SPEAKER_01 (09:37):
Sure.
Yeah.
So to fit in with your telephoneanalogy, which I like, you have
all that as the issue, and thenuh you have a separate issue,
which is the people that aretrying to communicate through
the telephone, they still needto send their message across,
right?
You can't just have thetelephone cut and then you can't
relay all this criticalinformation.
You have basically a bunch ofcalls that were getting stacked

(09:58):
up that none of them are goingout, and so uh this huge backlog
was playing into it as well.
So even when AWS recovered theirown internal tools, they still
had to throttle the throughputway down so that their own, you
know, internal critical stuffcould catch up and not get
swamped, and then uh theystarted uh partitioning out from
there.
So uh it really was just uh abig mess.

(10:21):
And like you said, Lambda wasone of the uh services that was
affected the longest.
I'm not sure exactly whatspecific way the CNS entry
messed with Lambdas so hard.
I think it was just because it'sso logic-based that uh they had
such a hard time untangling thatweb.
But uh the way Punchmark usesLambdas is for uh image

(10:41):
ingestion.
So if we have a client or avendor that has some new jewelry
items and they want to uploadsome pictures to it, uh we use
Lambdas in our own internalpipeline that basically takes
those pictures on our server,creates a copy over on AWS, and
uh does some pre-processing toit before we store it on S3.

(11:02):
So what was happening forPunchmark is while Lambda is
down, none of those images aremaking it through the pipeline,
so they're just getting held up.
So from a client that might looklike I uploaded these jewelry
images several hours ago andthey still aren't uploaded.
Uh, what's the deal with thesenew items?
But as far as we can tell,looking through what happened,
that's more or less the extentof how bad it was for us.

(11:26):
Uh obviously, uh, like Imentioned, some sites on the
internet were just down for along time because they depended
more heavily on some of thoseother services.
Um, but for us, we got offrelatively light.

SPEAKER_02 (11:38):
Yeah, it's it's really interesting because on a
surface level, like if you don'tI don't understand how the
internet works even, but from avery surface level, even more
surface level than me, uh, youmight interact with your
punchmark website and be like,oh, this this image is taking
forever to upload.

(11:58):
What what the heck, Punchmark?
And then it's like, well, yousee, it's because of, and then
you go into your explanation,and then you know, maybe they're
like trying to upload thisimage, and they're like, Oh, I'm
gonna just kill some time andI'm gonna watch Netflix.
Uh Netflix also out.
It's like, oh, I'm gonna killsome time and I'm gonna watch
Twitch also out, and I'm gonnago for a drive and use my BMW

(12:21):
car, and it's like, uh, alsoout.
It's like, you know what?
I'm going to bed, also out.
And it's like, oh my gosh, thisis the world in 2025.
So the one that was reallyinteresting that I didn't
realize how much it like nothamstrung us, but more like um
gave us a real pain in the neckwas it affected Atlassian, which
is uh runs, you know, Jira andTrello and Confluence and a

(12:46):
whole bunch of other ones.
Oh, and um Bitbucket too.
Um, that one was the one that Inoticed all the devs being like,
Bitbuckets down still.
Gosh, this is annoying.
So, how did that affect your jobon that day?

SPEAKER_01 (12:58):
Yeah, so Atlassian is what uh us devs use to more
or less uh organize all of ourwork that we're trying to do.
So uh for all the the ticketsthat come to the dev team to
work, that's where those live.
Uh if we want to push new code,that's where our source code
lives.
So it wasn't affected as far asnone of the websites can get to

(13:21):
the code because that would meanthat just none of them would
load at all.
Uh, it just meant that for thatday, we couldn't push any new
work to our uh develop branch,which we use to test stuff
before we push it out to uh thereal world.
Um so yeah, it was it was kindof annoying not being able to
you know share work with therest of the dev team.

(13:41):
Uh, but uh we just kind of gotthrough it by more or less
operating independently for acouple hours and then it was
resolved close to end of day.
So yeah, got it back working.

SPEAKER_02 (13:51):
Yeah, it's just it makes you think about sometimes
like over-reliance.
You know, are we over-relying onone specific company?
And I think that the smartestthing you can do is kind of
diversify a lot of your techacross a lot of different
spaces.
That's uh everybody knows that'swhat you're supposed to be

(14:11):
doing, but sometimes it reallyis just uh your hand is forced,
you know.
Some of these companies are sobig, and that's why you like you
almost scratch your head.
For example, at do you rememberit it kind of gave me big vibes
of um do you remember theEvergreen, the ship in the Suez
Canal?

(14:31):
Yep.
And that was one of those thosenews stories that I was like
following so closely because I'mlike, the internet or the the
world is not this dumb, is it?
And it's like, no, it no, it is.
It is this dumb.
We had a boat turned sideways bymistake, and it blocked one of
the major trade routes, andthat's why all of your

(14:52):
deliveries are are not gettingto where you want them to be.
And it's like, wow, we really dorely on like a couple things a
lot, don't we?

SPEAKER_01 (15:02):
It's crazy how little it takes, too.
Like you mentioned, yeah, wehave the world's one of the
world's busiest shipping ports,and then oopsie a ship blocked
it for a month, or you know,like this example, um, you know,
Amazon just had an automatedsystem that's trying to update a
DNS record, and oopsie hadaccidentally uh said it's empty,
and then the entire internetgoes down.

SPEAKER_02 (15:22):
So, like another one of those, just like at this
point, it's it's it thecomparisons write themselves.
The funniest one is uh do youremember Facebook login was like
really making a real push?
They were trying to make it soeveryone would log in with their
Facebooks.
It's kind of like uh Gmaillogin.
And uh Facebook had something gowrong at their root level in

(15:43):
their servers, and it hassomething to do with their uh
security level.
So basically, you needed to umlike a server malfunctioned or
broke or something like that.
But what was so funny, notfunny, but catastrophic about
it, was they had it impactedFacebook login, and Facebook um

(16:04):
was arrogant enough to make itso that everything was
contingent on Facebook login.
So then they couldn't access theserver point because all of that
that security was down and itdefaults to you can't access
this at all.
So they had to like, I rememberthey had to like break into
their own server system withlike a like a blowtorch

(16:27):
essentially, and just like cutthe doors off the hinges because
one server went down, and itjust kind of something very
human about that in there.
There's like a real uh metaphor,but I don't know what it is.

SPEAKER_01 (16:41):
There's uh there's a saying, right?
Uh uh ounce of prevention isworth a pound of cure, right?
That's the kind of circumstancewhere if you just look for one
more second at your deploymentpractices and make sure that
there's absolutely nothing thatcould go wrong with this
massive, massive internationallaunch you're trying to do for
this whole new security system.
Maybe you don't have to blowtorch one of your servers.

SPEAKER_02 (17:05):
Super interesting.
Now, Andy, is there did did youguys kind of do a postmortem on
this at all?
Did you guys like talk about howwe could uh you know better
handle this?
It wasn't on us.
That's like the thing that Itook from it.
Uh the way I was handling it iswe had a thread going in our
community, and I was doing uhpretty much hourly updates, and

(17:26):
you were passing me informationand I was posting in there, and
we had a bunch of um a bunch ofclients that were like following
that as their their source of ofinformation.
Did you guys talk about itamongst the devs at all?

SPEAKER_01 (17:38):
Uh a bit for sure.
Uh like I mentioned, we weren'tas heavily impacted by some
other sites, so there wasn't asmuch to post mortem.
If there would be one takeaway,it's that redundancy is always
something that you want to aimfor.
Uh obviously, like you said, allthese systems depend on each
other, so if one goes down, thewhole thing goes down.
The problem you run into withthat though is we can't

(18:00):
necessarily have two differentcloud providers that are
redundant to each other, right?
We can't run half of our siteson Azure and half of them on
AWS.
I mean, I guess we couldtechnically, but it would be a
nightmare to maintain.

SPEAKER_00 (18:10):
Yeah.

SPEAKER_01 (18:11):
So we we have internal uh replication as much
as we can.
Like for instance, we every daywe do a backup of our databases.
So if those just magically youknow fall out of the sky and
everything's gone, uh,everything's not gone.
You know, we can recover.
So the most important thing thatyou can do that I think we're
doing a very good job atpunchmark right now, is just

(18:32):
having that redundancy so thatwhen something goes down, you
can get back to a working statequickly and uh get on with your
day.

SPEAKER_02 (18:38):
And there you go.
I think it's really aninteresting uh story.
It was something I was thinkingabout a lot.
I almost did an episode like asit was happening because I was
like, Well, we can't work on acouple of other things right
now, so what am I gonna do?
Make a podcast episode.
But uh, it's something that Ithink as the world becomes even
more connected, these types ofthings I believe will happen

(19:00):
more because we are becomingmore and more reliant on code.
And I'm sure that you canattest.
Uh, there are some parts of theinternet that are ridiculously
behind when it comes to like,you know, not safety, but like
uh best practices, because theywere built by some like
overworked guy 20 years ago andthey've just never updated it.

(19:23):
Um, and you never really want tolike kind of mess around with
them, you know.
You don't want to remove thattree because it brings the
entire service down.
So uh just something I wanted tokind of share with our listeners
that this is this was a reallybig deal.
Like, I don't know how much thenews cycles really covered this,
but it was an important one forpeople that were maybe in the

(19:45):
know.
So uh I can't thank you enough,Ando.
I think this is really a coolconversation.
I love having you come on.
Always a pleasure, Mike.
All right, thanks everybody, andwe'll be back next week,
Tuesday, with another episode.
Cheers, bye.
Alright, everybody, that'sanother show.

(20:05):
Thanks so much for listening.
My guest this week was AndyZogie, the back end developer at
Punchmark.
He's one of my best friends, soit was really cool chatting with
him.
This episode is brought to youby Punchmark and produced and
hosted by me, Michael Burpo.
This episode was edited by PaulSuarez with music by Rod
Cochrane.
Don't forget to leave us afive-star rating on Spotify and
Apple Podcasts, and leave usfeedback at punchmark.com slash

(20:29):
loop.
And that's L O U P E.
Thanks, and we'll be back nextweek Tuesday with another
episode.
Cheers.
Bye.

All Episodes

Episode Transcript

Popular Podcasts

On Purpose with Jay Shetty

Ruthie's Table 4

The Joe Rogan Experience

.css-15opob5{left:0;position:absolute;top:0.8rem;} All Episodes

.css-14f5ked{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:2;overflow:hidden;}The AWS Outage and Why It Matters

Episode Transcript

Popular Podcasts

.css-r6mb8g{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:1;overflow:hidden;}On Purpose with Jay Shetty

Ruthie's Table 4

The Joe Rogan Experience

All Episodes

The AWS Outage and Why It Matters

On Purpose with Jay Shetty