Episode Transcript
Available transcripts are automatically generated. Complete accuracy is not guaranteed.
(00:03):
Welcome to episode 414
of the Microsoft Cloud IT Pro podcast recorded
live on 10/31/2025.
This is a show about Microsoft March in
Azure from the perspective of IT pros and
end users, where we discuss the topic of
recent news and how it relates to you.
Fortunately,
when we went to record this, the Internet
(00:24):
is back online after an AWS
and Azure outage, both related to DNS and
both within the last couple of weeks.
So what better to discuss today than what
happened, how it was resolved, and what IT
pros should keep in mind for future resilience
planning when it comes to your cloud infrastructure.
(00:45):
So, Scott, I saw this funny meme the
other day. I'm gonna read it to you.
I intentionally did not read this to you
earlier.
So I saw this. Somebody sent it to
me. I have to go oh, I know
where it is. I have to go find
it. I should have pulled it up earlier.
And this can tie into another topic as
well. Where is that
message?
(01:06):
Wow. Okay. Here you go, Scott. After getting
fired from ungrateful
AWS,
after an outage where my job was to
Vibe code all the DNS entries to IPv
six, happy to announce that it's my first
today at Azure. Azure recognizes the value of
Vibe coding IPv six DNS, and I just
force pushed my first 1,000,000 entries. Now off
(01:27):
to grab some coffee.
Yes. I've seen this one. The Internet. Somebody
has been following
at wrecked on x.
Yes. Actually, someone sent it. I do not
follow them, but somebody sent that because
DNS is apparently hard as evidenced by this
last week of both AWS and Azure.
I guess it wasn't quite within a week.
(01:47):
AWS was October 20. Azure was October 29.
Nine days, There was a little bit of
a spread in between,
but it does happen.
It's always a good reminder when
the cloud goes down
that it really
is somebody else's data center someplace else. It's
just not
it's not your data center. These things tend
(02:09):
to be far reaching. I'm always
amazed
when
Herndon goes down, so like US East Virginia
for AWS,
and
50% of the Internet just goes offline. Because
there are so many
of the
modern day
SaaS services,
like the things that you would depend on,
(02:31):
like, hey, I listen to music on Spotify,
I stream my podcast from here, I do
my banking with like, all these different things
are all homed out of that region.
So when bad things happen to Herndon,
particularly in AWS land,
bad things tend to happen on the Internet
for the rest of us or at least
(02:51):
I think the parts of the Internet that
folks who listen to this podcast would go
for. So for me, like I said, that's
things like Spotify going down,
that is
Reddit suddenly disappearing and going no. Yep. There
went the body of knowledge
that was pulling all these things out. And
then in this new world
of LLMs and everything else that are doing
(03:14):
both ingested plus real time searches of these
systems,
like, all that stuff starts to show its
cracks
along the way
as well. So the AWS one, interestingly,
like, manifests, I think, is a little bit
of, like, oh, this all sounds like a
lot of DNS. My understanding was it was
(03:34):
actually a problem with DynamoDB
and kinda light load balancing with Dynamo and
the way that they push
configuration
and things like that into it. But I
could be a little bit off there. I
didn't have a ton of time to dive
into theirs,
especially, like you said, with the Azure outage
coming on October 29, just nine days later,
and that one being
(03:56):
certainly more DNS related or at least like
a I think to the spirit of it
being that it was Azure Front Door and
kinda
some of the global load balancing capabilities of
Front Door that got out of whack due
to a
configuration update. And in both cases, in both
systems, these were configuration updates
that kind of went a little bit sideways,
(04:18):
and things got a little bit squirrely. It's
hard. Stuff at that scale is very complicated,
but it always
amazes me how
one of those configuration changes
can take down everything
so quickly that because we've seen it multiple
times from multiple different cloud vendors where you
(04:39):
would think they would have figured out by
this time how they could do, like, small
configuration changes that don't have the snowball effect,
but yet we continue to
see these. And, yeah, both were DNS. I
was reading some on the AWS one too,
and it sounds like it was it was
Dynamo
DB,
but updating
that's used to update DNS. And it was
(04:59):
like two different services were trying to update
the same DNS records tied to Dynamo DB
and, oh, and two things are trying to
update the same DNS record. It's like trying
to update the same line in a file
multiple times and SharePoint complaining that you have
version mismatches? It's definitely possible to
encounter these race conditions.
Even small changes do have big impacts, so
(05:22):
I think it's a little it's a little
off or maybe, like, not the right color
to say, like, oh, it's surprising when a
little configuration change
or, like, that a bigger configuration change goes
out. Like, all these things go out, whether
it's Amazon, whether it's Microsoft, whether it's Google.
Everybody has their own deployment practices
for safe deployments,
(05:43):
for making sure that things get flighted through,
like, multiple rings and they follow a general
progression. You see the same thing, like, when
a feature rolls out in SharePoint, for example.
Right? We all know about the different rings
that go in there with deployment rings and
things like that. So it's the best of
intentions.
The interesting thing for me in the a
AWS RCA
(06:03):
was they got into some of the nitty
gritty around how
complex these things are with all these microservices
that are running, talking to each other. So
you'd like things are starting to manifest where
we've built these really awesome
machines, right, to go and manage this all
for us and have all this underlying logic
and all these other things into them. But
(06:24):
when these, like, little subtle race conditions are
coming through or other things are coming out
and stuff gets out of whack, in in
the case of the Dynamo thing, these workers
between these various microservices
becoming desynchronized,
bad things
happen.
Right?
So I think in the AWS one just
(06:45):
pulling up their RCA real quick. So they've
got a couple components. They've got this planner
and these enactor workers
within dyno DynamoDB
that help with some some distribution of traffic
and other things via DNS, but it's a
bunch of basically, like, internal components. I'd encourage
somebody to go read about this. Like, if
you're interested in, like, distributed computing, hyperscalers, all
(07:06):
these things, like, it's always interesting to see
how these things are designed. But, you know,
apparently, you had this one service, which is
the DNS Enactor,
which when it fires up, it verifies
plan freshness, what it's supposed to do, what
it's supposed to process,
what updates it's or endpoints it's supposed to
(07:27):
update, all those things.
Turns out, the DNS and actor did within
Dynamo does a very, like, sane thing in
that it verifies the freshness of what it
needs to do
anytime that process starts or at the start
of processing. But it's not doing, like, state
management as it goes. It's always assuming that,
(07:47):
hey. I spun up. This is current state.
Let me go make some changes and then
check again kind of thing. So you had
these multiple actors that are talking to each
other,
and like you said, it's a contention issue.
So by the time one spins up and
it says, okay. Here's the plan. Here's what
I'm gonna go do,
and it goes and does it, well, it
turns out that another one was spinning up
(08:10):
with a potentially different plan because that is
they haven't been in flight. And all of
a sudden that check that had been performed
that was fresh was now stale, and it's
applying a stale configuration
and overriding what was already there,
and that leads to a series
of cascading
failures.
And for services like Dynamo,
(08:31):
they're so integral to the fabric of AWS.
So there's a bunch of other services that
are depending on DynamoDB. So if you're
doing compute and you're using virtual machines with
EC2,
you're doing functions with Lambda,
even things like RBAC and I'm ultimately tie
back to these database systems like Dynamo, and
(08:54):
they have these, like, just really bad, no
good, horrible days.
The closer to home for me on my
side, I've seen when we've had outages
in storage
and very similar thing, like, you'd be amazed
at the number of services that depend on
storage for something. Right? They publish some kind
(09:15):
of state in there.
Maybe they're not even using, like, unstructured storage.
It's not like they're storing logs or something,
but maybe they're using, like, NoSQL
tables or they're using queues
or things like that along the way. So
there there's just a bunch of moving pieces.
There's a bunch of dependencies,
and those dependencies
just tend to
(09:35):
bleed their way out. And I think what
we were seeing a lot more is with
these outages, at least these last couple, these
two most recent ones, and I think if
we look back a couple months as well,
the impacts are just so far reaching because
so many customers today
are dependent on the cloud. Like, I saw
a lot of chatter after this one, like,
oh, AWS went down, and then, oh, Azure
(09:56):
went down, and, oh, we should all be
mount multi cloud, and we should all be
and all these things. Right? Like, sure. Absolutely.
We should. If we had infinite money, infinite
time, infinite skilling, all those kinds of things
that are out there, but that's ultimately not
the reality for a lot of us. So
I fall back to, are these things bad?
Yes. Do we learn from them? Also, yes.
(10:18):
Like like this particular race condition in the
case of AWS,
the thing that happened in Azure, they happened.
They should not happen again because we learn
from them, we implement those changes, and we
go forward. And as bad as it is
to have half the Internet go down, well,
half the Internet was down. It wasn't just
you. It was everybody else. And
(10:38):
the fix also wasn't on you. The fix
was on somebody else. Right? So while all
those servers were catching fire, while everything's spinning
back up and there's just this big retry
storm going on and network links are getting
overloaded and CPU and memory and all these
things are going down, like, as bad as
it sounds to say it, it was somebody
else's problem to fix.
(10:59):
It wasn't our problem to fix. So I'm
still reminded of that part, like and very
mindful that, like, when these things do happen,
yes, they're bad.
Clearly, they can be very severe and go
out there and have some some crazy kind
of impact. But at the same time,
while you're maybe up all night trying to
inform your customers or
(11:21):
you're kind of running around trying to figure
out what's going on, ultimately, that responsibility sits
with somebody else
to make sure that it is ultimately where
it needs to be and that it's back
up and it's running. And I think, like
I said, like, these things happen.
We're talking like
these massive distributed systems.
They're built by the best engineers that are
(11:42):
out there, and
they still have these issues even with testing,
things like that, but they will get hardened.
These are just battles in the war. They
make these systems
more resilient at the end of the day.
Everybody learns from these. Like the AWS outage,
I can guarantee you folks in Azure learn
from. The Azure outage, I can guarantee you
folks at AWS and Google and competitors are
(12:04):
also learning from as well as we're all
publishing these RCAs and getting things out there
and kinda talking about what broke, what we're
doing to make it better, how we're fixing
it. Yeah. And even the whole multi cloud
thing doesn't always work. Like, I was looking
at the AWS and the Azure one, and
under both of them, Starbucks went down. So
it's like Yes. In that case, multi cloud
didn't even help. Like, Starbucks crashed with AWS.
(12:26):
They crashed with Azure. It is what it
is. And the Azure one too, like, you
mentioned the network storm, and I think that's
some of it. We talked about how a
small change can trigger a wide spread effect.
Looking at the Azure outage,
that one was a little bit more that
way where there was a change that was
applied to Front Door configuration change, and
(12:47):
it caused a few of the Front Door
nodes to fail. And then everything starts failing
over to working ones, but the working ones
don't handle all the failovers, and then they
start failing, and
it just snowballs from there where it wasn't
like to your point, people didn't just go
apply everything to all the front doors at
once, but one cascaded to another.
(13:12):
Do you feel overwhelmed by trying to manage
your Office three sixty five environment? Are you
facing unexpected issues that disrupt your company's productivity?
Intelligink is here to help. Much like you
take your car to the mechanic that has
specialized knowledge on how to best keep your
car running, Intelligent helps you with your Microsoft
cloud environment because that's their expertise.
Intelligent keeps up with the latest updates in
(13:34):
the Microsoft cloud to help keep your business
running smoothly and ahead of the curve. Whether
you are a small organization with just a
few users up to an organization of several
thousand employees, they want to partner with you
to implement and administer your Microsoft cloud technology.
Visit them at inteliginc.com/podcast.
That's intelligink.com/podcast
(14:00):
for more information or to schedule a thirty
minute call to get started with them today.
Remember, Intelligink focuses on the Microsoft cloud so
you can focus on your business.
It wasn't our problem fixed, but I also
caused a cascading failure this week, Scott. Unless
you wanna talk more about AWS and Azure
failures. We should talk about the front door
(14:21):
one really quick. Alright. And I think I
just wanna take this opportunity maybe as someone
who's a little bit closer to the lingo
that's used internally around these things to clarify
some things. Yep. I saw a thread on
Reddit that was diving into the front door
outage. And if you go read the RCA
(14:42):
that comes out, like, I'll go with this
first sentence in the what went wrong and
why. An inadvertent tenant configuration change within Azure
Front Door triggered a widespread
service disruption affecting both Microsoft services
and customer applications
dependent on Azure Front Door for global content
delivery. And I'm gonna go back to the
(15:02):
very first part of that. An inadvertent
tenant configuration change
within Azure Front Door triggered a widespread service
disruption.
There were folks on Reddit who were reading
that, and they were taking that terminology
of a tenant configuration change
to mean that a customer tenant, like you,
(15:23):
maybe you have a front door profile and
I have a front door profile,
that you would have the ability to push
a configuration
change to your front door profile that would
take down the whole system. That cascaded to
everything? Yeah. I coulda told you that from
externally. Right? But I can see where that
language tenant is used so broadly. So broadly.
(15:44):
Yeah. Familiar with it could take that. So
I just wanted to maybe provide a little
bit of clarification there. So when we say
tenant in
this respect,
really what we're saying is service tenant
or maybe tenant that the service itself is
hosted on. So may maybe another word for
tenant here would be scale unit. Like, what
(16:05):
are the scale units that host Front Door
versus
what are the actual
customer tenants and things that are out there?
And I think the confusion for this one
was maybe a little bit further born out
of the fact that the front door team
has currently blocked
all front door configuration changes. Oh, interesting. If
(16:25):
you have a front door profile and I
have a front door profile,
we are blocked from making changes to those
profiles right now. And I think this kinda
perpetuates that thinking
that, oh, you and I are blocked from
making changes, and that's because I could make
a change that's gonna impact you. And
I don't think that's the case with this
(16:45):
one. I think this is more like
scale units, internal service things,
all of that again. So there was a
configuration change internally.
That configuration change introduced
an invalid state, very similar to those race
conditions that we were talking about with Dynamo
on on the other side.
(17:06):
That inconsistent
state
caused a whole bunch of AFD tenants or
AFD nodes, AFD scale units, whatever we wanna
call them, to crash,
and on that crash, to subsequently not be
able to load properly.
So Azure Front Door is kind of a
global load balancer
and a DNS load balancer. All of a
(17:28):
sudden, you started seeing all this weird stuff,
increased latencies,
timeouts, connection errors
for
every sort of downstream service that exists out
there. So, like, in storage land, you ever
provisioned a ZRS storage account? A ZRS storage
account, your DNS endpoint, your public endpoint
is a DNS CNAME that is part of
(17:49):
a front door profile
and points to a front door profile. So,
not good. Right? Like, all of a sudden
your z ZRS
zone zone of resilient thing, like, could be
having some trouble due to lack of DNS
resolution.
The other one that happens in Azure land
is
so much of the tooling talks to API
endpoints that are available via Front Door or
(18:11):
fronted via Front Door. So you think about,
like, management.azure.com,
which is the restful API surface for all
of Azure Resource Manager. That's behind Front Door.
Lots of folks notice it when the portal
goes down because you just go to, say,
you're in a public Azure customer, it doesn't
matter if you're in The United Kingdom or
(18:32):
The United States. We all just go to
portal.azure.com,
and we get directed redirected
to the closest portal instance
via DNS load balancing via traffic manager. So
there's actually, like, regional endpoints for the portal,
but they're all masked out because they're part
of this resolution chain on the DNS side
that can go a little sideways
(18:52):
in in the case of
Front Door clearing out and getting to where
it's knee where it needs to be. So,
yeah, definitely not a good look for either
Azure or AWS on this one. I'm very
mindful of, like, the customer pain that's felt
on these and the friction that comes with
it. I think the consolation
is,
one, as
(19:14):
folks who
curate and look after these environments that are
hosted in Azure AWS,
as much as we own the message to
our users that, yeah, it's broken and it's
down, at least we don't have to own
the fix for it, which double edged sword.
I I don't think many of us could
fix it faster than the folks who built
these things
could anyway all along the way. But it
(19:35):
it does give us some stuff to go
out and think about
and see if we can do a little
bit differently next time. Yeah. And while they
do go down,
I would say lately,
and this was kinda the case of AWS
and Azure, I would say, is I feel
like
response times and fix times
for Azure and AWS have gone gotten quicker.
(19:57):
Like, the
time from when they first go down to
when they come back online
used to and I guess I think to
several years ago where you'd see outages that
would be, like, day long outages, whether it
was eight, ten,
twelve, twenty four hours. There have been outages
in Azure, AWS,
Microsoft three sixty five, all of those. I
feel like the recovery time when everything like,
(20:19):
to catch the issue starting to happen to
where it's starting to resolve. Maybe it's not
completely resolved,
but you're not hard down for, like, eight,
ten hours. Companies have gotten better at that,
catching it, mitigating it, and getting things back
up quickly or at least starting to get
them back up quickly. That seems to have
been gotten a lot better, I would say,
(20:40):
in the last few years. It goes both
ways. When the entire Internet is down, it
feels like forever.
And it's not just when the entire Internet's
down. I I think there's
economic loss that's associated with these things. So
I saw some estimates talking about, like, the
AWS outage even for the, quote, unquote, brief
period of time that it was being as
high as, like, 500 to $600,000,000
(21:01):
in lost revenue. Yeah. I saw some of
those numbers too. For the companies that that
are hosted on top of it. I think
like any
dark cloud, like, you gotta look for the
silver linings. It can't always be glass half
empty kind of thing. So I will say
a couple of maybe, like, positive things that
happen in both of these outages, both the
(21:22):
AWS one and the Azure one. I'm seeing
that communication's getting better. So while folks are
still complaining that, like, oh, the status pages
aren't updating, things like that, I do think
the kind of proactive communication,
like, we're finding a better balance between
how many engineers do we put on fixing
the problem, which I generally, I would say
(21:42):
let's index towards putting everybody on it. But
if we put everybody on it, that's at
the expense of being able to communicate to
customers, because we might even be taking the
person who can take that message and and
figure out how to get it to where
you need to be. So I think the
transparent communication getting way better. I've been really
impressed by the
post incident
reviews that have come out from both Amazon
(22:03):
and Azure recently.
They're kinda going above and beyond in the
things that they talk about and expose. Like,
you as a regular customer, me as regular
customer, we should never need to know the
names of
the internal
microservices
that are part of DynamoDB.
And, like, we should never need to know
about
(22:23):
these things like AWS's internal planner and enactor
workers.
Like, alright. Great. Like, let's not worry about
that kind of thing. So I think you
are seeing, like, a level of transparency
from the hypervisors
that run these things
and good transparent communication
that's happening during the outages.
The other thing I'll call out, like, these
(22:44):
took some time in both cases to fix,
but all those rollback procedures
and stopping the bleeding and all that stuff,
it worked. We're sitting here a week later
and people are still banging their heads against
the wall going, we don't know what the
problem is. We don't know how to fix
it. We don't know what changed. We don't
know what happened. That's not the case here.
Like, these things happened. There were point in
(23:05):
time, tons of friction, tons of pain, horrible,
yes. But they got fixed. They got fixed
by somebody else, and they were fixed successfully.
And then for whatever these failure modes are,
like I said, you can be pretty confident
that they're not gonna happen in the future.
Are other things gonna happen? Yes. They haven't
been discovered yet. But as they are, it
all bleeds to more resiliency and it lends
itself to more resiliency
(23:27):
for these services.
In some cases, I think there were mitigations
put in place in a timely manner. So
in the case of the front door outage,
I saw that
they actually pulled
the portal
out from behind front door. Like, they went
and manipulated some DNS records to be able
to give customers relief so that they could
(23:47):
reach the portal without having to go through
AFD
and the load balancing
mechanics
that it brings along the way. I think
the tooling's getting better. You're getting the ability
in the tooling to target specific API surfaces,
have other workarounds there, so that's all good.
And, yeah, in general, like, sucks that it
happened, but I'm actually, like, really happy with
(24:08):
the responses here and the way they came
out. Stuff could always go quicker. But that
said, I think for what happened and the
scale of both of these outages,
stuff actually happened in a very timely way.
And, ultimately,
not much that I would have wanted to
do as a customer anyway. Like, if I
was already a multi cloud customer and I'm
hosting
(24:29):
in AWS and Azure,
it's not like I'm gonna go out and
bang on the door and say, well, let's
go put ourselves into Oracle or Google and
get yet another cloud here. Like, that's not
necessarily the answer
or the thing that's going to save you.
You're only as resilient as your least resilient
service kind of thing still at the end
of the day.
I think there is a little bit of
an opportunity for customers to go through. Maybe
(24:51):
you do wanna audit your dependencies a little
bit, like, hey. Do I have to take
a dependency on this thing? Or if I
do, is there an alternative or a fallback
service for me? Along the way, review your
Doctor plans. So while you're not responsible, like
I said, for fixing the servers and and
the underlying microservices that power these things, I
think you still wanna have good ways to
communicate to your users about what's going on.
(25:14):
So if you're a company that works with
Azure
and you have admins who are maybe more
click ops and they're dependent on the Azure
portal, you wanna make sure that you have,
like, good documentation
for your employees about what happens when the
Azure portal is unavailable,
What help happens when the m three sixty
five portal is unavailable? What happens when this
service is unavailable? Just so they know what
(25:35):
to do, and they've got that kinda measured
comfort food. You also need to think about
kinda documenting
recovery plans and expectations
in terms of timing.
So what happens if my cloud provider is
down for ten seconds? What happens if my
cloud provider is down for ten hours? Those
are very different scenarios.
And the way we react, the way we
(25:57):
communicate with our user bases,
all those things are
going to be impacted.
I think you also do have to think,
like, I mentioned status pages.
Both AWS and Azure, like, the status pages
are not the greatest things at getting updated.
So, like, are there alternative systems that you
wanna look at? I see still lots of
customers using things like down detector and things
(26:19):
like that to see when these things are
occurring or if they have broader impact within
geo, outside of geo,
things like that. I think those are all
good to stand up. And then the last
thing I would think about is as you're
going through and you're figuring out maybe some
of these things around
recovery plans, things like that, is making sure
(26:39):
that you're not only setting the expectations with
users, but also setting the expectations with your
leadership.
So, like, if you work for a company
that's single cloud, multi cloud, does your leadership
have the right expectations
around
your company's dependency on the cloud? Has that
been communicated in the right way? Does your
leadership understand
what they've bought into?
(27:01):
Because there's the dream of the cloud, Oh,
it's somebody else's cloud, it's somebody else's problem,
it's 100% available. And then there's the reality
of the cloud, which we know so far,
no system out there is truly a 100%.
So making sure that those things are ready
to go so that your LT can weigh
out all those options they need to, like
multi cloud
strategy options,
(27:22):
ultimately understanding that whole, like, risk reward scenario
or maybe risk versus cost
for things like additional resiliency
and redundancy
and where that all falls out for you.
Sounds good. What with that, Scott? I actually
have family waiting for me to go do
Halloween y stuff. So Halloween y stuff. You
can It is the day for it. At
(27:43):
least the weather is nice here in Jacksonville.
Nice and cool out there. It's a balmy
68. Yep. I think this is the first
year it's under, like, 80 degrees Fahrenheit for
Halloween in a while. It's been a while
since it's been this cool. So yes. Well,
thanks for that. Hopefully, no more DNS
cloud outages here for a while. Hopefully
Yes. It's something that nobody wants to happen.
(28:04):
Nope. So go enjoy your weekend. Enjoy the
rest of your Friday, and we'll be back
again in a couple of weeks. Alright. Sounds
good. Thanks, Ben. Alright. Thanks,
Scott. If you enjoyed the podcast, go leave
us a five star rating in iTunes. It
helps to get the word out so more
IT pros can learn about Office three sixty
five and Azure.
(28:26):
If you have any questions you want us
to address on the show or feedback about
the show, feel free to reach out via
our website, Twitter, or Facebook.
Thanks again for listening, and have a great
day.