All Episodes

November 12, 2021 26 mins

So we’ve discussed how to save a company's time and manpower by fixing problems with Instana. We’ve explored the game changer of Turbonomic and applied to managing resources. These are great tools to solve the present. So what about the future? Robert Barron from IBM brings us into how that future will be evolving with proactive measures through Artificial Intelligence in Watson AIOps. He gets into the potential of being able to access tools that can accurately anticipate, predict and solve IT solutions. With these innovations, we’ll uncover the world of business intelligence looming around the next corner. And in an uncertain world ahead, it pays to have insights from Watson AIOps making improvements for the future.

Scaling AIOps is a show brought to you by IDC and IBM. If you want to learn more from IBM about AI Ops, visit: https://www.ibm.com/cloud/aiops

See omnystudio.com/listener for privacy information.

Mark as Played
Transcript

Episode Transcript

Available transcripts are automatically generated. Complete accuracy is not guaranteed.
Robert Barron (00:06):
Automation is really the name of the game. When we think about...
If we want a Five- nines availability, that means that we have
seconds to solve a problem during a month, minutes over
a year. So if I have to log in to the system, to
a server to solve a problem, that's it, that's already
too long.

Matt Eastwood (00:33):
Automation, even if you're not an IT or working in
the tech industry, it seems to be a conversation that's
happening all the time. Some find the amount of disruption
that comes along with it to be a bit scary. Others,
find automation to be fascinating. With all the challenges and
ethical issues that it brings, regardless of your point of
view though, in IT the fact is that automation is

(00:54):
here to stay. So the real question is where will
it be going next?
Hi, I'm Matt Eastwood, Senior VP of enterprise IT research
at IDC. And I'm one of the hosts of Scaling AIOPs,
Artificial Intelligence for IT Specialists and Business Outcomes. A joint
venture between IDC and IBM. Joining me is Steven Elliot, our

(01:18):
analyst on the show and group vice president of I& O,
Cloud Operations and DevOps.

Steven Elliot (01:24):
In this, the third episode of our series, we're going
to bring the conversation back full circle. We'll be talking
about IBM's Watson AIOps. Watson encompasses applications like Instana and Turbonomic
from our previous episodes and brings it into one suite
that it professionals can use to advance their operations. But

(01:44):
it goes beyond that to allow for proactive incident management,
to plan for bigger problems of the future.

Matt Eastwood (01:54):
So we talk a lot about digital transformation and how
this has continued to accelerate, and it's driving what I
like to describe as a continuum of applications and data
that stretches from the edge to the core. And we
see businesses differentiating more and more with apps and data.
And that means that the typical application portfolios growing pretty quickly,

(02:15):
and these applications are also highly interconnected and they're very
data dependent. So Steven, before we introduce our guests, maybe
it would be a good idea to get a little bit
of a sense about automation. We're talking a lot about
automation in this episode, and if you could give us
a brief setup as to why it's so important these days,
why we're talking more and more about automation?

Steven Elliot (02:37):
Well, I can tell you, automation is such a hot
topic almost now, more so than ever. In fact, in
some recent IDC research, we predicted that by 2023, 75% of
global 2000 IT organizations will adopt automated operational practices to
transform their workforce and to support unprecedented scale. One other

(02:58):
thing I'll mention, we're also seeing organizations put together different
centers of excellence or sometimes called centers of enablement specific
to automation, or certainly cloud architects and we're finding that
this is becoming a central rallying point for best practices
for tool selection, for driving analytics tied to automated tasks

(03:23):
more and more. So this theme of AIOps, this theme
of really tying together automated decision making with great data
and great sets of analytic models to drive some type
of, in some cases, autonomous automation, or in other cases,
the ability to automate certain pieces of a process to

(03:45):
bring humans into the fold, to help drive decision making,
and then to move that process to some conclusion or some
specific repeatable outcome.

Matt Eastwood (03:53):
So now I'd like to bring in Robert Barron, he's
been working at IBM for the better part of the decade
and focuses on finding IT solutions for better architectures, for site
reliability and so much more. So let's let him say
the rest.
It would be interesting to get a little bit of a sense
of some of the things that interest you inside and

(04:13):
outside of the company.

Robert Barron (04:15):
So certainly Matt, I'm very happy to be here. So
one of my big hobbies is the history of space exploration,
the Apollo landings of the moon, space shuttle, things like that. And I'm
also very fortunate that this hobby is very much aligned
with the concepts of operations and reliability and things like

(04:38):
the computer restarted seconds before Neil Armstrong lands on the moon.
How did they handle that? All sorts of near disasters.
The most famous, of course, being Apollo 13, which had
an explosion tearing the spacecraft in half, halfway to the moon,
and yet thanks to the engineers on the ground, they
managed to solve all their operational problems and bring the astronauts back safely.

(05:02):
So I take all these concepts and I bring them
together into lessons that are relevant to site reliability engineers today.

Matt Eastwood (05:13):
And I think that's a really interesting way to think
about our industry and where we've gone and what we've
done and the whole aspect of mission critical that came
from what happened back in the space race. So could
you tell us a little bit more about Watson AIOps
and what it does?

Robert Barron (05:30):
Watson AIOps is the name of the suite, the solution
that IBM has for the whole domain of modern operations that
use AI as a helper for the clients. It starts
off with a component called Instana, which then delivers the
observability section. The next part of the puzzle is application

(05:51):
resource management, which we have as a solution called Turbonomic.
And what Turbonomic does, it uses artificial intelligence to understand all
the costs, all the concepts, all the checks and balances
between my different applications to maximize my resource usage, but in
the context of what is best for my applications. The

(06:12):
third part of the puzzle is really what we call
Watson AIOps, which is the development coming from two sides.
The one side is our traditional operational solutions, which have
been super charged with extra AI. And the other side
is the things that are coming out of IBM's research.

(06:34):
The same concept is that started off, if you may
remember a Deep Blue Chess Computer beating Kasparov. Watson came
to the public consciousness when it won Jeopardy.

(07:01):
And then went on to do a lot of other AI
solutions such as health detect and cancer and so on.
So what we did was we took the genetic concepts of universal
AI that had been developed for Watson, taking all these things into
the operational domain. So any new concept that comes up,
any new data source that we might find, which we

(07:23):
want to plug into Watson AIOps. The underlying system is
ready for it and will get the most benefit out
of this new integration.

Matt Eastwood (07:32):
So IBM really saw the problem that was coming at.
A lot of organizations, which is really this rapid growth
and apps and data, and how to essentially enable the
automation around that. So people could scale in ways that
probably have never had to even think about scaling before.
So what did you see that, that's set up this

(07:54):
conversation around AIOps and Watson AIOps in particular?

Robert Barron (07:58):
Yep. So it's not just the amount of the data that
are unprecedented. It's also the collaboration of the data or
breaking down the silos because often, if you have an
application problem or middleware problem, then the solution can be found
in the networking data or in the hardware data. But this

(08:20):
was never made available in a timely manner to the
teams that were investigating the application problem. As soon as
you have the AI saying, I'm sitting in the middle,
collecting information from everyone and understanding the insights, then you
have a lot more collaboration between the different teams. You're breaking
down the silos, helping people work better together, where every success

(08:44):
needs to people saying, " I'm going to feed the Watson
with more information, more data, and then it's going find out,
it's going to detect more problems even earlier."

Matt Eastwood (08:56):
So one of the things that's come out of this
pandemic that we're living through has really been the awareness
around the importance of resiliency. And I know we've touched
on this emergence of site reliability engineers. This revolution that's
happening. And I'm wondering if you could just describe a
little bit for us. What some of... Again, and some
of the biggest challenges that people are facing and how

(09:18):
SREs came to be?

Robert Barron (09:20):
SREs came to be-- they first started off in Google,
where they realized that they were scaling their software at
a rate much higher than they could scale the humans
who were dealing with the software. And they decided that the
solution here was automation. They needed to start getting software
engineers to do operational tasks in a way which would

(09:43):
be repeatable, scalable, and they could reuse it in other organizations.
And a lot of people say that automation is the hallmark
of a good site reliability engineer, and that's very true,
but I also think that a big difference between a
site reliability engineer and an operator, or a CIS Admin or

(10:04):
a systems engineer is the responsibility to the application or
the service. And here we see a big challenge with
developers for sharing this responsibility and liner of business owners who
have to say, " Wait, I need my functional features from
the developers, but I'm also getting a lot of nonfunctional
information about the reliability, the resilience, the performance of the application

(10:28):
from the site of reliability engineer." And we see more and
more as SRE organizations mature the backlog or the tasks
of they're giving the application, start being merged with the functional
requirements or the new features that the developers are developing.

Matt Eastwood (10:45):
So I'd like to build on what you're saying here, Robert. I
want to bring Steven back into the conversation and Steven,
I'm hoping you could bring a little bit more of
the business perspective to the modern day SRE. What's a
sweet spot in defining some of the value here. How
are businesses thinking about the value associated with an SRE
to the business itself?

Steven Elliot (11:07):
Yeah. No, it's been a fascinating journey. Particularly the past
couple of years around site reliability engineering. And from a
business context, SREs are increasingly creating, what's called an error
budget. We're finding that these error budgets there's no one
number, right? So in other words, the error budget is
really essentially the budget that's required for the expected level

(11:32):
of system reliability that a customer is happy about. Because
as we all know, generally, for most of us, we
have about two seconds worth of patience on an app.
And if it's not working, we're gone. So now we're
finding because of the customer engagement models have gone digital
for almost every business. System reliability is absolutely critical to

(11:55):
customer experience and a happy customer.

Robert Barron (11:58):
Can I expand? With an example, I often use exactly on
the Steven's points. I was talking to someone who worked in a
bank and he said, " 10, 15 years ago, the primary
clients of my systems, were the bank tellers." So bank
teller works nine to five. So I have all the time in

(12:19):
the world in the evening to do my maintenance, my shutdowns,
my upgrades. Now I have an error budget of over 50%
because the bank is not open 24 hours a day.
And if the system is a little bit slow, it
takes 10 seconds, 15 seconds to start up in the morning.

(12:39):
The bank tellers are goint to complain, but he's not going to
do anything with this complaint. They're not going to quit
because the system is 10 seconds slow. But now the
same system, the primary client is the mobile client. So I
have millions of customers and they demand the stuff to be
available 24/ 7. I no longer have my after five

(13:02):
o'clock in the afternoon. I no longer have my weekends
for maintenance and if it takes 30 seconds to open, I'm
going to close their account and move to another back.
So technically I'm doing this same, but my availability and
performance requirements, my error budget has completely changed.

Matt Eastwood (13:20):
There's so much. I find interesting in the story of
Watson AIOps, but what we really want to accomplish in
this episode, and this whole series really, is to give
you information that you can work with it and put in
your tool belt as you go forward in the world of IT operations.
You're listening to Scaling AIOps, a podcast by IDC and IBM for

(13:42):
industry leaders and professionals to better understand how AI is
reshaping the world around us. Again, I'm your host, Matt Eastwood,
along with my co- host Steven Elliot.

Steven Elliot (13:53):
This is our third episode in scaling AIOP. And we hope
that you've learned a lot on this journey, but if
you're just joining us on this episode, I highly recommend
you go back and listen to our other two episodes.
In one, we discussed automation and rapid decision making. And
in the other, we talk resource optimization. If you've already

(14:13):
heard those, thank you for following along with us so far.

Matt Eastwood (14:19):
If I think about a successful digital transformation. One of the things that it
allows businesses to do is to go from a more
reactive state to a more proactive state, which is I think
what you're really getting out there. So the conversation around
automation in proactive AIOps, I'm wondering if we could talk a
little bit about that. And Robert, if you could describe

(14:40):
some of the levels of automation that Watson AIOps offers
and even the role that incident management can play in this?

Robert Barron (14:49):
So automation is really the name of the game. So I need
Watson AIOps to understand the problem, alert me well before the
client is going to feel it, so that I can solve the
problem and if the problem can't be avoided, to trigger
an automated solution that's going to happen behind the scenes without
waiting for a human. And each of these is a

(15:11):
different type of automation working on a different aspect of, or
different AI model. One of the newest features of Watson
AIOps is going over change records and understanding which of
these changes caused the problem. And then when we have
a new change, we can come up to the people
who want to make this deployment change and say, " Look,

(15:33):
you made this similar change in the past and caused
this problem." So we're going to raise a red flag
to tell you that if you do this again, change
the same system in the same way, there's a high risk.
So Watson AIOps is not just solving problems faster, but helping you avoid the cause
of the problem in the first place.

Matt Eastwood (15:52):
Where I'd like to go next is to talk a
little bit about scalability of AI and Steven, could you
describe for us really the demand for automation today and
the need to scale it for the future? A little
bit of perspective there might help.

Steven Elliot (16:07):
Yeah, sure. Many of the IT organizations, we advise or
especially as it relates to automation and AIOps, they are really
looking at things such as event noise reduction, anomaly detection
for root cause analysis. They're looking at predictive capacity insights.
They're looking at unified visualization and causation correlation. And of

(16:32):
course, also lean forward organizations are looking at AI driven
closed loop automation. And so when you think about AIOps
and the role of automation, they really, go hand in
hand and we're finding that organizations as they get more
comfortable with the idea of, what do different analytic models

(16:55):
bring to their problem or use case. And then certainly
what are the processes that they want to have a
direct impact on automating. And so all the different types
of analytics really do have a specific driver, specific impact
on those benefits and mapping that to the specific processes

(17:16):
and having that tight use case definition really helps drive
success, not only in measuring where you are today and
baselining it, but measuring the success engaging of these projects
moving forward so that you can continue to expand out
the impact zone and drive real business and technical benefits
over the course of multiple years.

Matt Eastwood (17:39):
So as Steven, you touched on this with the... When
we start talking about automation, we immediately start to touch
on people and process, and that's really a driver, but it's
also sometimes a bit of an inhibitor sometimes when people
really think about change and change management. So Robert, if
we could go over a little bit, what Steven's talking

(18:01):
about in terms of the need for scale in the future,
when we're using AI. I think I'd love to get
you to talk a little bit out how organizations are
happy to adopt more automation, or maybe where they're a
bit more reluctant. What would you say to them? How
do you get them to really separate some of the
challenges of making those changes in people, in process for

(18:22):
the sake of, kind of moving forward with the technology
and with the business?

Robert Barron (18:27):
Okay. That's an excellent question Matt. It's always, the technology
is the easy side. It's always the humans and the
culture. That's the difficult thing. And, while the to technology
is scale, the answer to the human side is often
the exact opposite. The quick win. You come in and
you say, " You don't need the whole big gorilla." You

(18:51):
can start off with something small. That'll just showcase the
capabilities of AI in a specific little quarter. Let's take
one specific model, like metrics anomaly, or event grouping, or
using a topology to find the blast radius of a
problem. Let's take this, lets quick win, implement it on

(19:16):
one application and then we'll show you that it works
and show you the benefit and what happens is that
often there's an avalanche effect or snowball effect because the
operators will say, " Oh gee, this AI system both solved

(19:37):
a problem for me. And it was actually much easier
to use than I thought. And now I can start
maybe customizing it and I can add my own concept
and my own ideas." Cause a lot of these modern
solutions Watson AIOps and Turbonomic certainly are much more customizable
and friendly to the end user than some of the

(20:01):
old monolithic monitoring solutions used to be. Because now as
we have more automation, the system is becoming more stable,
because things are becoming more repeatable and you have less
anomalies that are because humans were working in a different ways.

Matt Eastwood (20:17):
So there's some really great context there. Robert, I'm going
to ask, I want to really build on the importance
of leadership in this overall conversation around proactive AIOps. Robert,
any thoughts on, and from your perspective on leadership and
really how to drive this vision and mission forward?

Robert Barron (20:36):
I'm going to start by going back to the KPIs
and there's something that is almost Kafkaesque AI metrics. Cause
if we look at the metric. Everyone talks about is mean time
to repair, but what does the AI promise. The AI promises,

(20:57):
I'm going to be able to solve all the simple
problems automatically, make them go away. So ironically, the meantime
to repair is going to go up, cause all the simple things are going
to go away, won't even feature in our numbers. And the
only thing that remains, will be the ones that we
have to actually think about a lot to figure out. So,

(21:21):
a big part of leadership here is to understand that
metrics are really important but you have to understand the
reasons behind these metrics. And if at the end of
the month, the only thing that interests you is meantime
to repair and how it changed and number of incidents
and how it changed. Then at some points, when you

(21:42):
are adopting site reliability engineering, and when you are adopting AIOps,
you're going to see these numbers moving in the opposite
direction of what you expect, but for very good reasons.
Leadership has to be strong enough to say, " I understand
why this is happening and I'm happy with this and we're going to
go continue till the numbers change in the way that

(22:05):
I do want, because it's just a temporary shift in
the paradigm of what these numbers are representing."

Matt Eastwood (22:12):
That's really well said, Robert, I think, I'd like to
close out this conversation. I'm going to start with you
Robert. And then I'll come back to Steven. Just a
view to the future. If you could both really just
take a swag or provide a little bit of perspective
on where do you think we go next? What's the future of AIOps

(22:34):
in IT?

Robert Barron (22:35):
So I hope to a certain extent, action perversely, perhaps
that the AI goes away. That it's just... This is
the way we do it. Like we no longer say
digital computers. You just say computers. So AI needs to
be part of operations. It needs all the things that

(22:56):
are trivial to the computer, just have to be raised
up, bubbled up as insights to humans to make the difficult decisions.
And these decisions should as much as possible be the
operational ones and not the technical ones. Too often, I see, yes,
the AI is recommending that we do this, that or the other. Okay.

(23:19):
So why is it recommending? Just go and do it.
I'm not going to waste half an hour of figuring out
whether the recommendation is the right one or not. That the
AI do whatever it wants to do to solve the problem.
And then come back to me telling me, " No, we
tried this, it didn't work. We tried this, it didn't work. Now it's up
to you, the human to think about the new things."

Steven Elliot (23:41):
Yeah, I think one point is just continued acceleration of adoption of AIOps.
We're going to find it really becoming more and more
of a necessary set of capabilities to drive that particular
business outcome that these organizations require. I think the second
piece is that we're going to find more data sources

(24:05):
coming into the fray where not just operational data or
application data, but certain types of business data, maybe it's
weather information, maybe it's supply chain, maybe it's trucking. Right,
there's just so many information sources that are going to
come into these models. And they're going to play a
role in really driving that tighter alignment between what the

(24:28):
business is experiencing? How the digital services are driving that?
And if there's a particular problem across that, that end
to end stack, where is it and, who needs to
be identified and brought into the fray. So, we're going
to see a compression of people, process, technology, business sources

(24:51):
and processes come together even more so than what we probably
ever have in the past.

Matt Eastwood (24:57):
So I think that's a perfect place to leave this
conversation. So I want to thank you, Robert and Steven,
for both joining us today.

Robert Barron (25:04):
Thank you. It was a pleasure being here.

Matt Eastwood (25:10):
Thank you for listening to our show, Scaling AIOps, A rtificial
Intelligence for IT Specialists for Business Outcomes. Be sure to
go back and listen to any of our other episodes,
wherever you get your podcasts. Our aim here is to
provide useful information for IT professionals.

Steven Elliot (25:26):
The information we touched upon in this series is only the beginning
of the capabilities and conversation around AIOps and the future of
IT. If you want to learn more, then I suggest
you have a look at some of the research we've
done at idc. com.

Matt Eastwood (25:41):
I've been your host, Matt Eastwood.

Steven Elliot (25:43):
And I'm Steven Elliott.

Matt Eastwood (25:44):
And thank you for listening to Scaling AIOps, a joint
venture between IBM and IDC with production support from JAR Audio. It's been
a pleasure to have you all listen in, take care.
Advertise With Us

Popular Podcasts

Las Culturistas with Matt Rogers and Bowen Yang

Las Culturistas with Matt Rogers and Bowen Yang

Ding dong! Join your culture consultants, Matt Rogers and Bowen Yang, on an unforgettable journey into the beating heart of CULTURE. Alongside sizzling special guests, they GET INTO the hottest pop-culture moments of the day and the formative cultural experiences that turned them into Culturistas. Produced by the Big Money Players Network and iHeartRadio.

Crime Junkie

Crime Junkie

Does hearing about a true crime case always leave you scouring the internet for the truth behind the story? Dive into your next mystery with Crime Junkie. Every Monday, join your host Ashley Flowers as she unravels all the details of infamous and underreported true crime cases with her best friend Brit Prawat. From cold cases to missing persons and heroes in our community who seek justice, Crime Junkie is your destination for theories and stories you won’t hear anywhere else. Whether you're a seasoned true crime enthusiast or new to the genre, you'll find yourself on the edge of your seat awaiting a new episode every Monday. If you can never get enough true crime... Congratulations, you’ve found your people. Follow to join a community of Crime Junkies! Crime Junkie is presented by Audiochuck Media Company.

The Brothers Ortiz

The Brothers Ortiz

The Brothers Ortiz is the story of two brothers–both successful, but in very different ways. Gabe Ortiz becomes a third-highest ranking officer in all of Texas while his younger brother Larry climbs the ranks in Puro Tango Blast, a notorious Texas Prison gang. Gabe doesn’t know all the details of his brother’s nefarious dealings, and he’s made a point not to ask, to protect their relationship. But when Larry is murdered during a home invasion in a rented beach house, Gabe has no choice but to look into what happened that night. To solve Larry’s murder, Gabe, and the whole Ortiz family, must ask each other tough questions.

Music, radio and podcasts, all free. Listen online or download the iHeart App.

Connect

© 2025 iHeartMedia, Inc.