AI's Invisible Bottleneck: Why AI Stalls at the Network, not the GPU

Episode Transcript
Available transcripts are automatically generated. Complete accuracy is not guaranteed.
Speaker 1 (00:00):
For many, ai success isn't limited to how many GPUs
you can buy.
It's limited by how fast thoseGPUs can talk to each other
without tripping over theplumbing the network.
It can be the unsung hero orchief villain of your AI journey
.
Today, two of WWT's topnetworking minds, justin Van
Schaik and Eric Fairfield, layout the real choke points

(00:22):
slowing AI projects to a crawl.
Eric Fairfield lay out the realchoke points slowing AI
projects to a crawl.
Things like bad multipathing,tail latency spikes and
reliability gaps that turntraining jobs into week-long
sagas.
They'll also explain whynext-gen Ethernet not exotic
accelerators is quietlyrewriting the rulebook for
scaling AI.
By the end of this episode,you'll hear why the fastest path

(00:44):
to AI value might not startwith another GPU purchase, but
with a ruthless look at thewires, switches and software
stitching it all together.
This is the AI Proving Groundpodcast from Worldwide
Technology Everything AI all inone place.
Let's get to it, justin.

(01:08):
Eric, thanks so much forjoining us today on the AI
Proving Ground podcast.
How are you Doing great?
Fantastic GPUs, acceleratorsand that kind of being what
drives AI progress?

Speaker 3 (01:34):
Tell me why it's the network, or networks, and not
GPUs, that are that AI chokepoint for organizations that
might be stumbling or stallingon their AI journeys.
So I'm going to give you a subslice of that, dan.
The network is, there's alwaysa moving bottleneck.
Something is always going to bethe fastest part and
something's going to be theslowest part.
The challenge has been with AI,you have to get the GPUs
collectively speaking to eachother, and that collective

(01:55):
action requires a lot of veryhigh bandwidth, very low latency
, interconnectivity.
Historically, any kind of flawin the transport in the network
retransmits dying transceivers.
Slow things like that tend tohave an inordinate impact.
A 1% to 2% fail rate oftransceivers can have a 60%

(02:18):
impact on the job completiontime for a generative training
one.
So that's what we've beendealing with historically.
We're just stepping through,trying to fix it every step of
the way.

Speaker 1 (02:28):
Yeah, and real quick.
I'll jump to you here, eric,but first, justin, why is that?
You know?
You mentioned how GPUs need totalk collectively.
Why do they need to talkcollectively before they move
forward?

Speaker 3 (02:40):
I'll use the human brain analogy.
One brain cell ain't that smart.
You have to have, you know, wehave to have a few trillion, you
have to have a few hundredthousand neurons interconnected
to make a really powerful neuralnetwork to that can solve these
problems.
One large monolithic GPU is notgoing to cover anything.
You have to have a severalthousand working together and

(03:02):
that's where the collectiveworks in.

Speaker 1 (03:03):
Yeah no, absolutely Eric, no, absolutely Eric.
This is going to be a dumbquestion by design, but why
can't we just plug these GPUs,or plug AI, so to speak, into
the same network that email runson, or any other enterprise
application for that matter?

Speaker 2 (03:19):
Well, as Justin alluded to these GPUs, they talk
to each other collectively andthey have to come to an
agreeance within a very specifictime frame to be performant.
And you don't want these GPUshaving to contend with
traditional traffic, right,Email, web surfing, March

(03:42):
Madness, all those kinds ofthings, right?
You don't want to have tocontend with that traffic,
Otherwise it's going to affectthe job completion time very
significantly.

Speaker 3 (03:52):
Yeah anything impacting that communication
does have a huge impact, likeyou just said.
Like to use a basic analogy, Imean, these things are
essentially doing veryhigh-dimensional math, so, but
they're breaking up into chunks.
So, for example, if they allsay, okay, we've, we've
multiplied x by y, now we needto carry the seven, who has the
seven?
Oh crap, find the gpu with theseven.

(04:13):
Uh, he's, he's running a littlebit slower.
You wait until the seven showsup before somebody can carry
that seven.
It's.
That's a very dumb down, youknow concept, though, but that's
essentially what's happening.
Any impact, any gpu that slowsdown anything will slow the
whole process.

Speaker 1 (04:28):
Yeah, and how have networks then evolved over just
the last couple of years, orthree or four years, as AI has
or generative AI has jumped ontothe scene?
What types of changes havereally been made in terms of
networking that have allowed toenable that communication, that
fast communication?

Speaker 2 (04:47):
I would say probably one of the biggest things is
bandwidth has been changing veryfrequently, right?
It was just a few years agowhen we were in shock that we're
doing 100 gig in our datacenters.
Now the data center, the GPUnetworks, are 400 gig.
We're moving to 800 gig and 1.6terabit is in the very near

(05:11):
future here.
So the amount of bandwidththat's being utilized is just.
It's amazing how that's changedand how Ethernet is having to
change as a protocol this issomething Justin's been very
involved with is UltraEthernetright?

(05:32):
And how Ethernet's going tochange its way of communicating
to overcome some of the historicchallenges of congestion
management.

Speaker 1 (05:44):
Yeah, I do want to get into Ethernet and ultra
Ethernet, but just real quick.
You mentioned how you know theneeds are constantly rising,
eric.
Is this?
Is it a sustainable pathforward where you know we're
going to be able to account forall that moving?

Speaker 2 (05:57):
forward.
I would say it's definitelysustainable.
But what people have to takeinto consideration is, as we're
getting faster, the fiber needsare going to change, so there's
going to be a distincttransition from our typical one

(06:18):
pair of fiber to using cableslike MPO-12, mpoo16 cables to
transfer that much data andreally even moving from
multi-mode fiber to single-modefiber we're seeing this with
801.6 terabit is the movementtowards single-mode fiber.

(06:38):
So it's sustainable.
It's just we have to change theway that we're looking at our
cabling structure, which is alsogoing to work hand in hand with
power and cooling delivery.
Those are probably the biggestsustainability challenges right
now is actually the power andcooling more so than the network

(07:01):
Exactly.

Speaker 3 (07:02):
Well, there's always going to be that moving target.
But as we start to, you know,essentially one of the old rules
of networking is no matter howmuch bandwidth you put out there
, something will consume it andas we put, yes, we have a
sustainable growth path for theactual bandwidth Latency is
about as low as it can get.
Reliability has been vastlyimproving and that's also what

(07:24):
Ultra-Ethernet is talking about.
But we've been seeing Ethernet,you know, modify itself many
times over the years and it'slike we used to have like I use
voice over IP as a commonanalogy when they first had
voice over IP, we had PBXs andthat was like Digivoid your
phones.
And then eventually they saidwe can do voice over IP, but you

(07:48):
have that entirely separateEthernet network to run this.
You can't converge it becauseit's very sensitive and Ethernet
will drop every pack and it'sgoing to be horrible.
And then eventually we figuredout how to do the QoS properly.
We figured out how to have theright bandwidth, all the proper
adjustments.
Now VoIP is ubiquitous.
Ethernet evolves.
The requirements that we'reseeing in here are, in one sense

(08:09):
, just an iteration, the nextiteration of every bit of growth
we've had to deal with innetworking and, in another sense
, some of the peculiarities arewhy they're actually going into
the UltraEthernet Consortium todeep dive into some really
granular aspects of thetransport for AI that needs to
be updated, needs to change, andthat's irrespective of whether

(08:29):
you're running on Ethernet orInfiniBand.
Some stuff has to be modified.

Speaker 1 (08:40):
Yeah, justin, give us a little bit of that.
Maybe dive a smidge deeper withthe Ethernet, the InfiniBand,
now UltraEthernet, therequirements, you don't care
what it is.

Speaker 3 (09:07):
So if you look at the application AI as an
application, just like we sayvoice is just an application, ai
is Skynetexe and as long as youhave the proper, you know you
deliver what it needs.
It doesn't care how you'redelivering it.
Infinimain can do it, ethernetcan do it now and we're moving

(09:29):
into the ultra-Ethernet whereit's addressing those very
specific problems such as thoseretransmits once something drops
Currently.
So the go-fast juice thatactually makes Ethernet or
InfiniMand really really fast isRDMA, remote direct memory
access, and what that does isallow you one machine to put

(09:50):
directly resonant in the memoryof another machine information
without having it checked,similar to what we saw with
acceleration with X.25 to framerelay.
It said we now trust thetransport, we're just going to
send it straight through.
Rdma has always been very, verytwitchy.
One of the worst parts about itis if you drop relay, it said
we now trust the transport,we're just going to send a
straight throat.
Rdma has always been very, verytwitchy.
One of the worst parts about isif you drop anything in
sequence in rdma, um, it does aroll, go back n so it rolls back

(10:14):
to the last known good, whichmeans that any kind of
successive retransmissions orglitchy things.
You know, one drop packet perevery session can actually had a
huge outsized impact on the onthe job completion time.
So the rdma itself is beingrewritten in ultra ethernet.
Um, not like you know the trash,not throwing out the baby with
the bath water, but they arelooking at things like uh,

(10:37):
granular retransmits.
So it only you know says uh,you dropped, you know, packet
seven out of ten.
Please retransmit packet seveninstead of let's go back to zero
and go through the whole thingagain.
There's a lot of other tweaksgoing into there.
Retransmits in IP-basednetworks have traditionally been
handled at the transport layer.
They're moving all that down tothe link level.

(11:00):
So it's actually happening inhardware now instead of in a
software retransmit.
There's no cpu interrupt tomake that happen.
Again, it's not even going intoa tcp offload engine, it is
just hitting directly.
So, and then there's a lot ofother things in there um,
quality of service.
Instead of prioritizing, so in alarge elephant flow for in

(11:21):
gener, in generative AI, um,you're going to have like it's a
lot, it's it's highly varied.
It is not a homogenous stream.
Not every packet has to bedelivered reliably.
So now I can actuallycategorize into sub levels.
Am I going to be doing?
Uh, you know, reliable orderdelivery, reliable under
delivery?
Uh, all of these things, youknow, reliable, unordered

(11:43):
delivery, all of these things,unreliable, unordered.
It essentially can classifyevery single packet in a
workflow so that only the stuffthat needs the highest priority
will get sent.
That way.
That frees us up a lot, becausethen we suddenly don't have to
have every part of a flowbetween two GPUs or between
1,000 GPUs or 10,000 GPUs.
It won't be prioritized thesame.
It gives us a lot more leewayin what can and can't be done.

Speaker 1 (12:07):
And is that just enabling speed or is it avoiding
or mitigating some of thoseripple effects that you
mentioned earlier if somethingis not delivered reliably, or
what's the benefit of that?

Speaker 3 (12:19):
So let's roll back to the RDMA again really fast.
If you look at a day in thelife of a packet which is, you
know, the network crawl betweenyou know sent from point a to
point b, we don't look at itfrom nick to nick anymore.
We look at that process interms of from gpu to gpu, which
is so you're gonna have two,eight by registers going to a

(12:39):
gpu.
One, eight by register willcome out.
That is a 64-bit flop.
Now that goes out to the L1cache, l2 cache, l3 cache, onto
the PCI bus, over to the NIC,gets checked for everything
along the way and then finallyhits the wire.
Rdma removes all of that errorchecking and just sends it right
out to the wire.
And then on the other side, theother GPU does the exact same

(13:01):
thing, going back up the stack.
So what you're doing here isnot necessarily creating more
bandwidth out of it, notnecessarily creating a lower
latency out of it in terms ofthe speed of light is still the
speed of light.
We are minimizing the number oftouches, the number of steps
that we have to do in the middlein order to make sure it
happens.
So it's a question ofefficiency, if you will more

(13:23):
than just throwing more scaleper scale out of the problem.
Does that make sense?

Speaker 1 (13:28):
Yeah, no, absolutely Eric, anything you would build
on top of that.
Or where do you see thedelineation between Ethernet,
ultraethernet, infiniband, andhow does that affect our clients
or enterprise IT teams?
Do you have to pick one andstay in that lane for a long
time, or can you bounce back andforth, or can you even use a
best of breed type of situation?

Speaker 2 (13:55):
use a best of breed type of situation.
So that is a fantastic question, because we get this all the
time.
You know ethernet or InfiniBand.
What's the best way to look atthis?
And there's a few things thatwe always have to keep in mind,
and one of them is how well doyou know InfiniBand?
There's a lot of customers thathave deployed InfiniBand that
really have challengesoperationally because they don't

(14:15):
have anyone that really knowsthe ins and outs of InfiniBand,
how to troubleshoot it.
Infiniband actually installingit very, very easy, it's very
plug and play, but it goes offthe rails when it comes to
problems.
Right, as soon as you have thatproblem and you don't know how
to troubleshoot it.

(14:35):
Now you have to call theexperts.
Who are you going to call?
There aren't a lot of InfiniBandexperts out there and the ones
that are out there are ratherbusy dealing with other
implementations, troubleshoots,stuff like that.
So really you have to look atis Ethernet good enough?

(14:56):
And that's something that wehave shown in the AI proving
ground time and time again nowis in a lot of small networks
we're able to deploy Ethernetand it's just a performant, if
not more performant thanInfiniBand, right, and network

(15:17):
engineers can go in andtroubleshoot it, because they
already understand Ethernetoperationally.
So that's one of the biggestthings we look at is what's your
operational model?
Can you even handle InfiniBandfor when it goes sideways?

Speaker 1 (15:42):
Yeah, Eric, how is?

Speaker 3 (15:42):
the need or rise of distributed architectures.
How is that compounding thefact even more?
Or is it, If by distributed youmean you know kind of the
calico tapestry of networking,that's not necessarily going to
be happening a whole lot with AInetworks?
They kind of require ahomogenous transport.
You're not going to be goingfrom, you know, 100 gig to 10
gig to 400 gig to different qs'slike it.

(16:03):
It requires a fairly homogenousisland of communication.
You can connect that in withthe rest of the network, but not
at a spot where it's going tobe sharing any kind of traffic
with the ai itself yeah, what?

Speaker 2 (16:16):
what we are seeing is when it comes to distributed
architecture, a lot some of thatwill come down to edge
application and it's not part ofan overall AI training network

(16:40):
or an inferencing network at theedge, because the performance
needs to happen at that locationinstead of coming back to a
data center sharing withsomething collectively right.
So there is distributed naturewithin AI for very specific use

(17:04):
cases.

Speaker 4 (17:07):
This episode is supported by Nokia.
Nokia helps you realize yourdigital potential with trusted
purpose built IP, optical, fixedand data center solutions
providing superior performance,security and seamlessly
integrating into any ecosystem.
Nokia pioneering networks thatsense, think and act.

Speaker 1 (17:28):
Justin, I am curious.
You know we you know at leastas of this recording we just got
out, or relatively speaking out, of Cisco live, and you know
we've had a bunch of otherconferences and video GTC, et
cetera.
Tell me more about thispartnership that I'm hearing up
between NVIDIA and Cisco tobring Cisco's networking and
operating systems to NVIDIA'sSpectrum X ecosystem.

(17:50):
What does that all all signal?
All signal to the industry.

Speaker 3 (17:53):
So in terms of industry patterns, it's the best
of both worlds.
Just to be perfectly candidhere, a lot of the industry has
been absolutely flocking toNVIDIA solutions because they're
excellent, they're fast andthey're powerful.
But they're also the onlysolution on the market and they
have a tightly integratedvertical stack and everything

(18:13):
works together.
Many of our customers, manycustomers in the industry, are
looking to have some level ofdiversified risk around those
suppliers, and if it's one greatsupplier for GPUs, they don't
want to have that same greatsupplier for network.
So integrating Cisco into itgives them a little.
It gives them choice as well.
Under the hood, what we'rehoping to see is some level of

(18:36):
standardization on the best wayto do it.
Currently there's severalthoughts around it.
There's, at the end of the day,it's where do you want to
reorder your packets?
If you're going to use all ofyour bandwidth, you know you can
spray them out across theenvironment.
There's going to be out oforder delivery.
It still has to be delivered inorder to that GPU on the other
side.
That's the RDMA way of things.
If it's not out of order,things go wrong.

(18:59):
So you reorder it either insidethe network or you reorder it
at the edge on, like a DPU or aBluefield or something like that
, or ConnectX 7 or SuperNIC.
And then various partners havevarious methods for doing it.
Cisco has one way, nvidia hasanother, broadcom has another,
arista has another, juniper hasa different one.

(19:20):
Also, we are hoping to see, bycombining some level of
engineering expertise betweenthese major vendors, that
they'll start to come up with abest option with other options
available.
Cisco uses a DL, nvidia usesspectrum x.
By the first stage is justallowing cisco uh nexus to
participate in that environment,meaning at at layer two,

(19:43):
there's going to be a handshakebetween the super nick and the
switch in the middle and says ah, you are a spectrum x
compatible switch, therefore Iwill use you.
That's the NVIDIA side.
And then on the Cisco side,they just have to use some P4
programmability on the Silicon 1chips to allow it to speak
Spectrum X, meaning will it doadaptive routing in the middle?

(20:04):
Will it have some kind ofcongestion metering available as
well?
All these things are beingfactored in.
So we're very excited aboutwhere it's going, hopeful that
it will deliver everything thatwas expected.
But of course that's yes, writethat comment.
Of course they're going to bedoing deliverables as expected.

Speaker 2 (20:22):
Yeah, you know, to add on that, I always look at it
.
One of the things that NVIDIAused a lot was adaptive routing.
Is what helped drive InfiniBand, the use of adaptive routing
and sharp together.
And really what the Spectrum Xarchitecture is doing is they're
taking those capabilities fromInfiniBand and really applying

(20:44):
it to the Ethernet world.
So they've had that specialsauce, and a lot of the other
Ethernet vendors really didn'thave a special sauce.
Outside of implementing thingslike ECMP entropy tools, like
Justin mentioned, dlb, dynamicload balancing, or something we

(21:05):
call flowlet switching packetspring.
There's a variety of waysaround that and I wrote a whole
article around ECMP.
And now Cisco, in thisrelationship, has brought a
special sauce to their solutionright by being able to tie into
the NVIDIA adaptive routingarchitecture.

(21:28):
So again, it's very exciting tosee where this is going to take
it, because this will give usthe ability to look at Cisco in
NVIDIA reference architecturesas well.
So very exciting.

Speaker 1 (21:45):
And maybe dive in a little bit deeper, eric, on why
that's exciting and more so likewhy is that exciting for
enterprise IT teams?

Speaker 2 (21:52):
Well, again, if we talk about some of the
operational models that are outthere, you have a lot of
organizations that havestandardized on Cisco switching
Right and, by Cisco having thispartnership with NVIDIA, they're
going to be able to get intothe NCP program for helping

(22:13):
build reference architecture.
So, for a customer, they cannow implement these AI networks
with Cisco solutions, knowingthat they're designed and
accepted by NVIDIA and theydon't feel like they're having
to go rogue right.
And they don't feel likethey're having to go rogue right

(22:34):
.
So, again, it's going to allowus to tie into existing
operational models and not haveto worry about, you know, do we
have to convince them of movingnecessarily to a different
switching platform to supportthis?

Speaker 3 (22:48):
And not just a different platform, but simplest
terms like let's think it's adifferent operating system, you
know, like it doesn't reallymatter where you're standardized
, but like there's a hugepushback and a lot of
institutional and technicalinertia associated with bringing
an entirely different operatingsystem.
So a lot of customers are veryhappy with their Cisco and their

(23:09):
CLIs.
They do not want to actuallystart incorporating Cumulus or
Sonic into the environment.
It just complicates things.
So this allows them to actuallyport their same data
center-wide and enterprise-wideskill sets directly into the AI
without having to go throughthat sharp learning curve.
Now another protocol to learn afew more tweaks, special
environment sort of thing muchmore easily absorbed into their

(23:32):
operating models.

Speaker 1 (23:34):
Yeah, justin, a couple of months ago, or it
might even have been back in2024, you wrote in an article
that and you used a line that Ireally liked you had a quote
unquote science that needs to beverified.
You know, these systems need tobe tested on real data that are
bouncing between real serversand GPUs or storage or whatever
it might be.
You're talking again about you,again about having a new OS,

(23:54):
new operating system.
How would our clients, or anyorganization for that matter,
start to verify that thesesystems would work within their
own real-world settings?

Speaker 3 (24:05):
So it's a scientific method.
First you define the problem.
They have to define whatthey're trying to accomplish.
Then you gather the information, you form a working theory like
, hey, you know what Ethernetwill work, great, we need ultra
Ethernet, or we need to run onNexus, or we need to run
wherever else and then you testit.
The testing part has been thebiggest challenge in the
industry really, because thereis such a complex ecosystem with

(24:28):
a lot of very expensivehardware and if you're going to
make a $10 million, $100 millioninvestment in a full-on AI
architecture, you want tounderstand how it's going to
perform before you can do it.
And that's why we actuallybuilt the AI-approving ground
and that's a lot of the workthat we have coming in is we
want to see if what we thinkwill happen will happen.
Hence science that needs to beverified.

Speaker 1 (24:51):
Yeah, I like that you mentioned the AI proving ground
here, certainly the namesakefor our podcast, eric.
Can you explain a little bitbetter in terms of what the AI
proving ground is and what itoffers clients or organizations
out there as it relates totesting, validating and proving
out that they're on the rightpath for the AI journeys?

Speaker 2 (25:11):
Absolutely.
The AI proving ground reallygives a customer the ability to
bring their ideas to reality.
Right.
How can we test building out anAI system you know being
hardware software right and makethis a reality to see if we can

(25:34):
even do this right?
How can we make the art of thepossible happen and what
architecture is going to workbest do?
Do we need to look at infinibandversus ethernet?
Do we need to look at c versusArista?
Right, what software is goingto make the most sense?
Do we run Slurm?

(25:56):
Do we run AI right To do ourorchestration?
And the AI proving ground isthe perfect place for them to do
that, because one we have thepeople to help build it right,
because a lot of times thecustomers may not have the
knowledge to do that, they don'thave the lab, they don't have

(26:18):
the budget to buy all this tosee if it works even right.
And that's one of the biggestbenefits of the AI proving
ground is we have the people, wehave the software, we have the
resources and the relationshipswith our OEMs to make the art of
the possible happen.

Speaker 3 (26:39):
And let's also look.
You know time to delivery hereBecause you look at the vast
majority of our customers.
They have excellent internalteams, a lot of intelligence,
they've got their own labs,internal teams, a lot of
intelligence, they've got theirown labs.
But the turnaround time onaverage is going to be like six
months to a year to get hardwarein rack, a stack, get it all
going.
We have this down to this isour focus and we have this down

(27:02):
to a martial art so we canactually bring it in very
quickly, so we can turn aone-year evaluation cycle into a
three-month evaluation cycle.
It helps the muse have theright information to make the
right choices faster.

Speaker 1 (27:26):
Well, that vendor ecosystem figures to get even
more complex.
Every networking OEM seems tobe touting AI-integrated
offerings.
How are you making sense,justin, of the marketplace?
Is it just going to continue toexpand and expand, and expand,
or are we going to see someconsolidation or partnerships by
that way?

Speaker 3 (27:42):
A bit of both.
You know you're asking me toactually give you a
prognostication of what themarket's going to look like.
We will see alliances you knowongoing alliances happening
between, say, cisco and NVIDIAor similar kind of connections
where we can collaborate on this.
We'll see some fracturing aswell.
We'll see new entrants.
You know AMD, you know,starting to produce some good

(28:03):
performance.
We'll start to see.
We've seen, like with DeepSeek.
There was a bit of a disruptorthere as they came up with a
less resource intensive but amore efficient way of processing
.
The numbers were still tweaked alittle bit on when they
published the data, but there'sgoing to be disruptors that will
change everything.
There's going to be continuingalliances and there's going to

(28:24):
be people every once in a while,like we saw this happen with
bringing up the sorted past, butwe we had VCE between VMware
and Cisco and EMC and theyfractured and the entire
integrated stack became a brokenstack.
We will be seeing those thingshappen as well.
We try to stay on top of it bycollaborating tightly with our

(28:46):
partners Also keeping ourfingers in the wind to see
exactly what our customers areasking for.
Keeping our fingers in the windto see exactly what our
customers are asking for, andwe've had instances where what
we've been hearing from ourcustomers is significantly
removed from what the strategicdirection is of our partners,
and that's where we talk to themand say we are hearing
different things than you arehearing.
We should go through this.

Speaker 1 (29:09):
Yeah, well, understanding that it's a
relatively chaotic landscape andthat there's going to be
changes in the market in thenear term and long term.
Eric, is there any advice orguidance that you would give to
clients on how to handle thatrapid pace of change?
How should they look at thelandscape and be able to advance

(29:31):
their organization forward,knowing that there could be
changes coming down the line atany moment?

Speaker 2 (29:38):
I think one of the most important things is really
when we look at what we're doing, especially from an AI
networking perspective, is whenyou're making choices you have
to look at the systems are goingto change rather quickly.
What was the new shiny objectis going to have lost its luster

(30:00):
after 18 months, 24 months,very easily.
But you have to think about howdoes this play into the bigger
picture?
Because as the systems change,is possible to re reuse some of
that architecture somewhere else.
And this comes down to.
You know, a great example ofthe ethernet versus infiniband

(30:21):
argument is as things change,can you put your infiniband
network anywhere?

Speaker 3 (30:28):
no, you do file and print on infiniband probably.

Speaker 2 (30:31):
Exactly so.
What you have to do is thinkyou know again big pictures as I
go from, let's say, 400 gig to800 gig.
Can I utilize this 400 gig indifferent areas of my network?
Now, lot of people don't thinkabout is you know, when I make a

(30:56):
decision about my highperformance architecture, what
do I need to do with it in 18 to24 months?
Is there a place for itsomewhere else in the
organization?
That makes sense, you know,instead of just you know a quick
one-off.

Speaker 3 (31:10):
Yeah, when it comes to the enterprise, you want to
have a very much a top-downsolution.
It involves coordinated actionat the C-suite level.
You're going to be looking atthe CFOs and you're looking at
the CapEx and the OpEx, but atthe same time they have to
understand that their standarddepreciation cycles of six years
, seven years, is not going toapply.

(31:31):
They have to actually look at aone to two year refresh cycle
consistently.
So they have to bring it in tofunction flawlessly and be able
to tear down and tear it outalso flawlessly.
That's huge operationalconcerns.
That's where the CEO has to beable to talk to.
The managing directors has tobe able to talk to everybody
else.
So it has to be very much atop-down solution Customers who

(31:52):
are doing a grassroots solution,or historical data scientists
sitting in a little small islandof performance very quick, this
will become apparent that thosedo not scale.

Speaker 1 (32:08):
What did work has to change?
Yeah, well, so far in thisconversation, which has been
fantastic We've only necessarilytalked about how the network
can support or drive AI.
I'm curious let's flip thatscript here, justin.
Where can AI really enhance andhelp accelerate or make more
efficient the network?
What types of use cases are weseeing in terms of applying AI
to the network?

Speaker 3 (32:28):
So I mean we've already seen a lot of machine
learning inside networks.
You know, put it a full AIlevel?
Not certain, but like we've hadthings where, like security,
where they do a heuristic engineto actually look at patterns in
the environment, learn thepatterns of the environment and

(32:50):
then notice when something isoutside of that norm and then
apply a fix with varying levelsof autonomy.
You know they've hadself-healing networks.
So really bringing AI into thenetwork is nothing new at all.
As we've improved the capacityfor this, we have a few other
choices now.
So other partners have theirown little AI networking engines

(33:11):
that are run on a small L40S orsomething and we can put that
in and then I can say thingslike natural voice interaction.
I can say, you know, hey, siri,what's my network look like
today?
And then the network will comeand say, oh well, blah, blah,
blah and it will tell youexactly where things are wrong
or please drill down to wherethat performance issue is.
You're watching in real time.
Brief me on it.

Speaker 1 (33:38):
It makes it easier to track down very complex
environments and see where theproblems are and how to fix them
.
Yeah, eric, any other AI use?

Speaker 2 (33:46):
cases on the network that you foresee either coming
soon or maybe in the distantfuture.
I would say the big thing isoperational change.
Right Again, how do we make iteasier to troubleshoot something
, how do we make it easier tomanage the network?
And I think one of the biggestthings that it's going to drive
is more appropriate data lakedesign.

(34:07):
Right, where is the information?
How can we access thatinformation?
Is really going to drive thesedifferent AI for network
operations discussions right,you already see it happening.
You know Juniper, with theirMIS product, has a fantastic AI

(34:29):
agent built into it.
Cisco, as you know, recentlyannounced that Cisco live.
They're looking to have a, anAI, uh, enablement across their
entire platform.
Right, to make it easier tolook at the network holistically
.
So that the big thing there, Ithink, is we're going to see

(34:50):
better telemetry andobservability data lakes occur,
and that's going to beabsolutely huge and that also
pushes it out to the, the um,back to the edge again.

Speaker 3 (35:03):
You're going back to that point because, if you look
at again, the, the data gravity,but having an edge compute
where you have I use cars as asan example Waymo self-driving
cars Okay, they have to havecertain level of autonomy.
They can make decisionsthemselves real-time based on
the environmental inputs, andthey do but they also need to
have some level of coordinationwith the mothership to ensure

(35:24):
that they have the latest data.
That could be either a dispatchto tell them where to go, it
could be a real-timeunderstanding of traffic
patterns in the city so theyknow where to avoid congestion
versus driving straight into it.
And then they have to worryabout backhauling their own data
back to the mothership so thatthe data center actually has an
updated understanding of what'sgoing on.

(35:45):
They also have to be able tofunction completely autonomously
without a network.
Without a network, they havelocal city maps that they can
actually continue to apply evenwhen the connection goes down,
so that they can still navigatethe streets even if they don't
have an understanding of thereal time.

(36:05):
So it's very collaborative andAI will then help us analyze
exactly what data has to liveresident in that, in that car,
what data has to be easilyfetchable?
What data can wait for a coupleof days?
Uh, it goes like that.
There's a.
The ultimate evolution of this,and you know, is going to be
something like I summon my waymoto my house to pick me up, to

(36:29):
take me to work, the, the waymost shows up and says good
morning, justin.
I see from your Fitbit that youdid not sleep very well last
night.
I've taken the liberty ofscheduling, you know, ordering
your favorite latte at Starbucksand we stopped through for
there.
However, also, it seems likeyour, your, your elevate, your
heart rate's a little bitelevated.
Your, your, your cardiologisthas suggested you don't deal

(36:49):
with that, so we might cut backon, you know, the extra shots of
espresso today.
But if you think about all thatlittle conversation, there is a
pre-existing HIPAA agreement, sothat's to be able to pre-fetch.
So when the car is beingsummoned to you, it has
pre-fetched your actual medicaldata, it has accessed your
Fitbit to see exactly where it'scoming through.
It has brought it all in, shootit all up, and the car has a

(37:11):
localized understanding as wellas a more of a direction as to
what to do with it.
That's the shank of the law and, yes, I know there's tons of
ethical considerations aroundthat level of integration, the
privacy concerns, but that's atheoretical kind of like.
This is what.
That's the art of the possible,and AI will help us determine
again exactly how to make thosedecisions where to put the data,
how to move it faster.

Speaker 1 (37:33):
Yeah, no, absolutely.
That's fantastic.
That's a great analogy.
Well, we are running short ontime, so I do want to cut it
short here.
Justin, Eric, thank you so muchfor taking time out of your day
to join us on the AI ProvingGround podcast.
Hopefully we'll have you backsoon.
Thank you very much.

(38:07):
All right, Thanks again.
The flow of data, and thendecide whether to add more GPUs.
Second, AI belongs inside thenetwork, not just on top of it.
From anomaly hunting securityengines to conversational
network co-pilots, embeddingmachine learning where packets
live turns troubleshooting froma war room into a quick chat.
And third, telemetry is the newgoal Rich, well-designed data

(38:31):
lakes and the observabilitypipelines that feed them.
Let ops teams shift fromreacting to predicting whether
the endpoint is a data centerswitch or an autonomous car at
the curb.
The bottom line is if you wantAI that scales, start by asking
how fast your GPUs can talk, nothow fast they can think.
The network is the heartbeat ofevery model you'll build next.

(38:54):
If you liked this episode ofthe AI Proving Ground podcast,
please consider sharing withfriends and colleagues and leave
a rating or review.
And don't forget to subscribeon your favorite podcast
platform or on WWTcom.
This episode was co-produced byNaz Baker, Cara Kuhn, Mallory
Schaffran and Stephanie Hammond.
Our audio and video engineer isJohn Knobloch.

(39:15):
My name is Brian Felt.
We'll see you next time.

All Episodes

Episode Transcript

Popular Podcasts

Crime Junkie

24/7 News: The Latest

Stuff You Should Know

.css-15opob5{left:0;position:absolute;top:0.8rem;} All Episodes

.css-14f5ked{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:2;overflow:hidden;}AI's Invisible Bottleneck: Why AI Stalls at the Network, not the GPU

Episode Transcript

Popular Podcasts

.css-r6mb8g{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:1;overflow:hidden;}Crime Junkie

24/7 News: The Latest

Stuff You Should Know

All Episodes

AI's Invisible Bottleneck: Why AI Stalls at the Network, not the GPU

Crime Junkie