Episode Transcript
Available transcripts are automatically generated. Complete accuracy is not guaranteed.
Speaker 1 (00:00):
This is the Art of
Network Engineering, where
technology meets the human sideof IT.
Whether you're scaling networks, solving problems or shaping
your career, we've got theinsights, stories and tips to
keep you ahead in theever-evolving world of
networking.
Welcome to the Art of NetworkEngineering podcast.
My name is Andy Laptev and I'mhere to tell you that InfiniBand
(00:20):
will always be the standard forAI and HPC workloads, and
Ethernet will never support thelossless fabrics these workloads
require.
I mean, it's obvious, guys,right If you look at market
share that what InfiniBand hasin any AI HPC workloads.
I was telling our guest Jaybefore the episode that I was
down at Frontier Supercomputerlast year.
It's the world's fastestsupercomputer at least it was
(00:42):
then and they are running allkinds of flavors of InfiniBand.
There's a little bit ofEthernet for management, but I
don't know.
If I look at the market,infiniband seems to have it and
people like to say don't betagainst the Ethernet.
But I'm just not feeling it.
So I have brought someluminaries in here.
Mike Bouchon we all know, mike,how you doing.
Mike, I'm doing well, good tobe here.
Speaker 3 (01:12):
And Jay Metz, or Dr
Jay, as I've been instructed by
his public relations firm tocall him- Jay, why don't you
tell the folks who you arequickly and what you do and why
you might be offended by what Isaid about Ethernet?
Oh, offended is not the rightword.
Pleasantly amused, I'd have tosay so.
My name is Jay Metz.
I am a technical director atAMD.
I am the chair of the steeringcommittee for UltraEther, which
(01:32):
is an organization that ispulling together to create a
tuned Ethernet for AI and HPC.
Speaker 1 (01:39):
Let's get to the
problem right.
What's Ethernet?
I mean, I've had people tell melike well, rock V2, and there's
all kinds of tweaking andplaying you can do and you need
special cards plugged intoservers to do that stuff Like.
Is it true that Ethernet out ofthe box, without a ton of work
and customization and nerd knobturning, cannot create a
lossless fabric that can supportmodern AI HPC workloads?
(02:01):
Is that a true statement?
Speaker 3 (02:04):
I think probably that
would be a true statement.
So let's define our terms justa little bit though, because
knowing where you are in thesystem makes a big difference
into whether we're going to becomparing apples to apples or
some other.
Generally speaking, when wetalk about these kinds of
workloads, there are effectivelythree different types of
networks.
You've got your generalall-purpose network, where
(02:28):
you've got your land traffic,your WAN traffic, oftentimes
your storage traffic, and thenyou've got the actual AI or HPC
network, which is a backendnetwork, and that backend
network actually can be brokendown into two different types of
networks as well, which we callscale up and scale out.
So that gets a little confusingfor some people.
So if you'll give me twoseconds, I can try to explain
(02:50):
what it is effectively, andwe'll use primarily AI for the
examples here, because it windsup being a little easier for
people to grok than the specificuse cases for a particular HPC
kind of thing.
They're, more generallyspeaking, people more familiar
with the chat, gpts and thatkind of stuff.
So when I say scale up and Isay scale out, what I'm really
(03:10):
referring to is the fact thatI've got these accelerators,
gpus, tpus and so on, and I haveto put in all the data to get
its work done, and sometimes Ihave to make it.
I have to get these things tocommunicate together, and I
spread out all of thesedifferent communications.
I set the workload from one toanother, to another to another,
(03:30):
and I call that scale out.
And so ultimately, what thatreally means is that the larger
these models wind up going, orthe more work that has to get
done.
You want to put in more GPUsinto the system, so you have to
connect them together, so Iscale that out.
Sometimes the workloadsthemselves are pretty big and
you can't fit them inside of asingle GPU's memory, and so I
need to put multiple GPUstogether and pretend it's one
(03:53):
huge honking GPU.
In order to do that, I've gotto network those little suckers
together.
I call that a scale up network.
So my scale up makes one big,huge honking GPU, and a scale
out network connects all thosebig, huge honking GPU, and a
scale out network connects allthose big, huge honking GPUs
together.
And so what we're doing at UECis to create the scale out
network that creates theinterconnection between all
these big, huge honking GPUs,and that's the difference
(04:14):
between what we're trying to do.
We're not talking about yourgeneral purpose data center, you
know, connect your you know,your home directories into your
laptop, kind of a thing.
That's.
That's not the kind of stuffthat we're talking about.
We're talking specificallyabout a purpose-built network
for a type of workload.
Just so happens that AI is atype of workload with many
(04:37):
subtypes and HPC is a type ofworkload with many subtypes, and
they all have their own littlerequirements.
But we're working specificallyon interconnecting a lot of
these different backend GPU orTPU accelerators together.
Speaker 1 (04:49):
And I was surprised
to learn that AI has been around
what?
60 years or something like that.
I mean, I thought it was a newtechnology when the whole chat
GPT thing happened, like oh myGod, look at this, this is
amazing.
But I guess what I'm surprisedat and maybe it's just because
of the explosion of LLMs, nowthat you know Ethernet's been
around a long time, aitechnology, I guess, has been
around a long time, but now weseem to be at a crossroads of
(05:11):
like uh-oh.
I mean, you know the UEC wascreated to try to shore up some
of the shortcomings of Ethernet,so they could support
everything you just described.
Speaker 3 (05:22):
Is that fair?
Yeah Well, I mean, likeeverything else, there's
scaffolding involved, right,there's the ability and the
approach to solving particularproblems, but the AI
consideration is old I meanreally, really old and it also
has been frighteningly prescient.
I mean, if you've ever seen themovie Colossus, you know the
(05:43):
Forbidden Project, one of myfavorite movies by modern
standards is very slow andplodding, but it is one of the
scariest movies I've ever seenin my life because it is so
accurate, and I think that it'sone of those things that we
should probably be taking as amoral lesson and an ethical
lesson in the work that we do.
I know I certainly take it withme whenever I go into the
conversations that we've gotabout the unbridled passion for
(06:06):
the halcyon days in the future,but the concept of the AI and
how it works has been around forquite a while.
What we are trying to do hereis actually go underneath that
workload process, though, right,we're trying to understand the
infrastructure requirements tomake that happen, hopefully in a
positive way, because when itturns out that when you start to
(06:27):
look at how these messages getpassed back and forth across a
network from one device toanother device, one endpoint to
another endpoint, you start torealize very quickly that the
amount of nuance and thevariability in all of these
different functions is verydifficult to control right.
So you have to create this senseof flexibility while at the
(06:50):
same time creating rigidboundaries inside that allows
these kinds of traffic flows tohappen unimpeded.
And that's where things can getreally complicated.
But at the same time what youwant to do is you want to create
an environment on the networkthat allows for the rapid free
flow of information, quick fixeswhen there's a problem and
(07:12):
information about telemetry thatallows the devices to be able
to handle the problems thathappen when they happen, without
the need for manualintervention.
All that kind of stuff soundslike if it sounds like all the
old self-healing networks fromthe past.
There are some parts of thatthat you're probably, you know
probably in there, and I thinksome of us have the battle scars
for that.
(07:32):
But the principle is there.
You want to use the propertelemetry to get the proper
tuning up and down the Ethernetstack to get the workload to
work properly when you'retalking about really large
numbers of devices.
Speaker 2 (07:45):
So on the Ethernet
side, I mean, do you see
Ethernet versus InfiniBand Like,is Ethernet adding stuff to try
to, I guess, get to parity withInfiniBand, or do you think
that these are like fairlydecoupled technology streams?
They'll kind of overlap indifferent areas but they're
going to pursue their own endsbecause they can do
fundamentally different things.
Speaker 3 (08:06):
Well, they, in some
ways they solve the same problem
and in some ways they don't,Right?
So one of the things thathappens with InfiniBand and and
let me, let me just, I mean Idon't want to sit here with a
cup of coffee and the sign thatsays change my mind.
But I'll leave Andy for that,for that role.
I like InfiniBand, I do.
I like the technology, I likethe way that it has been the
(08:27):
gold standard for highperformance networking for a
very long period of time.
And it solves a problem in avery good way.
I am not looking to beat orkill InfiniBand by any stretch
of the imagination.
As a matter of fact, I thinkfor the end user, for the
consumer, having options forwhatever tool they need to use
for whatever job they have tocomplete is going to be in their
best interest.
So I'm not looking to kill ordefeat anybody in any way, right
(08:50):
.
Having said that, I do knowthat there are approaches that
have natural limitations, right?
So what we're looking to do iswe're trying to solve a problem
that has emerged over the lastcouple of years, not just for AI
, but also for HPC, where thenumber of devices are just
growing by a factor.
(09:10):
So we were talking about 10,000devices, now we're talking
about 100,000 devices, 250,000devices, because devices now is
a different thing, right.
We used to talk aboutinitiators and targets and
endpoints and cards and switches.
We're not talking about thoseanymore, because every initiator
, every target, every switch hasmultiple endpoints on it.
(09:33):
Every GPU has multipleendpoints on it.
I mean, all these things areexploding because we're moving
the trust boundary of whatconstitutes an endpoint further
and further and further into theprocessing core.
Now, that means that we have torethink the entire end-to-end
solution and if I'm going to betalking about a million
endpoints which is what we'vebeen doing for UltraEthernet I
(09:54):
have to make sure that thosemillion endpoints are being
treated equally across a networkthat is also being treated
equally.
Now that means that the problemsthat we've been trying to solve
with Rocky, the traditionalRDMA, the stuff that InfiniBand
does, has a difficult timetrying to get to that level.
I mean InfiniBand, for example,has a 16-bit lid, which means
(10:16):
that it only has about 48,000devices or endpoints that you
can put into a single subnet,which means you have to create
multiple subnets to routebetween them, which creates
questions of latency andtopologies and that sort of
stuff.
That's fine, there's nothingwrong with it, but it does mean
that there are some people whowant to take the opportunity of
(10:39):
saying, hey, look, I want to tryto do this with a different
type of an approach, andUltraEthernet is the way that's
doing that, in an open ecosystemkind of a way that allows
people to say, hey look, I canbuild upon Ethernet, I don't
have to create a brand newproprietary network, I can stand
on the shoulders of giants, Idon't have to modify the things
that people already understand.
(10:59):
But I can do the tuning thatgoes up and down the stack that
would otherwise be considered alayer violation and verboten
right.
So ultra-ethan allows us totake the opportunity to solve
some of the problems of scale,scope, expanse and an equivalent
treatment of the traffic inthose kinds of environments for
those types of workloads, and sothat's one of the reasons why
(11:21):
we're trying to say, look, we'rejust taking a different
approach to solve problems thathave emerged as the scales have
gotten bigger, not to say that,you know, infiniband is
necessarily in our target.
Speaker 1 (11:30):
Did you say one
million endpoints is the goal of
?
That's how many endpoints thatyou're Dr Evil?
One million?
Speaker 3 (11:38):
endpoints is our
starting point.
We, and the reality of it isthat we actually.
Now I'm starting to wonder ifthat was too small, right, so?
So, let me, let me, let me tryto explain a little bit why I'm
saying that.
All right, so, from a practicalnuts and bolts rubber to the
road, these models, and we'lltalk about large language models
and, believe me, those aren'tthe only kind of models that are
in play here.
(11:59):
Right, you know, large languagemodels are not the same kind of
models that you see.
We have the video out there orthe audio.
That's a completely differenttype of AI in terms of the way
that the infrastructure works.
But we're not even talkingabout that.
We're just talking aboutrelative good, old fashioned,
two years kind of way.
Llms All right, if I've got thenumber of parameters in an LLM,
it started off with 70 billionparameters.
(12:20):
I could fit that onto a laptopReally Realistically parameters.
I could fit that onto a laptopreally realistically small,
small, right, right, or.
But I could definitely doinside of a server.
I can do if I, if I move it upa little bit, if I move it from
70 to, let's say, 200, right,that's a little bit different.
Now I'm talking about a smallcluster.
If I go to 405, that's a littlebit even more, because the
reason is I got to fit a modelinto the memory of these
(12:40):
accelerators and they don't fit.
It doesn't matter which GPUmanufacturer you're talking
about, they don't fit.
You've got to be able to createthese things work together.
But the larger the models are,the more you have to do this
swapping in and out of thememory.
That means that the network hasto be good enough to be able to
create that swap in and out ofdata, which also means that the
knock-on effect of this ispretty significant as well.
(13:02):
Right?
So I've got my storage network,which is usually on my
front-end network, nowhere nearmy back-end network.
My front-end network now has togo a lot of swaps in and out of
my back-end network to myfront-end network.
So now I've got to accommodatestorage in a persistent fashion
which gets closer to the memory.
Memory and storage is colliding, is becoming very similar.
All of these things arehappening at the same time.
(13:24):
They're all going at the sametime and they all have to be
accommodated at the same time.
And so if I get to a pointwhere I've got a trillion
parameter model right, and we'reright around the corner from a
trillion parameter model right.
We've already seen peopletalking.
Lama just came out, you know,the new Lama paper came out with
a 405, right they're working onon they're actually working on
an older version of the loms ofthe conversations that I've been
(13:45):
hearing people talking aboutwhat they?
They want massive amounts ofdata to train, huge amounts of
data to train, but you're goingto have to put it somewhere.
You're going to have to move itsomehow, because each of these
different gpus have to be ableto have all of that information
when it needs it, where it needsit, at the time that it needs
it, reliably and safely, so thatyou don't lose data and you
(14:05):
don't lose precious nanosecondsand trying to move the bits
around.
That's what we're trying toaccomplish, and the way that we
are addressing it inUltraEthernet is that we're
looking and saying, hey look,we've got a lot of flexibility
in all of these different placesin the stack.
Right, we've got a standardizedapproach to solving problems up
into the workloads and downinto the hardware, and that
(14:28):
interface is the network.
The network is the computeprocess for what we're looking
to accomplish here, and so allof that means that whether we're
talking about AI workloads forLLMs or video or audio or HPC
environments, you can tune theEthernet based upon the
semantics of that requirement inan ultra Ethernet environment
to get the best performance thatyou need at that particular
(14:49):
time.
And that's the goal that we'retrying to take, because those
are going to require a lot ofdevices to be able to move that
data around.
Speaker 2 (14:55):
I had a couple of, I
guess, questions around that.
So do you see, like the, Iguess, what UEC is doing?
So spoiler alert I actuallythink UEC will be successful.
But there's kind of two thingsyou've mentioned and you kind of
somewhat cleverly mentioned it,by the way You've got a bunch
(15:16):
of technical challenges and yousort of dropped in the open
ecosystem comment and it was.
I think people might've missedthat.
By the way.
Do you think that UEC goes andsucceeds on the back primarily
of the technical capabilities oron the back of the open
ecosystem part?
Like I know it's kind of a 1A,1B thing.
(15:37):
What do you attribute themomentum to?
Because I do think there ismomentum in the industry around
it.
But I think, depending on whatyou're trying to do, you might
look at one thing or the otherthing and kind of value it more.
Speaker 3 (15:49):
Well, in my position
as the chair, I don't have the
luxury of second guessingpeople's motivation.
All of the major graphicsprocessing unit manufacturers
are involved and participating,and we're happy to have all of
them, and we're happy to haveall of them, and I think that
(16:12):
their own particular motivationsare ultimately to solve a
particular problem.
That is not and the truth ofthe matter is that it's not
really fair to assume any oneparticular person or company or
group is going to be able tocome up with an answer that's
going to solve everybody'sproblem right.
That's an incredible burden toplace on somebody.
In the first place, regardlessof how you feel about it.
(16:32):
I do believe that the approachto solving a problem is when you
get an emergence of a similarway of thinking about solving a
problem right.
The good teams, the good squads, the good approaches to
addressing an issue never comethrough based upon political
(17:00):
strife.
At the same time, healthycompetition is always used right
.
Competition of ideas,competition of dissent,
competition of, you know,approaches All of these things
are ultimately going to be goodif the end goal is clear in mind
, and I do think that having anopen ecosystem where people can
feel free to contribute becausethey've got a vested interest in
(17:20):
that outcome, they can have avoice and they don't have to say
well, I'm going to have to takewhatever I get.
That's going to be a good thingas well, and I think that
that's why we have as manydifferent companies inside of
UltraEthernet as we do,including all of the players
that you'd probably expect wouldwant to have proprietary
systems.
Speaker 2 (17:37):
I actually think the
open ecosystem piece is probably
the biggest part of UEC,because we're probably at the
first time in networking in thelast 25 years where the future
is not particularly certain,right?
I mean, we were on a 20-yearpath where it's like you know,
you're going to double thecapacity, yeah, right.
So it's like and even if youlook at like the major network
(17:58):
silicon providers, right,whether it's custom silicon,
merchant silicon, whatever, Imean, there's a pretty well-worn
roadmap where everyone's tryingto hit specific targets based
on what obviously comes next.
I think when you get into AI,the explosion of endpoints, I
don't think people reallyunderstand the order of
magnitude we're talking aboutand the size of these data
(18:20):
centers.
To give people just a littlebit of a background a large data
center not in the US, let'stake the US out, because the US
is a little bit weird In othergeographies.
A 20-meg, a large data center,not in the US, let's take the US
out, because US is a little bitweird, you know, in other
geographies.
You know, a 20 megawatt datacenter is like a good sized data
center, right.
A 60 megawatt data center, ahundred megawatt data center, is
like a massive data center.
(18:40):
People are building gigawattdata centers now, like it is.
When you said it's like adding,you know it's an order of
magnitude.
I mean people need to get theirheads wrapped around that.
It's huge.
And when you do something,that's that much bigger and,
frankly, around a technologywhere it's not obvious what
comes next, I mean the rate ofchange is crazy.
(19:00):
And so the open ecosystem to merepresents optionality.
It's like you know what if, ifyou go all in on on you know one
particular direction, ignoringwhether it's a single supplier
or whatever right, you'relimited to whatever that
direction can provide.
I think when you go in and yousay we're going to allow people
to to compete with their ideasto, you know, maybe be a little
(19:22):
bit speculative in theirapproaches, I think what that
does is it provides optionality,and I think that's the value.
And then obviously there's somebenefits that come out of
competition.
Right, I mean, the number onedriver of economic advantage is
competition.
When people are forced tocompete, what you see is a level
playing field and then peoplehave to.
You got to step up and I thinkthat's good for everybody.
(19:44):
But that open ecosystem piece,I don't know.
I think if you did all thetechnology bits, but you didn't
have the open ecosystem piece.
I think the value prop is morethan halved.
I think that open ecosystem tome that's where UEC really
shines.
Speaker 3 (20:00):
Yeah, and I'm
certainly not going to disagree
with you there, not just becauseof the fact that I agree with
you.
I do think that history hasborne you out.
I think that where there aretechnologies that I have worked
in and still continue to love,but the fewer the players that
are involved, you know,eventually you wind up with the
(20:21):
last buggy whip menu, right, and, quite frankly, even if it's a
really good buggy whip, you knowyou still need the buggies and,
quite frankly, people will.
It's a really good buggy whip,you know you still need the
buggies and, quite frankly,people will come out.
There's always going to be aroom for innovation in that
regard.
So I think that ultimately, wewant to make sure that there is
an encouragement of this kind ofopenness and openness.
(20:43):
I think we actually it's areally good point, mike I think
we need to identify whatopenness means.
I think we actually it's areally good point, mike I think
I think we need to identify whatopenness means.
Okay, openness is is one ofthose terms that is so
overloaded now that I don'tthink people actually quite get
what that's supposed to be.
Openness is the ability foranybody who has a vested
interest in participating.
That's what openness means.
It doesn't mean that you arejust given things and it doesn't
(21:04):
mean that you can just throwthings out and then people have
to take it.
It means that there's amarketplace of ideas and where
you get the opportunity, thechance, to take a broad stroke
of your peers and try topersuade them that your idea is
a good one.
And that is what open reallymeans.
Not that not necessarily thatyou are going to have to take
(21:25):
something or that you have togive something.
It is all about the ability toget that opportunity to put your
idea out there and to acceptother people's ideas in that
marketplace.
Speaker 2 (21:34):
How do you avoid this
from turning into like I mean
the standards bodies, I think,but a fairly glacial pace, you
know so.
And and AI, I don't think we'lltolerate that.
And so the question really islike how?
How do you give everyone theopportunity?
They compete?
Everyone has their own sort ofperspectives and, in some cases,
interests.
(21:55):
How do you prevent, like anecosystem-type environment from
essentially devolving to?
You know the lowest commondenominator and you know so.
This stuff arrives, you know,sunday, after never.
How do you handle that?
Speaker 3 (22:11):
It's a it.
That's a very fair question.
It's a combination of right.
So we talk about the technology, but what you're asking about
is the people right.
So the thing is that you needto have the combination of the
technology and the ideas to putforth with the people, who have
not a herd of cats mentality.
They've got to be able to havea vision that they can all agree
(22:32):
to and subscribe to.
And then there are guidelinesand boundaries that we place in
play to allow the companies todo the work that they need to do
, and it's not.
You know, mediation is often theart of pissing everybody else
equally Right.
So if you want to, if you wantto create an equal playing field
(22:54):
for everybody, nobody winds upbeing super thrilled because
they can get anything they wantbecause they can't.
And that takes communication,that takes constant, you know
negotiation and persuasion.
That that takes.
You know the ability forsomebody to come in and say look
, these are the rules that we'veall agreed to play with.
And then you have the refereesto be able to do that, and I
(23:15):
will say that UEC is actuallyvery good at doing that.
We've got a strong, very strongleadership.
Not every standards body does,but we have a very strong group
of companies and a very stronggroup of people who are actually
in the technical advisorycommittee, which is the
technical arm for the steeringcommittee, and we have a very
(23:35):
good set of chairs for each ofthese different work groups who
are dedicated to the cause, andso we have a series of checks
and balances that go on into theorganization, that are built in
from day one, so that peopleknow exactly what to do when
they need to do it, and long,long list of preparation in
order to be able to do that.
It's it's.
(23:56):
It's one of those things thatis the unsung part of you know,
any standards organization.
It's all the parts where, well,what do I know?
How would I have to do next?
Right, who do I have to talk to?
Knowing that in advance hassolved a lot of problems before
the even came up, and then youcan actually do the technical
stuff.
It's knowing the right pit crewfor that race.
Speaker 1 (24:14):
And there hasn't been
like infighting or personality
disorders coming to light andright, like when you get a group
of people I've talked to RussWhite and Radia Perlman in the
past about.
I remember Radia going on Idon't know if it was Spanning
Tree, but she she told a reallyinteresting story to me about
(24:35):
just the you know, the egos inthe room and how certain people
wanted certain things done.
So the open ecosystem, toMike's point and amazing right,
but I can't imagine all thesenetworking vendors agreeing on
anything that's going to theydon't have a choice though,
right, right.
Speaker 2 (24:48):
So I think what's
driving this, which is different
than what you see in some ofthe protocols work, the
protocols work initiallyactually got off the ground
pretty quick.
It wasn't like initially, thestandards bodies didn't move
glacially.
They moved that way when itstarted getting into some of the
more advanced stuff that wasmaybe more speculative, a little
bit more niche.
I think when there's forcedadoption, that's happening when
(25:09):
you have this kind of acatalyzing event, someone's
going to show up with a solution.
If you spend all of your timefighting, no one shows up with a
solution and the person whogoes around the end sort of wins
out.
I think the pace of, honestly,if you look at just the amount
of money that's being spent, Imean there's strong commercial
(25:31):
reasons for people to kind offigure it out together.
I don't think that the marketwill wait and you know.
And so if either people cometogether and figure it out or
they don't, and if you look at,kind of the folks in UEC, like
not, you can't have that manypeople involved and have them
all in a dominant, incumbentposition.
So almost by definition, it'sin most by volume of people,
(25:54):
it's in most people's bestinterest to work it out so they
can sort of get a place at thetable.
I think that's the thing that'sdifferent than some of the
standards work over the last 15years or so, and that's why I
think that's I'm sorry tointerrupt Go ahead.
Speaker 3 (26:10):
Please continue.
No, it's okay, go for it.
I think you're absolutely right, because when you look at the
scope of the problem we weretalking about tuning from the
physical layer all the way up tothe software that just sheer
scope of touchpoint no onecompany can do it all.
They just can't.
You can't do it all.
I mean, there is so much stuffthat has to go on that, if
(26:31):
you're going to, the barrier forentry for any new company is
immense.
There's barrier for countrybarrier for entry for existing
companies is immense.
There's a reason why companieswho have never been part of
standards bodies are now part ofUltraEthan is because it's that
big of an issue, right?
And so if you're going, it'snot about the iteration of where
you're already now, it's what.
(26:52):
What are you going to beputting out that comes out to
compete with what you're puttingout now?
And everybody who is a part ofuec at some level in the back of
their head, I believe I'm not agood mind raider, so you're
going to have to take this witha grain of salt but everybody
believes that that if they'regoing to be in a competitive.
They've got to be able to findwho their allies are going to be
(27:13):
on the technology level to workwith, both above and below the
osi layer that they're workingon, and they can't invent every
single nerd knob as, as andy wastalking, you've got to work
together in order for this tocome together.
Come out, or just you might.
You might as well just pack upand go home, I think.
Speaker 2 (27:29):
Well, can you for
folks who aren't, I guess,
familiar with who's in UEC,without you don't have to go
through the names of folks?
But?
But you know, it's not likeit's I mean ethernet's in the
name but it's not like it's abunch of networking vendors, I
mean, as you mentioned it, inthat full stack.
So people have like a mentalmodel for how expansive an
(27:53):
effort this is, because I dothink this is pretty unique.
I mean, the amount oftechnology that's represented by
this group is crazy.
Speaker 3 (28:03):
Okay, so let's just
take a very basic infrastructure
mental model here.
You've got two devices that areconnected through a switch.
Well, if you're workload,you're going to have to have the
software right.
So you have to have thesoftware interface into the
network, right.
So we got to have the softwareon one device and you have to
have a similar softwareinterface on the other side.
And then you've got to have theactual ability to identify how
(28:26):
you're going to formulate thosepackets.
So you're going to formulatethose packets, so you're going
to have to be able to understandthat in the network
architecture, whether it be in aNIC or inside of a chip or
something, you've got to havethe actual where to put the bits
in the right format you need.
Then on the other side you haveto go all the way down that
stack.
So you've got the software, youhave the power to run the GPUs
(28:47):
and the CPUs.
You need to have the PHYs andthe SERDEs at the network level.
To be able to connect that ontoa wire, you need to have the
cable that goes to a switch.
You have to have that switch toanother cable, another PHY and
a SERDEs go off to anothernetwork interface card.
And then you've got to be ableto take that at a high enough
bandwidth speed back into aprocessor and into the memory
that has to go.
You have the memory componentthat is tuned to this kind of
(29:09):
workload and can handle the typeof addressing you know
synchronization.
That goes along with thenetworking stack, and then you
have to be able to chime thatback up into chimney, that back
up into the software stack onthe other side.
All of those different piecesof the puzzle have to exist for
one packet to work in oneworkload at any given point in
time.
Then you have the issue of whathappens if I got multiple
(29:30):
devices and I have to be able tonegotiate across that network
and across those links andacross those wires to make sure
that I know exactly what's goingto wind up happening in a lot
of these different systems,because now I've got to have a
traffic cop that goes along withit.
Then you have to be able tounderstand how I'm going to
configure all of these differentthings.
(29:51):
What's the topologyconsiderations?
That's another element, that'san entire art form, all in and
of itself right.
That has a lot to do withswitching, it has a lot to do
with cabling.
It has a lot to do with patchpanels and so on and so forth.
So the cabling folks, thesignaling folks, the power folks
, they're all part of this aswell, for that very reason.
(30:13):
And then you have the idea ofwell, where are we going to go?
How do we make this forwardscompatible?
Where are we going to go nextyear?
How do I add things into thissystem moving over time?
Right, because we're used to inHPC, for example, we're used to
wholesale budgeting on one go.
Right, everything from yourstorage to your networks, to
your compute.
It's all in one build ofmaterials to your compute.
It's all in one build ofmaterials.
That's how Frontier was built.
That's how anything is built interms of high-performance
computing.
That's not how Ethernetnetworks are deployed inside of
(30:34):
regular data centers, yournormal everyday mom-and-pop data
centers.
You've got a budget cycle foryour compute.
You've got a budget cycle foryour network.
You've got a budget cycle foryour storage and they're never
aligned right, no-transcriptscaffold into something there?
(31:16):
That's a different question.
So I now got powerarchitectures that I have to
take into consideration.
Now we don't get into the powerstuff at UEC, right?
That's not what we do, but wedo consider ourselves to be
everything up to that point,because we need to consume that
power and distribute itappropriately across the system,
which means we can't do this ina vacuum.
We have to be able tounderstand what the consequences
(31:38):
are, but all of these thingshave long-term consequences that
are going to affect a lot ofdifferent companies and a lot of
different.
Speaker 2 (31:46):
Does the scope of
that become, I guess, a risk?
Different companies and a lotof different.
Speaker 3 (31:50):
does the scope of
that become?
I guess, a risk?
It's always a risk, um, I meanthe.
So what?
We've what?
What's happened?
We've always come up with waysof more interesting, clever ways
of solving problems.
We don't have enough roominside of the memory for our
gpus, so we create parallelism,right, we create different ways
of handling it.
But that's not a cure-all,that's not a panacea for the
problem, because when you createparallelism, you introduce
(32:11):
other types of problems.
You create overhead, you createsystems.
You have to have additionallatency between these different
parallels.
So you've got pipelineparallelism, you've got data
parallelism, you've got tensorparallelism.
They all have their owntradeoffs.
It's all about mitigating thosetradeoffs, and anybody who comes
(32:32):
up with a better way ofmitigating is going to be
successful for the short term.
So I think that we're going toeventually have power problems.
We're just not going to be ableto power a 20 trillion
parameter model, so we're notgoing to have 20 trillion
parameter.
We're going to have to come upwith some other way of
addressing that need in adifferent way, just because of
the fact that there's not enoughmetal to bend in order to make
(32:55):
that happen, let alone theamount of nuclear requirements
that goes along with it.
I think that we try to solvethe problems that we have in our
hands and we try to see what weneed to be able to do.
Our focus right now is to say,if we were to get to that point
from a network perspective, whatare the problems that we have
to solve and what are the thingsthat we can control.
(33:15):
And that's ultimately whatwe're kind of focusing on,
because we don't want to losethe scope of our own abilities
by spreading ourselves too thinWith the insane growth of
everything you just said in AI.
Speaker 1 (33:25):
Is there a finish
line for the UEC?
Is this an effort that's nevergoing to end?
I almost envision you come upwith IPv6 and we think, oh,
we'll never run out of addresses.
And then something happens like, oh crap ran out of addresses.
I mean, will Ethernet is whatthe UEC building?
Do you think that'll be goodfor a very long time?
Or could we hit a wall of likeuh-oh, we didn't foresee this
(33:46):
other explosive thing happening.
We hit a wall of like oh, wedidn't foresee this other
explosive thing happening Likenow we're up to 10 bajillion
endpoints Whoopsie, okay.
Speaker 3 (33:52):
So there's a couple
of different ways to answer that
question.
I believe that there are enoughproblems with physics right now
that are yet to be solved, thatthere is a very long runway for
work that can be done inside ofall trees.
We've had to deliberately put apause on a number of the things
that we have to do becausethey're just, quite frankly,
(34:12):
outside the scope of ourbailiwick.
That's why we're working withorganizations like SNIA and OCP
and IEEE, because we don't wantto be an island, right, we want
to work very closely with a lotof these other organizations and
the ecosystems because they'resolving the problems that are
going to affect what we're doingand what we're doing is going
(34:33):
to affect.
So the way I see it and the waythat I've been approaching the
leadership of UltraEthanet isthat we have a job to do, but we
are not right.
You really need to understandyour role in the world and you
know underestimating your roleis equally as dangerous as
overestimating.
You really need to be able tostand where you fit so that you
can do the best possiblecontribution, not only to what
(34:56):
your members are doing, but towhat the industry and then
consumers are doing, because ifyou do the best, buggy whip
again and no consumer needs itbecause somebody's figured out a
better way to do it and youweren't paying attention to the
industry, well, that's yourfault, and so I'm trying to take
that perspective on the waythat UltraEthernet works from a
bigger picture type of.
So I think that, ultimately,the problems that we're seeing,
that are being resolved instorage and memory, addressing
(35:19):
and topologies all of thesedifferent things that we're not
really focusing on now are goingto be extremely important in
the future.
So, as far as I can tell,nobody's come up with a roadmap
of problems that has an endpoint, and as far as there are
problems that affect what we'relooking to do and the partners
that we've got in our ecosystem,I think we're going to be
around for quite a while.
Speaker 1 (35:38):
That was my take on
it.
It seems like it's going to be.
I don't think the UEC is goingaway anytime soon.
It's going to be an ongoingeffort for a long time.
It seems like.
Speaker 3 (35:46):
Yeah, and I'd like to
get 1.0 before we start talking
about shutting it down.
Speaker 1 (35:51):
Yeah, yeah.
When's the?
You have a V1 coming up?
Are you allowed to say whenthat?
Speaker 3 (35:56):
Yeah, we're
anticipating a likelihood by the
end of Q1 of this year, 2025.
Yeah, We've been like I said,we've got a lot of people.
We've got over.
I think we've got 14 or 1500individuals now.
We've got about 120 companiesin a little over a year and we
have eight different workinggroups, not including the
technical advisory committee andthe steering committee and the
(36:18):
marketing committee and all thatkind of stuff.
But there are a lot of peopleworking very, very hard on
getting this thing out asquickly as possible.
Speaker 1 (36:25):
I was just looking
through your working groups on
the website.
It's amazing.
I didn't realize you had agroup working on physical layer,
another group on link layer,another on transport.
That's why you're just breakingthe problem into manageable
pieces and putting some of thosehardest people on it.
Speaker 3 (36:40):
Yeah well, so that's
a really good point, because
when networking, people havethought about networks.
They've known this model somuch they've even forgotten why
it was there.
And so the problem is that whenwe make changes to the link
layer, go into 802.1Q, makechanges to the physical layer,
(37:01):
802.3,.
There is a limit of the peoplewho are involved in each of
these different problems.
They're trying to solve a veryspecific problem in a very
specific, constrained boundary.
There's not a lot of discussionabout what the consequences are
those.
Once you start sending packetsup and down that right, you
encapsulate it and you're goodto go.
Anything else is considered alayer violation.
The end user has to say, allright.
(37:21):
Well, if I'm going to do, letme give you a really good
practical example from storage.
It also affects AI and HPC, sobear with me for a second.
(37:41):
So we have priority flow control, which is a way of putting a
pause frame inside of a linkbetween two devices in order to
avoid to basically keep thingsin order.
You want to make sure if youhave no way of getting to your
destination, you put a pause onthe link so that the packets can
remain in order and then youdon't have to do any reassembly
on the other side.
This was useful for Rocky,useful for Rocky V2, useful for
FCOE all require in-orderdelivery.
Now the problem is that itworks really really well if you
(38:05):
have a good understanding ofyour traffic type.
Really well.
If you have a goodunderstanding of your traffic
type, right.
If you have a goodunderstanding of your fan-in
ratio and your oversubscriptionratios of the target from the
initiator, you're okay.
The problem is if somebody said, hey look, this works really
good, I can have losslesstraffic all the way across and
not realizing that their fan-inratio was off the charts and
(38:25):
that created head of lineblocking that would cascade
across the network, right.
So you had to treat yourlossless traffic very
differently than you had totreat your lossy traffic.
Now that means that what?
Once we're trying to solve thatparticular loss less problem in
a large scale environment, youcan't use the same techniques
and expect the same type ofresults, sad but true.
(38:46):
Where this went really off therails about 10, 12 years ago was
when they tried to put iSCSItraffic onto a lossless
environment.
Now Fiber Channel has anoversubscription ratio of about
4 to 1 to 16 to 1, dependingupon the application.
Iscsi had a 400 to 1oversubscription ratio.
So they were jamming 400different links into one target
(39:07):
and it was causing all kinds ofheadline blocking, problems with
iSCSI once it went outside ofthat single switch problem.
So you take a solution that wasreally good for a well-defined
and well-understood problem andyou try to extrapolate that to
what it wasn't designed to doand you're going to have all
kinds of issues.
So as we start to do that withlarge-scale Ethernet, ai, hpc,
(39:30):
you start to realize that hey, Ican't do that.
I need to understand what'sgoing on, that the link layer is
going to affect the transportlayer.
What goes on in the transportlayer is going to affect the
transport layer, what goes on inthe transport layer is going to
affect the software layer,software layer and so on and so
forth, right, and so what we'redoing is saying I don't want to
change the ethernet structure,right, what I want to do is I
want to tweak it so that whatgoes on above and below are now
(39:51):
aligned for that type of traffic.
So for, if I have a reliableunordered delivery, I want, I
want my ethernet to do that.
I have a reliable orderordereddelivery, I want my Ethernet to
do that.
If I have a reliable ordereddelivery, I want my Ethernet to
do that, and if I want to doidempotent operations for HPC, I
want my Ethernet to do that.
But I got to change that allthe way up and down the stack
and I got to get the errormessages to go back.
I got to get the OO codes to goback and forth between the link
(40:14):
layer and the physical layer.
That's not in Ethernet rightnow.
That communication does notexist natively or mandatorily
right.
So we're trying to make surethat anybody who puts together
an ultra-Ethernet environmentknows that we've done the
thought about saying, okay, thelink layer and the transport
layer and the software layerhave to align this way.
If you're going to be usingthis type of AI, for example,
(40:37):
you want this type of congestioncontrol.
If you're going to have reallylarge systems with probably in
cast environments, you may wantto have receiver based
congestion control.
You may even want to puttrimming inside of the switches,
but you don't have to do that.
But you should know why andwe're doing that heavy lifting
for you so that you can say thatin this environment these are
the kinds of things that you'regoing to be doing.
We've got work.
We've got the compliance andperformance and test work groups
(41:06):
to help say to an end user thisis why we do what we do and
what we're recommending and howyou can be sure that you're
actually compliant in this kindof environment.
We're trying to provide all ofthese tools not just for the
vendors but also the end usersto understand why we're breaking
those layer violations andmaking the work the way we are,
because it's tuned specificallyfor the back end network of a
type of workload.
Hopefully that made sense.
It wasn't just too much of aramble.
Speaker 2 (41:25):
I think it's good the
question I have.
So when you do that, when youbreak the layers, you can do
that in, I guess, a couple ofdifferent ways.
You can say that we're going tobound this by a reference
architecture for a specific usecase, and so it's sort of
everything is kind of, let's say, it's hardwired to work a
certain way, and then theintegrations are done sort of
(41:47):
before the things are evendeployed.
The other way to do it, atleast for parts of it, is to add
some orchestration layer overthe top and say some of that is
configurable because thesedevices are deployable in
different areas.
And you know, today I think wehave you've got front end and
back end.
Networks are fairly distinctand you know, I think there's
questions over time.
(42:07):
You know, do people reusedifferent devices and kind of,
where's the boundary?
You've already talked a bitabout some of the storage
implications.
Storage implications Do youthink that these will be, I
guess, architecturally definedor is there like an
orchestration?
Speaker 3 (42:25):
requirement that
comes in over the top to handle
how these things come together.
There's a third option, right,and that is to make the actual
packets and message deliverysystem be a lot more flexible
and dynamic.
So the way that we'reapproaching that is kind of
navigating between that skilland shribdist line of full scale
proprietary stack, which isvery rigid, or the overarching
software architecture layer,which is very slow.
(42:47):
So what we're looking to do iswe're trying to say, hey look,
each of these different messageshave to be able to have equal
treatment across the network,but we don't want to keep state
across a million nodes or thesystem that's going to acquire
that.
So we're not going to have it.
We have a statelessinfrastructure.
So what we do is we createtransactions for each individual
flow where the addressinformation of the final
(43:08):
destination of the memorylocation is built into the
packet itself.
And that means that I'm going toset up a transaction.
I can immediately there's noslow start I immediately can
send off this packet and oncethat transaction flow is done,
each packet itself, each messageis identified, the message ID
has its own identification andthe destination can do the
(43:29):
reassembly of that message inthat transaction and close down
the system without having tomaintain state across the
network.
Incredibly flexible approachbecause each of these different
transactions has their ownsemantic requirements based upon
the workloads that you'reacquiring, which means you can
actually run different types ofpacket delivery systems at the
(43:50):
same time, because it's alladdressing that married semantic
layer, but not tied to it tothe point where you have to do
that.
Every single packet, everysingle message has to be that
way and that allows us to dosome really incredible flexible
things with equal cost, with thepacket spring and the ability
to directly talk into the memorylocations at the other end,
(44:12):
while also maintaining thecongestion control notifications
that go back to a sender wherethey can act with a sender
itself, can have a lot morecontrol over which path it's
supposed to be taken for thenext next flow, and it makes it
incredibly flexible youmentioned ecmp.
Speaker 2 (44:26):
Are you exploring
non-ecmp to fan traffic out over
all available links?
Speaker 3 (44:29):
yeah so.
So a lot of people get um it.
Ecmp is um I, I was.
It was a mistake to put that,because the way that we're doing
packet spraying is moregranular than normal ESCMP and I
do have to be a little bitcareful because there's some
things I'm not supposed to betalking about in great detail
before we go for 1.0.
But nevertheless, it is a formof ECMP.
(44:52):
It is not an equal cost fromthe sense that we would normally
have it deployed in atraditional data center
environment.
It really has to do with thefact that we have a strong
degree of variability in theradix of our links to allow us
to be able to keep a finedistribution across many, many,
(45:13):
many links with this kind ofgranularity.
So it gives us a higher degreeof sprayability without having
to go to flow level dedicationof a link, which is what you
would get with ECMP.
Speaker 2 (45:25):
Well, and the failure
domains that you were going
back to the previous part of thediscussion by and Odysseus
would be proud, by the way whenyou, I guess, navigate between,
like an orchestrated outcome orsort of a hard-coded, you know,
pre-deploy outcome, you alsochange some of the failure
(45:47):
domains on that, which I thinkis nice.
You know, having worked onscheduled fabrics in the past,
let's say you know pre-currenttechnology and watching entire
data centers go down.
I think having something that'sa little bit more tolerant of
different types of workloads Ithink is pretty good.
I have a, I guess, to put a bowon something we started earlier
(46:11):
.
We spent a lot of time talkingabout all the reasons UEC might
fail.
If Andy's going to change hismind, maybe, maybe is there like
a particular reason.
You think that you're that, yousomething you've seen right
when you look at it and you saidthis is, this is why UEC is
going to succeed.
What's the?
What do you think is like the?
I'm not looking for like thesecret sauce, but like what's a
(46:34):
thing that you look at andyou're like you know what that
that doesn't give me hope, thatgives me confidence.
You look at, you're like youknow what that that doesn't give
me hope.
Speaker 1 (46:39):
That gives me
confidence.
Speaker 3 (46:41):
I love that question.
Speaker 1 (46:42):
You got Bouchon right
there.
I love that question I.
Speaker 3 (46:45):
It goes back to
something you said, mike, um, so
I'm going to take.
I'm going to take that and I'mgoing to.
I'm going to turn it just alittle bit around.
I probably spend close to 20hours a week on UEC and it's
it's and it's a part of my job,but it's definitely the biggest
part of my job and I have.
In the last two years I havewatched UEC go from six
(47:07):
companies of trying to solve avery specific type of a problem
where there was a period of timewhen people were like you know,
these are six companies withvery big egos and we're not
going to see these guys agree onanything To 115 companies,
people proudly displaying theUEC logo at Supercomputer or OCP
(47:28):
or something along those lines.
When I sit in on the meetings,it's spirited something that
Andy said but they all genuinelybelieve in the value of the
outcome.
So the one thing that I haveseen and I've been part of these
standards bodies for a verylong time the thing that's gonna
make this succeed is just thelevel of sheer will in the
(47:49):
people who are putting thistogether.
This is something that they'reexcited about, they're
passionate about and you can.
You know, passion derives fromthe Greek word for suffering,
and sometimes that's exactlywhat happens when you deal with
a passionate person.
But you know all that, jokingaside, they do fervently believe
that they're doing somethingthat is going to solve the
(48:11):
problem that is going to vex alot of people in the very near,
you know, and they're workingnonstop on it.
So I think, ultimately, I haveto say, the people, the people
that I've been working with,they're not resting on their
heels, they're not cooling theirjets, they're going full bore
on this, and I mean from allcompanies.
A lot of companies are puttingit's not just two or three, it's
(48:33):
like a lot of them, and thatkind of self-motivation is
unmatched in anything I've everexperienced.
Speaker 2 (48:40):
I did a pile on that
and then I'll maybe kick it back
to Andy to see if we've movedhim a little bit from his
starting position.
I was involved a lot in likethe open daylight stuff and open
daylight had kind of came outof the gates really strong, a
lot of interest.
You know, ultimately I don'tthink you saw the volume of
deployments that people hadhoped.
(49:00):
I think it was instructive tothe broader SDN movement but I
don't know that outside of acouple of different open
daylight distributions it wasn'tlike the huge deployment
success that people had hoped.
I think what's different hereis that there's like an
immediate, very acute need and Ithink when you take all those
people, I think what unifiesthem is that there's like a very
(49:22):
tangible thing that's veryconcrete, that has like very
real, you know, like businessdrivers behind it.
I think that's the thing thatOpen Daylight didn't have.
It was a little bit of theory,it was this idea that there was
a better way of doing things,but it's like it wasn't
immediate, it wasn't like thisreal kind of central need.
(49:43):
I think what you have here islike a very strong need.
It's being driven by a bunch ofbig players, but it's not only
the big players, and I thinkthat you've got some technology
milestones that are forcing like, look, it's gotta be deployable
by, and then they pick a dateright.
I think that you put those twothings together.
I think that's why you seesuccess.
(50:05):
You know, if, if necessity isthe mother of invention, I think
we've got the necessity sideand I think what it's doing is
it's driving a lot of theinvention, um, and we're seeing
that, and that's how you breakties, that's how you remove the
standards.
You know sort of slowness.
I think all of that comestogether and that urgency.
To me, that's the thing that'sdifferent this time.
So, andy, I don't know if thatmoves you, but you opened with
(50:31):
questions.
I don't know where you're atnow.
Speaker 1 (50:33):
All right.
So before I bring down mydecision and bring the gavel
down, I do have one questionleft for Jay and then maybe a
comment.
So you said the V1 is going tocome out in Q1 of 2025.
So I guess my question is whatwill that information look like?
So I'm a network person and Imanage networks and you're
completely revamping Ethernetfrom the physical layer on up.
(50:55):
How does a network engineer whohas worked with traditional
Ethernet all this timeinternalize, learn and be able
to support and deploy what theUEC is doing?
Is this going to be an 800-pagewhite paper that I have to
memorize?
How are we going to help peopletake what you're building and
learn and deploy it?
Does that make sense?
1,600 pages dude.
Speaker 3 (51:15):
Come on, you're right
.
Well, okay, yeah, right, um,well, okay, yeah.
So we were already planning onhow to help educate people on
this because, like I said,there's an awful lot of nerds,
um, and and, quite frankly,there's a lot of stuff in there
that no one person has the depthof background to be able to get
in one sitting.
I mean, there's stuff forfirmware developers, there's
stuff for hpc people, there'sstuff for ai people, for lib
(51:37):
fabrics, for storage.
There's just a lot of thingsthat have nuances that are not
universally understood.
So the spec itself is going tobe public.
We're not charging anything forit.
You'll be able to download itfrom the yelterethanetorg
website.
You'll be able to readeverything yourself.
We're also going to open up forpublic comments and feedback
(51:58):
and that kind of stuff for errorcorrections or revisions or
possible future ideas.
So there's going to be a wayfor people to actually provide
feedback into the organization.
At the same time, we're alreadystarting.
I've asked the chairs of thedifferent work groups to, in
their copious free time, startthinking about how to start
educating people on the work oftheir own particular projects,
(52:21):
because there are a lot of them.
There are a lot of independentprojects that are going on in
each of these and we've got a wecall it the marketing committee
, but that's.
That's really just the, youknow, the communication group.
It's the one that is designingthe white papers, the
presentations, the seminars, thewebinars, those kinds of things
that are going to help getpeople a little bit better
understanding about how todeploy Ultra.
(52:43):
And then we're also offering tothe vendors themselves, the
members of UltraEthernet, anykind of assistance they need for
helping with their ownmaterials, for getting their
piece of the puzzle out and saythis is what we're doing and
this is how it works withUltraEthernet, because there's
so many different moving partsthat any one particular company
may only have a small part, orthey may have a large part, and
(53:06):
we're offering all kinds ofsupport for making that message
as consistent and clear aspossible for them.
Some of those companies arerather large and some of them
are rather small, but we've gota good spread of contributions
from all of the big names thatyou've probably heard of.
So we're already starting toplan on a campaign of
(53:27):
understanding and giving as muchinformation for people to be
able to use so that they canmake informed decisions.
And they may still wind upgoing with InfiniBand and that's
perfectly fine, but we want tomake sure that everything is out
and open to people to be ableto understand how this stuff
works and ask the questions thatthey need to ask.
We're going to be doing anawful lot of integration with
(53:49):
other organizations OCP, snea,ieee, ofa.
The OFA puts together thelibfabric stuff, so we expect a
lot of joint announcements andpresentations and educational
material to be coming forth.
Speaker 1 (54:06):
Awesome.
Will UltraEthernet run onexisting hardware, or is this
going to require yes?
Speaker 3 (54:13):
So there's only
really one mandatory thing you
have to do in order to beEthernet compliant, and that's
the transport layer.
And since most of thedeployments we expect to be
running for UltraEthernet aregoing to be DPU or NIC-based you
know basically the serverNIC-based approach to this
transport we don't anticipatethat going to be too difficult
(54:35):
because you won't have to changeyour switches.
It'll fit inside of existingGPU clusters.
You don't expect anything froman Ethernet infrastructure to
have to be changed.
Obviously, when we start to gointo silicon spin, anything that
involves ultra-Ethernet basedtrimming support or the physical
layer modifications, that's adifferent story.
(54:58):
But those are optional, thoseare not mandatory things you
have to have in order for all toread.
But when it comes you'll beable to use the existing
infrastructure for yourenvironments or you can wait
till the new ones.
But obviously there's ascaffolding that has to happen
anyway.
So we're trying to make it ascompatible as possible.
Speaker 1 (55:17):
So I said I had one
question, which is a lie because
I just asked two, but I willend it with just one comment,
which is around, I guess theglacial pace that networking
seems to move in right.
I mean, if IPv6 adoption ornetwork automation is any
indication in the past 20 yearsand kind of the lackluster
adoption rates that we have.
(55:38):
I'm guessing, and I don't knowhow you guys feel about it, but
it seems like the financialincentive to be able to support
AI HPC workloads right Like this, is the thing we all have to do
it.
I'm wondering if that'll pushus faster than we traditionally
move in networking right.
Does that make sense?
Speaker 2 (55:56):
I think it's the
great tiebreaker.
I think it's a good way to putit.
Speaker 3 (56:02):
Yeah, I mean.
I mean, the thing is that,remember, we're also talking
about a backend network, right?
We're not talking about ageneral purpose network where
you've got a lot of differentworkloads you have to.
You're not going to be doingVLAN configurations like you
would in a typical data center.
This is for a specific purpose.
So you know, I think that whatyou're really looking to do is
how do I connect my GPUstogether for AI properly?
I can use that without needingto necessarily disrupt my
(56:26):
traditional glacial pace ofnetworking adoption, if that's,
theband is going to rule theworld forever and that's based.
Speaker 1 (56:32):
I mean, we're all
biased, I guess, in one way or
another, and I just see that themarket share they have.
But then, after hearingeverything that the UEC is
working on and I'll be honestwith you, probably 15 to 20% of
(56:52):
what you said I think made senseto me and that's a compliment
to you, because no, but you andthe UEC are just such a
brilliant group of folks who areworking on such an important
thing at such deep levels, likejust when I saw all those
working groups and you broke itdown in the levels, I'm just
blown away at what you're doingand it seems real to me.
I told you before the recordingI'm like, oh, the UEC, they've
(57:14):
been doing this for years andwe're still waiting on a spec.
I mean, they've been doing thisfor years and we're still
waiting on a spec.
I mean a real cynical, shittykind of thing to say to the man
who's chairing the thing.
So that's why I didn't say that.
And here we are, and I'm sayingit on the record, but that's
right, but that's how I feltRight and I don't, you know, my
mind has been changed here.
I mean, you know actually thework that you've, that you're
(57:35):
doing on Ethernet.
I mean I believe that this isgoing to be the way of the
future.
I mean I don't see how onecompany that owns InfiniBand can
retain their strangleholdforever on AIHPC.
There's just too much growththere, there's too much revenue
to be had and we all knowEthernet.
So, like, updating Ethernetmakes a hell of a lot more sense
(57:55):
than all just trying to figureout something else.
I guess Mike was right.
I heard him once say like don'tbet against Ethernet, and at
the time I'm like, yeah, ok, pal.
But once again Mike Bouchon isright, I have been proven wrong
and my mind has been changed.
So that's.
Speaker 3 (58:10):
The gavel has come
down.
Speaker 1 (58:12):
That's why we're here
.
Speaker 3 (58:22):
Jay, thank you so
much for your time and all your
efforts.
Speaker 1 (58:24):
I feel like we could
have spent days talking about
this.
Maybe we can have you back onsomeday.
I didn't ramble enough.
You want more.
There's just so many rabbitholes.
We could have went down andjust in the interest of time we
didn't.
But the technical stuff, it'sjust been really fascinating.
Thanks so much for your time.
Thanks for all the work you'redoing.
I can't wait to see the V1 specand all 1600 pages.
Mike, always a pleasure.
Thank you so much for beinghere and for your insightful
questions, as always.
You can find all things Art ofNetwork Engineering on our link
(58:46):
tree that's linktreecom.
Forward slash, art of NetEng,most notably our Discord server.
It's all about the journey.
We have about 3,500 people onthere now.
It's a community.
If you don't have a community,it's one you could try out and
hop in.
(59:07):
We have study groups spanningall kinds of vendor
certifications, differenttechnologies and in Q1 of 2025,
we'll probably have an ultraEthernet group in there of folks
talking about all the thingsthat, as network engineers,
we're going to have to learn andfigure out and deploy.
Thanks so much for listeningand we'll catch you next time on
the Art of Network Engineeringpodcast for links to all of our
content, including the A1 merchstore and our virtual community
(59:27):
on Discord called it's All Aboutthe Journey.
You can see our pretty faces onour YouTube channel named the
Art of Network Engineering.
That's youtubecom forward slashArt of NetEng.
(59:48):
Thanks for listening.