The Importance of Solid State Drives (SSDs) in the AI Revolution

Episode Transcript
Available transcripts are automatically generated. Complete accuracy is not guaranteed.
Speaker 1 (00:02):
All right, folks, welcome back to the SNEA Experts
on Data podcast.
My name is Eric Wright.
I'm the Chief Content Officerat GTM Delta and co-host of this
amazing podcast here.
So lucky because I'm surroundedby experts on data experts on
so many things and I've got twofantastic folks joining me today
because we're going to talkabout what's new in storage.

(00:26):
As we're recording this, we'rewrapping 2024, which has been a
big year for so many things.
We've seen a lot of innovation.
You know, show season windsdown and the tech events start
to move into.
Let's prep for next year.
But the one thing that'sinteresting, of course, is the
work that's going on in SNEA iswell ahead of stuff you're going

(00:48):
to be seeing announcements forin the first part of the year,
because the stuff that we'retalking about is stuff that
really is going to have prettyinteresting long lasting effects
.
But really the thing about itis the people that you get to
hang out with and spend timewith learning about what's's
happening, what's coming, andreally the idea that we can

(01:09):
innovate at incredible paceswhen we collect together in
great groups like snea.
So with that, I'm gonna open upand uh, cam, if you want to do
a quick introduction for folksthat are brand new to you.
And then, uh, and then we'llmove on to jm next, and then
we'll let jump in and we'regoing to talk about what's new
in storage for 2025.

Speaker 2 (01:31):
Yeah, sure thing.
Thanks, Eric.
My name is Cameron Brett.
I'm co-chair of the SNEA SSDSIG along with John Michael, and
we try to be an authority andrepresentation for the industry
for things about SSDs, such asform factors, classification,

(01:51):
types of SSDs, TCO models,things like that.
So John Michael and I work veryclosely on that.
I am also an employee of Kioxia.

Speaker 1 (02:02):
And John Michael yeah .

Speaker 3 (02:04):
John Michael Hands, senior Director of Product
Planning at FADU, and we thoughtit'd be fun today to get
together and chat about some ofthe new trends in storage.
There's a lot of discussion on,of course, ai and storage
workloads, qlt, gen 6, edsff,txl, all kinds of fun stuff, so
we thought we'd just kind of goback and forth and chat about a

(02:25):
little bit what's going on inthe trends for next year.

Speaker 1 (02:28):
Well, there's no better trend than when you can
just say let's rub some AI on it.
So, Cameron, let's talk aboutwhat is AI doing to drive trends
and changes in the storageindustry.

Speaker 2 (02:43):
Well, I think both John Michael and I will have a
lot to say about this, but at avery simplistic level, gpus and
DRAM and storage are definitelythe big cost drivers for AI, and
storage is critical for AIsystems in all the various
phases, and one of the goals ofstorage is to help keep the GPUs

(03:04):
working so they have less idletime.
All types of workloads arebeing used.
Pretty much all the SSDattributes are tested Low
latency, read-write throughput,small block, large block, iops,
performance, et cetera, etcetera.
In some AI phases, lowestlatency and highest performance
are needed, but in many casesit's not.
The lowest latency and highestperformance are needed, but in
many cases it's not.

(03:25):
This is where you have a choicefor kind of a cost versus
performance metric comparisonand you can choose different
SSDs.
The highest performance, lowestlatency SSDs are enterprise
class and the data center classSSDs still provide very good
performance and latency, buthave a focus on consistent and
predictable performance andlatency.

(03:45):
So these are typically requiredby hyperscale applications, and
this is one of the pages thatJohn Michael and I worked on
kind of having a distinctionbetween enterprise class and
data center class in VME SSDs.
Ssds will also likely play arole in the expansion of memory

(04:10):
space to augment existing DRAMin places like RAG, specifically
with approximate nearestneighbor search or ANN.
Being able to scale by storingboth index and vectors in SSDs
as the database grows will boostperformance while keeping costs
in check, since the SSDs can,in some cases, replace DRAM.

(04:30):
You'll see, you'll see moretechnologies like this in 2025.
You know so there's there's alot going on.
I'll hand it off to JohnMichael to to say a few more
words on SSDs and AI.

Speaker 3 (04:47):
Yeah, where I started was Meta has with the Lama 3
training.
Obviously it's open sourcemodel and there's been a
tremendous amount of informationthey've put in those white
papers and research studies onthe storage configuration for
the actual training, which Ithought was really fascinating.
Like, for instance, in the Lama3 training there was like I

(05:10):
think it was like 16 or 70,000GPUs in the cluster that they
were actually used for Lama 3training and in that they had,
like you know, a 240 petabyteNFS cluster and they're doing
this with NVMe over fabrics andthey kind of describe their
tectonic architecture that theyused to use for warm storage and
blob storage and how theymigrated that to Flash.
But the really interesting partis these ratios, like this

(05:31):
checkpointing, where it's verybursty.
They said up to the peakbandwidth of seven terabytes, a
second of bandwidth needed to dothese checkpoints on the
cluster, which is just insane.
You have to remember all theseGPUs have to be coherent to do
the training.
So I really enjoyed, you know,these videos that Meta put out.
There's one I think it's calledTraining Llama Storage

(05:51):
Perspective.
You know they kind of talkabout some of these challenges
and tail latency.
You know, obviously all thesestorage vendors are trying to
ask the question of like okay,what do we do in storage to make
it better for AI?
So you know, for one, I thinkit's clear that high performance
is needed.
Like you know, lots ofbandwidth, you know, for like

(06:12):
checkpointing and sequentialwrites and training for random
reads.
Like I ran all the MLPerfstorage stuff, it's really just
like you know, the three MLPerfworkloads that are up for
training are like the UNet3D andResNet, are all just like high
block size random read.
And you know there wasn't,unfortunately in the Rev1.0,

(06:39):
there wasn't, unfortunately,that much stuff you could tune
on the drive to make it gofaster.
All the tuning was like filesystem and the network storage
level to make the trainingworkloads go faster, the network
storage level to make thetraining workloads go faster.
And so we're actually, you know, for the I am part of the
MLPerf storage work group andthey're targeting the 2.0 for
like middle of next year andthere's a bunch of stuff they're
trying to tackle RAG S3,checkpointing, like updated
training benchmarks.
So everybody wants to know likewhat can we do to like

(07:01):
benchmark the storage for youknow, for AI.
The other really interestingthing that happened was NVIDIA
has kicked off this likestoragenext motivation, which
you know unclear if they'regoing to, you know there's going

(07:22):
to be some.
Obviously a bunch of storagevendors and companies are part
of SNEA.
They want to maybe do this inOCP.
You know maybe some stuff inNVMe they don't really know yet
but I mean there was 110 storagevendors on this call that they
kick off meeting and it wasreally about like where are we
today?
And you know just that was partof what Cameron was talking
about.
Like you know characterizingthese workloads like data
loading sequential versussemi-random checkpoint.

(07:44):
You know characterizing theseworkloads like data loading
sequential versus semi-randomcheckpointing sequential large
block size.
You know RAG embedding lookup.
You know small block size,semi-random and understanding
some of these workloads.
But they're looking more in thefuture and I think that that
was really interesting to lookat the perspective of.
You know, right now it'sobviously like storage is not a
huge spend in the CapEx for AI,if you just look at you know I

(08:07):
think some of these hyperscalerswere over $30 billion and you
know in a quarter for CapEx.
You know in data center spend.
So like, if you just thinkabout, you know, storage as a
two to 5% or something of thatof these giant numbers like it's
very easy to see where.
That's why the storage marketis so interested in
understanding this.
But it was really nice becausethey kind of focused on okay,

(08:30):
where are we at today?
We understand these.
I just mentioned some basicunderstanding of some of these
training workloads.
You need lots of bandwidth, youneed lots of IOPS, but I
thought it was reallyinteresting where they are going
and what they were explainingand I think CJ gave a session on
this at OCP as well as SuperCompute of where they are going
and where what they were umexplaining and I think you know
cj gave a session on this at ocpas well as a super compute, and

(08:53):
I think a lot of those sessionsare up on the internet.
Um, but they, they were theones that pioneered this, uh,
gpu direct storage and you knowthis.
The wrote the white papers onthe big accelerator memory and
the GPU direct, and so they'reshowing us this world where the
IOs come from the GPU directlyto the SSD or MDMU or Fabrics,
through some kind of likeBluefield NIC or something,

(09:14):
directly to the SSDs.
And this looks a lot differentthan storage today.
Right, it's thousands of queues,thousands of threads and the
really interesting use case Ithought about rag was you know,
you know we talked about cx alittle bit in my.
They're like, okay, memorydoesn't solve you know, okay, it
solves your like a couple ofterabytes problems.

(09:35):
But what if you have a databasethat's 100 terabytes or two,
two petabytes for rag, like it'snot going to work for for
memory?
Right, you have to, you needssds.
So the the the thing that justblew my mind was their target
for Gen 6 is 200 million IOPSper GPU.

Speaker 1 (09:54):
Good golly.

Speaker 3 (09:56):
Their challenge to the storage industry is like how
do we get that?
It's very clear.
No pressure, it's very clear.
No pressure we can get to like 7million IOPS, gen 6 with
standard SSDs we have today.
But what are we going to do?
I thought this was just a lotof fun.
I mentioned a lot of this.

(10:17):
Work is just getting kicked off, but there's so much going on.
I mentioned, obviously, theshameless plug for SNEA STC.
I thought the AI sessions werefantastic.
I went to the one going onright, right.
I mentioned um and I'll just do, obviously, the shameless plug
for snea stc.
I thought the ai sessions werefantastic.
I went to the one percent e Iwent to the one from dell and
one from microsoft.
They were so good and most ofthem way over my head, which
means that they're good.
That's the stc way yeah, wellthey're.

Speaker 1 (10:41):
They're amazing to watch because of how quickly
people sort of head to the stageand the one hour talk is pre,
is what precedes the three hourtalk.
That really is the one that wemoved to the hallway track and
there's such a great set ofpeople there that are really
sharing what's going on.
It's such a beautiful soundingboard and you know it's funny.

(11:03):
You talk about like, how do wepredict what's coming?
And that there's so manyparameters, so many
opportunities for tuning,changing and then workloads like
it's.
This is a brave new world,although some would say it's a
silly new world, but it is bravein what we're trying to
accomplish and trying to do itin the most environmentally, you

(11:26):
know, intelligent way.
We know there are manydifferent sacrifices in this
equilibrium of you know what wehave to sacrifice in order to
achieve these gains, so it'samazing.
So, john Michael, one of thethings he mentioned there was
pcie, gen 6 and uh, obviouslywe're seeing some neat stuff

(11:50):
that's coming around that.
So what does the 2025 outlookyou know as far as coming with
gen 6 and why is this actuallyimportant in in what we're
seeing around some of theinnovation coming up?

Speaker 3 (12:06):
Yeah, the funny thing is you know every single
generation there.
You know we had a camera andI've had this discussion with
with SATA and NVMe.
We first came out with NVMe itwas like why does anybody need
seven times performance?
Sata is just fine for theseworkloads, like you know.
And then you go from Gen 4 toGen 5, you double the bandwidth.
I don't think people quiteunderstand like that.
You're doubling the bandwidth.

(12:27):
It's 2x the performance andthat's why these interface
changes are so important.
And you know, as we just talkedabout AI, like when you go from
, say, 200 gigabit networking to400 gigabit networking, now you
need to go from Gen 4 to Gen 5on the NIC and you need to go to
Gen 5 on the drives tobasically saturate the network

(12:47):
and saturate the GPUs and allthat stuff.
So we are seeing now all theGen 5 shipping and volume from
all the vendors, which is great.
We saw a couple of announcementsfor Gen 6 at FMS.
Just like, the companies areworking on SSD controllers.
I don't expect any of them toship in 2025.
I think there's a lot ofengineering work that goes on.

(13:09):
But the interesting new, youknow, obviously the new company
in town for high speed computingis NVIDIA, you know, and it's
clear that you know the AMD andIntel.
You know what they've saidabout.
You know Gen 6 is, you know,it'll be like 2026, 2027
timeframe.
It'll be like in 2026, 2027timeframe, right when NVIDIA is.

(13:32):
Like we have Blackwell and wehave DPUs in SuperNIC from
Mellanox that are going to beshipping Gen 6 next year, like
so they're asking for Gen 6yesterday, right, like what?

Speaker 1 (13:41):
Yeah, we're already.
The demand is already there.
No market testing required.

Speaker 3 (13:46):
So back to that, like we're going to go through this
cycle that we go through everytime.
We go through an interfacetransition, which is why do we
need the extra performance?
And um, I talked to a few ofthe hyperscalers about this.
They said, well, if you can dogen 6 at you know the same or
less cost than gen 5, then ofcourse, of course so and when.
When they say cost, they don'tmean like cost of the controller

(14:08):
, cost of the nan, like thatstuff doesn't materially change.
You know, it's really theretimers in the system and the
low loss material and all theother stuff.

Speaker 2 (14:19):
Yeah, overall system costs.

Speaker 3 (14:21):
Yeah, and when they say TCO and lower costs like,
they mean like, ok, how do yougo from 10.5 to 10.6 without
increasing the system cost by 30or 40 percent, because that's
not going to work.
And so again, every generationwe go through this.
It's not possible you have tohave extra costs to get there
and then people figure out waysto.
You know, retimers are gettingmore common, there are more

(14:41):
vendors, prices are going down.
So, yeah, if it wasn't clearfrom the AI that AI absolutely
needs higher bandwidth drives,they are absolutely going to 800
gig networking and they need tobe six drives.
And they have Gen 6 GPUs forBlackwell and they want Gen 6
drives.
So it's really you know race.

(15:02):
The storage vendors, you know,are going to try to figure out
how they get there.

Speaker 2 (15:06):
But yeah, I mean, yeah, nvidia, when NVIDIA wants
something, I mean and they'reready for it.
You know people are going tojump, you know, as quickly as
possible.
So I mean, nvidia is basicallythe bellwether for ai right now.

Speaker 1 (15:18):
So that is definitely uh.
You know, jensen is the new uh.
The new name drips off thetongues of so many pundits
because we see when there'sannouncements coming up I know
they've got stuff with CES, sowe're likely going to see tons
of new announcements comingabout, what's already on the

(15:41):
ground and what is coming.
And then this always has thisincredible downstream effect.
And it's funny that, like yousaid, JM, this idea that why do
we need more?
You know, I used to have mynetwork team at one organization
.
He'd always tell me he's like Isay, hey, we need to get on.
You know, the on the goodbackbone for this one.

(16:01):
He said we need to move off the10 gig to the 100 gig.
And he would just look at me,you know, in that grisly
networking style, through hiscigarette smoke, out in the
parking lot going if you can, ifyou can, max the bandwidth on
that thing, I'll buy you lunchevery day for the next year.
He says this isn't a bandwidthproblem.

Speaker 3 (16:21):
You, you cannot possibly stress the bandwidth,
because at the time there'salways the bottleneck was way
closer to the workload and nowthat is just the need to sit
over the wire is incredible youknow I forgot to say one key
trend right, which is, you know,obviously all the controller
vendors are trying to figure outokay, can we make a controller
that's twice as fast at similarpower?

(16:43):
You can.
You have to go to, you knowfancier tsmc nodes and, you know
, do lots of tricks, you know,to basically be able to scale
the form, but the nand also hasto get more power efficient.
You can't TSMC nodes and dolots of tricks to basically
build a scalable form, but theNAND also has to get more power
efficient.
You can't double the bandwidthand double the IOPS and then not
have a huge generation overgeneration improvement in NAND
IOPS per watt and bandwidth perwatt, because if you want to

(17:06):
saturate Gen 6, gen 6 is 28gigabytes a second and it's
seven and that's a lot of io andyou still have the same power
envelope per ssd.
So this is, uh, incrediblyimportant, as you know, kind of
moves us into this discussionabout energy efficiency and form
factors and, like all this, gen6 just highlights all these

(17:29):
issues that were present on Gen5, but they were definitely
magnified compared to Gen 4.
But in Gen 6, they're verypolarizing.

Speaker 1 (17:43):
We quite literally have been pounding against these
walls now as we close out theyear.
Now then on the hardware side,qlc neat stuff we're going there
and it's funny.

(18:32):
Over the years I remember wealways three things software,
firmware and hardware that aregiving us gains that allow us to
both, and now baking it intothe hardware layer that
potentially we can unlock future, you know, efficiencies again
with better, better use, bettersoftware that can leverage these
capabilities.
So so, cameron, you want totalk about what's what's going
on in in QLC and why is it maynot be the year of VDI in QLC
and why it may not be the yearof VDI, but it will be the year

(18:54):
of QLC.

Speaker 2 (18:54):
Well, I mean, as John Michael was commenting on, all
the improvements in performance,improvements in storage density
, has to happen at the same time, and QLC has been talked about
for many years and this islikely the year when QLC will
kind of take a major foothold.
You know there were some QLCSSDs out in 2024, maybe a little

(19:17):
bit in 2023.
But you know 2025 is likelywhen it's going to take a big
foothold, you know, especiallytowards high cap SSDs and, you
know, possibly even into somecold or archival SSD use cases.
I mean one thing about you knowthe storage density and we'll

(19:39):
probably touch on this a littlebit more during.
You know talking about, maybe,power efficiency.
Is that you know?
I mean, once you grow an SSD incapacity, you know you still
have the same power envelope for, you know, for an NVMe SSDs,
you know there's some caseswhere you can kind of go up the
power envelope to, you know, 40or 70 watts, but generally

(20:01):
speaking, you know keeping it at25 watt cap is kind of you know
the mark that you don't want togo over for an NVMe SSD.
Yeah, not all QLC is going to becreated equally and you know
they are going to vary bysupplier and a lot of it is

(20:22):
going to depend on the chip andpackaging architecture.
So, depending on you knowthere's CBA architecture, cua
architectures and some of thosewill have an effect on the
robustness of the cells and youknow thereby can determine
whether it's really fit forenterprise use or data center
class SSDs or archival orrelegated to client.

(20:46):
You know type of workloads toclient.
You know type of workloads, butyou know definitely QLC is
going to be certainly the key tothe high-cap SSD space that
we're going to see a lot more ofin 2025.

Speaker 1 (21:02):
I for one welcome our high-cap QLC overlords.

Speaker 3 (21:08):
You know it's funny when I talk to a lot of folks
about qlc stuff, the more Irealize I'm like, oh my god, we
have to just go back to thebasics, like I.
I think that some people forgetlike some of the absolute
fundamentals.
Like so you go from three bitsper cell to four bits per cell,
you're storing 33 percent morecapacity and you have a cost

(21:31):
reduction of 25%.
Now, like that's just at theNAND level.
And if you have like a waferone wafer and you can turn it
into TLC or QLC, like that's thedensity increase you get.
So if you're looking at justlike bit output from the
industry, like they will getmore bits in the same way first
if they go to QLC and they willbe at a lower cost Like this is
just fundamental and this is thewhole reason why you want to go

(21:51):
to QLC.
Now at this cost reduction is atrade-off.
The trade-off is endurance,retention and right performance.
You're storing more bits percell, you have more voltage
levels and you have to dodifferent programming stuff and
the way that this works is ittakes more power to do the NAND
program and so typically there'sa lot of trade-offs on like the

(22:12):
right performance.
Now some of this is alleviatedwhen you go to these like a lot
higher capacities, you can atleast get reasonable right
performance, you know, at 25Watts.
But just remember this is amassive trade-off, right?
So you're going typically fromlike a, you know, 10,000 cycle
programmer race cycles forenterprise TLC QLC ranges from
1,500 cycles at some of the mainvendors.

(22:35):
I think Solidigm has 3,000cycles for their hard drive
replacement high capacity QLCand then they also have a TLC
replacement that is like 5,000cycles.
So what Cameron said is it's allacross the board as far, like
there are a bunch of vendordifferences in the QLC, but we
do see something kind ofstandardizing around this two
terabit die, you know, which isenables these 60 and 120

(22:59):
terabyte SSDs.
So it's very clear that theseare very desirable for AI
workloads where power is a hugeconstraint in the data center,
or they want to alleviatenetwork bottlenecks by just
basically having a lot ofcapacity, you know, local to the
storage nodes, and all we cansay is people are buying these
really high capacity drivesystems that vendors can make
them.

(23:19):
So now it's a scramble of likeoh yeah, okay, we, or reminder,
we need to look at this highcapacity QLC.
You know there's always goingto be this TCO story of people.
By the way, the origin of thisNeo TCO model was me doing a
bunch of analysis on QLC versushard drive racks and looking at

(23:39):
some of the various workloadsand it gets really interesting
right 3x to 4x dollar pergigabyte range and QLC versus
hard drives.
And you have all these extraIOPS for deduplication and
compression and erasure, codingand stuff.
There's all this fun stuff youcan do with TCO.
But yeah, just as a reminder,qlc is always a TCO thing.

(24:02):
It always will be a TCO thing.

Speaker 1 (24:03):
Versus TLC it is cheaper but it has a bunch of
trade-offs and whenever you'redoing trade-offs it's now a tco
discussion and one thing thatyou brought up and it's it's
starting to then move to thefore that when we talk about
like life cycle, it means frommanufacturing, pre-manufacturing
, like we're talking about theartifacts that are generating

(24:26):
the physical you know, drivesand and memory themselves and
then all the way through todestruction.
So it's beautiful to see moreand more discussions.
We talk about lifecycle meaning, quite literally, birth to
destruction at the physicallayer, and we're seeing more,
I'll say, a lot of centralizeduse.
Before it used to be, everyenterprise would have bought up

(24:47):
all the NVIDIA gear, and nowobviously it's the hyperscalers
that are doing a bulk of thebuying, and in doing so, that
means that we could probably getmore efficient use of it
because by the nature of theirsales cycle, they have to make
it effective and efficient forthem as a business.

(25:07):
So I hope that in doing so,while centralization,
decentralization there's acontinuous battle over who
should own innovation, but it'slike we're going to be able to
do stuff, innovation at scale insome of these pieces of
hardware that we couldn't havehad access to unless it was in
that sort of shared model.
But anyways, that's just mypiece as the outsider looking in
of, I'm excited by what we'vegot coming ahead and that we're

(25:30):
not just talking about speedsand feeds, and you know, tco is
far more than just what it costme to buy the drive.
Now, power efficiency againclose to my heart.
I'm a, I'm somebody who I dohave a number of children, who I
want to have a number ofchildren, and I want to have a
number of children, and I hopethat they have plenty of

(25:52):
excitements in the outside worldto enjoy, and that is from
things like better things, whatwe're doing around power
efficiency.
We're changing the way we docomputing in general.
So, when it comes to powerefficiency, down to the metal,
what's what's new and whatshould we be really looking

(26:12):
forward to in 25?

Speaker 3 (26:17):
I gave a presentation at FMS this year on power
efficiency and SSD controllersand where we started was in
consumer SSDs.
The drive sits there 99% of thetime.
So power efficiency in a laptopis all about battery life, all
about how fast can the drive goto sleep, go to zero power and
then how fast can it wake up.
And these are all about theseNVMe power states and PCIe L1.2,

(26:41):
very low power states, and sonone of that works in the data
center because you have verytight latency requirements.
You can't even you know the.
But, by the way, consumerdrives are really good.
Now they can go to zero idlepower and get back in you know
10, 5 to 10 milliseconds, whichis like the latency of a hard
drive.
Like that's pretty wild.
Still too much.

(27:01):
Right In the data center worldwe're talking about 50 to 100
microsecond read latency, like 5milliseconds way too much.
You can't.
Those don't work.
And obviously you have a, youknow.
You know you have batterybackup and you have power loss
capacitors and you have PLIstuff, like.
So some of these tricks to goto sleep don't exactly work.
But the important part is incomputing SSDs.

(27:23):
You know, historically theconversation is, the assumption
is.
These things are always beingused and the measure of power
efficiency is performance perwatt in the active state.
So you run an active workloadand you measure how much active
power it's consuming to run thatworkload and then you divide
the performance by the amount ofpower in watts and now you have

(27:45):
performance per watt.
This is a really importantmetric for a ton of reasons.
So one we just talked aboutthese form factor power limits,
which is like okay, if you wereat an interface limit or a form
factor limit of 25 watts, thebetter power efficiency you have
, actually the higherperformance you can deliver at a
certain TDP.
The example I gave was actuallygoing the other way, which is

(28:07):
saying okay, well, we have a 25watt drive, what happens if we
cap the power at 16 watts?
Okay, well, you want to knowwhat your performance per watt
is so that if you have a betterperformance per watt then you
won't lose as much performancewhen you're reducing the power
outflow.
The power savings aren't just onthe drive.
I think people forget that thedrives go into a system and then

(28:28):
the systems have fans and allthis other stuff going on, and
when you have a drive thatconsumes like 16 watts instead
of 25 watts.
And, by the way, hyperscalershave been playing this tco game
for a long, long time, runningdrives kind of close to their
operating temperature, towardsthe upper limit, to make sure
they run fan speeds as low asthey absolutely can, to
basically reduce the power onthe servers and reduce the tco.
So, um yeah, there's uh, well,there's the whole sustainability

(28:54):
angle which we haven't evendiscussed yet, but there's all
the like.
What I just mentioned is likethe practical SSD architecture,
applications of performance perwatt, which is understanding how
much power can you deliver in acertain power envelope, in
certain form factors.
You know, modulating atdifferent power limits for
certain customer requirementsand certain use cases.
The other benefit is that ifyou can lower the active power

(29:16):
in the use phase and lower theTCO of the server, you have less
wasted power on idle power, onthermal loss and fans.
And there's a bunch of workgoing on in the sustainability
space to enhance the PUE metrics, to be able to describe how do
you actually measure thatinefficiency at the server level
, not just at the data centerlevel?
Traditionally they've beentalking about PUE at the data

(29:37):
center level, like the coolingat the data center level, at the
rack level, but there's coolingwithin a server.
It has fans and you need to beable to quantify that as well.

Speaker 2 (29:47):
Yeah, I mean that's also where EDSFF comes in, with
the cases and the heat sink soyou can run the fans at lower
CFM and help keep the powerconsumption down.
You know, taking that metricIOPS per watt or performance per
watt that John Michael wastalking about, I mean you can.
Then you can factor in a costso you can look at IOPS per watt

(30:10):
per dollar, kind of extend thatto really take a look at the
economics in addition to theperformance.
And you know, as far as youknow power efficiency goes, you
know the power consumption of ahigh cap drive is also still

(30:30):
going to be.
You know it's not double thepower if you double the capacity
or triple the capacity youstill need to keep either a 25
or some sort of constrainedpower envelope.
So I mean that's whereadditional efficiencies through
QLC and eventually five bits percell come into play.

Speaker 3 (30:54):
Purcell come into play.
Yeah, and I kind of touched onit briefly, but I mentioned when
a workload has a drive that'sidle for an extended period of
time.
Now you need to optimize idlepower and we started to see that
I think Micron was the first toannounce a drive that actually
has a decent power reduction forL1 in a data center drive.
The new high-cap drive goes to5.something watts to 4 watts in
a L1 substrate, which is great.

(31:16):
It doesn't sound like a lot ofsavings, like a watt and a half
or something, but that's a20-30% reduction.
It's actually a pretty bigchunk.
And so now in this world whereyou're replacing potentially
hard drives with a QLC orhigh-biz Purcell or something
and you have a drive that can dohundreds of thousands of IOPS,
but you have a hard driveworkload that's 200 IOPS, the

(31:40):
drive's going to be idle a lotof the time.
So now you have to figure outhow do we save the power on the
drive side.
And so we're not there yet froma data center side as far as
like those technologies beingdeployed.
But we have hyperscalers, likeactively asking those questions,
which is great because I wasasking them like four or five
years ago in OCP and everybodywas like shut up.

(32:00):
Yes right, we're never evergoing to use L1 sub-threads on a
data center drive.
Just we can talk about.
We're never, ever going to useL1 someplace on a data center
drive.
Like, just don't even talkabout it.

Speaker 1 (32:09):
Yeah, and it's funny because that becomes this thing
that we find like there's aninnovation that will create an
efficiency in something that wethought was a dead technology
and then that ultimately becomesthe new technology.
You know, it's funny.
I'm on the straight softwareside.
I'm a Ruby on Rails fan, justbecause it's super easy.
I've been using it for so longand it comes out of the box with

(32:31):
SQLite and the first thing youdo is get rid in SQLite that
make it as performant, if notmore so, and as scalable as most

(32:53):
Postgres implementations formoderate and to large size sites
.
So we're seeing softwaredevelopment that's being enabled
by work that's happening closerto the metal, and so the real
like the impact is so broad andit's that's what's exciting to
me.
I'm just like because it's notjust this one thing that we're
innovating on, it's the entireecosystem that sits atop of it,

(33:17):
and I know every day is a bloodywondrous day to be in computing
.
I don't know, I'm excited asheck about it every day, but I
maybe I'm too much of a nerd, Idon't know, I think we all are
at some level that's it.
Well, and you know high capdrives.
Let me tell you I was uh, on aprevious podcast I was talking
about remember the days of theexcitement when you could get.

(33:37):
Wait, we've got 64 gig drives.
What, oh my goodness, this iswild.
You know what is an unlimitedamount of storage?
Then 128, then 256 and now,like you wouldn't even hand
someone a free you know usbstick if it was less than 512.

Speaker 2 (33:54):
You're like what am I gonna do with that?

Speaker 1 (33:57):
yeah, exactly what are you gonna keep one picture
on it.
So let's talk about what does.
What's the future of high cap,and where are we seeing high
capacity work show up in thatenterprise type of display?

Speaker 2 (34:12):
Well, I mean, the most talked about place right
now is AI, and AI certainly hasa demand for more storage.
But I mean, for many years, youknow, 960 gig and 3.84
terabytes were the sweet spotsand then things, you know, over
the years.
You know, even not that longago, it quietly went up to 8

(34:33):
terabytes and 15.
And I know that my company wasone of the first ones to offer a
30 terabyte SSD.
But those were kind of used inyou know kind of special use
cases were kind of using youknow kind of special use cases.
And then, once AI became areality, you know 30 terabyte is

(34:54):
kind of like the starting pointfor hard high cap drives and
you know this is this is alsowhere you know, talking about
EDSFF, where 2T and E3.L andmaybe 2T.
You know this is where the highcap drives are going to be most
prevalent.
But yeah, this is definitely AIhas spurred on the high cap
race and every company issprinting in it right now.

Speaker 1 (35:22):
So I was going to say it's kind of like when we look
and we say, why do we choosethese sort of moonshot missions
and what does it actually get us?
But like, while AI doesn'tnecessarily seem like it's doing
what we would hope or expect itto do at the moment, what it is
doing is creating a fantasticyou know burgeoning innovation

(35:42):
around the data, centeredarchitectures and hardware and
software to allow AI to beefficient and performant.
So while we're sitting heregetting it to generate emails
for us while burning offthousands of watts, at least we
can hopefully get it better downto the bits and get these

(36:02):
drives where they're getting themost out of that hardware.
Sorry, jim, I cut you off there.

Speaker 3 (36:13):
You touched on one of the use cases, which is
terabytes per watt, right.
And if you have a 122-terabytedrive and the NANDs, the vendors
have all said, yeah, we'regoing to go to 1,000 layers or
whatever, right?
So it's not going to slow down,certainly paths to 256-terabyte
drives and above.
And even though hard drivesSeagate's shipping hammer in

(36:34):
production and they justannounced 30-terabyte CMR and
32-terabyte SMR in production,even for the channel.
So that's, I mean, the trend isgoing up on hard drives, but
SSDs are already remember,they're already four times
bigger and they're half thephysical size and a fraction of

(36:55):
the size.
So from just a capacitystandpoint, ssds are already far
, far ahead.
And that's going to becontinually stressed in data
centers where they have no morepower, and this could be in a
regional data center, in a COLOor AI data center, where they're
like, okay, well, we didn't,you know.
Yes, they cost a lot more andSSDs are still 10X the price of,

(37:17):
you know, of hard drives.
But you know, maybe if the QLCmarket comes back down to the
earth, maybe it's going to belike a 6X.
And now if you have a 6Xmultiplier on the price but you
can go from five racks of harddrives to one rack of SSDs the
same power like man.
These are really toughdecisions for data center
operators to make.
Remember, you also don't justget the capacity, you also get a

(37:41):
ton of performance.
Now you can open up a bunch ofAI use cases for reading that
data.
Some interesting thing we'veseen.
I think one thing that wementioned on QLC is just from a
technical standpoint.
It's actually physicallycheaper.

(38:02):
There's more bids per sell.
But the other trend we've seenis some of these vendors have
used consumer TLC to basicallyget to market faster on a
high-capacity drive with 16 diestacks or even potentially these
crazy 32 die stacks of consumer3,000-program race cycle TLC
NAND.
So there isn't just one way tobuild a super high cap drive,

(38:24):
but certainly QLC is is the wayto build the biggest drives
right now.
And as far as the use casesyeah, you know I don't like
there there are tons of usecases for high capacity SSDs.
Object store power savings inthis training data.
You know all the training datais going multimodal.

(38:46):
You know I just got the updatepushed on my Tesla this morning,
which I'm excited for, the newself-driving that they trained
in the new xai data center withall these gigabytes of footage
from high bandwidth, highresolution footage from the new
cameras.
It's so awesome, right thatthese um, you know these new
training sets, you know thesenew big data, you know you want

(39:08):
fast ssds to basically kind ofdo that type of workload.
So, yeah, I don't think it'sgoing to slow down.
I think the demand, I thinkwe're you know, we're already at
like 20 of the bits shipped atqlc.
I think it's going to continueto ramp and now all the analysts
are finally like, yep, now it'stime qlc is going to go nuts
next year.

Speaker 1 (39:26):
It's such an interesting innovation area
because I think of generaloptimizations.
We always link back to the sortof gold rat-esque type of
things of find the constraints,sublimate the constraints.
And we look at this in like,how do we just tackle just the
bottleneck?
But what's actually happeningwith these types of innovations
is that we're eliminatingsomeone else's bottleneck.

(39:46):
But what's actually happeningwith these types of innovations
is that we're eliminatingsomeone else's bottleneck by
adding innovation in an entirelydifferent area.
So that's why there's so manymoving parts, but we're seeing
them all converge together.
That allows, like you said,stuff that's going on in the XAI
data center that just would nothave been possible five years

(40:07):
ago from 50 different areas ofinnovation.
And then now that we'reactually pushing the technology
with stuff like FSD, I love it.
I love seeing use cases thatare real, not just like, hmm,
what could we do with this?
You're like, we're literally,we're already doing it.
It's being used every bloodyday.

Speaker 3 (40:28):
Now I forgot the actual plug for the SNE SSD SIG.
So if you are working on SSDsand big SSDs, come to the SNE
SSD SIG.
We have.
I believe SolidIme is about tocontribute a paper on how do you
benchmark large you know, largecapacity drives for things like
endurance, like something.
If you have like an indirectionunit size it's not four

(40:51):
kilobytes, maybe much bigger soyou can reduce the amount of Ram
.
You know the typical like.
You know, genetic workloadsjust don't make sense.
So there's a lot of like,nuance of like, how do you test
these, these big drives, andthat's actually a lot of the
technical discussion that we'redriving through the work groups.

Speaker 1 (41:07):
Yeah, instead of the.
I said even as a buyer of, justyou know, enterprise storage
gear for sands and you'd getthose performance numbers from
the vendor and you're like yousure these aren't just like you
know, linear 4k reads Like Ithink that your iometer test is
cute, but let me try it out inproduction, and every time I put

(41:29):
it in production I would gettold about a month later like,
well, you see, the problem isyour workloads.
I'm like, oh, it's my fault,sorry, sorry for me and my silly
workload getting in the way ofyour performance.
So I could literally go on forhours on this stuff.
You guys are both fantastic andso thank you both for sharing
what's coming up.
But quick, closer, uh, what'swhat's super exciting to you and

(41:54):
how do we best get a hold ofyou?
If you want to chat more onthis stuff, let's start with you
, jm, and then we'll close outwith cam I mentioned.

Speaker 3 (42:00):
Yeah, cameron, cameron and I.
That's me.
If you're a member of snea, youshould come to the SSD SIG.
Everybody's welcome.
Now, I think the 2025 rules aredifferent, where you used to be
part of CMSI or whatever.
So they're like anyways.

Speaker 2 (42:13):
It's open to any member now.

Speaker 3 (42:14):
It's open to anybody next year.
So just come into the workgroup if you want to talk about
SSDs.
Yeah, the TCO stuff.
There's a ton of interest in AI, tco and, like I mentioned,
like the, now you're talkingabout IOPS per dollar and stuff
like before it was justgigabytes per dollar.
So now I'm going to be updatingall the SNA TCO models to
basically be ready for AIworkloads and that's going to be

(42:36):
a lot of fun.
I can't wait.
Nvidia is asking me to do this.

Speaker 2 (42:58):
And yeah, it's going to be fun.
Fantastic Cameron, FantasticCameron AI, as well as kind of
continually monitoring and kindof being the advocate for EDSFF.
You know this form factortransition away from 2.5 and M.2
.

Speaker 1 (43:16):
So I think that's some of the exciting stuff that
we have going on in the SSD, sigand, as they say in sports, you
got to be in it to win it, andgetting joined up with SNEA is
super easy.
The new membership options wehad a great discussion with Jay
Metz talking about how themembership look for 2025.
So it's much easier to getinvolved.
It's much easier to getinvolved in multiple disciplines
within the group now, so everybarrier to entry is being

(43:40):
lowered and just the fact thatyou can sit with a peer group
who is living these problems.
If all this stuff, if you'resitting in your chair and you're
shaking with excitement, like Iam, and you want to go and
start putting this stuff intoaction, then get beside JM and
Cameron and all the fine folksat SNEA.
There literally is no bettercommunity of people that are

(44:03):
really doing fantastic thingsand you know there's lots of
opportunity ahead.
So let's see what's coming upwith 2025.
Thank you both for havingnumber one created great content
together in 2024.
It's been great to spend timewith both of you and for folks
that want to get involved again,head to steenaorg.
Check out the other podcastepisodes as well.

(44:24):
We're both on audio and we'realso.
We have these on youtube.
So if you'd like to see thesebeautiful smiling faces, do more
, all right.
And of course, uh, jm, youmentioned uh, stuff that's from
sdc and even some of the otherstuff with ocp.
There's tons of great publicfacing content, so, even if you
missed it, you didn'tnecessarily need to be there.
Uh, check out the youtubechannels.

(44:44):
There's lots of great contentavailable for recap and review,
and I'm a fan.
So with that, both of you,thank you very much, and for all
the folks that are watching andor listening, thank you and
happy new year and we'll see youall in 2025.
For the Stia experts onDataCrew yeah, thanks, eric.

All Episodes

Episode Transcript

Popular Podcasts

On Purpose with Jay Shetty

Stuff You Should Know

The Joe Rogan Experience

.css-15opob5{left:0;position:absolute;top:0.8rem;} All Episodes

.css-14f5ked{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:2;overflow:hidden;}The Importance of Solid State Drives (SSDs) in the AI Revolution

Episode Transcript

Popular Podcasts

.css-r6mb8g{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:1;overflow:hidden;}On Purpose with Jay Shetty

Stuff You Should Know

The Joe Rogan Experience

All Episodes

The Importance of Solid State Drives (SSDs) in the AI Revolution

On Purpose with Jay Shetty