World’s Fastest AI Inference: A Conversation with SambaNova’s Innovators

Episode Transcript
Available transcripts are automatically generated. Complete accuracy is not guaranteed.
(00:00):
All right, and hello everybody and welcome to the number one

(00:05):
generative AI meetup podcast in the world. So today we have a very special interview and guests.
So today we are doing an event in Palo Alto at Samba Nova headquarters. So for those that don't know,
Samba Nova is a chip startup that is competing with the best of them. So they are competing

(00:28):
within video, cerebral, rock. And you know, I think that they are going to be competitive. So
it's very exciting because together with the South Bay Genai AI meetup, we are doing one of
their first events. So if you're here in person, thanks for coming. If you're just listening

(00:49):
to the podcast, thank you for listening. So today we have two special guests. We have Ermesh,
who is the director of machine learning here at Samba Nova. And we also have Ragu, who is part of
the founding team at Samba Nova. And actually turned his PhD into the hardware that this company is

(01:11):
now using today. So thanks so much for taking time out of your busy day and sitting out a chat with
us. So can we kind of maybe start and would you be able to give a brief rundown of like, you know,
what is Samba Nova? Like how are you different from the other competitors that you have?

(01:32):
Yeah, so maybe I can give a high level picture and then maybe dive deeper as a conversation
might take us. So Samba Nova, he already alluded to this in your introduction. We are effectively a
full stack AI company at this point. So we make our own processors and we have our own

(01:53):
different kind of architecture called the reconfigurable data flow unit or the Rdu.
We've taped out a few generations of it and the most recent generation is called the SN40L. So SN
Samba Nova 40 is the four generation and L is I don't quite remember what L is. And that's the

(02:14):
most recent version that we have taped out and it's in production right now people are using it.
We are a platform where we do both training and inference. So and recently of course in the
past few months we've had a cloud inference API offering. So the core, I would say the core thesis where

(02:39):
we are different from architectures like Nvidia. My background is partly my
background is computer architecture. So I'll go for a, I'll answer your question with that lens in mind.
The architecture does not have the traditional notion of instructions that you stream through.
It works in a different way. So because of that it's a data flow architecture because of that we

(03:02):
have to build our own compiler which is the other half of arguably the more dominant
half of Samba Nova's advantage I would say. And yeah I mean as of now we've been mostly focusing on
inference through the cloud sort of API offering. At least that's where all the
that's where all the limelight is being shown. But we've always been in training and inference

(03:27):
player in the market. Okay very cool. So see you mentioned that you have the RDEU which is I guess
since the reconfigurable data flow unit like could you maybe go into a little bit more detail
what that is? Yeah sure so as the name stands it's called reconfigurable data flow unit and
the genesis of this architecture comes from some research done that Stanford that I was

(03:54):
lucky to be part of. The core idea behind data flow architectures in general and reconfigure
the RDEU in in particular is that typically when you execute something on any architecture
you specify what to execute in terms of a stream of instructions like the one nine-man style
architecture. You stream through a bunch of instructions and as the instructions flow through

(04:17):
your processor they have side effects where they you know load manipulator results, store it back
the results to the memory. RDEU and the data flow architecture concept in general it doesn't have
the traditional notion of instructions. So instead they have or the RDEU has like a C of you know
like a two-dimensional grid of compute and memory units they are kind of all talking to each other

(04:41):
in this sort of interconnect fabric and when you program the chip instead of specifying the sequence
of instructions you actually statically assign a role to each of these units such that the whole
unit behaves like a compute graph. So at this point there's no stream of instructions flowing through

(05:03):
your chip it's mostly every unit is going to do one or a few things that are pre-programmed and you
flow the data through it. So every unit is going to fire and do its thing when its data is available
and push the results out to its next set of units. The main advantage with this is you don't have
to pay the overheads of the instructions because building an instruction fetch, decode, you know

(05:24):
register file, branching all of these things consume some overheads on your silicon and it also
introduces a bunch of these synchronization overheads so it's you lose performance because of that.
The primary thing with AI inference at least now is performance you need to run as fast as possible
remove bottlenecks and with the data for architecture data just naturally flows so there's no

(05:49):
waiting no forced synchronization point things like that. And as a model builder it's just very cool
to see you write your code in PyTouch and then you see this thing running really really fast.
Like when you're running things on GPU so you have a new network graph right each operator is a
node you can see each operator running step by step right when you're running things on RDEU

(06:09):
the compiler that you have the hardware that you have it and I will sort of spatially
out the entire graph and then you can see all of those things running simultaneously and that's
where you get the performance so it's pretty cool to see just the same PyTouch code running on GPU
how it runs the bottlenecks and slowness that you see and then switching to RDEU the same PyTouch
code you see the success to send the performance benefit that you see on a stack right now.

(06:34):
Okay interesting so you can just easily switch between like the GPU and RDEU it's like it's
a one line change. Yeah very very triple amount of change.
Okay so it seems like that's where your competitive advantage is mostly for inference and you
mentioned like 6x performance boost versus traditional GPUs and I totally have seen like some 10x

(06:57):
faster performance. Yeah so not just inference inference training both right it's just like now
market wants to talk about inference the way I see to talk about inference if you look at what
you've talked about and the papers that you published with last two years two three years
you've talked about faster training performance in last year we did work together where we showed
for these large language models we can do 6x faster training than GPUs using some algorithmic

(07:22):
innovations and conversion to the same accuracy in fact we trained a 13 billion or 70 billion
parameter model it'll 13. Yeah to a sort of for almost 500 600 billion tokens at 6x
sort of speed and got to the same accuracy at the model should be on a GPU. So training inference
both right we go beyond training and inference to sort of other domains also but right now inference

(07:45):
is what we sort of talk about given where the market is. I think that's fair to talk about inference
but I think that it would be interesting to know kind of like what else these ships could do because
when I was looking at your website I kind of got the impression that these ships could only do
kind of training inference but as you were sort of talking about earlier you mentioned it was

(08:08):
kind of like a general purpose type chip one of their types of use cases to people use this chip or
like could they use this chip for. Yeah so the we have you know we have quite a few customers who are
you know national labs who use our ships for scientific computations for example. So we've certainly

(08:30):
made certain design choices that suits the chip better for AI like for example we just look at the
the chip speeds and feeds of the Icent-Fordial it's about 630 odd terraflops of B-floor 16.
So the difference really between AI and maybe other domains is that some domains are okay with

(08:50):
data types like B-floor 16 but if you're doing scientific compute you might require you know
FP32 like the full single precision data types. So our chip is capable of doing both
its B-floor 16 as well as a floating point. The capabilities are different I mean as it is the case
in say Nvidia for example but at the end of the day the architecture of the chip and its capabilities

(09:17):
they are able to run pipelines of parallel operators in a data flow fashion that's the way I put it.
So I want to introduce one term here and I'll describe what it is so we the architecture has been
inspired by some prior research on what we call parallel patterns and really what parallel pattern
says is it's just a higher order programming construct which represents how a particular operation

(09:42):
is getting paralyzed. So an example would be and if you imagine like a mathematical operator like
dot product you take two sort of one-dimensional arrays and you you know multiply and add them together
you get one value out of it you can decompose that into a map operation or a zip which is which means
it's an element wise multiply of the two things and then a reduction. So zip and deduce would be

(10:05):
the two operations there are other operations like this you know map filter group by sort things
like that and very early on we know that these primitives whether it's AI or scientific computing or
you know other domains which can be accelerated you can break down the thing that takes a lot of time

(10:25):
into these components into maps, zip, reduces, etc. So we have a path to taking a compute graph
expressed like that down and you know program it on the RDo. So in that sense any dense past linear
algebra kernels that forms the heart of pretty much any compute workload you can actually program and

(10:48):
accelerate on the RDo. Now from a business point of view of course you know that inference is
AI general and inference in some particular is one of the hottest burgundy markets right now.
So most of the messaging on the website is probably on that but the underlying technology is not
built to saying you know it's only inference. So I view to make maybe different design choices

(11:14):
as it because only inference. Yeah I think that makes sense given how much of the startups are using
prebuilt models and just running with it versus like the handful of companies that are actually
training these larger models but for the training case and video seems to have like a mode with

(11:34):
the developed ecosystem they built with CUDA and you mentioned like it's relatively easy to
port over the inference side. How hard would it be for some company to like port over their whole
pipeline to some of that. It's not that difficult right. So I think we need to understand that when people
or email them they'll operate the operator by trustable so CUDA doesn't matter CUDA is in the back

(11:56):
it right. You sort of traverse your graph and then figure out what depending on your hardware
that obescalive right. So if your heart rate is sort of a video you use the CUDA kernels at least
that's personally how I see it. And if your heart rate is all hardware is sort of mapped differently.
So from an ML developer point of view again it doesn't really matter right. A trivial amount of

(12:16):
change you can sort of port from PyTorch, take the PyTorch code on GPU and sort of run it on our hardware.
And we see that like we still we have a lot of training customers right. A lot of them in the
national labs, motorbim in enterprise cases and we see them successfully training a lot of
different large language models in the wide variety of languages. So at least the way I see it

(12:39):
it's not a big deal. What about in terms of integrating with these cloud service providers? So let's say
a developer can do both training inference easily. Another mode that Nvidia has built up is like
all the deep integrations they have with all these cloud service providers because apart from just
the fast chip there's a lot of overhead that goes into building large servers there's like networking

(13:01):
switches memory and things like that. How do your systems like you mentioned you provided training
as a service or sorry inference as a service on your own hosted solutions? Can people use now or
in the future some one of our systems through AWS or Google Cloud or Azure?
Yes I think so firstly the way we saw structured our business model right VV sort of when he said

(13:27):
your full stack we are ready full stack. We actually shipped to a lot of our enterprise customers
we shipped our hardware to them and manage it for them. So we manage effectively the way I see
personally as a developer and maybe there is some nuances here which I'm missing but as a developer
of the way I see it is the entire data center is managed for you. So all of the things that you
talked about network etc is taken care of right. You get APIs for fine tuning training

(13:49):
you get APIs for fine tuning pre training in front of whatever you want to do how the things get
mapped to the entire distributed cluster etc is all hidden from you right. So in a way we are
already doing that we are kind of doing it on-premise behind sort of a customer firewall
integrations with cloud providers etc I personally don't see a reason why sort of that won't happen

(14:14):
why we are not there etc is something that I'm not aware of. Yeah I think there's no technical reason
because if the programming interface is something that's standardized and you know it's
optimizing anything is you know the usual mantra is like right once optimise everywhere kind of thing
so there's always when you want to extract peak performance there will be something you'll have
to do on every platform so I don't think we're immune to that but that said we still enter

(14:40):
from the same tensor or tors destruction so there's no fundamental reason why that wouldn't happen
so it might happen in the future we just haven't done it yet. I mean we are a starter as a starter of
all the important things that you learn is pick and choose your battles right you need to build that
ecosystem out as you build the company out so and picking and choosing a battle means you try to

(15:03):
understand where you can get the most business as of today and sort of try sort of building things out
there. Looking at this one tangential point on this before we move on to the next one which is
in addition to what Ramesh just said Ramesh should be kind of alluded to this earlier so just
more to highlight the point again one is the integration maybe with other cloud providers and there's

(15:26):
something there but the other important thing is some of our customers something we've learned as
well they don't for variety of reasons their data they want to maintain control over it they
want it to be within their firewall they're not okay uploading it to like a cloud service for
regulations security you know sovereignty what not so for a variety of reasons there is a

(15:50):
concrete use case for people holding you know these machines behind their firewall
yeah yeah they want to fine tune train deploy stuff to their employees based on their data and
this is there's a big use case big clientele in sort of in that space and some but no as effectively
one of the bigger players in that space where it is you get your APIs for training and inference

(16:17):
and you get it on machines which are behind your firewall so in a way not being on GCP and not
being not having to access via some other third party provider like that is a benefit for those clients
yeah yeah yeah 100% I mean it makes sense I mean I think a lot of companies they are sometimes having
like very confidential proprietary data and they don't necessarily trust the big clouds

(16:42):
and they just feel more comfortable having their own hardware so I think that is very true
that's like a it's a really valid and like good market segment and it's a universal problem we've had
I mean we have customer deployments both across national apps and enterprises spanning so many
countries and continents so it's not just one specific region or a specific set of countries it's

(17:04):
everywhere this is the same problem they want their hardware behind their firewall so that their
data is predicted with the leaks for sure so one thing I was curious about is sort of like it seems
like there aren't many huge limits to the hardware from what if I understand or what you were saying

(17:25):
but we work curious so right now under cloud offering I think the biggest model you have is the
llama 405B model which is running well I tried it out it was it was very fast I was quite impressed
amazing I would say but I'm curious like what's the limit so like some people say there's like rumors

(17:50):
that open AI has like a trillion parameter model I don't know I don't know if they do but like
you know like what if we want to build like GPT-5 GPT-6 so we want something
what's up a bigger like we don't want a trillion parameters but we don't like 10 trillion is that
like something is possible to like run like on the hardware right now yeah well

(18:12):
so I'll answer your question maybe by dissecting it into maybe two axes one of them is
how are you going to run different kinds of modern architectures and then what about what's
the limit in terms of the size like how big and then where will we sort of qualify for example
I think model architectures maybe the prior discussion probably covers it which is

(18:34):
if the entry point is similar then you can express whatever kind of model architecture
through you know a pie-torch pie-torch a through the pie-torch layer
so in that sense we've already you know we you know we've run mixed-row
mixture of experts, transformers, convolution models and so on so the breadth is not a problem

(18:59):
that's that's the generality of it now comes the size saying okay how big can you go
I think this is where the you know if we sort of divide up the AI market segment into sorry
the accelerated segment into maybe three categories one is the GPUs in GPU like architectures
the other one is like maybe extra I'm only architectures it's like the memory system choice used

(19:22):
and maybe us I'm singling out us because I want to kind of compare against these two buckets
if you look at GPU only architectures the the bigger the model the more memory you need that's
just the universal truth you need to store them somewhere you need to stream it from somewhere
and do all this GPUs only have hpm attached to them high bandwidth memory and the typical trade-off

(19:44):
hpm is you get more bandwidth but you get lesser capacity so most GPUs have about 60 80 to maybe
150 gigabytes of hpm that's usually the ballpark so models which are bigger like a trillion
parameter model you've shard it because you want to spread out the memory required over multiple
GPUs for the capacity reason. SRAM only architectures you know like you know maybe saree

(20:10):
brisk rock maybe there are others but those kinds of architectures it's the same idea but you spread
it out over SRAM on ship on ship or on bay for SRAM but that's where you hold it now some are
going to be s in 40 years specifically we have two types of memory we also have SRAM we have hpm
but we also have ddr which is high capacity but low bandwidth directly attached to the to the

(20:36):
socket so what that enables us to do is we can play a memory optimization we can adopt memory
optimization algorithms where you can use ddr to store capacity you can host not just one but
several several models can keep them in ddr use the most critical things use hpm for that which is

(20:58):
high bandwidth and then on ship SRAM gives you the data flow and the operator fusion capabilities
so in that sense the limit is really as big as rddr which is like 24 terabytes 12 terabytes I should say
per per node which is 8 rd use and if you want to go larger you know we have the software

(21:22):
mechanisms to shard it across multiple sockets so in that sense I would say the larger the model
the better suited we are to kind of take advantage of that yeah the way I personally think about
it is for the same footprint we can run the largest models amongst all of the companies that
you mentioned right you think about it hpm SRAM even smaller than hpm so if you want to

(21:46):
run a larger model you have to scale out so much just to run a single model right I think we heard
this numbers even for inference one of these companies have to run at 576 sockets for I think a 70b
model 70b we run 70b model and eight sockets we don't need 576 sockets right just because of
this hierarchical memory that is created and simiron GPUs a lot of issues sort of with sort of scaling

(22:08):
so and we see that and we see this sort of being very useful so to coming back you can
run as bigger model as you want the bigger the model then you sort of credit to these memory
constraints the more memory that you hold the bigger the model you can run on a single socket
and if it sort of run out of space on a single socket you scale out right and for us the advantage

(22:28):
that I personally see as a developer is that I have an option to scale out or not to scale out it
depends on how much capacity that I have right because we have this larger memory attached to the hpm
and the SRAM while for the other providers the only ways to scale out and we saw that being very
useful and our customers see that being very useful like we have trained 13b 70b 176b parameter models

(22:50):
you can even larger in trading and depending on how bigger cluster and how many nodes are available
internally in my day sort of ML Farm the data center that I have I can sort of chain the size
of the socket count that I want to trade on you can't do that on a GPU or other architecture
because to fit that model out you need x number of sockets right so if you're whatever reasons

(23:11):
you don't have those x number of sockets to start with it becomes very hard for you to play
so the size of the model is not a problem in fact I think for infants we can run
fighter-lens sockets on a single node fighter-lens sort of parameter on a single node is that true?
Yeah I think it's more than that probably whether it's a 5 for sure yeah yeah yeah yeah yeah
but of course we can scale out anyway you want right our the system we have has a

(23:36):
scalability of doing whatever you want and that flexibility makes it really useful for most
developers in different deployments and other things one thing I found funny was you mentioned you
use the DDR RAM the slow high latency RAM almost has like a stored solution where on personal computers
that is like the fastest memory that we have yeah yeah yeah yeah yeah yeah yeah it shows the

(24:00):
architect of models today yeah yeah so I guess on those lines what is the smallest model that it
would make sense to host on your servers versus just running it locally is there like a trade-off
where your chips are going unutilized if you run like a really small model how local are we talking

(24:24):
about here is your phone or your laptop phone laptop something like that the thing is as you go
smaller you can store more and more things on your SRAM and they can run really really really fast
in scene speeds the way I think about running locally versus running on cloud is just a choice
depending on privacy concerns loot and sequins and things like that but for us I never think of a

(24:48):
smallest model there things don't make sense like the smaller the model the better it is because you
can run extremely fast right and also it's I think there are two you think about what all takes up
memories one of them is model sizes one which is small versus large other one is how big of a
context is that model if it's a small model but you're looking at a million tokens you will still
require a lot of memory and you still require the same bandwidth so there you're back to saying okay my

(25:13):
model is fast but I still need like any to keep my context somewhere and then it's the input if it's
a multi-model thing it's like an input size as well so all of these things require capacity as
well as bandwidth so I would say for a very very small model which operates on also a small
to moderate input sequence when I say small I mean something that fits in your CPU cache for example

(25:36):
a hundred megabytes well even less than that maybe you can run it on your laptop
but when you're using that in conjunction with speeding up a larger model like you know speculative
decoding for example or you know you're building some agentic system where you want a small
large model to sort of ping pong between each other you'll quickly run into this problem okay small

(26:01):
model runs here big model runs there someone has to manage the memories and so on but I so I
would say anything where you're off the CPU's memory which is maybe like 50 60 megabytes of memory
that you have on ship generally you'll start seeing the impact of okay I'd rather use something like
sample no order on it talking about agents seems like that's the new wave of a hundred that is

(26:28):
because these models are getting somewhat plateau in terms of their performance across all of these
competitors and the only way to push more performance is by building some kind of agentic behavior
just like throwing more compute at the problem and having these models just think through their
steps and work with each other so if we were to build an agent with like a sambanova architecture

(26:53):
would it be better in terms of performance if we were to host like multiple models and like the same
environment and have them like locally talk to each other oh yeah definitely and we our platform
is really well suited for the agentic paradigm right so there are a few things that you require for
agentic paradigm one is extremely fast inference because each model needs to run at a very fast pace

(27:16):
either you're generating multiple answers and selecting one of them or you're generating a sequence
of multiple answers while each answer is getting corrected with some feedback so that's fast
inference that you need the second thing sometimes the model these agentic systems need is
multiple models collaborating with each other right he talked about kind of this three-died

(27:37):
memory architecture SRAM HBMD RAM right that allows us to host all of these different multiple models
simultaneously and instantly swap between them on a different architecture like GQs or
that SRAM HBMD architecture that he talked about you can only host models as long as it fits in their
memory as soon as it doesn't fit in the memory you need to instantiate another node or a socket

(27:59):
to sort of sort of run that model so it starts becoming very expensive right so because our model
sort of allows you to have these extremely fast inference speed and switch between the different
models instantaneously without having to boot up another AWS instance you can run really good
agentic systems on our platforms and you'll see some of these demos today in the in the unique talk

(28:24):
that you have right you can just send a lot of these agentic systems very efficiently on our stack
both from a raw performance point of view but also from a cost point of view yeah there's actually a
paper we wrote about it is it's gonna appear in a conference soon where we try to quantify this
saying okay what exactly is the cost that we need to pay attention to when it comes to agentic

(28:45):
systems and as an infrastructure provider any workload kind of looks like an agentic system
like I'm gonna know yeah because I'm not a I can get away by saying these things because I'm not a
hardcore machine learning person but if we think about any agentic system it's you you have my like
what makes you say you have different models you're gonna produce tokens from one which is going

(29:10):
to be the input of the other maybe with some post processing in the middle and this happens
iteratively and there's going to be a collective refinement which is the awkward
so at some point as an infrastructure provider you need to host all these models somewhere to put
them somewhere and then based on what which models are participating in the agent you have to do

(29:30):
fast inference on them so you need the models to be fast and you also want a switching time to be
kind of in the noise and if you translate that over to okay how would you do this on GPUs for example
you have to put all the models in hpm which means if you have more agents or you have you know the
sound total of all the models that you want to provide as an infrastructure person if

(29:53):
there exceeds hpm capacity you need those many number of machines so when we kind of compare it
a single rd you node which is like eight sockets if you want you can host models which are 19 times
more than what you can host on dgx h100 so in other words if you want to host if you fill up one node

(30:15):
of sambal novel entirely with parameters the equivalent to dgx would be 19 dgx boxes
to host the same in hpm that's just a capacity we just have more capacity that's the the dd and thing
the other way is well if you don't want to do that like GPUs also have host like a CPU and

(30:36):
CPUs have memory you'll say well I'm going to store it there and I'll keep switching between the CPU
memory and GPU memory that switching happens on a very thin straw which is the PCIe that switching
cost can be more than 10 to 15 times smaller sorry longer I should say which means sambal novel switches
around 15 times faster so we have some results in the baby waves like well if you don't be

(31:00):
attention to the switching almost half your time is going to go in switching between the agents
yeah so at that point that's not a good user experience if you're able to switch between them
because of the three tier thing it's much more natural of an architecture and we can kind of see
this in nvidia's trajectory as well like they grace hopper for example like they have another layer

(31:24):
of memory although they've not come out explicitly said it's going to be agentic and so on it's
you can see that there is a requirement for another layer of memory it's kind of being validated by
other solutions like that I see so I'm curious it seems like you've done some experimenting with agents
is there kind of like a sweet spot that you've found for maybe like the type of model that you found

(31:48):
fun runs good with agents so like I think like in a certain sense we might think like oh like I just
want to use the biggest model possible for my agents because this can be me the smartest but then
like maybe it'd be a little bit slower and then if you get the slower model then it's not as smart
and then it's like way faster I don't know if you have you done any research into like
the sweet spot is there yeah it's a very open question so we're still figuring that out

(32:13):
I think one of the interesting things that we notice is people would sometimes find a sweet spot
and then it's finding that sweet spot on a system like GPU where they're more hampered with what
they have they can't read it and fast inference of switch between models and then it sort of do
things on our stack it looks very the trade-offs look completely different right so I think it's

(32:35):
still a very open open question part of the challenges the models are also always changing like
the models are getting really really really good one year ago an 8B model I mean if you look at it
Lama 3.18 B is almost as good or maybe better than what Lama 270 B was right so all of these
trade-offs are very sort of changing every day so kind of hard to answer I think some rules of

(32:59):
thumb that I personally use is if I want to use LLM as a judge use a larger model so the smaller
model is done is a lot of generation and the pick and choose which generation of current use a
larger model so that you incur the cost of a larger model only once and the workhorse becomes a
smaller model and then you sort of use that kind of a system to build your agent system but again as

(33:21):
these models are improving these trade also keep on changing and that's where again sort of
coming back to the platform because for us as a platform we can switch between these models so
easily depending on and we don't care whether it's big or sort of small or large and
the size we don't have to scare out right we can say on the same system it just becomes very easy

(33:44):
for us to do all of these analysis not just for us but for our customers also as a mendicable.
So we talked a lot about performance memory optimizations I assume all this comes at some cost
some price to the end users what is like the price difference let's say of generating like a million

(34:05):
tokens for the latest Lama model both in terms of like the explicit cost and maybe like the implicit
change some of their infrastructure to account for some of the systems.
I actually don't know what a price per token is I just know that it was competitive I don't remember
exactly what it was like. The price per token that we charge is there are other factors influencing

(34:31):
that which is you have to be competitive in the market and so on but I can speak to the
infrastructure and what what is the costs in some sense and I can answer that in terms of two variables
one is performance per area which is we have a rack what are you doing with a one rack and how many

(34:53):
tokens how many racks do you need to run a model and so on and the other one is the performance per
watt which is what is the power consumption and I think for both scenarios most of the not most all
the models that we have in the cloud deployment they're all running on 16 sockets 16 sockets is one rack

(35:13):
and single rack is the because of the you know the the details on you know how big our chip is and
where we end up you know using power it tends to be much more power efficient and hence more energy
efficient than some of the other deployments one is because it's a different technology no other is

(35:34):
you end up using your on chip through hbm if you look at when the power goes it's about half of it is
in data movement in inference or even maybe more than that so the more you can capture more data
movement you can capture on chip the lesser the power you know so that's one of the unique advantages
of data flow so you know we haven't made public but you know we have actually we might have written a

(35:58):
blog about this we have a energy efficiency gain of you know almost up to 5x compared to H100s
in some benchmarks I think there's a link there's a public blog post on this or linked in post or
something that directly translates to cost because you're paying for the electricity it's the

(36:18):
operating expenses and if you're able to squeeze out you know more tokens for less power that's
you know that's that's being an education the other part is the performance per area which is
how many racks do you need to sustain these many number of users that's where the capacity comes in
so 4 or 5 years you said you liked it was very fast still running on 16 sockets right so it's not

(36:41):
the as the models increase as the model diversity increases that doesn't put any pressure on
power now I need these many more racks because I need these many parameters because those axes are
decoupled you can use your hardware much more efficiently they're also translates to costs
so in both in both those regards even though two different parts of the system is helping one of

(37:09):
the memory the other one is the data flow part it tends to make it much more cost effective
to become an influence provider that makes sense yeah we have a sort of a blog post somewhere
that you can find on our website in the number that I remember right now is 10x cheaper to sort of
host models just because of the sort of 30 talked about the other line of questions that I had was

(37:38):
the competitive landscape so there is a couple other companies that have come out that are
touting like really fast inference well before the sambar nova announcements there was another
king of the hill grok which was really fast and that you guys came out and I think it was like
twice as fast as grok's inference if I'm not mistaken but I'm sure there's gonna be more

(38:01):
competitors coming out because if I'm not doing 405b no I don't think there's none the biggest
model I think they did have they did have early access they added up for like a day or something
then it was brought down okay I think so is that a competitive landscape?
I definitely use kids I guess depends on the price point they might have a cheaper price point

(38:22):
at some of the smaller models yeah yeah right and then one of our friends also works at Serebrus
and they're you know thinking of IPOing and they have a different you know solution to this
yeah what do you think about this competitive landscape and do you see like more architectures you
talked a little bit about a SRAM what you guys are doing and the traditional GPUs you see like other

(38:44):
architectures that especially in your research and Stanford and where the future of this space is going
oh for sure for sure I think I think there's I mean just because of the amount of interest it's going
to it's bound to attract a lot of smart people to think hard about hey you always find something to
innovate in if you go one level deeper because this is one of those cases where oh well why wouldn't

(39:08):
we do X you know it could be different ways of stacking different packaging methodology and so on
on on the couple of other startups you know I'm assuming you're referring to Grock and Serebrus here
I mean they've they've done a good job there there's no we you know it's definitely an engineering
accomplishment to scale to 500 sockets building away for scale compute engines I mean it's they've

(39:29):
done an excellent job it doesn't change the fact that SRAM is not as dense as HPM
so there is this trade-off between capacity and bandwidth so if we if on one end if we're saying the
models have to get more like the world will have more net parameters overall we need larger context

(39:49):
lengths there are more people who want to use inference just you know larger batch sizes if everything
is increasing and you have a memory technology that's not going to scale with technology we
under certain point since physics yeah then you are eventually going to run into other kinds of
limitations that's one of the things I want to mention not to take away from the engineering

(40:14):
accomplishments right I mean I do want to say that they it is I'm sure they have more tricks
up their sleeve you know as do we so it's it's I would say it's we are kind of in this race together
but we of course feel like the way we have built it is the more scalable way of doing things

(40:34):
yeah and yeah and more players I think would enter we had recent front raising of these photonic
startups coming up as he said the market is big and a lot of potential so you will see more smart people
figuring out different solutions I think one thing that needs to be sort of thought through is always
it's not just the hardware the software ecosystem matters right and the software ecosystem takes

(40:56):
time to mature and you really need to focus on doing a lot of different things with your software
at running a lot of different models so that your software ecosystems always improving and that's
where that's that's what gives me confidence more in some and over just the amount amount of
number of models that we're running both training and in front the size of different models that
we're running that it just looks like we are at a better state in terms of software ecosystem

(41:21):
than where other people are so the hardware and the complementing software they're both sort
of going right now the only exception to that statement I can think of is Nvidia yeah yeah yeah
so the 800 pound girl in the room yeah 3.4 trillion market cap
competitive in the room I guess why did you both decide to join a smaller startup or start your

(41:46):
own startup as opposed to joining this dominant player well my case is a bit different in that I
uh prior to coming here I was a graduate student at Stanford and we uh I mean the the group of
students you know I was part of them and we came up with this initial concept group of concept of

(42:09):
the state of the architecture so the evolution for me was it the journey it took me from academia
to industries slightly different so in that sense I wasn't really looking for a job to go to a
Nvidia or anything I didn't have to but the other thing is it's the promise so we know that
Nvidia has done a great job I mean they've they've invested billions of dollars in not just building

(42:33):
their software ecosystem but teaching it like just they've grabbed tons of mind share over the past
decade and a half but people understand what simtty programming model is what GPGPUs are
I got into this field because of CUDA in that was my first uh the internship so to speak
so they've done an excellent job in grabbing the mind share um but we also understand that

(42:58):
with the programming model consummations so if you want to make a difference an incremental change is
not really the way to do it was one of the sort takeaways you have to think about it from the ground
up from the grassroots level which means it's a lot of work you do take on a lot of risk but the reward is
I mean seven years later we're able to see that we can routinely outperform them

(43:20):
inside of having half the bandwidth I should have mentioned that earlier so all the all the
one tgx box one data scale box which has eight socket each we have half the bandwidth but twice the
performance right and it's it's it's not a magic thing we just happen to use the memory more efficiently
you know so that was the just the potential of doing something like that is is the

(43:44):
allure of a small startup was that I don't know what would be your story but that's that's my yeah
yeah right it's a few reasons um so firstly I've always been a hard-to-enthusiast I come from
background of computer architecture before I switched to my research I was doing computer architecture
I was constantly medicine working on data for competitors right so it didn't spend a lot of time

(44:06):
there but spend enough time there to understand the potential impact that it can have in terms of
performance so I'll always cheer for new harvest startups that are trying to sort of change the ecosystem
right so that's one reason second reason is sort of when I've been in ML for a long time and I really
really enjoy personally these problems around harvest software co-design how can you rethink algorithms

(44:28):
to release suit your sort of suit your harder performance you really need to sort of deep dive into
machine learning understand a lot about why things work the way things work and how you can
restructure them and rethink the math to actually make them faster so that's a very interesting problem
that I personally kind of enjoy right coming from hardware back in order becomes my problem to solve

(44:51):
in a way right and I think third thing is I generally believe that from the purely ecosystem point
of you it's important to have multiple players you cannot just have one dominant player because
the ecosystem is never going to grow right if you really want to or if you're really serious about
democratizing machine learning it's important that you democratize training and inference and

(45:11):
platforms is where it starts right so that also was one of the reasons I really wanted to sort of
join a startup again why someone who an reasons have been discussed before as someone who comes from
a background hardware I could see the appeal and what they're trying to do and sort of
why they're going the way they are and then here for four years don't regret it and I do think

(45:35):
it was a good call yeah I mean this is definitely an exciting place to work especially having the
fastest inference in the world right now and both Mark and I think understand what it's like to work
at a big corporation there's a lot of comfort stability and we're building large products at scale
but with that comes some overhead bureaucracy and you know we can't move as fast we have to think

(45:59):
about scale and making everything compatible for everyone all of our customers so maybe NVIDIA is
in a similar place I don't understand the hardware side as much but maybe they have to move a
little slower than startups can yeah possibly right I mean I also don't know what goes inside
in the media and how they think but in general a larger company the bigger the ecosystem that they

(46:24):
already are in place the harder it is for them to move quickly speaking a little bit about the
education in like side so most of the research in like ML I think in the last 10 years ago was either
at universities or Google and then I think in the last five years or something the industry to

(46:49):
started sucking all the academics from the best universities and they're just throwing money at
ML people from top universities so like you said that are you started or joined this journey
about four years ago what was the education landscape and the amount of research funding or the

(47:14):
enthusiasm in academia like and were a lot of other people also leaving academia for industry
yeah well for seven years actually not four sorry I agree but there's no shortage of enthusiasm I mean
I would say the machine learning class at Stanford the taught by Andrew and I mean that's the most

(47:38):
widely taken class on the on the planet probably if you put that was the on the undergrad side I
mean it's open to everyone so you'll find people undergrad grads not computer science majors
everyone I mean they they I don't think they book the stadium for the class but it's pretty I think

(47:59):
they it's something close to that where you you've so many people so many posters that class has more
TAs than a normal computer active architecture class would have students like 60 or teaching assistants
there is no depth for popularity and the set of people working on it right what about as you go
further along on the graduate level degrees or PhD candidates even with that so I think the the

(48:28):
one thing that even I know we say AI and MLB use it is a very broad term but there are so many
niches even here where you find specializations where people are trying to find across domains you
know what are the co-design techniques that you can apply between two domains I mean something that
we we ourselves do here saying how can we advance the field while we use what we have here

(48:48):
there is I mean Arun Kofau in the christry has a lab at Stanford and he does all sorts of things you
know well how do you increase the context length from today Lama has 128,000 tokens I mean
there are fundamentally new architectures with state space models which are like one two million tokens
the existing approaches can't scale but for that you need to go one level deeper again

(49:15):
and invest in a much more radically different approach not all of them might not all the ideas might
pan out but that's the seat and the path from coming up with a cool idea in academia to being
adopted in the industry is probably the shortest now because you don't have to wait for a publication
to be accepted in new ribs or ICML or something there's a cool idea you're able to explain and you

(49:38):
able to justify it again and to you that someone's going to be watching and pick it up right that's
the bar is in that way it's the best time I would say to be in this field so I'm oh go ahead and
the MLB I mean it's just not just the top labs right you have a lot of startups in America
so the talent is kind of spreading out people are excited about it they want to take on a new

(50:01):
challenge so I would say yes a lot of excitement you see the number of PhD students increasing significantly
in these domains the amount of research that is coming out is just insane it's impossible to follow
research anymore you got go to these top ML conferences it's like 10,000, 15,000 people so you
can just see the amount of enthusiasm throughout everywhere in the world around that right so yeah

(50:23):
I mean it's it's a great time and it has been this for almost six seven years I remember going to
I think EMNLP which is the top LLP conference or at least was at that point in 2018 and they were
just not ready for the inflow of all the people that were coming here we actually ran out of food on
the sort of the reception to them I was just 10 minutes late and there's this plot from Jeff

(50:47):
Deane and David Madison I don't know we probably have seen it since an old plot now so it's even
much more exaggerated now this is one plot saying Moor's Law which is it's an exponential it's
something is doubling every two years another plot saying the number of archive papers archive papers
is going faster than Moor's Law it just the sheer volume of people thinking about this is it's staggering

(51:10):
yeah another most beautiful thing about all of this is that a lot of this is happening at a
grassroots level also yes top industry lives they are kind of stop publishing but a lot of interesting
innovations happening in the grasswood level you have one of these sort of kind of I don't know
that in Geo specifically or not but LUTRII ML collective all of these organizations doing a lot

(51:32):
of these fundamental ML works with the computer they have and pushing the boundaries and talking about
it without actually publishing the paper more releasing the artifacts, editing block posts and
educating everybody and those things are turning into well-adopted technology right the the
road road re embedding or the road theta scaling that we talk about in the context like extension

(51:53):
without training that was kind of I think was a random reddit post somewhere someone did
someone for interesting people loved it a lot of thumbs up people picked that up improved
upon it and then it showed up in a llama to a llama three paper I forgot which one I might
be mixing stories but I mean there are such stories that and if you see this world of entire synthetic

(52:13):
data generation a lot of innovation is happening in the one source communities people working at the
grasswood level in startups in these decentralized labs I don't know you know what to call them
right like it it's just people on reddit people on twitter just figuring out recipes this news
recession now says I don't know how to announce them right again sort of open source kind of

(52:34):
collaboration is amazing yeah so I'm curious you mentioned about like a Moore's law and I feel like
a lot of us we speculate like like for those who don't know like Moore's Laws roughly defined as
like the number of transistors is gonna double like I read two years but some people have kind of
transitioned that to say like out the amount of computing power will sort of double every two years

(53:01):
or one year what do you think do you think that like that trend will continue or like I've heard
Moore's Law has been dead for like the last 10 years but I'm not sure about what do you think
so I actually have a great quote to answer your question which I read recently on probably reddit or
LinkedIn it said the number of people who are predicting that Moore's Law is dead is doubling every

(53:23):
two years it was fantastic and I mean I think so if you look at what Moore's Law is it's like you said
it's you're computing power overall capabilities is increasing with you know it used to be every
year then we can have 18 months now it's every two to three years but it's some exponential which is
kind of slowing down and earlier they used to be two effects one is because your device or

(53:48):
the transistor size is decreasing you also could lower the operating voltage so because of that
up to a certain point as you shrunk the size of your transistor you could pack more transistors
onto the same area and double the number of transistors at the same power so that is a straight-up

(54:08):
increase in capability without having to increase the power draw that scaling stops called
entered scaling and that stopped for now it's you know you don't see the same scaling anymore
we do see shrinkage in transistors but so for example we there are our first chip was a 7
animated part now it's a 5 animated part and there'll be three and you know stuff you know beyond

(54:34):
in terms of technology scaling and the physical designs have a lot of innovation there
what is also true is that transistors is only one component of a chip just you know that's the
thing that makes up the air-ethmetic logic units and you know the the muxes they're all in all that
stuff and the and the sram although the sram bit cells are constructed in a different way

(54:55):
so the sram the memory cells and the compute cells they the the transistors they scale differently
so beyond a certain point the memory cells don't shrink anymore because of other reasons
but more importantly the communication between them the wires
where you're having you have like an interconnect on chip is a cross-part here and like

(55:16):
yeah those wires are implemented using metal layers on top of your silicon and that is not scaling
at the same rate so which means even though you have a lot of transistors you might not be able to
power them at the under certain point and the way you manage your interconnect that becomes a lot

(55:36):
more important but there will be scaling I think there is I wouldn't call Moore's law as dead
because otherwise you know we wouldn't be looking at okay what's the newer thing to be done
but I do say I would say that it's not the sole knob for hardware improvement we do see much more
advanced packaging technologies now like chiplet designs where you know you do what you can on a

(56:01):
single chip and then you play a packaging exercise where you connect them multiple chips together in
a certain multi chip module or chip one way for on substrate I co-vers style 2.5d packaging
all of it is effectively saying I'm going to give you net more silicon area but it's accomplished
using these other techniques so I I wouldn't call it dead so long answer short summary it's not dead

(56:30):
okay well you've done too late only on that that's good to know and that's also exciting because I'm
all here for the exponential growth so I know we are kind of running out of time just because we
have an event that's going to start soon but I am curious it seems like it's incredibly
you're able to have your PhD and turn it into a company do you have any advice let's say

(56:57):
you're somebody who's maybe a graduate student maybe an undergrad you just have an idea
like is there any advice you'd give to somebody who is maybe in like a similar place that
thinks like hey like I have an idea I want to start a company like any any advice you'd give to that
person interesting question I mean so first of all I want to say that it's it's

(57:24):
because it's a systems project it has almost not almost it was always been a team effort so even
at Stanford it was a group of graduate students who sort of collectively brainstormed argue
reiterated over and you know we came up with what we came up with and even within the company when
we decided to start in transition it's it's the group it's the collective decision so I would say

(57:47):
more it's equal parts technology and people really that made it happen and I happened to be
at the right place at the right time so it was it was very fortunate it definitely helped that we
would prepare it with the right amount the when built sort of the real thing when the when this
transition opportunity came by so just based on my own experience one is you can never time any of

(58:12):
these things to perfection but if there's an idea chase it and build it and good things might happen
it's it's and talk about it like if you do some cool things in a vacuum it's very hard to especially
now it's so much happening it's very hard to discover and nothing speaks better than something that

(58:33):
works yeah so I would say one of the things that I mean the the RcoFounder-Hunle who was my PhD
advisor back at Stanford a bunch of fortunate things happened but we're able to act upon it because
we had enough ready saying okay this could be a good starting point and then a whole bunch of things
happened within the company but if we weren't ready for it then well something else would have

(58:57):
happened so luck was there but I think the just being just go chase down if you have a cool idea
and find the right angels who can advise you and guide you in the right way think that's
don't sit on your team yeah I think that that's really good advice also I thought it was really

(59:22):
interesting how you mentioned that it's good to kind of share the idea too because I see a lot of
people they kind of a very secretive about their idea that kind of have this belief that's like oh well
I'm just gonna build kind of in silence and then like I'm gonna do this like big grand reveal
to the world and then nobody actually cares what they do and I mean like maybe I'm sure there

(59:44):
are exceptions to that like I mean you're the type of person who's working on like cold fusion
and like your time machine it's like maybe then you you can make it in secret but I feel like you know
most of the time what we're not doing something of that kind of extent so yeah I like I like that
and also like the people so I'm curious how did you find your co-founders yeah so I think

(01:00:07):
just one term I'm not a co-founder of the company one of the I'm one of the founding sort of
employees so we have three from the Rodrigo Kunele and Chris Rape and for me it was you know I
since I was a student at Stanford I knew Kunele of course was my PhD advisor and Chris Rape is you
know we we had a shared collaboration for a long time so although Chris wasn't my advisor we worked

(01:00:31):
we spoke almost pretty regularly I don't know the cadence but we spoke pretty regularly
and Kunele and Rodrigo have a shared history before so Kunele is of course he's known for his
prior work on sort of multi-core processors and domain specific languages and all these things
and his first startup just called Afara tried to commercialize the first multi-core processor

(01:00:57):
that's where Kunele and Rodrigo I think their first interaction happened so ever since that
that connection and you know Rodrigo of course then let the whole spark division in RCEO in
in Oracle so that shared connection helped us you know translate what we had come up with at Stanford

(01:01:20):
with the right sort of industry experts where they're able to spot the
the stuff that had value and then improve the stuff that didn't have value.
That's super interesting yeah so that's all the questions I had but I am curious is there anything

(01:01:40):
that we didn't ask you that you feel like needs to be mentioned anything you want to plug
anything you want to ask us kind of the floor is yours.
Only one plug go to cloud.sambo.ni. I sign up for an API key and try everything for free whatever

(01:02:02):
we've talked about it's all there and show us what you built. We would love to see what you built.
Yes we love to see what you built. Taggers on LinkedIn, Taggers on Twitter. Yeah we love to see
what you built on top of the APIs that you have. Yeah for sure. I mean we are also our intention is
also to be more open we want to learn it's the same ideas again like we the communities out there

(01:02:24):
so there's no reason for us to guess so anything where people use even if you don't tell us anything
it's already a lot of valuable information for us. Yeah that's all exciting I want to use it for
the next hackathon and see what I can build specifically for agents. Yeah for sure. I'm

(01:02:45):
definitely going to use it just because like the price we were looking at it seems very competitive
and also like the speed I feel like so fast it'll be good for hackathon because the thing is
that sometimes like you need to it takes a while to get your response so it'll be nice so

(01:03:06):
I'm ready to try it out and use it. I played around with the cloud I'm impressed but yeah I'm
ready to use it more so yeah awesome well thank you so much for like your time it's been like an
awesome conversation and yeah so thank you everybody we'll catch you the next one. Thank you.

All Episodes

Episode Transcript

Popular Podcasts

Stuff You Should Know

Las Culturistas with Matt Rogers and Bowen Yang

Dateline NBC

.css-15opob5{left:0;position:absolute;top:0.8rem;} All Episodes

.css-14f5ked{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:2;overflow:hidden;}World’s Fastest AI Inference: A Conversation with SambaNova’s Innovators

Episode Transcript

Popular Podcasts

.css-r6mb8g{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:1;overflow:hidden;}Stuff You Should Know

Las Culturistas with Matt Rogers and Bowen Yang

Dateline NBC

All Episodes

World’s Fastest AI Inference: A Conversation with SambaNova’s Innovators

Stuff You Should Know