Software and hardware acceleration with Groq

Episode Transcript
Available transcripts are automatically generated. Complete accuracy is not guaranteed.
Jerod (00:04):
Welcome to Practical AI, the podcast that makes
artificial intelligencepractical, productive, and
accessible to all. If you likethis show, you will love the
changelog. It's news on Mondays,deep technical interviews on
Wednesdays, and on Fridays, anawesome talk show for your
weekend enjoyment. Find us bysearching for the changelog

(00:24):
wherever you get your podcasts.Thanks to our partners at
fly.io.
Launch your AI apps in fiveminutes or less. Learn how
fly.io.

Daniel (00:44):
Welcome to another episode of the Practical AI
podcast. This is DanielWitenack. I am CEO at Prediction
Guard, and I'm joined as alwaysby my cohost, Chris Benson, who
is a principal AI researchengineer at Lockheed Martin. How
are doing, Chris?

Chris (01:01):
Doing very well. How's it going today, Daniel?

Daniel (01:03):
It's going great. Yeah. It's been fun, productive week
in the AI world over here atPrediction Guard, so no
complaints. But this is a reallyI'm excited about this episode
because it's one I've beenwanting to make happen for quite
a while.
Today, we'll be talking aboutboth AI hardware and software

(01:26):
with DJ Singh, who is a staffmachine learning engineer at
Grok. How are doing DJ?

DJ (01:34):
Hey, Daniel. Thanks for having me. It's been going well,
yeah.

Daniel (01:38):
Yeah, good, good. Yeah. And I guess we should specify
for our audience, this is Groqas in G R O Q. I imagine maybe
some people get confused thesedays with that. But yeah, this
is one that I've been reallyexcited about, DJ, because I've
been observing what Grok hasbeen doing for some time and, of

(02:00):
course, innovating in a lot ofdifferent ways, like I mentioned
on the hardware side and on thesoftware side.
Could you maybe just set thestage for us a little bit in
terms of the overall ecosystemas you see it in terms of what
may be a bloated term of, like,AI accelerator or hardware and

(02:21):
also the software that goesalong with that and and kinda
where Grok fits into thatecosystem.

DJ (02:27):
Right. So I think I'll first start and just quickly brief
about Grok. So Grok is, ofcourse, a company which provides
fast AI inference solutions. Sowhether it's text, image, or
audio, we are delivering AIresponses at blistering speeds
and order of magnitude more thantraditional providers. Now you

(02:49):
spoke of AI accelerators and,traditionally, training and
inference has been done on GPUs.
But I think in the last fewyears, we've seen all sorts of
AI accelerators come in place.So there are those, more mobile
device oriented ones that phonecompanies like Samsung and Apple

(03:10):
come up with, right? And thenthere are more more stuff
happening on the server side,part of which is what Grok is
also leading towards.

Daniel (03:22):
Yeah. That's great. And on the server or hardware side,
am I correct? Is sort of Grokdoes have have their own
hardware that they've developedover time.
Is that right? What's kind ofbeen the progression of that and
a current state?

DJ (03:37):
Absolutely. So Grok developed this technology which
we call as Grok LPU. It'sessentially a software and
hardware platform which comestogether to deliver that
breakthrough performance of lowlatency and high throughput But
Grok's how Grok got into it wasfirst to develop that software.

(03:59):
So we developed the softwarecompiler first before moving on
to the hardware side, kind of ashift in how traditional
development was being donepreviously.

Daniel (04:11):
And did that mean, that does seem very unique to me. So
what was the, I guess, themotivation or the thought
process behind taking maybe thatnon standard approach, kind of
compiler first, then hardware?

DJ (04:25):
Yeah. No. Absolutely. So, traditionally, as I mentioned,
development is done and that anew accelerator is developed. So
if somebody makes the hardwarefirst and then the software has
to deal with the inefficienciesof the hardware.
Whereas when Grok decided, andnow this company is founded by
Jonathan Ross, our CEO, who wasa co founder of, Google's TPU

(04:50):
program, the tensor processingunit program, and based on his
learnings from there, one of thekey decisions was let's develop
the software first. Right? So wehave developed this software
compiler which helps to convertthese, AI models into, this code

(05:11):
which runs on the Grok LPU, butspecifically, the compiler is
responsible for, scheduling eachand every operation of that AI
model. So you can think of itlike an AI model in terms of
computers made up of, like,additions and multiplications.
And, it kind of the softwarecompiler, it, decides where and

(05:35):
when to schedule something.
And that goes into our, youknow, various design principles,
one of which, of course, Imentioned is to be, software
first. Right? Now you might ask,why, do we do this. Right? So
one one key consideration isthat not only does the software

(05:58):
have to deal with hardwareinefficiencies.
Right? But there are otheraspects of the hardware which
can add on delays, whereas Grokprefers to have a deterministic
system in place. So determinism,I would say, is like
deterministic compute andnetworking to kinda have an
understanding of where and whento schedule an operation. So to

(06:23):
understand this, we can consideran analogy. Now imagine a car
driving around along the roadwith several stop signs.
Stopping at every sign isessential for safety, but it
does add some delays. Right? Nowwhat if the world was perfectly
scheduled and we knew where tostart the car and drive at

(06:45):
maximum speeds so that there areno collisions. Right? So there
would be no need for these stopsigns, no delays as such, and it
also makes a more efficient useof the road since you can then
have more cars and everybody's,like, going at maximum or near
maximum speeds.
So to reflect this analogy andback to the hardware space, Grok

(07:09):
chose to remove components whichcan add delays. So it could be,
let's say, network switches orother even algorithmic delays,
some sort of algorithms whichcontrol packet switching. These
all things add non determinisminto the system.

Daniel (07:28):
I did wanna maybe some of our listeners out there, like
you've been talking about thiscompiler level which, you know,
I think of a compiler, similarto what you said as, hey. I'm
writing some higher levelsoftware code that's compiled to
these instructions that rununder the hood on the actual

(07:48):
hardware components doing, asyou said, additions or or
whatever those sorts ofnumerical operations are. But
people might sort of be confusedalso in terms of the software
stack. They may be familiar withsomething like CUDA, which
helps, you know, have havedrivers to run on certain
hardware like NVIDIA GPUs. Or Iknow, you know, we've we've

(08:12):
worked a little bit with, IntelGaudi processors, and there's
driver package Synapse, which issimilar in that sort of way,
helps translate kind of yourhigher level code to run on
these hardware components.
Could you help us kind of mapout that software stack, like
where this compiler fits in? Andare there other components like

(08:34):
these drivers that would have aparallel in the Grok world?

DJ (08:39):
Yeah. So traditionally, as you've mentioned like on let's
say Nvidia ecosystem, there arelike tons of engineers who go
and create these kernels whichare invoked when you have some
sort of model operations. Sothere would be maybe even

(09:01):
thousands of engineers in thecompany who would work towards
developing this very specializedkernels to go and execute
things. However, due to the thestructure of the GPU itself
architecturally, this is not thebest philosophy for design. You
know, I'm sure the audience isfamiliar with GPUs.

(09:23):
I remember playing games on themgrowing up and editing videos,
and these grew up to be morepowerful in the recent decades.
But, know, GPU started in thenineties and the design hasn't
changed all that much. We've hadan addition of high bandwidth
memory and other hardwarecomponents to it, but all of it

(09:49):
essentially is still theoriginal design, you know,
originating from the originaldesign. It does make the system
to be again less deterministic,so that goes back to the
compiler system here. And let'slet's talk about the NVIDIA GPU
kernels here, right?
So they have to deal with thedifferent hierarchies of

(10:10):
memories as an example. So forthose of the listeners who are
familiar with the differentmemory systems in a computer
system, right? So you might befamiliar with like an L1 cache,
which has like an even time oraccess time of like one
nanosecond, but you do then havethese bigger memories which are

(10:31):
high bandwidth memories, whichare like closer to 50 to one
hundred nanosecond. And for atask to be processed
performantly, data needs to befetched from between these
different memories onto thecompute, which is there, and
that transfer of data adds inmore delays. And since this is a

(10:52):
conservative system, right?
So let's say you have twooperations and one depends on
the other, it's waiting on thatoperation to complete. So it
adds on further delays, youknow. So one operation is stuck
on waiting on the data. Theother operation is stuck on the
the second operation. Right?
So that kinds of justincrementally adds more and more

(11:15):
delays into this. So that's thean example of how the
traditional, I guess, compileror the traditional kernel based
system, doesn't scale as well.What Grok chooses to do, of
course, is not have any kernelswhatsoever, but have a compiler
which controls this at a finegrained level. So a typical

(11:38):
system will have multiple chips.Right?
So AI, you know, I'm sure peopleare familiar with models like
LAMA 70,000,000,000, right? Andthese models tend to be spread
across multiple GPUs and even onmultiple Grok chips, right? And
this compiler kind of controlshow this model is precisely

(12:00):
split across these differentchips and how it's executed and
to get the best performance outof it down to the level of the
chipset and the networking. Soas I mentioned before, we've
removed a lot of the hardwarewhich adds delay and this sort
of scheduling is done, by byGrox compiler alongside with

(12:24):
some assistance from the, ofcourse, the firmware which is
there.

Chris (12:27):
And I appreciate that. As we talk, I'm trying to get a
good sense of kind of how thewhole stack looks as you're
starting to dive into it. You'vetalked a bit about the kind of
the compiler versus having akernel at the model layer there.
But with you guys covering boththe hardware and the software,

(12:50):
would you say Grok, as I try tounderstand that whole business
model that you're approaching itwith, is it more of an
integrator that's full stack allthe way from the hardware up
through the OS and into themodel layers, or from an
integration layer? Are youwriting most of the software
stack that's touching thehardware?

(13:11):
How do you choose whether to gopick, and I'm just pulling
things out of there, notattributing to you, but going
and picking Linux and pickingCUDA and picking this and
picking that versus what you'rewriting to create your own full
stack? How do you, I'm trying toget a sense of kind of how
that's distributed, thosedecisions from a design

(13:31):
standpoint.

DJ (13:32):
Yeah. No. That's a great question. So all the way from
our starting stack, right, solet's start at the top. So most
folks end up when think aboutusing AI models in production
would end up using some sort ofAPI.
So we our cloud organizationdesigned a REST compatible API.

(13:55):
It's compatible with OpenAIspec, which is there, which
makes it very easy fordevelopers to really integrate
with it. And then that ties intoall the way into our, rest of
our stack. And to answer yourquestion directly, yes, most of
the stack has been customwritten. We are of course using

(14:16):
some, Linux based primitiveswhich are there, underneath our
system, and there are of coursesome components, such as, for
the compiler, there is this,MLIR system which is being used.
MLIR is like a compiler term. Idon't want to go super deep into

(14:37):
it, but it's like a multi levelintermediate representation
which kind of helps to transformthings in between. So overall, I
would say this entire designpattern has been thought through
from scratch, and it's taken thecompany a couple of iterations
to get to that point.

Sponsor (15:07):
Well, friends, I am here with a new friend of mine,
Scott Dietzen, CEO of AugmentCode. I'm excited about this.
Augment taps into your team'scollective knowledge, your code
base, your documentation, yourdependencies. It is the most
context aware developer AI, soyou won't just code faster. You
also build smarter.
It's an ask me anything for yourcode. It's your deep thinking

(15:28):
buddy. It's your StanFlowantidote. Okay, Scott. So for
the foreseeable future, AIassisted is here to stay.
It's just a matter of gettingthe AI to be a better assistant.
And in particular, I want helpon the thinking part, not
necessarily the coding part. Canyou speak to the thinking
problem versus the codingproblem and the potential false
dichotomy there?

(15:48):
A couple of different points to make. You know, AIs
have gotten good at makingincremental changes, at least
when they understand customersoftware. So first and the
biggest limitation that theseAIs have today, they really
don't understand anything aboutyour code base. If you take
GitHub Copilot, for example,it's like a fresh college
graduate understands someprogramming languages and

(16:09):
algorithms, but doesn'tunderstand what you're trying to
do. And as a result of that,something like two thirds of the
community on average drops offof the product, especially the
expert developers.
Augment is different. We useretrieval augmented generation
to deeply mine the knowledgethat's inherent inside your code
base. So we are a copilot thatis an expert and that can help

(16:31):
you navigate the code base, helpyou find issues and fix them and
resolve them over time much morequickly than you can trying to
tutor up a novice on yoursoftware.
So you're often compared to GitHub Copilot. I
got to imagine that you have ahot take. What's your hot take
on GitHub Copilot?
I think it was a great one point zero product, and I
think they've done a hugeservice in promoting AI. But I

(16:55):
think the game has We have movedfrom AIs that are new college
graduates to, in effect, AIsthat are now among the best
developers in your code base.And that difference is a
profound one for softwareengineering in particular. If
you're writing a new applicationfrom scratch, you want a web
page that'll play tic tac toe,piece of cake to crank that out.

(17:16):
But if you're looking at tens ofmillions of line code base like
many of our customers, Lemonadeis one of them.
I mean, 10,000,000 line monorepo As they move engineers
inside and around that code baseand hire new engineers, just the
workload on senior developers tomentor people into areas of the
code base they're not familiarwith is hugely painful. An AI

(17:38):
that knows the answer and isavailable seven by 24, you don't
have to interrupt anybody andcan help coach you through
whatever you're trying to workon, hugely empowering to an
engineer working on unfamiliarcode.
Very cool. Well, friends, AugmentCode is
developer AI that uses deepunderstanding of your large code
and how you build software todeliver personalized code

(18:00):
suggestions and insights. A goodnext step is to go to
augmentcode.com. That'saugmentc0de.com. Request a free
trial, contact sales, or ifyou're an open source project,
Augment is free to you to use.
Learn more at AugmentCode.com.That's augmentc0de.com. Augment

(18:25):
Code Com.

Daniel (18:32):
So, DJ, you you mentioned that the a lot of the
focus around, you know, reallythat design from the hardware
layer up through those softwarelayers and and digging into all
of those was to achieve fastinference. Could you tell us a
little bit about the kinds ofmodels that you've run on Grok

(18:52):
and just some highlights interms of when you say fast
performance, what does that meanpractice? Now I've seen some
pretty impressive numbers onyour website, so I won't steal
your thunder. But yeah, justtalk a little bit about kind of
what is achievable with whatkinds of models on the Grok

(19:16):
platform.

DJ (19:17):
Yeah. So first of all, you know, I I share some numbers,
but we are just getting started.So these numbers are only gonna
get better with time. But, like,let's say let's take LAMA
$370,000,000,000 as an exampletends to be one of those
industry standards for comparingperformance. So we've had

(19:37):
numbers all the way from like300 tokens per second to like,
multiple thousands tokens persecond depending on those use
cases, right?
And, yeah, we've had somesmaller models which go up to
several thousand tokens persecond. We've had our one of our

(19:57):
speech to text models calledWhisper, which is again an
OpenAI model running on on Grok.And this model, I think we've
gotten around 200 x as the speedof factor as they discuss it in
the audio world.

Daniel (20:12):
Yeah. And and maybe talk a little bit about and maybe for
those out there that aren'tthey're trying to process these
thousands of tokens per second,what does that imply? I would
say, you know, if you're using achat interface, for example, and
and something is responding atthousands of tokens a second,
it's potentially a wall of text.It's sort of almost all at once

(20:36):
as far as our human eyes see it.Could you talk a little bit
about the implications of that?
So I mentioned the chatinterface, which certainly some
people are using chatinterfaces. Right? But at the
enterprise level for trueenterprise AI use cases, why is
fast inference for these kindsof models why is that important?

(21:00):
Because in a chat interface, Ican only read so much text so
fast with my own human mind asit comes back to me. Could you
give us some and I certainlyhave my own thoughts on this,
but I'm wondering if you couldthink about why does that speed
matter in enterprise use cases,and why does it matter to push

(21:23):
that maybe further you know, ourown speed of reading, for
example?

DJ (21:30):
No. Great question. So I think if you were to start with,
what Google studies from adecade perception over, like,
search results is like if ittakes longer than I think it's
about two hundred millisecondsor so, somebody, like, lose
interest, you know? So speed iscritical, whether it's for the

(21:51):
enterprise or everyday people,right? I mean, we've
demonstrated this several timesand, you can try out for
yourself.
You can have like, let's say youopen ChatGPT with something like
O1 or you have Grok on the sidewith one of our reasoning
models, and you can trycomparing them side by side. So

(22:12):
what becomes more critical asI'm coming to is that we like
everybody thinks of speed as asbeing, yes, it's important for
real time applications, but thenthere is the aspect of accuracy,
right? So if you could reasonfor longer, for let's say in the
case of our reasoning model, sowe've had like deepseq r one for

(22:35):
example, right, and thesemodels, they generate a lot of
tokens. And if you can reasonfor longer, you can get higher
quality results as a as aconsequence of this. So while
not making the system too slowfor the user, right?
So whether again it's enterpriseor it's for everyday users,

(22:59):
speed can translate to qualityas well.

Chris (23:01):
So to extend that just a little bit, if you are, and we
kind of been talking directlyabout inference speed and stuff
like that, more from thepractitioner standpoint, if
you're maybe a business manageror a business owner out there,
and you're looking at Grok, andyou're kinda comparing it
against more traditionalinference options that are

(23:24):
already out there, when you'retalking in terms of speed, and
for instance, being able to havethe time to do the research and
stuff, what are some of the usecases from a business standpoint
where they need to go, it's timefor us to reassess the more
traditional routes that we'vetaken on inference, and look at

(23:45):
Grok for these solutions? Couldyou talk a little bit about what
some of those business caseswould be?

DJ (23:50):
Yeah. I mean, if you care about accuracy, speed or cost,
you should consider Grok. So notonly are we fast, the Grok LPO
architecture allows us to givereally low cost or I would say,
our cost per token are reallylow, and we pass on those
savings to all of our customers.So if you are concerned about

(24:14):
any of these cases and you wannawork with different modalities,
if you care about image, text,or audio, if you care about rag,
if you care about reasoning, weare there for you.

Daniel (24:25):
Yeah. And just to tie into that as well, some people
might be listening to this andand thinking in their mind, oh,
Grok has this whole platformthat they've designed, hardware
and software. I don't have a Idon't have a I don't have a data
center. It's going to beexpensive for me to spin up
racks of of these things. Couldyou talk a little bit about I

(24:50):
could be mistaken, so pleasecorrect me.
I think that is something thatcan happen. There are physical
systems that people can accessand use and potentially bring
into their infrastructure. But Iknow also I see a login, I see
API as you mentioned, REST APIin your previous answer about

(25:10):
the developer experience. Maybejust talk through some of those
access patterns and also how youas a company have thought about
which of those you provide.Because certainly there
advantages on the hardware sideof maybe a fixed cost, but then
there's the burden to supportthat.

(25:31):
Just talk us through a littlebit about the strategy that you
all have taken. Because you aredeploying this whole platform,
how have you thought aboutproviding that to users and in
what sort of, what sort ofaccess patterns, I guess?

DJ (25:45):
Right. So I'd say to start with, one can go to our website
grok.com and just experience thespeed themselves. It's a chat
interface, and then it's trivialto sign up for our, account over
there, and on a free tier, weoffer like tons of tokens over
there for free. You can sign upand, get access to our APIs. So,

(26:11):
once you get access to our APIsand let's say you've already
been using an existing API,let's say you're using OpenAI,
it's it's pretty easy for you toswitch to, Grok and, it's maybe
a single or two line change andjust try it out for yourself,
right?
We firmly believe in lettingpeople experience the magic

(26:32):
themselves other than us talkingabout it. I think just actions
speak louder. So, yeah. For, ofcourse, our deep enterprise
customers, we, of course, dooffer other services on that
side, right? So, we're talkingabout single tenant and then
there's, of course, multi tenantbased architectures over there.

(26:55):
So we do offer dedicatedinstances where there's a real
need for that, and we do managethat. So now Grok kind of
deploys its own data centers,and we offer those all all over
an API. So it's very easy forour customers to go and sign up
and use them.

Chris (27:14):
I'm gonna ask you if you would if if you could kinda talk
a little bit about it justbecause as folks are listening
and stuff and they will go trythat out afterward, and I know
that we'll have links in theshow notes to the site so that
they can do that. But could youtalk a little bit about, and you
could pick your example, but youmentioned the OpenAI, and

(27:34):
something that they've probablyhad experience with. It's one of
those things that everybody hasat least touched at some point
out there, and you're providinga better experience here. And
could you talk a little bitabout what that is? When you
talk about go experience thisyourself, and you're gonna see
how amazing it is, could youtalk through what you've seen

(27:56):
your customers experience inthat way, just so that listeners
will kinda get a sense, or maybea preview of what they should
experience?
Having messed around with OpenAIfor a while and now they're
going over to Brock and they'redoing that and they're going,
woah, this is amazing. What isthat amazing that you're
expecting them to see?

DJ (28:14):
Well, first of all, people are are just amazed by the speed
that they get, like the speed ofthe output that comes up, you
know, whether it's text oraudio, you just get the output
right away right there. It'sit's really, really fast and
it's, I think, really makespeople think of new ways of

(28:34):
doing things. So, you know, oneexample from our developer
community and, you know, ourdeveloper community has grown to
over a million developers now.So one one recent example from a
hackathon was that somebody,developed this snowboarding
navigation system based on Grok,taking images and kind of trying

(28:57):
to guide people whilesnowboarding. As I mean, my mind
was blown by these creativegeniuses that are out there.
Just amazing. So all sorts ofnew applications out there,
enabled by the speed.

Daniel (29:11):
Well, DJ, I I do wanna follow-up on some of what you
had talked about there on thedeveloper community. So could
you maybe clarify one thing forme? So there's the the Grock
systems that you have deployedand models that you have
deployed in those systems, whichit sounds like if I'm
interpreting things right,people can just use, I'm

(29:34):
assuming your programminglanguage clients or REST API to
access that API and build off ofthose models that are in that
environment. So in that case,it's sort of accessing models
maybe in a, like you say, in asimilar way to they would access
OpenAI models and that sort ofthing. Is there another side of

(29:54):
the developer community that issaying, Hey, well, we're
actually, we have our own custommodels, whatever those might be.
What is the process? I guess myquestion is, what is the process
of getting a model supported onGrok? You've talked about mainly
kind of the Gen AI level modelsof LLM or vision or

(30:19):
transcription. How wide is thesupport for models in terms of,
Hey, if I have this model I'mthinking in my mind, in a
manufacturing scenario, if Ihave a model that's a very
specific model that needs to runat extremely fast speeds to
classify the quality of productscoming off of a manufacturing

(30:44):
line, right? But it's a custommodel.
I say, Okay, Grok has thefastest inference. What should I
expect in terms of model supportas of now in terms of
architectures and then yourvision for that in the future
and also how maybe people couldcontribute there if there is an

(31:04):
opportunity.

DJ (31:05):
Yeah. I think right now, one can just reach out to our sales
team and we can figure it out.So based on, the workload and,
the size of the model and thingslike that, we could we could
figure out what's the best pathgoing forward. Now going to the
future, we have some veryexciting developments, but I

(31:27):
don't want to spoil that rightnow since it's still a work in
progress. So I guess we'lldisclose that whenever we can.

Daniel (31:35):
And maybe kind of along with that, I know we have, you
know, even my team, tried outrunning models on a variety of
GPU alternatives. Sometimes whathappens there is the latest
model comes out on the market,and it's maybe you know,
supported in certain driverecosystems very quickly. And

(31:59):
then maybe on some of thesealternates, there's a there
needs to be a kinda longerpathway for support in kind of
custom software stacks thataren't, aren't GPU based. How do
you all navigate that right now?I know, of course, our team is
small and it's hard for us tonavigate that.
Maybe you have people thinkingabout those things every day.

(32:22):
But yeah, how do you navigatethat challenge as an engineering
team to support all of thesedifferent models as they're
coming out given that you have acompletely different software
stack than others are workingwith in the ecosystem?

DJ (32:40):
Yeah. If you think about it, we don't have to write kernels
per model level, you know? Sowhen a new model comes out,
generally on the GPU world andeven other custom accelerators,
typically, people spend a lot oftime writing more optimal
versions of it. So you mighthear about, new coder kernels

(33:03):
being launched. Let's say, youknow, after the original
attention, there was the flashattention one.
So that's like more optimal wayof running some of these models
on the GPU. But we don't have todo this, at a per model level.
What ends up happening is as weenhance our compiler over time,

(33:23):
all these enhancements justreflect onto all of the models
that we end up supporting, andthe process to support,
different models on Grok is,kind of similar. We end up
spending some time removingvendor specific hardcodings.
Right?
So there tends to be a lot ofGPU specific code, which we end

(33:46):
up removing. And then we kind ofrun our compiler to translate
this into, you know, finally tothe Grok hardware, but there are
a lot of knobs we turn, tweak,and turn to give you the best
possible performance of of that.And as the compiler improves
with time, we just end uppassing on these improvements to

(34:07):
all the models right away. Soour effort per model is is not
as high, you know.

Daniel (34:14):
So, just to clarify on that point, these models would
kind of roll out. You would kindof build into the compiler kind
of less vendor specific thing ormore general functionality over
time, which would expand yourability to support certain types
of operations. But you wouldn'tnecessarily be able to say, hey.

(34:37):
I've got this random model. Icreated my you know, some
research team created their ownarchitecture, right, of of this
crazy thing.
It may take some effort to kindof map that into the Grok
software stack, but maybe if I'mhearing right, sort of less
burden over time as theecosystem develops. Is that the

(34:59):
right way to inter interpretthat?

DJ (35:01):
Partially, yes. But I would add that if you think about what
the Grok system is at the heartof it, right, it's matrix
multiplication and vector matrixmultiplications. And that's what
most machine learning models arethere. Right? Yes.
When we have a generationalshift like transformers, one
might want to go and look atwhat's the new model type and

(35:25):
how well does it map to ourhardware. We might want to have
some strategies to address someof that, right? But
fundamentally, models haven'tchanged all that much, after the
transformers have beenintroduced. Now, you know, you
kind of hear about, diffusionmodels, even in the text world
most recently. But as long asthese fundamentals don't change

(35:50):
frequently, I think our our corebelief of, just supporting this
wide ecosystem of modelscontinues to live sturdy.
If you if you look at other AIaccelerators, some of them have
gone and hard coded, let's say,to the transformer architecture
itself, and their bet is thatsuper specialization is the way

(36:14):
to go. But our belief is that wewould like to support a more
wider scale of models. Andthat's that's pretty much, what
our compiler system would do tokind of map between this high
level, let's say, PyTorch modelinto the Grok platform,
converting it to, let's say,intermediate layer where the

(36:40):
compiler could workindependently of what model it
is. So there's no hard coupling,let's say, to a particular model
or to even an architecture type.It's it's very low coupling.

Chris (36:55):
I'm curious. I've been been really kinda spinning on
the on kind of the speed of ofyou're talking about in terms of
inference and some of thecapabilities that your stack
offers. As in general as themodel ecosystem has been
developing into second half oflast year and into this year,
raging into this year, kind ofagentic AI, and then that's kind

(37:17):
of evolving into physical AI,and so you're dealing with
robotics and autonomy and thingslike that that you're supporting
to where we're expecting anexplosion of devices out there
in the world that these systemsare supporting, what is your
strategy forward and approachfor thinking about kind of

(37:37):
physical AI that we're evolvinginto, where you have agents that
are interacting with physicaldevices that are interacting
with us in the real world, soit's not all in the data center,
but the data center's supportingthat. How does that fit into
your overall view forward?

DJ (37:57):
Yeah, I think the AI industry revolves very rapidly.
Like personally, I don't think,there can be any long term
strategy which will not needadjustments based on
developments, but, our belief isstill that I think edge based
deployments, you know, andcalling things over over APIs,

(38:18):
will be the preferred,interface, going forward for a
long time, right? So, sure,your, let's say, mobile chip
might be able to perform somebasic level tasks over there.
But if you need like really highaccuracy, high quality, model
inference, doing this over anAPI, I think would get you

(38:42):
there. So compared to the, youknow, like, model size which you
can actually deploy on a mobilephone.
So just another example for,like, an edge device.

Daniel (38:52):
I have one question for you just as a as an engineer
that has been working on at theforefront of this inference
technology, what has been someof the challenges that, I guess,
you've faced as you've reallydug into these problems, maybe
that were unexpected or maybethey were expected for you? What

(39:14):
have been some of the biggestchallenges and maybe some
learnings that looking back onyour time working on this
system, you can share with theaudience?

DJ (39:24):
Yeah. No. Great question. As I said, I think the AI industry
moves really fast and sometimesthere are these shifts. Right?
So we saw this shift to largelanguage models and that's when
the company itself kind ofpivoted to focus on this. So
Meta releasing LAMA and the LAMAtwo series of models was really

(39:45):
what got our company to, focuson this side and really push on
this, right? So similarly, Ithink we are a startup. We are
always pushing on all fronts,always trying to improve on
things. So whenever there's somenew architectural change, we
look to see how we could bestadapt our system for that to get

(40:09):
to maximize throughput, right?
So sometimes there are thesekind of changes and, you know,
this is something which actuallyexcites me about Grok and
working at such a talent densecompany. My colleagues really
come up with really greatexciting new ways of doing

(40:30):
things to really push the bar onsome of these things. So, maybe
it's like could be a mixture ofexperts or reasoning models
whenever something new comes up.Right? I think that's getting
the maximum performance out ofthat is something we care about.
We deeply care about. And, yeah,I think that's been, one of the

(40:54):
key areas.

Daniel (40:55):
Awesome. Well, as we kind of get close to an endpoint
here, this has been fascinating.I'm wondering, DJ, if you could
just close us out by sharingsome of the things that you
think about personally kind ofgoing into this next year. As
you mentioned, things are movingso fast. There are shifts that

(41:16):
are happening.
What are some of the things thatare most exciting for you as you
kind of head into this next yearof development and work?

DJ (41:24):
So as a developer and amateur data scientist, I would
say that for me, the push on thecoding side of the AI world has
been very exciting. It helps mekind of think about how can I
have more impact, whether it'sat Grok or in the world in
general? So the push of AI onthe coding side, reasoning

(41:48):
models, multiple modalities andthe fusion of all of this,
right? I think that's what Ireally want to look forward to
for the next couple of years.There's of course the robotics
bit which we touched upon, butthat I feel is probably a couple
of years down the line.

Daniel (42:06):
Awesome. Well, thank you, DJ, for for representing
Grok, and and, congratulationson what you and the team have
achieved, which is which isreally amazing and and
monumental work. So great work.Keep it going. We'll be excited
to follow the story and and hopeto get an update again on the
podcast sometime soon.

(42:27):
Thanks.

DJ (42:27):
Sounds great, guys. Thanks for having me.

Jerod (42:37):
All right, that is our show for this week. If you
haven't checked out ourChangelog newsletter, head to
changelog.com/news. There you'llfind 29 reasons. Yes, 29 reasons
why you should subscribe. I'lltell you reason number 17.
You might actually start lookingforward to Mondays.

DJ (42:57):
Sounds like somebody's got a case of the Mondays.

Jerod (43:00):
28 more reasons are waiting for you at
changelog.com/news. Thanks againto our partners at Fly.io, to
Brakemaster Cylinder for theBeats, and to you for listening.
That is all for now, but we'lltalk to you again next time.

All Episodes

Episode Transcript

Popular Podcasts

On Purpose with Jay Shetty

The Breakfast Club

The Joe Rogan Experience

.css-15opob5{left:0;position:absolute;top:0.8rem;} All Episodes

.css-14f5ked{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:2;overflow:hidden;}Software and hardware acceleration with Groq

Episode Transcript

Popular Podcasts

.css-r6mb8g{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:1;overflow:hidden;}On Purpose with Jay Shetty

The Breakfast Club

The Joe Rogan Experience

All Episodes

Software and hardware acceleration with Groq

On Purpose with Jay Shetty