Optimizing for efficiency with IBM’s Granite

Episode Transcript
Available transcripts are automatically generated. Complete accuracy is not guaranteed.
Jerod (00:04):
Welcome to Practical AI, the podcast that makes
artificial intelligencepractical, productive, and
accessible to all. If you likethis show, you will love the
changelog. It's news on Mondays,deep technical interviews on
Wednesdays, and on Fridays, anawesome talk show for your
weekend enjoyment. Find us bysearching for the changelog

(00:24):
wherever you get your podcasts.Thanks to our partners at
fly.io.
Launch your AI apps in fiveminutes or less. Learn how at
fly.io.

Chris (00:44):
Welcome to another episode of the Practical AI
Podcast. This is Chris Benson. Iam your cohost. Normally, Daniel
Wightnack is joining me as theother cohost, but he's not able
to today. I am a principal AIresearch engineer at Lockheed
Martin.
Daniel is the CEO of PredictionGuard. And with us today, we

(01:05):
have Kate Soule, is director oftechnical product management at
Granite for IBM. Welcome to theshow, Kate.

Kate (01:13):
Hey, Chris. Thanks for having me.

Chris (01:15):
So I wanted to I know we're gonna dive shortly into
what Granite is, and some of ourlisteners are probably already
familiar with it. Some may notbe. But before we dive into
that, wondering we're talkingabout AI models, that's what
Granite is, and the and the theworld of LLMs, generative AI.

(01:36):
Wondering if you could start offtalking a little bit about your
own background, how you arrivedat this, and we'll get into a
little bit about, you know, whatIBM is doing and why it's
interested in and how it fitsinto the landscape here for
those who are not alreadyfamiliar with it.

Kate (01:51):
Perfect. Yeah. Thanks, Chris. So I lead the technical
product management for Granite,which is IBM's large family of
large language models that isproduced by IBM Research. And so
I actually joined IBM and IBMResearch a number of years ago,
before large language modelswere really became popular.
You know, they had a bit of aNetscape moment right back in

(02:14):
November 2022. So I've beenworking at the lab for a little
while. I am a little bit of aodd duck, so to speak, in that I
don't have a researchbackground. I don't have a PhD.
I I come from a businessbackground.
I worked in, consulting for anumber of years, went to
business school, and joined IBMResearch and the AI Lab here in
order to get more involved in intechnology. You know, I've

(02:38):
always kind of had one foot inthe tech space. I I was a data
scientist, for most of my tenureas a consultant and always, you
know, thought, that there was alot of exciting things going on
in AI. And so I joined the laband basically got to work with a
lot of generative AI researchersbefore large language models
really kind of became big. Andabout two and a half years ago,

(03:01):
a lot of this technology we'reworking on, all of a sudden we
started to find and see thatthere were tremendous business
applications.
You know, OpenAI reallydemonstrated what could happen
if you took this type oftechnology and force fed it
enough compute to make itpowerful. It could do some
really cool things. So fromthere, we worked as a team

(03:21):
really to spin up a a programand offering at IBM, for our own
family of large language modelsthat we could offer our
customers and the broader opensource ecosystem.

Chris (03:31):
Do you I'm I'm curious with one of the things that I've
I've you know, we've noticedover time is different
organizations kind of arepositioning the these large
language models within theirproduct offerings in in very
unique ways. And and we've youknow, we could go through some
of your competitors and say theydo this way. How do you guys see
that in terms of, you know, howlarge language models fit into

(03:55):
your product offering? Is thereis there a vision that IBM has
for that?

Kate (03:59):
Yeah. I you know, I think the fundamental premise of large
language models are that theyare a building block that you
get to build on and reuse inmany different ways, right,
where one model can drive anumber of different use cases.
So, you know, from IBM'sperspective, that value
proposition resonates reallyclearly. We see a lot of our

(04:21):
customers, our own internalofferings where, there's a lot
of effort on data curation andcollection and kind of creating
and training bespoke models fora specific task. And now with
large language models, we get tokind of use one model and with
very little labeled data, all ofa sudden, the world's your
oyster.
There's a lot you can do. And sothat's a bit of the reason why

(04:44):
we have centralized thedevelopment of our language
models within IBM Research, nota specific product. It's one
offering that then feeds intomany of our different products
in downstream applications. Andit allows us to kind of create
this building block that we canthen also offer customers to be
able to build on top of as well.And open source ecosystem

(05:07):
developers, you know, we thinkthere's a lot of different
applications for that oneoffering.
And so, you know, that's alittle bit kind of from the
organizational side why we'rewhy it's kind of exciting,
right, that we get to do thisall within research. We don't
have a a p and l, so to speak.We're doing this to create
ultimately a tool that cansupport any number of different
use cases and downstreamapplications.

Chris (05:28):
Very cool. And and you mentioned open source. So I
wanna ask you because that'salways a a big topic among
organizations is, if I remembercorrectly, Granite is under an
Apache two license. Is that isthat correct?

Kate (05:41):
That's correct.

Chris (05:42):
Why I'm just curious because we've seen strong
arguments on both sides. Why whyis Granite an an open source
license like that? Why what wasthe decision from IBM to to go
that direction?

Kate (05:55):
Yeah. Well, there was kind of two levels of decision making
that we had to make when wetalked about how to license
granted. One was open or closed.So are we gonna release this
model, release the weights outinto the world so that anyone
can use it regardless if theyspend a dime with IBM? And,
ultimately, IBM, you know,believes strongly in the power

(06:17):
of open source ecosystems.
A huge part of our business isbuilt around Red Hat and being
able to provide open sourcesoftware to our customers with
enterprise guarantees. And wefelt that open AI was a far more
responsible environment todevelop and to incubate this
technology as a

Chris (06:34):
whole. And when you say open AI, you mean open source
AI?

Kate (06:37):
Open source AI. Just making sure. Very important
clarification. Very importantclarification. So that was why
we released our models out intothe open.
And then the question was underwhat license? Because there are
a lot of models, there are a lotof licenses And a bit of the,
like, moments, that everyone'sseeing is you have a Gamma

(06:58):
license for a Gamma model.You've got a LAMA license for a
LAMA model. Everyone's coming upwith their own license. You
know, it kind of, in some waysit makes sense.
Models are a bit of a weirdartifact. They're not code. You
can't execute them like on theirown. They're not software.
They're not data per se, butthey are kind of like a big bag
of numbers at the end of theday.

(07:19):
So like, you know, some of thetraditional licenses, I think
some people didn't see a clearfit, and so they came up with
their own. There are also allthese different kind of
potential risks that you mightwanna solve for with a license,
with a large language model thatare different than risks that
you look at with software ordata. But at the end of the day,
IBM really wanted just to keepthis simple, like a no nonsense

(07:40):
license that we felt would beable to promote the broadest
broadest use from the ecosystemwithout any restrictions. So we
went with Apache two, becausethat's probably the most widely
used and just easy to understandlicense that's out there. And I
think it really speaks also towhere we see models being

(08:00):
important building blocks thatare further customized.
So we really believe the truevalue in generative AI is being
able to take some of thesesmaller open source models and
build on top of it and evenstart to customize it. And if
you're doing all that work andbuilding on top of something,
you wanna make sure there are norestrictions on all that IP
you've just created. And sothat's ultimately why we went

(08:22):
with Apache two point zero.

Chris (08:24):
Understood. And one last follow-up on licensing and then
I'll move on. It's more it'spartially just a comment. IBM
has a really strong legacy as assomeone in the AI world and and
decades of software developmentalong with that. I know both Red
Hat with the acquisition someyears back being strong on open
source and IBM both before andafter has.

(08:46):
Was it was it I'm just curious.Did that make it any easier, do
you think, to go with opensource? Like, hey. We've done
this so much that we're gonna dothat with this thing too even
though it's a little bit newer,you know, in context.
Culturally, did it seem easierto get there than some companies
that that possibly reallystruggle with that, that don't
have such a legacy in opensource?

Kate (09:08):
I think it did make it easier. I think there are always
going to be like any companygoing down this journey has to
take a look at, wait, we'respending how much on what and
you're gonna give it away forfree and come up with their own
kind of equations on how thisstarts to make sense. And I
think we've just experienced asa company that the software and

(09:29):
offerings we create are so muchstronger when we're creating
them as part of an open sourceecosystem than something that we
just keep close to the best. So,you know, it was a much easier
business case, so to speak, tomake and to get the sign off
that we needed. Ultimately, ourleadership was very supportive
in order to encourage this kindof open ecosystem.

Chris (09:49):
Fantastic. Turning a little bit, as as as IBM was
diving into this into this realmand and starting, you know, and
obviously, like, you have ahistory with Grant that that,
you know, you guys are on 3.2 atthis point. That means that
you've been working on this fora period of time. But as you're
diving into this verycompetitive ecosystem of of

(10:09):
building out these open sourcemodels that are that are big,
they are expensive to make, andthey're and you're, you know,
looking for an outsized impactin the world, How do you decide
how to proceed with what kind ofarchitecture you want? You know,
how did you guys think about,like, like, you're looking at
competitors.
Some of them are closed sourcelike OpenAI is. Some of them

(10:32):
like Meta AI, you know, hasLlama and, you know, that that
series. As you're looking atwhat's out there, how do you
make a choice about what isright for what you guys are
about to go build? You know?Because that's one heck of an
investment to make.
And I'm I'm kinda curious howyou when you're looking at that
landscape, how you make sense ofthat in terms of where to
invest.

Kate (10:52):
Yeah. Absolutely. So, you know, I think it's all about
trying to make educated betsthat that kind of match your
your constraints that you'reoperating with and your your
broader strategy. So, you know,early on into our Generve AI
journey, when we were kind ofgetting the program up and
running, you know, we wanted totake fewer risks. We wanted to

(11:14):
learn how to do, you know,common architectures, common
patterns before we started toget more, quote, unquote,
innovative in coming up with netnew additions on top.
So early on the gen and also youhave to keep in mind, this field
has just been like changing soquickly over the past couple of
years. So no one really knewwhat they were doing. Like, we

(11:34):
look at how models were trainedtwo years ago and the decisions
that were made, the game was allabout as many parameters as
possible and having as littledata as possible to keep your
training costs down. And nowwe've totally switched. The
general wisdom is as much dataas possible and as few
parameters as possible to keepyour inference costs down once

(11:56):
the model's finally deployed.
So the whole field's been goingthrough a learning curve. But I
think early on, our goal wasreally working on trying to
replicate some of thearchitectures that were already
out there, but innovate on thedata. So really focusing on how
do we create versions of thesemodels that are being released

(12:18):
that deliver the same type offunctionality, but that were
trained by IBM as a trustedpartner working very closely
with all of our teams to have avery clear and ethical data
curation and sourcing pipelineto train the models. So that was
kind of the first majorinnovation aim that we had was

(12:38):
actually not on the architectureside. Then as we started to get
more confident as the fieldstarted, I don't wanna say
mature because we're still very,again, very early innings, but
we started to call less to someshared understandings of how
these models should be trainedand what works or doesn't.
Then our goal really has startedto focus on from an architecture

(12:59):
side, how can we be as efficientas possible? How can we train
models that are going to beeconomical for our customers to
run? And so that's where you'veseen us focus a lot on smaller
models for right now, and we'reworking on new architectures. So
for example, mixture of experts,there's all sorts of things that

(13:20):
we are really focusing in reallywith kind of the mantra of how
do we make this as efficient aspossible for people to further
customize and to run-in theirown environments.

Chris (13:32):
So that was a fantastic start to as we dive into Granite
itself, kind of laying it out.You know, your last comments,
you talked about kind of thesmaller, more economical models
so that you're getting efficientinference on the customer side.
You mentioned a phrase, whichsome people may know, some
people may not, mixture ofexperts. Maybe talk as we dive

(13:54):
into, you know, what Granite isand its versions going forward
here. Maybe start with mixtureof experts and and what you mean
by that.

Kate (14:04):
Absolutely. So if we think of how these models are being
built, they're essentiallybillions of parameters that are
representing small littlenumbers that basically are
encoding information. And, youknow, to like draw a really
simple explanation, if you havea, you know, a linear
regression, like you've got ascatter point and you're fitting

(14:24):
a line, Y equals MX plus B, likeM is a parameter in that
equation, right? So that excepton the scale of billions. With
mixture of experts, what we'relooking at is, do I really need
all 1,000,000,000 parametersevery single time I run
inference?
Can I use a subset? Can I havekind of little expert groupings
of parameters within my, largelanguage model so that at

(14:48):
inference time, I'm being farmore selective and smart about
which parameters get called?Because if I'm not using all,
you know, 8,000,000,000 or a20,000,000,000 parameters, I can
run that inference far faster.So it's much more efficient. And
so really, it's just getting alittle bit more nuanced of
instead of, like, I think a lotof early days of generative AI

(15:10):
is just throw more compute at itand hope the problem goes away.
We're now trying to like figureout how can we be far more
efficient in how we build thesemodels.

Chris (15:19):
So appreciate the explanation on mixture of
experts. And that makes a lot ofsense in terms of trying to use
the model efficiently for aninference by reducing the number
of parameters. I believe you'reright now, you guys have is it
10,000,000,000 are the are themodel sizes in terms of the

(15:39):
parameters? Or or have I gottenthat wrong?

Kate (15:41):
We got actually a couple of sizes. So you're right. We've
got 10,000,000,000. But speakingof those mixture of expert
models, we actually have acouple of tiny MOE models. MOE
stands for mixture of experts.
So we've got a MOE model withonly a billion parameters and an
MOE model with 3,000,000,000parameters. But they
respectively use far fewerparameters at inference time. So

(16:03):
they run really, really quick,designed for more local
applications like running on aCPU.

Chris (16:09):
So and and when when when you make the decision to have
different size models in termsof the number of parameters and
stuff, do you have different usecases in mind of how those
models might be used? And isthere one set of scenarios that
you would put your 8,000,000,000and another one that would be
that 3,000,000,000 that youmentioned?

Kate (16:30):
Yeah. Absolutely. So if we think about it, when we're kind
of designing the model sizesthat we wanna train, a huge
question that we're trying tosolve for is, you know, what are
the environments these modelsgonna be run on? And how do I,
you know, maximize performancewithout forcing someone to have
to, like, buy another GPU tohost it. So, you know, there are

(16:51):
models like those small MOEmodels that were actually
designed much more for runningon the edge locally or on a
computer, like a just a locallaptop.
We've got models that aredesigned to run on a single GPU,
which is like our10,000,000,000, models. Those
are standard architecture, notMOE. And we've got models on our

(17:11):
road map that are looking at howcan we kind of max out what a
single GPU could run, and thenhow can we max out what a box of
GPUs could run. So if you goteight GPUs stitched together. So
we are definitely thinking aboutthose different kind of tranches
of compute availability thatcustomers might have, and each

(17:31):
of those tranches could relateto different use cases.
Like, obviously, if you'rethinking about something that is
local, you know, there's allsorts of IoT type of use cases
that that could target. If youare looking at something that
has to be run on, you know, abox of ATPUs, you know, you're
looking at something that youhave to be okay with having a
little bit more latency, youknow, time it takes for the

(17:51):
model to respond, but where theuse cases also probably needs to
be a little bit higher valuebecause it costs more to run
that big model. And so you'renot gonna run like a really
simple, like, you know, help mesummarize this email task
hitting, you know, eight GPUs atonce.

Chris (18:09):
So as you talk about kind of the segmentation of these of
the of the family of models andhow you're doing that, I know
one of the things you guys havea white paper, which we'll be
linking in on the show notes forfolks to go and take a look at
either during or after theylisten here. And you talk about
some of the models beingexperimental chain of thought

(18:29):
reasoning capabilities. And Iwas wondering if you could talk
a little bit about what thatmeans.

Kate (18:34):
Yeah. So really excited with the latest release of our
Granite models. Just theFebruary, we released Granite
3.2, which is an update to our2,000,000,000 parameter model
and our 8,000,000,000 parametermodel. And one of the kind of
superpowers we give this modelin the new release is we bring
in an experimental feature forreasoning. And so what we mean

(18:56):
by that is there's this newconcept, relatively new concept
in Genov AI called inferencetime compute, where if you what
that really equates to, just toput in plain language, if you
think longer and harder about aprompt, about a question, you
can get a better response.
I mean, this works for humans.This is how you and I think, but

(19:16):
it's the same is true for largelanguage models. And thinking
here, you know, is a bit of arisk of anthropomorphizing the
term, but it it's where wherewe've landed as a field, so I'll
I'll run with it for now, isreally saying generate more
tokens. So have the model thinkthrough what's called a chain of
thought, you know, generateslogical thought processes and

(19:37):
sequences of how the model mightapproach answering before
triggering the model to thenrespond. And so we've trained
Granite 8B 3.2 in order to beable to do that chain of thought
reasoning natively, takeadvantage of this new inference
time compute area of innovation.
And what we've done is we'vemade it selective. So if you

(19:59):
don't need to think long andhard about, you know, what is
two plus two, you turn it off,and the model responds faster
just with the answer. If you aregiving it a more difficult
question, you know, andpondering the meaning of life,
you might turn thinking on, andit's gonna think through a
little bit first beforeanswering an answer with a much,
in general, a longer kind ofmore chain of thought style

(20:21):
approach where it's explainingkind of step by step why it's
responding the way it is.

Chris (20:25):
Do you anticipate kind of and I've seen this done from
different organizations indifferent ways. Do you
anticipate that your inferencetime compute capability is going
to be kind of there on all themodels and you're turning it on
and off? Or do you do youanticipate that some of the
models in your family are morespecializing in that and that's
always on versus others? Whichway you kinda mentioned the on

(20:48):
and off, so it sounded like youmight have it in all of the
above.

Kate (20:51):
Yeah. You know, I right now, it's marked as an
experimental feature. I thinkwe're still learning a lot about
how this is useful and what it'sgoing to be used for, and that
might dictate what makes sensemoving forward. But what we're
seeing is kind of universally.It's useful one to try and
improve the quality of theanswers, but two as an
explainability feature, like ifthe model is going through and

(21:13):
explaining more how it came upwith a response that helps a
human better understand theresponse.
So you know, I think it issomething we're heavily
considering just baking into themodels moving forward, which is
a different approach right thansome models, which are just
focused on reasoning. I I don'tthink we're going to see that
very long. You know, I thinkmore and more we're going to see

(21:33):
more selective reasoning. Solike Claude three point seven
came out. They're actually doinga really nice job of this.
We can think longer or harderabout something or just think
for a short amount of time, so Ithink we're going to see
increasingly more and more folksmove in that direction. But you
know, well, there's still againearly innings. I'll say it
again. So we're going to learn alot over the next couple of

(21:54):
months about where this ishaving the most impact. And I
think that could have somestructural implications of how
we design our roadmap movingforward.

Chris (22:01):
Gotcha. With there has been a a larger push in the
industry toward towards smallermodels. So, you know, kinda
going back over the the recenthistory of of LLMs and, you
know, you saw initially, youknow, the just the number of
parameters exploding and themodels becoming huge. And,
obviously, you know, we talked alittle bit about the fact that

(22:23):
that's very expensive oninference

Kate (22:25):
Yeah.

Chris (22:25):
To to run these things. And over the last especially
over the last, I don't know,year, year and a half, there's
been a much stronger push,especially with open source
models. We've seen a lot of themon Hugging Face pushing to
smaller. Do you anticipate as asyou're thinking about this
capability of being able toreason that that's going to

(22:46):
drive smaller model use towardmodels like what you guys are
creating, where you're saying,okay, we have these large you
know, Claude has the, you know,big models out there, you know,
as an option, or or a LAMA modelthat's very large. Are you guys
anticipating kinda pulling a lotmore mindshare towards some of
the smaller ones?
And do you anticipate thatyou're going to continue to

(23:08):
focus on on these smaller, moreefficient ones where people can
actually get them deployed outthere without without breaking
the bank of their organization?How how how does that fit in?

Kate (23:18):
Yeah. So look. One thing to keep in mind is even without
thinking about it, withouttrying, we're seeing small
models are increasingly able todo what it took a big model to
do yesterday. So you look atwhat a tiny, you know,
2,000,000,000 parameter, ourgranite two b model, for
example, outperforms on numerousbenchmarks, you know, LAMA two

(23:41):
seventy b, which is a muchlarger but older generation. I
mean, was state of the art whenit was released.
But the technology is justmoving so quickly. So, you know,
we do believe that by focusingon some of the smaller sizes,
that ultimately we're gonna geta lot of lift just natively,
because that is where technologyis evolving. Like, we're

(24:02):
continuing to find ways to packmore and more performance and
fewer and fewer parameters andexpand the scope of what you can
accomplish with a small languagemodel. I don't think that means
we're going to ever get rid ofbig models. I just think if you
look at where we're focusing,we're really looking at kind of
where are the models you know,if you think of the eighty

(24:24):
twenty rule, like 80% of the usecases can be handled by a model,
you know, maybe 8,000,000,000parameters or less.
That's that's what we'retargeting with Granite, and
we're really trying to focus in.We think that there's definitely
still always gonna be innovationand opportunity and complex use
cases that you need largermodels to handle. And that's
where we're really interested tosee, okay, how do we expand the

(24:46):
Granite family potentially,focusing on more efficient
architectures like mixture ofexperts to target those larger
models and more complex modelsizes so that you still get a
little bit more of a morepractical implementation of a
big model, recognizing that,again, it's not you're always
gonna need there's always gonnabe those outliers, those really

(25:06):
big cases. We just don't thinkthere's gonna be as much
business value, frankly, behindthose compared to really
focusing and delivering value onthe small to medium model space.

Chris (25:18):
I I think we've that's one thing Daniel and I have
talked quite a bit about is thatwe we would agree with that.
It's I think the the bulk of theuse cases are are for the
smaller ones. While we're at it,you know, we've been talking
about various aspects of Granitea bit, but could we take a
moment and have you kinda goback through the Granite family
and kind of talk about eachcomponent in the family, what it

(25:42):
you know, what it's called, whatit does, and just kinda lay out
the array of things that youhave to offer.

Kate (25:48):
Absolutely. So the Granite model family has the language
models that I went over. Sobetween 1,000,000,000 to
8,000,000,000 parameters insize. And again, we think those
are like the workhorse models.You know, 80% of the tasks, we
think you can probably get awaywith a model that's
8,000,000,000 parameters orless.
We also, with 3.2, recentlyreleased a vision model. So

(26:10):
these models are for visionunderstanding tasks. That's
important. It's not vision orimage generation, which is where
a lot of the early, like, hypeand excitement on generative AI
came from. It's like DALL E andthose.
We're focused on models whereyou provide a image in a prompt,
and then the output is text, themodel response. So really useful
for things like image anddocument understanding. We

(26:34):
specifically prioritize a verylarge amount of document and
chart, q and a type data in itstraining dataset, really
focusing on performance on thosetypes of tasks. So you can think
of, you know, having a a pictureor an extract of a chart from a
PDF and being able to answerquestions about it. We think
there's a lot of opportunity.

(26:55):
So Rag is a very popularworkflow in enterprise. Right?
Retrieval augmented generation.Right now, all of the images and
your PDFs and documents, theyall get basically thrown away.
But we really, like, are workingon, can we use our vision model
to actually include all of thosecharts, images, figures,
diagrams to help improve themodel's ability to answer

(27:16):
questions in a rag workflow?
So we think that's gonna behuge. So lots of use cases on
the on the vision side. And thenwe also have a number of kind of
companion models that aredesigned to work in parallel
with a language model or avision vision language model. So
we've got our Granite Guardianfamily of models, and these are

(27:36):
our we call them guardrails.They're meant to sit in right in
parallel with the large languagemodel that's running the main
workflow, and they monitor allthe inputs that are coming in to
the model and all the outputsthat are being provided by the
model, looking for potentialadversarial prompts,
jailbreaking attacks, harmfulinputs, harmful and biased

(27:58):
outputs.
They can detect hallucinationsand model responses. So it's
really meant to be a governancelayer that can sit and work
right alongside Granite. Itcould actually work alongside
any model. So, you know, even ifyou've got a OpenAI model, for
example, you've deployed, youcan have Granite Guardian work
right in parallel. And, youknow, ultimately just be a tool
for responsible AI.

(28:19):
And, you know, the last modelI'll talk about is our embedding
models, which again is meant tobe you know, assist a model in a
broader generative AI workflow.So in a Rag workflow, you'll
often need to take large amountsof documents or text and convert
them into what are calledembeddings that you can search
over in order to retrieve themost relevant info and give it
to the model. So our Graniteembedding models are are used

(28:42):
for that embedding step. Sothese are meant to do that
conversion and and can supportin a number of different similar
kind of search and retrievalstyle workflows working directly
with the granite large languagemodel.

Chris (28:54):
Gotcha. I know there was there was some comment in the
white paper also about timeseries. Yes. Can you talk a
little bit to that for a second?

Kate (29:02):
Absolutely. So I mentioned, Granite is multimodal
and that it supports vision. Wealso have time series as a
modality, and I'm really gladyou brought these up because
these models are reallyexciting. So we talked about our
focus on efficiency. Thesemodels are, like, one to
2,000,000 parameters in size.
That is teeny tiny in today'sgenerative AI context. Even

(29:26):
compared to other forecastingmodels, these are really small
generative AI based time seriesforecasting models, but they are
right now delivering top of thetop marks, when it comes to
performance. So we just, as partof this release, submitted our
time series models to Salesforceas a time series leaderboard
called Gift. They're the numberone leaderboard on Gift right

(29:47):
now, number one model on Gift'sleaderboard right now. And we're
really excited.
They've got over 10,000,000downloads on Hugging Face.
They're really taking off in thecommunity. So it's a really
excellent offering in the timeseries modality for the Granite
family.

Chris (30:03):
Okay. Well, thank you for going through kind of the layout
of the family of models that youguys have. I actually wanna go
back and ask a quick question.You talked a bit about Guardian
of providing guardrails andstuff. And that's something that
if you take a moment to diveinto, think we often tend to

(30:24):
focus kind of on you know, themodel and it's gonna do x, you
know, whatever.
I love the notion of integratingthese guardrails that Guardian
represents into a largerarchitecture, you know, to
address kind of the qualityissues surrounding the inputs
and the outputs on that. How didyou guys arrive at that? I'm

(30:45):
just you know, and and how howdid you you know, it's it's
pretty cool. I love the ideathat not only is it there for
your own models, obviously, butthat, you know, that you could
have an end user go and apply itto something something else that
they're doing, maybe from acompetitor or whatever. How did
you decide to do that?
And I think that's a fairlyunique thing that we don't tend
to hear as much from otherorganizations.

Kate (31:08):
Yeah. You know, so Chris, the one of the values, again, of
being in the open sourceecosystem is we get to, like,
build on top of other people'sgreat ideas. So we actually
weren't the first ones to comeup with it. There's a few other
guardrail type models out there.But, you know, IBM has quite a
large, especially IBM research,presence in security space.

(31:29):
And we there are challenges insecurity that are very similar
to the large language models ingenerative AI that, you know,
it's not totally new. And what Ithink are we've learned as a
company and as a field is thatyou always need layers of
security when it comes tocreating a robust system
against, you know, potentialadversarial attacks and dealing

(31:51):
with even the model's own innatesafety alignment itself. So, you
know, when we saw some of thework going out in the open
source ecosystem on guardrails,you know, I think it was kind of
a no brainer from perspective ofthis is another great way to add
an additional layer on thatgenerative AI stack of security
and safety to better improvemodel robustness and figure out,

(32:13):
you know, IBM's hyper focus onwhat is the practical way to
implement generative AI. So whatelse is needed beyond
efficiency? We need trust.
We need safety. Let's createtools in that space. So it it
kind of, you know, number ofdifferent reasons all all made
it very clear clear and easywhen to go and pursue. And we
are actually able to build ontop of Granite. So Granite

(32:33):
Guardian is a fine tuned versionof Granite that's laser focused
on these tasks of detecting andmonitoring inputs going into the
model and outputs going out.
And the team has done a reallyexcellent job, first starting at
basic harm and bias detectors,which I think is pretty
prevalent in other guardrailmodels that are out there. But

(32:54):
now we've really started to kindof make it our own and innovate.
So some of the new features thatwere released in the 3.2 Granite
Guardian models includehallucination detection. Very
models few models do that today,specifically hallucination
detection with function calling.So if you think of an agent, you
know, whenever an LLM agent istrying to access or submit

(33:15):
external information, they'llmake what's called a tool call.
And so when it's making thattool call, it's providing
information based off of theconversation history saying, you
know, I need to look up, youknow, Kate Sowell's information
in the HR database. This is herfirst name. She lives in
Cambridge, Mass, X Y Z. And wewanna make sure the agent isn't
hallucinating when it's fillingin those pieces of information

(33:37):
it needs to use to retrieve.Otherwise, you know, if she made
up the wrong name or saidCambridge, UK instead of
Cambridge, Mass, the tool willprovide the incorrect response
back, but the agent will have noidea.
And it will keep operating withutmost certainty that it's
operating on correctinformation. So, you know, it's
just an interesting example of,you know, some of the
observability we're trying toinject into responsible AI

(34:00):
workflows, particularly aroundthings like agents, because
there's all sorts of new safetyconcerns that really have to be
taken into account to make thistechnology practical and
implementable.

Chris (34:11):
And and, you know, that's actually having brought up
agents and stuff and that beingkind of the really hot topic of
the moment of, you know, 2025 sofar. Could you talk a little bit
about Granite and agents and howyou guys you know, how are you
thinking? You've you've gonethrough one example right there,
but if you could expand on thata little bit in terms of, you

(34:32):
know, how does how is IBMthinking about positioning
Granite? How do agents fit in?What's what does that ecosystem
look like?
You know, you've started to talkabout security a bit. Could you
kinda weave that story for us alittle bit?

Kate (34:45):
Absolutely. So, yeah, obviously, IBM is all in on
agents, and there's there's justso much going on in the space. A
couple of key things that I Ithink are interesting to bring
up. So one is looking at theopen source ecosystem ecosystem
for building agents. So weactually have a really fantastic
team located right here inCambridge, Massachusetts that is

(35:08):
working on a agent framework andbroader agent stack called b AI,
like a bumblebee.
And so we're working reallyclosely with them on how do we
kind of co optimize a frameworkfor agents with a model that, in
order to be able to have allsorts of new, tips and tricks,
so to speak, that you canharness when building agents. So

(35:28):
I don't wanna give too muchaway, but I think there's a lot
of really interesting thingsthat IBM's thinking about agent
framework and model codesign.And that only unlocks, you know,
so much potential when it comesto safety and security because
there needs to be parts, forexample, of an LLM, of an agent
that agent developer programsthat you never want the user to

(35:49):
be able to see. There are partsof data that an agent might
retrieve as part of a tool callthat you don't want the user to
see. So for example, an agentthat I'm working with might have
access to anybody's HR records,but I only have permission to
see my HR records.
So how can we design models andframeworks with those concepts

(36:12):
in mind in order to betterdemark types of sensitive
information that should be, youknow, hidden, in order to
protect information that themodel knows, like, these types
of instructions can never beoverwritten no matter what type
of, like, later on attacks,adversarial attacks somebody
might try and do and say, you'renot Kate's agent. You're, you
know, a nasty bot and your jobis to do x y and z. Like, how do

(36:34):
we prevent those types of attackvectors through model codesign
and agent model and agentframework codesign? So I I think
there's a lot of really excitingwork there. More broadly,
though, you know, I think evenon more traditional ideas and
implementations of agent notthat there's a traditional one.
This is so new. But moreclassical agent implementations.

(36:55):
We're working, for example, withIBM Consulting. They have a
agent and assistant platformthat is where Granite is the
default agent and assistant thatgets built. And so that allows
IBM all sorts of economies ofscale.
If you think about we've now gota 60,000 consultants out in the
world using agents andassistants built off of Granite

(37:16):
in order to be more efficientand to to help them with their
their client and consultingprojects. So we see a ton of
client zero what we call clientzero. IBM is our, you know,
first client in that case of howdo we even internally build
agents with Granite in order toimprove IBM productivity.

Chris (37:34):
Very cool. I'm kinda curious as as you guys are
looking at this this array of ofconsiderations that you've just
been going through, and as thereis more and more push out into
the edge environments, andyou've already talked a little
bit about that earlier, as we'restarting to wind down, could you

(37:54):
talk a little bit about kind ofas as things push a bit out of
the cloud and out of the datacenter and as we have been
migrating away from thesegigantic models into a lot more
smaller, hyper efficient models,often that have that are doing
better on performance and stuff,and we see so many opportunities

(38:15):
out there Could you talk alittle bit about kind of where
Granite might be going with thator where it is now and and kind
of what the what the thoughtsabout Granite at the edge might
look like?

Kate (38:28):
Yeah. So I think with Granite at the edge, there's a
couple of different aspects. Oneis how can we think about
building with models so that wecan optimize for smaller models
in size? So when I say building,I mean building prompts,
building applications so thatwe're not, you know, designing
prompts how they're writtentoday, which I like to call it

(38:52):
like the YOLO method where I'mgonna give 10 pages of
instructions all at once andsay, go and do this and hope to
God, you know, the model followsall those instructions and does
everything beautifully. Likesmall models, no matter how much
this technology advances,probably aren't going to get,
you know, perfect scores on thattype of approach.
So how can we think aboutbroader kind of programming

(39:15):
frameworks for dividing thingsup into much smaller pieces that
a small model can operate on?And then how do we leverage
model and hardware co design torun those small pieces really
fast? So, you know, I thinkthere's a lot of opportunity,
you know, across the stack ofhow people are building with
models, the models themselves,and the hardware that the model

(39:37):
is running on. That's going toallow us to push things much
further to the edge than we'vereally experienced so far. It's
gonna require a bit of a mindshift again.
Like, right now, I think we'reall really happy that we could
be a bit lazy when we write ourprompts and just, like, you
know, write kind of word vomitprompts down. But

Chris (39:55):
Right.

Kate (39:55):
I think if we can get a little bit more, like, kind of
software engineering mindset interms of how you program and
build, it's gonna allow us tobreak things into much smaller
components and push thosecomponents even farther to the
edge.

Chris (40:08):
That makes sense. That makes a lot of sense. I guess,
kind of final question for youas we talk about this. Kind of
any other thought you talked alittle bit about kind of where
you think things are going rightthere. Anything that you have to
add to that in terms of kind ofindustry or specific to Granite,
where you think things aregoing, what the future looks

(40:29):
like when you are kinda windingup for the day and you're at
that moment where you're kind ofjust your mind wanders a little
bit?
Any anything that appeals to youthat kinda goes through your
head?

Kate (40:39):
So I I think the thing I've been most obsessed about
lately is, you know, we need toget to the point as a field
where models are measured by,like, how efficient their
efficient frontier is, not by,like, you know, did they get to
0.01 higher on a metric orbenchmark? So I think we're
starting to see this with likethe reasoning with granite, you
can turn it on and off with thereasoning with Claude, you can

(41:01):
pay more, you know, have harderthoughts, you know, longer
thoughts or shorter thoughts.But I really wanna see us get to
the point, and I think we've gotthe like, the table is set for
this. We've got the pieces inplace to really start to focus
in on how can I make my model asefficient as possible, but as
flexible as possible? So I canchoose anywhere that I wanna be

(41:23):
on that performance cost curve.
So if my if my task isn't, youknow, very difficult, I don't
wanna spend a lot of money onit, I'm gonna route this in such
a way with very little thinkingto a small model, and I'm gonna
be able to achieve, you know,acceptable performance. And if
my task is really high value,you know, I'm gonna pay more.

(41:45):
And I don't need to, like, thinkabout this. It's just going to
happen either from the modelarchitecture, from being able to
reason or not reason, fromrouting that might be happening
behind an API endpoint to sendmy request to a more powerful
model or to a less powerful butcheaper model. I think all of
that needs to be know, we needto get to the point where no
one's having to think about thisor solve for it and design it.

(42:06):
And I really wanna see I wannasee these curves, and I wanna be
able to see us push those curvesas far, to the left as possible,
making things more and moreefficient versus like, here's a
here's a number on theleaderboard. Like, I spent
another, you know, x gazilliondollars on compute in order to
move that number up by point otwo. And, you know, that's
science. Like, I'm ready to movebeyond that.

Chris (42:29):
Fantastic. Great conversation. Thank you so much,
Kate Soul for joining us on thePride to Clay podcast today.
Really appreciate it. A lot ofinsight there.
So thanks for coming on. Hope wecan get you back on sometime.

Kate (42:41):
Thanks so much, Chris. Really appreciate you having me
on the show.

Jerod (42:51):
All right. That is our show for this week. If you
haven't checked out ourChangelog newsletter, head to
changelog.com/news. There you'llfind 29 reasons. Yes.
29 reasons why you shouldsubscribe. I'll tell you reason
number 17. You might actuallystart looking forward to
Mondays.

Kate (43:11):
Sounds like somebody's got a case of the Mondays.

Jerod (43:15):
28 more reasons are waiting for you at
changelog.com/news. Thanks againto our partners at fly.io, to
Brakemaster Cylinder for theBeats, and to you for listening.
That is all for now, but we'lltalk to you again next time.

All Episodes

Episode Transcript

Popular Podcasts

United States of Kennedy

Dateline NBC

Bookmarked by Reese's Book Club

.css-15opob5{left:0;position:absolute;top:0.8rem;} All Episodes

.css-14f5ked{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:2;overflow:hidden;}Optimizing for efficiency with IBM’s Granite

Episode Transcript

Popular Podcasts

.css-r6mb8g{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:1;overflow:hidden;}United States of Kennedy

Dateline NBC

Bookmarked by Reese's Book Club

All Episodes

Optimizing for efficiency with IBM’s Granite

United States of Kennedy