Productionizing GenAI at Scale with Robert Nishihara

Episode Transcript
Available transcripts are automatically generated. Complete accuracy is not guaranteed.
(00:06):
In today's AI Explained, weare going to be talking about
productionizing generative AI at scale.
You know, this week has been a veryexciting week, as you probably have seen
the launch of Llama 3.1, uh, an opensource model from Meta, which seems to
uh, uh, really come very close in termsof accuracy to the closed source models.

(00:29):
Um, it's a, uh, it's an exciting time thatwe are living in, and, I'm actually, the
founder/CEO of Fiddler AI, and for thosewho don't know, I'll be your host today.
We have a very special guest today.
You know, that is, uh, RobertNishihara, a CEO of Anyscale.
So, welcome Robert.
Um, so, just a brief intro about Robert.

(00:51):
Robert is one of the creators ofRay, a distributed framework for
scaling Python applications andmachine learning applications.
Ray is used by companies acrossthe board, from Uber to OpenAI to
Shopify to Amazon to scale their MLtraining, inference, data ingest and
reinforcement learning workloads.
Robert is one of the co-foundersand CEO of Anyscale, which

(01:12):
is the company behind Ray.
And before that, he did his PhD inMachine Learning and Distributed Systems
in Computer Science at UC Berkeley.
And before that, hemajored in math at Harvard.
Uh, so excited to welcome you, Robert.
Thank you for, you know, sharingthis, uh, this opportunity with us.
I'm excited as well.

(01:33):
Thanks for, uh, having me on.
Awesome.
Robert, uh, first up, uh, I wouldlove to listen to your, to you.
How did you like, you know, get,get into Ray, like how, you know,
give, give us a brief intro intoRay and the Anyscale founding story.
Yeah.
So today I'm spending all of my timeworking on distributed systems and systems

(01:54):
for AI, but in grad school, I startedgrad school with no background in systems
and Actually, I was spending all of mytime more on the theoretical side of
AI, trying to design algorithms for, uh,deep learning training or reinforcement
learning, and proving theorems about, youknow, how quickly these algorithms could

(02:15):
learn from data and converge, and so it'sreally coming from that side of things.
Uh, but, what we found, and, you know, meand, and my other, uh, co workers in grad
school, was that, we were, even though wewere trying to spend our time designing
algorithms, we were actually spendingall of our time managing clusters, and

(02:37):
building systems, and writing code to,um, scale, GPUs, and things like that.
And, that wasn't, you know, There wasn'tan off the shelf tool that we felt we
could just use to solve these problems.
We ended up, uh, and it wasn'tjust us doing this, right?
A lot of other researchers inAI were building their own tools

(02:59):
and building their own systemsfor, um, managing compute.
And we felt that, uh, The tooling herewas really a bottleneck for AI research.
And so, we thought there was anopportunity to try to make that easier.
And we started Ray, whichis an open source project.
Our goal was just to, you know, tobuild useful open source tools to make

(03:21):
it easier for people doing machinelearning to, uh, take advantage of AI.
You know, multiple machines to scale inthe cloud to, um, to sort of solve the
distributed systems challenges for youaround, you know, dividing up the work
across different machines, um, schedulingdifferent tasks on different machines,
recovering if one machine crashes,um, you know, moving data efficiently.

(03:44):
There are many differentsoftware engineering challenges
around scaling compute.
And, um, and we just got startedbuilding Ray to really try to,
um, solve these challenges.
And one of the things that, youknow, the underlying thing here
is that AI was becoming more andmore computationally intensive.

(04:07):
That was kind of the core premiseof all of this, that people needed
to scale things in the first place.
If you could do all of this on justyour laptop, uh, there wouldn't
have been any problem to solve.
Yeah, I remember those days when Iwas an ML engineer at Bing Search
in 2000s, we would train simplemodels on our desktops, you know,
and the world has changed a lot now.

(04:29):
And you guys are working with someamazing tech forward companies like,
you know, OpenAI, Uber, Pinterest.
We were talking about it before thecall, you know, what are some of
the challenges that Ray helps solve?
for these type of companies now,you know, when they're building
these large scale AI models.
You know, the challengeshave changed over time.

(04:50):
I think for a lot of companies, one,um, have gone through or, you know,
even now are starting to go through atransition of, um, adopting deep learning.
Okay, so this is something thatYou know, you mentioned Uber.
They were a pioneer in machinelearning and machine learning
infrastructure and had been doingmachine learning for many, many years.

(05:11):
And of course, the starting pointfor them was not deep learning.
It was smaller, you know, simplermodels like XGBoost and so forth.
And the, um, when they went throughthis transition of really enabling
deep learning in all of theirproducts, Um, the, like, the underlying
infrastructure challenge and systemschallenge got a lot harder because

(05:33):
it's far more compute intensive.
All of a sudden, you know,you need to manage GPU compute
instead of just CPU compute.
You need to manage a mixture of them.
And that is, um, you know, you end upwith potentially a different tech stack
for deep learning and a different techstack for classical machine learning.
And, um, it's sort of.

(05:55):
Providing all of these capabilitiesinternally to the rest of the team is,
is, you know, can be quite challenging.
Um, I mean, we see all sorts ofchallenges, challenges around one is just
enabling distributed computing, right?
Enabling distributedtraining, things like that.
There are other challenges around, um,you know, getting, going the, the handoff

(06:18):
from training the model and developingthe model to deploying it, right?
We hear people say it takes us.
You know, six weeks or 12 weeksto get from developing a model
to getting it into production.
And it could involve, that could involve,um, you know, handoff to a different team,
rewriting it on a different tech stack.
Um, and just a lot of challenges thatpeople face are around also around

(06:43):
just how How quickly their machinelearning people or data scientists
can, can iterate and move, right?
Are they, uh, are they spending mostof their time focused on the machine
learning problems, or are they spendingmost of their time, uh, managing
clusters and, and infrastructure?
Yeah, so let's go a coupleof levels deeper, right?
So I think distributed training is a veryinteresting problem, you know, um, You

(07:06):
know, let's, let's basically, maybe youcan walk us through, what would it take?
Like, you know, are you like, how, how,how does the architecture look like?
You know, are you, are yousort of like sharding the data?
More towards these like differentreplicas, you know, how do you
maintain the state, you know, giveus like a taste of under the hood.
Uh, what makes Ray like the such arobust platform for distributed training?

(07:30):
And, um, that's a great, great question.
And just before I dive into training,there are three main workloads
that we see people using Ray for.
It's quite flexible, but, um,training is one of them, um, serving
and just inference is another.
And then the last is data processing,like data preparation or, you
know, ingest and pre-processing.

(07:52):
And companies typically haveall of, all of these workloads.
And they're all challenging.
Um, so, challenge, we see a numberof challenges around training.
The first is just Going from a singlemachine to, to multiple machines.
Um, but a perhaps more subtle challengeis, as you're scaling training on

(08:15):
more and more GPUs, and you mentionedPinterest, this is a challenge that,
um, it was one of the reasons Pinterestadopted Ray for training their models.
As you're scaling training on moreand more GPUs, it's very easy to
become bottlenecked by the dataingest and preprocessing step.
So you may want a lot of training, Youknow, training is expensive because
the GPUs are expensive, uh, so youwant to keep them fully utilized.

(08:38):
And that means feeding the data in,pre-processing the data, loading
the data, and feeding it in fastenough to keep those GPUs busy.
And that may mean you needto scale the data in just in
pre-processing even more, right?
And so, uh, That can often bedone on cheaper CPU machines.
So you want to scale up the dataingest and pre-processing on a
ton of CPU machines, a separatepool of CPU compute, and then

(09:01):
pipeline that into training on GPUs.
It's, um, conceptually simple,but hard to do in practice.
And, um, you know, both ofthese things may, you may want
to scale them elastically.
You may need to recover fromfailures if you're using spot
instances or things like that.
Um There are other challenges yourun into at even larger scales

(09:25):
around actual GPU hardware failuresand recovering quickly from that.
In the regime where you start tohave really large models, um, model
checkpointing and recovering fromcheckpoints can be quite expensive.
And so you need to, um, build out reallyefficient handling of checkpoints.
So there are many challenges.

(09:47):
Uh, those are thechallenges that Ray solves.
And I'll just to call out the, uh,the data ingest and pre-processing
pipelining in, that is an areawhere we're seeing more and more,
um, AI workloads becoming not just.
Uh, you know, requiring mixed CPUand GPU compute, and really being,

(10:08):
um, you know, both GPU intensiveand data intensive, and that's a
regime where Ray does really well.
So these days, I mean, uh, of course,like, uh, larger technology companies
like Uber, Pinterest, uh, uh, will,will be probably building and training
models from scratch, but a lot ofthe enterprises that we work with at
Fiddler, they are looking forward toand are already working with these.

(10:32):
You know, pre-trained models, you know,there, you know, there was the Llama 3.
1 launched this week, and basicallythere's quite a few other options.
There's closed source models, like,you know, GPT and, and, uh, and Cloud.
So where does Ray fit into this equation?
So if I'm like an enterprise customer,I'm looking to build this conversational

(10:53):
experience, you know, with likeusing one of the pre-trained models,
how can I leverage Ray to fine-tuneor, you know, kind of build this
model with this LLM application?
That's that's, um, that's a good question.
So not every company is going tobe training really large models.
Right.
Um, although I do think the vast majorityof companies will train some models.

(11:16):
Um, but what every company will dofor sure is they'll have a lot of
data and they'll use AI to processthat data and extract insights or,
or draw conclusions from that data.
Right.
And that is, it may sound like just a.
data processing task, but there'sgoing to be a lot of AI and a lot of

(11:39):
inference being used in doing that,because you're going to want to draw like
intelligent conclusions from your data.
And that is again a workload thatwill combine, it also, it won't
just be inference, there will besort of regular processing and
application logic combined with AI.
And so this is something that also fallsinto the regime of, um, you know, I think

(12:05):
Like large scale processing mixed CPU andGPU compute combining machine learning
models with other application logic.
And this is, of course, there are whenyou think of large scale data processing
you might think of systems like Sparkwhich are mature, battle tested, and
fantastic for data processing, um,but are also really built for a CPU

(12:28):
centric world, right, and not reallybuilt with deep learning in mind.
And so when it comes to scaling dataprocessing, um, in a way that uses deep
learning, in a way that uses GPUs, and isoften working with unstructured data like
Images and text and video, then you end upwith a hard data processing problem where

(12:52):
you have, um, you're, you're scaling onCPU compute, GPU compute, you're running
a bunch of models, and, Uh, these are thekinds of challenges we're seeing today
that, uh, you know, that we can help with.
So it seems to me that, uh, youknow, Ray is, you know, definitely
could be a platform of choiceif you're thinking about scaling

(13:14):
your model training inferences,especially optimizing your GPUs.
to the fullest extent possible and itseems like that is like the unique sort
of uh advantage uh now when it comes toproductionization right so you know like
basically you mentioned something aboutLLM inferencing and ML inferencing and

(13:34):
um how does that work with Ray and youknow uh what are some of the aspects
of uh that you're trying to solve thatwhat are some of the challenges that
you've seen you know in productionizingyou know GenAI and also your
traditional machine learning workloads
Yeah and so with inference, wetend to divide it into online
inference, where you're poweringsome real time application, right?

(13:57):
And offline inference, where you'reperhaps, um, processing a larger
dataset, and you care less It'sless latency sensitive and more
throughput or cost sensitive.
So, with, um, you know, the challengeswe see around online inference,
these are things like, Um, well, thechallenges actually change over time as

(14:23):
you're building AI applications, right?
Um, we see kind of, we were talkingabout this the other day, but As
businesses start adopting AI, they'reoften in this exploratory phase where
they're trying to figure out how touse AI in their business, how to,
um, what product even makes sense.
And so at this point, thequestion is about quality.

(14:44):
Are the models good enough?
Right?
Is the quality high enough?
Uh, how do I fine-tuneto improve the quality?
You know, how do I reduce hallucinations,use RAG, all these kinds of things.
And.
They, you know, they care about iterationspeed just for experimentation, um, and
there's also a lot of data pre-processingthat has to be done at this stage to,
um, you know, to see if you can just getyour data into the model in various ways.

(15:08):
Um, then If, once they figure it out,once they validate, hey, this is the
right product, I want to, it makessense, people like it, um, and you start
moving these applications to production,then the challenges change, right?
At this point, you might care,start to care about cost, right?
You might start to care about latency.
Is it responsive enough to, you know,for people to really, uh, engage with it?

(15:32):
Reliability, like, howdo I upgrade the models?
Right, so the nature ofthe challenges changes.
It always starts with quality,because that's just what
determines if it's possible or not.
But once you meet the quality bar, thenthe criteria very quickly changes to
latency and cost and other factors.

(15:52):
Got it, got it.
And so, uh, so that's actually a very goodpoint you mentioned, reliability, right?
Like, uh, you know, one of thethings that, uh, when you're
productionizing model inference islike, how do you upgrade the models?
How do you swap, uh, uh, an existingmodel with a, a challenger model
that you actually have trained, youknow, how do you deal with that?

(16:12):
How does Ray deal, how does Ray help, youknow, developers do these things where
you upgrade those models, you iterate on,you know, A/B test model versions or like
MOOC move from champion to challenger and,and in a manner that you're not affecting
anything from latency or throughputand all those, all those things.

(16:33):
Yeah.
Um, great question.
So there's, and the challenges aredifferent for, uh, large models
and for small, small models.
For example, on the, um, we see,Uh, companies that want to deploy
thousands of models, right, thathave, that might fine-tune or train
one model per customer of theirs.

(16:53):
And if they have thousands of customers,they end up with thousands of models.
And they might be Of course, they mightbe low utilization models, and so they
end up needing to, you know, there'sthe operational challenge of managing
and deploying many, many models.
There's an efficiency questionof how to, um, you know, perhaps

(17:13):
serve all of these models from ashared or smaller pool of resources.
With larger models You sometimes end up,the challenges end up being around, um,
GPU availability and things like this.
And, and just to give one example, ifyou, as you deploy a model and then you
upgrade a model, a natural way to do thatis to sort of deploy, um, a, just like,

(17:39):
you know, you have the old model, theold model that's serving in production,
then you deploy a new, the new one, andthen you like slowly shift traffic over.
And that done naively.
We'll require double theamount of GPUs, right?
And so, um, can you do this with,without requiring that many extra GPUs?

(18:00):
Can you do it kind of in placeor with maybe one extra GPU?
And so, um, there are all sorts of ofchallenges around making that work well.
Um, I mean, there are things likeyou may have bursty traffic, right?
You may, if you, if you have to reservea, a fixed pool of GPUs, but then you
have bursty traffic, there's goingto be a lot of times when you're sort

(18:22):
of provisioning for peak capacity,but you have unused, uh, capacity.
And can you, you know, run otherworkloads, can you multiplex,
like, other less criticalworkloads into that unused compute?
At those times, like these are,uh, for cost efficiency reasons.
These are some of the kinds ofchallenges we can help work on.
Yeah, makes sense.

(18:43):
So I think a couple of audience questions.
I think people are asking,uh, what's the relationship
between Anyscale and Fiddler AI?
As you may have listened to Anyscale isthis large scale, uh, model training,
deployment and inferencing, uh,product, uh, and, and, and, and so, and
Fiddler is focusing on observability.
So, you know, you can thinkabout those two things as
complimentary in your AI workflow.

(19:05):
Uh, so, hopefully that clarifies, uh,some of those things, so, uh, uh, when
you're thinking about your AI workflow.
So, let's actually, uh, that's arelevant question around that, right?
So, you know, there's multiple parts ofthis AI workflow we just talked about.
There's data processing, you know,feature engineering, model training,
model selection, deployment, youknow, inferencing, monitoring.

(19:29):
Uh, so when you kind of think aboutthis, right, uh, especially in
the GenAI setting, you know, fromtraditional AI now to GenAI, how should
enterprises streamline this process?
You know, now some of the things mayhave gone out of the window, you know,
how does feature like, what does featureengineering mean in the case of like,
you know, GenAI or LLM applications?
What does model training mean?

(19:51):
Is it fine-tuning?
Uh, what does model selection mean?
And then of course, like we arefacing some of those challenges
in observability of LLMs.
How has it affected you, and what aresome of the things that, that enterprises
that you think are trying to solve inthis operations of GenAI, end to end?
Um, there's so many, you know, differentchallenges that you mentioned, and the,

(20:14):
you know, the challenges that, Like I wassaying, the challenges really do change
over the life cycle of deploying, um,generative AI or really deploying, uh,
different AI applications in production.
Um, one of the, you know, we seea couple distinct phases around

(20:34):
experimentation and just like fastiteration and then another around, um,
scaling and managing many differentmodels and, and, uh, the kinds of.
The, um, the nature of the observabilitychallenge is very different from, you

(20:54):
know, previously in machine learning orin, in, uh, traditional, uh, Applications.
Um, I mean the challenges aroundevaluation are far more complex.
And one thing we actually often recommendwhen people are getting started with
building generative AI applications isto front load the model evaluation part,
like to over invest in, um, internal, inbuilding like internal evals early on.

(21:21):
Because that's going to determine, uh,that's going to determine the speed
of everything else you do, right?
If you're going to A new model is goingto be released, you're going to swap
that in, or you're going to fine-tunea model, and then you're going to ask
yourself, is the new one better or worse?
And if you have automated evals that youfeel confident in, then you're going to be

(21:42):
able to answer that question very quickly.
And new models are going to be releasedall the time, so that's a question
you're going to have to ask a lot.
And also, evals are one of thethings that you can't obviously, you
can't easily outsource, because theyrequire domain expertise, right?
It's very custom to what we have.

(22:03):
And the same is true with, um, youknow, the way you sort of craft the,
um, like the data, your, you know,your specific data into a form that can
be used by the AI application, right?
There are things that are, um, I thinkeasier to factor out or easier to

(22:25):
outsource, like, um, in fact, like,A lot of the AI, you know, scaling,
infrastructure, uh, components, that'sless, or just the performance stuff
making it as fast as possible on the GPU.
These are things that, um, you couldsolve in house, but are, are more
consistent from company to company,um, but the, a lot of the other, like,

(22:45):
evaluation and, and data processingare, are quite, uh, uh, you know,
leverage a lot of domain expertise.
Got it.
And, and so, uh, you know, when yousort of like, uh, uh, you know, as
your customers, you know, think about.
Using RAVE for both traditional MLtraining, like deep learning model

(23:08):
training, and now expanding to LLMoperations, like LLM fine-tuning.
Like, how are you supporting that today?
You know, for example, uh, would they beable to like, uh, you know, would they
be able to coexist on the same platform,uh, that you're, you're providing?
And, and so like, you know, likenow, like, you know, customers

(23:29):
are also kind of confused, right?
There's like a lot of options today, like,you know, uh, for deep, for deep, for
machine learning, this one for LLMs, thatone, you know, how are you sort of solving
those challenges for your customers?
Because there's, there's clearly, youknow, these two worlds have emerged in
the last, you know, couple of years.
Well, I don't, I thinkpeople that we work with.

(23:51):
Tend to don't, they don't tend to want adifferent tech stack for LLMs and another
tech stack for computer vision models andanother one for, you know, XGBoost models.
if you can have.
a common platform or frameworkthat supports all of these
things, that's advantageous.
Um, especially because AI is evolving veryrapidly, it's moving very rapidly, right?

(24:16):
There are going to be new types ofmodels released, new optimizations,
new frameworks, and so if you are, uh,a lot, you know, a lot of companies
have gone through this Big migrationto enable deep learning from, you
know, from classical machine learningand only to find at the end of it
that they like need to change thingsagain in order to enable LLMs, right?

(24:39):
And that's not the end of it, right?
There are all of these modelsare going to be multi modal and
Agentic workflows are coming out.
They're going to becomemore complex, right?
So the, from the perspective oflike the ML platform team, right?
The people who are providing AIcapabilities to the rest of the company.
You really want to optimize forflexibility and being able to, be

(25:01):
relatively agnostic to, you know,different types of models, different
types of, you know, even likehardware accelerators, different
types of, frameworks, right?
When it comes to, you know, Iwouldn't just like, um, you know,
just, uh, only support XGBoostor only support PyTorch, right?
Or only support, uh, one inference engine.

(25:22):
All right, the more that you can positionyourself to be able to immediately take
the latest thing off the shelf and, uh,and work with that, the faster you're
going to be able to move, becausethose new things are going to come.
Right, so if I, let's say, like,you know, in the GenAI case, right,
so if I were to Um, fine-tune, um,say Llama for my business use case.

(25:45):
And could I basically have likea Ray inferencing module that
would load up my fine tuned Llamaand help me scale my inferencing?
Like how, how would like, if I'mlike an engineer doing that, like
what would that process be like?
Yeah, so Ray has, um, you know,there are really two layers to Ray.
There's the core system, which isjust, um, scalable Python, basically.

(26:08):
It's the ability to take Python functionsand Python classes and kind of execute
them in the distributed setting, right?
And then, but that's And that's, so that'swhere the flexibility of Ray comes from,
because with Python, you know, functionsand classes, you can kind of build
anything, but it's too low level, right?
If you want to do training, or if youwant to do build, do data processing or

(26:28):
fine-tuning, then you, using that API,you'd have to, you know, build all of the
training logic and data processing logicon top of these, you know, just functions
and classes, which is too low level.
And this is why Python has, youknow, a rich library ecosystem.
And, uh, you know, Pandas and NumPyand all these different, uh, tools

(26:49):
that you can just use off the shelfto build powerful Python applications.
And, uh In the distributed setting,Ray tries to do something analogous.
There's the course API, which islike Python, and then, or scalable
Python, and then there's anecosystem of scalable libraries.
And certainly not as many as, as,you know, Python, but, um, these

(27:10):
include libraries for trainingand fine-tuning, libraries for
data ingest and preprocessing,libraries for serving, right?
And that is, uh, and they form anecosystem, which is really powerful
because in order to To do training orfine-tuning, you also typically want to,
uh, process some data and load some data.

(27:31):
And so the fact that you can dothese together in a single Python
application is pretty powerful.
Instead of, um, you know, I have runmy, like, my big spark job to prepare
the data, then I pull up a new system,like, for training, and I have to, like,
glue these things together and managea bunch of different frameworks, right?
The way people were doing things beforeis often more analogous to taking each

(27:55):
Like Python library and making it astandalone separate distributed system
when what you really want is a commonFramework with different libraries
on top that can all be used together.
Yeah.
Yeah.
So so now when I'm actually doingthat productionization, right?
So how do like, you know teams likehow are you seeing teams kind of?
Maintain that consistent high performanceand accuracy across these multiple

(28:18):
production applications, you know Whatare, what is it, what is it that you
found out, you know, are, are theyevaluating the same metrics, you know,
before, you know, pre deployment andpost deployment of GenAI applications?
You know, what are some of the otherconsiderations around scalability,
throughput that they are looking at?
Yeah, um, so you're talkingabout, like, quality metrics?

(28:41):
Quality, yeah, quality, cost, performance,you know, give us like a, give us like
a catalog of things that that yourcustomers are looking at, you know,
pre and post deployment of GenAI.
Certainly, some of those, those onesyou mentioned, quality is probably
the hardest to measure, but also,um, you know, the probably the one

(29:01):
that's foremost and most important.
and then latency andcost are really big ones.
And even, you know, it might soundstraightforward to measure latency,
but There are a lot of subtletieswith measuring latency, especially for
LLMs, where you have, um, you know,you don't just produce one output,
you produce a sequence of outputs, anddepending on the application you're

(29:25):
working with, you may care about thetime to generate the first token, or
you may care about, you know, how manytokens per second you're generating.
And you may care, you know, the, thismay vary from, Both of those numbers
can vary quite a bit depending on theamount of load on the system, right?
degree to which you do batching.

(29:45):
And so we often, expose knobs for, forsort of making that trade off of, uh,
sort of, you know, more model replicasor lower batch sizes to improve decreased
latency for generating new tokens or,uh, but at the cost of, throughput.

(30:06):
And there are, so there aresome, some trade offs like that.
And there are a lot of, of course,a lot of optimizations you can do
to, improve both of these things.
And what have you seencustomers do in such a scenario?
Like, you know, are they writingcustom evaluators for these things?
And, uh, you know, what have you seencustomers do when they're trying to

(30:28):
evaluate their quality of models?
Yeah, I mean, the starting point is alwaysto, and on the quality side, like, of
just how well the thing works, um, thestarting point is always to look at the
data by hand and kind of score qualityand, and come up with good, um, like
reference examples and, um, you know,that you can kind of run benchmarks on.

(30:54):
And it can be done with a smallamount of data, but you're often hand
labeling, um, a number of examples tobegin with, and we also see people, you
know, using AI quite a bit to come upwith these reference examples, like,
come up with hand labeled examples.
Like, for example, uh, one of theapplications we built in house

(31:16):
was just, uh, question answeringfor our documentation, right?
If people want to use Ray, here's achatbot you can use to, uh, Ask array
questions and get answers, right?
And so, we wanted to come up with evalsfor this, so that if you, um, you know,
if we swap in a new model, right, Llama 3.
1 or whatever, then we can evaluatequickly how, how good that was, um,

(31:40):
whether the change was beneficial or not.
And, in order to generate, uh, the,you know, the, the evals, what we did
was, sort of, to generate a syntheticdataset of questions and answers.
And we did that by taking, you know,our documentation, randomly selecting

(32:00):
like one page from the documentation,feeding it into, you know, GPT-4, and
asking GPT-4 to generate a, um, youknow, some example, some questions
and answers, or take a passage andgenerate a question that can be answered
with that passage, and then pairingthose together to form your dataset.
And Then, uh, you know, using those toevaluate your, uh, again, when you run the

(32:26):
evals, you have the, the system take in aquestion, generate the answer, and then,
and compare that to the reference answer.
And you have anotherLLM do that comparison.
And so there are a lot of stepsand a lot of like, AI being used.
Uh, but those are like, that'soff, that's a common pattern.
Yeah.
So, I mean, this is kindof interesting, right?
You know, early on, like in ML world,evaluation used to be evaluating

(32:50):
a closed form function, you know,like, you know, root main squared
error, or, like, precision recall.
Whereas now, evaluation meansyou're running a model for another
model, right, as you just described.
And, uh, are you seeing, like, this growthof evaluation models, like, in customers?
Like, are you seeing that take off?
Right.
It's so funny.
It's, um, you know, in the past,Model evaluation, what do you

(33:14):
mean model evaluation, right?
It's just, you just compute the accuracyand how many of them does it get it
right and how many does it get it wrong.
But that's, that's, uh, it'svery different when the output
is a sentence or an image.
That's right.
By the way, the whole, you know, fieldhas really, uh, shifted in a lot of
ways over the past decade, right?

(33:35):
If you, I'm sure you remember, Um,as after, you know, in the years
after, um, ImageNet and after everyonewas excited about deep learning,
um, there's, a lot of the field wasdriven by, uh, the ImageNet benchmark.
And every year, people coming up withnew models that performed better on

(33:56):
the, you know, ImageNet benchmark.
And the data set was static right.
It was, um, just the ImageNet data set.
You have, you split it intoyour train and test data.
And it was all about, can you comeup with better model architectures
and better optimization algorithmsto do better on that data set?
And now, you know, the optimizationalgorithm is, is, uh, more static, right?

(34:18):
It's like variance,stochastic gradient descent.
Um, the.
The model architecture is of course,there's still a lot of innovation there,
but it's more static than before becauseyou have transformers and so forth.
Um, and all of the innovation is reallygoing on the dataset side, right?
That was like considered a static thingin the past and now that's actually where,
you know, people are putting all of theirenergy and spending tons of money and,

(34:40):
and using lots of AI to curate the data.
And that's a, just acomplete, paradigm shift.
Yeah, absolutely.
Awesome.
Let's, let me take someaudience questions.
I think there's some questions around,uh, you know, training aspects of it.
There's a specific question of how doyou differentiate between PyTorch, DDP,
and Ray distributed training library?

(35:01):
Are they complementary?
Um,
They are complementary.
And so, um, you can use Ray, most peopleusing Ray use Ray along with PyTorch or
with different deep learning frameworks.
And, you know, one way to use Ray is to,to have, to take your PyTorch model and

(35:22):
we have a ray train wrapper, which youcan, uh, you know, pass in your model
and then we can set up, uh, Ray willset up, you know, different processes on
different machines and set up, uh, DDPor other, um, uh, you know, distributed
training, uh, you know, protocolsbetween the different PyTorch models.
Uh, you know, functions on thedifferent, in the different processes.

(35:45):
And so, essentially, in thatcase, it's like a thin wrapper
around DDP, around PyTorch.
And what Ray is adding is both,like, setting it up easily, having
a standardized way to do this acrossdifferent frameworks, and also a
way to handle the data ingest andpreprocessing and to feed that in,
as well as the fault tolerance piecesaround, um, you know, handling, um,

(36:06):
Uh, application or machine failures.
So, it is very complementary.
You can, in some sense you can thinkof frameworks like PyTorch and other
um, And also various inference engineslike vLLM, TensorRT-LLM, as being
focused on, um, you know, running themodel as efficiently as possible on
the GPU or on the set of GPUs and likereally single machine performance.

(36:29):
And then Ray handles the multi machinescaling challenge, like a lot of
the distributed systems challenges.
And so those, you know, arevery complementary things.
Yeah.
Yeah.
So, um, you know, how does like, Iguess the question is probably around
like RAG and embedding, you know, like.
Maybe like, how does it work, uh,you know, for those use cases?

(36:51):
Maybe you could take some specificexamples of, you know, customers
developing RAG applications or, or,you know, on top of like your platform.
Yeah.
So we, there are many different piecesto the whole RAG pipeline, right?
There's, um, of course there canbe, and you may do a subset of these
and not all of them, but there's,um, prepare, you know, first of

(37:14):
all, there's embedding computation.
Right, which involves taking yourdata, um, processing the data, and
there are many decisions to makeabout how you process and chunk the
data and compute the embeddings.
Like, that can be, uh, that can also bea large scale problem or a small scale
problem depending on the data you have.
Um, but the data preprocessingand, and like, letting you iterate

(37:35):
there, uh, that's some, and scalethat, that's something that, uh,
you know, Ray does very well.
Then there's also, um,you know, the actual.
Real time, like, inferencepart, where you are, you know,
serving, um, uh, requests, right?
And this is, there's often a number ofmodels being composed together to do this.

(37:56):
It's not necessarily just one LLM, right?
You're, you have, uh, the retrievalstage, where you are, you know,
retrieving, you know, uh, differentembed, um, different pieces of content
based on your embedding of the query.
You are often, you may use additionalmodels to, you know, Um, sort of
rank the context and decide whatcontext to feed into the model.

(38:17):
Right, you have Uh, you alsocan use other models to rewrite
the query before you embed it.
Um, and then, once you are, you know,then you feed that into your generation
model, you can, you can generate theoutput, you may have other models kind
of fact check or, or, uh, you know,uh, check the correctness of the model.
And so, the way you're doingserving is actually often running

(38:40):
many different models together.
And that's something we didn't talkabout as much, but, uh, We, the
inference problems we see with Rayare often starting to have more and
more models deployed together, notjust a single model, and, and, you
know, many calls to different models.
So there's the serving sideof things, which can have a
growing amount of complexity.

(39:00):
And there's also, you know, this iterationloop where you need to continually improve
the quality of the application, right?
It's not just the qualityof the model, right?
It's the quality of theend to end application.
That may mean fine-tuning the model.
Right?
It may mean, um, but it also maymean fine-tuning your embeddings.

(39:20):
It may mean chunking your, your data,your original dataset in different ways.
And so, it's actually veryimportant to kind of develop, um,
unit tests for different stagesof the RAG pipeline, right?
Because if you swap, you make somechange, um, and it gets better or
worse, say it gets worse, um, you needto know where, what got worse, right?

(39:44):
Is it the generation that got worse?
Is it that you fed the wrong context in?
Maybe you ranked the context incorrectly.
Um, maybe you retrieved thewrong context in the first place.
So.
Um, there's a whole, you know, softwareengineering practice that needs to
be built up around this, um, in orderto really, like, unit test these

(40:06):
different pieces and, and, uh, sortof quickly identify where the room
for improvement is, and, um, Where Raycomes in, or, you know, Ray doesn't
solve all of these pieces, but Ray isuseful for all of the compute pieces.
The data handling, the fine-tuning,the serving, um, and so that
the people developing these RAGapplications can focus mostly on,

(40:29):
um, you know, the application logicand how information is being, is
flowing from one place to another.
Got it.
Got it.
That makes sense.
And, and so I guess, uh, you know,like people want to know, uh, how it
differentiates from, uh, I guess like,uh, you know, using like, like this
whole rag architecture that is emergedlike using a vector database and a

(40:51):
modeling layer and an orchestrationsystem together to sort of like
hook, hook up your rag application.
You know, you know, let's say you useone of the, uh, open source vector
databases, say Quadrant or whatever,and maybe you have a Llama and you're
trying to put all of these thingstogether through a Llama index.
How does that workflow differentiatefrom like using something like, you

(41:16):
know, you know, in Ray and, you know,like what are the pros and cons here?
I mean, they're complementary.
In fact, you know, Ray doesn't, um,Ray's very focused on compute pieces.
Okay.
Don't actually, um,provide a vector database.
We don't store your embeddings.
That is, um, you know, and so peoplewho are using Ray to build RAG

(41:38):
applications are using Ray along withthe vector database or along with
Llama index, uh, and these other tools.
So.
Where Ray ends up being really usefulis when, um, you know, again, if you're
running everything on your laptop, thenthat's fine, and that's really, that's
not the regime where you need Ray.

(42:00):
Of course, you know, it can be useful onyour laptop for using multiple cores and
stuff, but where Ray really adds value iswhen you need to scale things, and when
you, it starts to, you know, Some stepof the process starts to be too slow.
Maybe you need to fine-tune on, you know,more GPUs, or maybe you need to do the
data pre-processing on more CPUs andjust, you know, and these kinds of things.

(42:28):
Yeah, basically moving from singlenode or like workstation based RAG
applications to like a distributed,highly scalable RAG application using
that compute layer, essentially.
And giving you the experience, youknow, but making it feel very similar
to just writing Python on your laptop.
Yeah, yeah.
By the way, um, I actually want tojust emphasize the sort of complexity

(42:50):
we're seeing around inference becausethat is, it's been growing quite a bit.
A lot of you think of machinelearning serving as Taking a model
and hosting the model behind anendpoint and maybe, you know, auto
scaling the replicas of the model.
Right.
But, um, we're starting to seeapplications or, you know, AI products

(43:12):
that people are building where theywant the AI, uh, you know, they want
their product to sort of completean end to end task like booking an
Airbnb or sending an email for you or,you know, making a reservation at a
restaurant or these kinds of things.
Writing code is another example.
Yep.
And.
That is not, the way they're at leastarchitecting these today, it's not

(43:33):
a single call to a single model.
And if you want to write code,or you want to book an Airbnb,
there are many steps, right?
You have to take some vague descriptionthat, um, you know, a person said about
what kind of, Vacation they're lookingto have, turning that into different,
you may have a model, turn that intosort of different types of requirements.
You may have another model retrievedifferent like candidate Airbnbs.

(43:59):
You may like, have another models,call score those that each candidate
against the different Criteria.
Um, you know, you may have a modellike generate an explanation for
each of, of the top ranked Airbnbs.
Like why that is a good, um, uh, agood choice based on the criteria.
Like there, I'm sort of oversimplifyingit, but they're completing these

(44:22):
end to end tasks often involvesmany calls to different models.
And so you end up witha serving challenge.
That's not just a singlemodel behind an endpoint.
It's like highly dynamic where.
You have, you know, calls todifferent models based on, that
are determined based on, you know,the output of previous models.

(44:42):
Um, you may start with, uh, youknow, GPT-4 for all of these just
to prototype and get it working.
But, uh, when you have hundreds of modelcalls stacked together, you often, at
some point, find that, hey, latencyreally adds up, the cost really adds up.
And, uh, You don't actually need a,like, a fully, generally, general purpose

(45:02):
model for each of these things, you,you kind of, in some cases, a small or
specialized model, uh, that's reallygood at one thing may, may do the trick.
And so you end up, um, startingto use fine-tuning and, and,
uh, uh, open source models.
And composing these things together.
So it can actually get quite tricky.
And I think this is the directionthat inference for machine learning is

(45:24):
going in that, you know, in the, in thecoming years, you're going to see like
very complex serving systems that havehuge amounts of models being composed
together to, in a very dynamic wayto complete, uh, like complex tasks.
Absolutely.
I think there's this whole notion ofmodel merging that has emerged where
people want to blend these different.

(45:46):
Models and, um, and, and,and merge together responses
and hook up these workflows.
This is great.
Um, so I guess, uh, you know, finally, uh,you know, as you, as you sort of describe
the, the sort of the stack, like, youknow, you may have a vector database, you
can have a modeling layer of your choice,you can have an orchestration system.

(46:07):
Uh, you can have, uh, I think an arrayand can function as like this really
robust distributing, distributed computingframework to bring all these together.
And then, how does observabilityfit into this thing?
You know, for example, would youconsider that as like an additional
add on into this workflow to evaluatethese metrics that we talked about?
Yeah, I think of observability as criticaland it comes up in a lot of ways because

(46:34):
you often talk about, think abouthow hard it is to develop an
application, but, what, where alot of the time is spent is really
debugging when something goes wrong.
And many things can go wrong.
It can be that something is not behavingcorrectly, or it could be something
is crashing, or it could be that it'sworking just fine, but it's too slow.

(46:56):
there are many forms of,
you could be debugging performanceissues or debugging correctness issues.
And, that is something thatwe've seen people spend endless,
countless hours trying to, resolve.
And if you don't have, the informationreadily at your fingertips that

(47:17):
you need to answer those questions,then you may have a really bad time.
And, So it's often, of course, sometimesyou don't know what information you
wanted to log or store until the errorhappens, and at that point it's too late.
And so it's very hard to run productionsystems without observability tooling.

(47:38):
In fact, you really,it's really essential.
Absolutely.
I think that's a great, uh, uh, Sortof line to sign off at this point.
So, uh, uh, you know, thank you so much,uh, Robert, for joining us on this webinar
and sharing your valuable thoughts.
I've learned a lot, uh, on this.

(47:59):
And I think, uh, for those of you who arethinking about moving from workstation
based AI development for prototypingto productionization, you know, please
definitely look at, uh, Ray and anyscale is they're doing some great work
working with tech forward companies.
And then of course, you know.
We, you know, as you sort of thinkabout observability, uh, we are always

(48:19):
there from Fiddler to help you out.
That's about it for this AI Explained.
Thank you so much, Robert.
Thank you.

All Episodes

Episode Transcript

Popular Podcasts

Bookmarked by Reese's Book Club

Dateline NBC

Stuff You Should Know

.css-15opob5{left:0;position:absolute;top:0.8rem;} All Episodes

.css-14f5ked{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:2;overflow:hidden;}Productionizing GenAI at Scale with Robert Nishihara

Episode Transcript

Popular Podcasts

.css-r6mb8g{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:1;overflow:hidden;}Bookmarked by Reese's Book Club

Dateline NBC

Stuff You Should Know

All Episodes

Productionizing GenAI at Scale with Robert Nishihara

Bookmarked by Reese's Book Club