Maximizing GPU Utilization: Heterogeneous Pipelines with Ray and Kubernetes

Episode Transcript
Available transcripts are automatically generated. Complete accuracy is not guaranteed.
Tobias Macey (00:11):
Hello, and welcome to the Data Engineering Podcast, the show about modern data management.
Your host is Tobias Maci, and today I'm interviewing Robert Nishihara about the challenges of maximizing the utility of your available hardware for AI and data intensive applications.

Robert Nishihara (00:26):
Robert, can you start by introducing yourself? Thanks for having me on. I'm Robert. I'm one of the co founders of AnyScale,
and we are commercializing Ray, which is an open source project distributed
system that
a lot of companies use to scale
and run compute intensive AI workloads
ranging from model training to training data preparation,

(00:49):
inference, reinforcement learning. We can go into a lot more detail on all of those, but this is an open source project that we started
as grad students at Berkeley, and then started any scale to commercialize.

And do you remember how you first got started working in the data and AI space?

Yeah, well, I actually got my start more with AI research, more on the theoretical side of things. This was just around the time that deep
learning was taking off. If you remember in 2012,
2013,
deep learning was delivering amazing results, groundbreaking
results
in computer vision, and the whole world was It felt like AI was just exploding at the time,

(01:31):
so that's when I got into AI research.
I had no background or interest in distributed systems at the time, on the infrastructure side of things. My interest was really more on the algorithmic
side. Can we come up with better algorithms for learning
from data, for training these models? Can we develop better general purpose techniques for learning?

(01:52):
So at the time, we were doing research on deep learning training methods, optimization algorithms,
reinforcement learning algorithms.
And the bottleneck that we faced, or one of the bottlenecks, was that in order to AI is very empirical, so if you come up with a new algorithm,
in order to validate whether it's good or not, you have to try it out and see how it performs in practice.

(02:15):
Often need to really convince yourself you need to not just try it out on a small toy problem, you need to try it out at some meaningful scale with a largish model,
a large amount of data, and see if it can really solve the problem.
In order to run those experiments, you need to scale your
algorithm across a bunch of machines, across

(02:35):
a bunch of GPUs,
or you have to And so we found ourselves spending all of our time I'm talking about my fellow grad students and myself.
We were spending all of our time building tools for managing clusters, for handling machine failures, like moving data across machines,
getting stuff to run on
cheaper
spot instances or on GPUs,

(02:57):
and we thought, Wow, surely there's
some reusable tooling that can be built here so that not everyone has to build their own tools and redo all of this all the time. And that led us to start RAID. Basically,
we
thought that AI was taking off and that the need for scale was only going to grow, right? That the era of doing machine learning research on a single machine was going to be a thing of the past, and so the need for scale and the degree of scale was only going to grow.

(03:26):
And that was just going to introduce a lot of hard systems and infrastructure challenges, and so there 's a big opening to try to build something useful there.
Initially, we were just building it for ourselves, but we thought, Hey, distributed computing is really hard. We'd love to build tools that are useful for a lot of people.
That's kind of how we got started. And it ended up being very prescient

and also coincided
fairly closely with the introduction and rise of Kubernetes.
I'm wondering what are some of the evolutions
from when you first started Ray and introduced it as an available project
through to where we are now where we have moved beyond
the niche aspect of these deep learning. Despite the popularity that it grew to, it was still fairly bespoke in terms of who was using it to LLM training and inference, which has gained much broader adoption,

(04:20):
and the distributed compute ecosystem has also evolved substantially.
Just wondering if you can talk through some of that growth and how Ray has taken advantage of it.

Yeah, and you mentioned far more people are building
models today than before,
and that is really going to accelerate with the rise of coding agents, which it's still, even today, it's hard to build models and make use of your data.
As coding agents like Cursor and Cloud Code and others reduce that barrier and the amount of expertise required to really make use of your data and train models, I think far more people are going to do it. And you're right, the landscape has changed a tremendous amount.

(05:02):
Kubernetes
was not anything like what it is today, the standard it is today.
Today, the vast majority of Ray users are using Ray on top of Kubernetes, and Kubernetes has really just emerged as the dominant standard for container orchestration and managing
provisioning

(05:22):
compute and managing container
life cycles across different clouds.
And that was not the case when we started Ray.
We were in a distributed systems and AI lab at Berkeley,
and so there was a lot of The
people who had created Apache Spark were in that lab, and so we had a lot of experience with other distributed systems, and we launched a bunch of Spark clusters, and the standard way to launch Spark clusters was not on Kubernetes, it was to

(05:53):
run this script that another grad student had written to directly talk to EC2 and spin up virtual machines and SSH to them and install all the relevant stuff, and
that has changed quite a bit. But I would say the change you've seen over those years is
the shift
from a high degree of fragmentation to

(06:14):
consolidation
of the infrastructure tech stack.
And Kubernetes is one example. There used to be a number of container orchestrators and
ways of doing container orchestration.
At another layer, think about deep learning frameworks.
Of course, you're familiar with
PyTorch and TensorFlow.
There used to be Theano

(06:36):
was a popular one, Torch.
Actually,
folks at Berkeley
created Caffe, which was one of the early popular
deep learning frameworks. There were a dozen of these. There are a ton of different deep learning frameworks,
and
now it is primarily PyTorch.
So what
often happens when you have a new use case or new workload, like the emergence of AI, there's a proliferation of different frameworks

(07:04):
and then consolidation,
because you tend to have a standard emerge over time that gains momentum that many people are using and contributing to, especially with open source. Open source really lends itself to the
emergence of a standard.
And so the PyTorch layer

(07:24):
that we see, the Kubernetes layer, those are really
some of the most important layers that we see, pieces of software that we see people using together with Ray. And so typically,
our users are not just using Ray on its own, they're using Ray plus PyTorch plus Kubernetes,
often plus a VLLM or SGLANG
to implement their overall workloads.

And
I know too that one of the
early standout features of Ray was the RayTune
library,
which was focused on hyperparameter
tuning, which was a very frequent topic of conversation
and one that I hear a lot less now, particularly because there are so many more of them.

(08:08):
And I'm just wondering if you can talk to some of the ways that the focus of Ray has shifted from when you first started it and in those early days of deep learning growth and adoption to where we are now, where deep learning is still valuable and there are areas of utility for it, but the majority of the focus is on these transformer based language models or vision models?

Yeah,
hyperparameter tuning itself is a fun area to talk about, and then I'll come back to
how Ray relates to all of that. But was always hard to difficult to optimize
neural networks, because there are many choices you have to make. Choices around the architecture,
choices of the parameters of the optimization algorithm, learning rates, things like dropout and so forth. And

(08:58):
if you got the parameters a little bit wrong,
it was easy
for the optimization
to not work. And so
there were a lot of papers written and a lot of experiments done to
figure out the best way to just search over these different parameters.
Hyperparameter search is basically

(09:19):
train the model a bunch of times with different settings of the parameters and see what works the best. And you can be more clever about that, you can be less clever about that and search randomly.
In fact, random search is a very strong baseline.
And
of course that lends itself well to Ray because

(09:39):
if you're training the
same model a bunch of times, it's a compute intensive workload, you're using a bunch of things in parallel,
and so you need a good way to express that. And it can get more complex than just
run a bunch of experiments in parallel,
because you're often looking at the results of some of the experiments,
stopping them early,
investing more resources in the promising ones, spinning up new experiments based on what worked well or what worked poorly,

(10:05):
but it's not that complicated.
Now,
what has changed in hyperparameter
search? First,
a few things have changed.
One is
we're starting to train really big models,
and you just, for your big run, you can't actually run 100 copies of the training run, because you just don't have the compute resources to do that. So that

(10:28):
is no longer
viable in a lot of cases.
Second,
we came up with a lot better initialization techniques for
neural networks.
Lot of the problems that hyperparameter search was solving was just not knowing how to initialize
the weights of the neural network or

(10:48):
how to set learning rates and things like that, and that has been a lot more best practices have emerged around that, so you can get it right more frequently.
And the third thing is
where
the emphasis is now is not about run a bunch of experiments and take the best result.
It's really more,

(11:08):
I'm only going to do one big training run.
So
I need to get that one right.
I can do a bunch of small scale experiments first. And so how do I do small scale experiments
and learn how to set the parameters at that small scale and configure things properly so that when I do my big run,

(11:30):
will work? And so there's a
progression
of run more experiments at a small scale, use that to set a smaller number of experiments at the next scale, and gradually go up in scale,
but reduce the number of experiments,
and use what you learned at the smaller scale to ensure that the larger scale runs go well. So there's a lot of shift in perspective

(11:53):
of how to do what it means to do hyperparameter search.
Now, just to say hyperparameter
search is one use case that people used Ray for. Another very early
use case for Ray was reinforcement learning. This was actually kind of the motivating use case because
the whole world was excited about this previous wave of reinforcement learning with Atari and MuJoCo and AlphaGo,

(12:16):
and we wanted to do research on those types of algorithms.
And
now, of course, we can talk about how the use cases we see
people running with Ray have shifted, but
reinforcement learning, of course, has made a comeback. And now we see a ton of people using Ray for reinforcement learning

(12:36):
for post training LLMs.
And
one of the
actually, just the other day, Cursor released their Composer two model,
which
uses a lot of reinforcement learning to
build a great coding model. And
they use Ray for all of that for
basically the RL infrastructure.

Then in order to be able to
build these models,
whether you're doing a from scratch
training
run or you're doing fine tuning of the model, there's also a lot of data preparation and data manipulation involved, which Ray is also very well situated for, particularly given the fact that it's easily integrated into the broader Python ecosystem,

(13:18):
which has become the de facto language for a lot of these workloads.
And for people who are
building these data workflows, data pipelines,
there are numerous tools to choose from. I'm just wondering if you can give some of the ways that you think about the juxtaposition of Ray versus something like an orchestrator like Airflow or Dagster or a distributed compute system like Spark and some of the selection criteria for when somebody would use Ray for a particular use case versus reaching to some of the other tools that are more of these, I'm going to say,

(13:53):
pipeline native workflows?

Yeah, that's a great question.
So data preparation, training data preparation and preprocessing is something that has changed so much since
the early days of deep learning. If you remember
early on when people were training deep learning models,
you would preprocess your data, but
the preprocessing you would do was very limited. You might in the case of ImageNet, a computer vision data set, you might try to

(14:24):
take each data point and generate multiple data points to augment your data set. And you could do that by
taking each image and randomly cropping or scaling it to get some
more versions of that same image.
Going back to that point in time, all of the research was on model architecture.

(14:48):
The
ImageNet dataset was a fixed dataset
and split into your training set and your test set, so it's this fixed benchmark.
All of the research was,
how do I choose the best optimization algorithm and the best neural network architecture
so that I can when I train that on my training set and then test it on my test set, I get the best results, get the best score.

(15:12):
That has almost entirely flipped. Of course, there's still interesting research happening
on
model architectures, but people have largely converged on the transformer architecture.
And optimization algorithms are largely
variations of stochastic gradient descent,
and despite
tons and tons of research in that area.

(15:34):
And
people have realized that
really getting
great results comes down to getting the data right.
And so there was this maybe previous mental block where the dataset was treated as static. It was treated as kind of a given that you don't optimize, and now that's the thing you really optimize over, and you invest money in collecting data, and you spend a lot of your experimentation

(16:00):
and compute budget
on curating the data. And so
now we see
it's not simple data cleaning where you just strip trailing white space from your sentences and you crop your images so they're all the same size and things like that, and normalize your data.
You are
really
doing a ton of experimentation

(16:22):
and actually running a ton of models to filter out low quality data,
to augment your data with high quality synthetic data,
and annotate your data.
For example,
often using models to generate a lot of structure in your data. For example, if
you have a data set of images or videos, you might use a vision language model to generate captions for those videos and images,

(16:46):
and then to later use in training. You may compute embeddings. It's very common to run dozens of classifiers
in your data preparation
process
so that you can use
those tags, you can filter out subsets of your data to train on specific subsets,
and
you can filter out low quality data.

(17:08):
So it's very common to see data curation pipelines that involve dozens of stages of filtering and annotation
and
are really,
really complex.
It's just enormously different from how people did prepared training data in the past. So a lot of things have changed about this data preparation stage.

(17:30):
Some of the big differences
are that it's now model driven and GPU driven instead of CPU driven.
Basically,
are
More and more data processing is shifting to GPUs
because
more and more data processing
is being done with inference,
and
that is a massive shift. So you ask about how does Ray relate to the data world,

(17:56):
and when would you use Ray versus Apache Spark,
or when would you use
Airflow and workflow orchestrators like that? Let me start with the workflow orchestrators,
because those are actually quite complementary with Ray. They operate at different levels of granularity.
Of course, there's always overlap,
but I think of Ray as

(18:18):
running an individual
workload.
I'm
running my data processing pipeline, and it has
this op stage where I download the data and
decompress the videos, and then I stream that into
some CPU
based heuristic filtering, and then I stream that into some

(18:38):
model based filtering, which
judges aesthetic score and
filters out explicit content or these kinds of things. And then this streams into some vision language model stage where I'm running using transformers,
and then maybe I'm writing out to Parquet or writing to some blob storage.
And
Ray would be used to express this workload,

(19:01):
assign
different compute resources to each stage of computation,
manage the processes
sort of that are executing each operation, stream data between them, handle back pressure if one stage is too slow,
handle failures if one process dies and needs to be recreated,
handle auto scaling of

(19:23):
the compute resources to match the throughput of all the different operations
and things like that, really solving these core distributed
computing challenges.
Now, that data processing
workload
might just be one
part of some overall broader
workload. For example, maybe you want to set up something like you're collecting new data every day. You're a robotics company. You have

(19:48):
robots out in the real world that are streaming video back, so you've got new data coming in every day. Every day, you want to run that data processing job to curate your data.
Once that finishes,
you want to take the curated data,
maybe copy that over to a different cloud,
and then your Neo Cloud, where you run training, and then kick off a training job. That kind of coarse grained

(20:09):
workflow orchestration is where you would use something like Airflow or a different one, and that's very compatible with Ray. You might use Ray to run the individual components of each
stage.
So we see Ray being used together with
Airflow, Flight, and all of these different engines.
The Spark comparison is interesting. So there are

(20:30):
First,
the high order thing, of course, is that
Ray is used for a huge range of workloads, and Spark is specific for big data processing. So Ray is also used for big data processing, but it's also used for reinforcement learning, inference, training, and
stuff like that. But you
can ask the question specifically about data processing. If I have a bunch of data and I need to transform and manipulate my data,

(20:55):
when would I use Ray and when would I use Spark?
They are actually designed for very different use cases. So to oversimplify,
Spark was built at a time when, with the emergence of big data,
when
people were not really thinking about GPUs.
So if I to, if I have a bunch of tabular data, if I have a bunch of data that is nicely structured in tables,

(21:18):
and I want to
do analytics, I want to join a bunch of tables together
and run SQL queries
and do analytics like that, Spark is a fantastic choice. Where Spark starts to run into limitations
is when I have more multimodal data. Perhaps I'm working with images,

(21:38):
video,
robotic
sensor data,
all of these types of things.
And
the way that I manipulate this type of data, the way I get value out of this really
unstructured,
really multimodal data is not by running SQL queries on it.
Instead, it is by running inference on it. So what am I gonna do with

(22:00):
just a PDF or a video?
I'm going to feed it into a model. And that because that model is is able to understand it and and and reason about it. And so
we're entering this regime where data processing is becoming
an inference heavy workload and is running on GPUs. You've still got CPUs, of course. You've got it's a mixture of CPUs and GPUs. And

(22:25):
that heterogeneity,
this scenario where you have tons of multimodal data
and it is being processed both with inference and with regular processing
on a mixture of CPUs and GPUs,
that is a scenario where Ray is the best system for is really designed for using that for that kind of scenario.
And so to oversimplify,

(22:47):
Spark is amazing for running SQL queries
on tabular data,
and Ray is amazing for running inference on multimodal data.

You
mentioned
GPUs
being a critical
hardware need for a lot of these data processing workloads.
You mentioned
the focus on inference that Ray has been investing in. And we also briefly touched on some of the distributed systems
capabilities
around Ray and Kubernetes

(23:19):
and some of the overlap there.
And I think that all of that focuses in on an interesting question as well that a lot of people are dealing with right now is how do I actually get the most out of my hardware because these GPUs are super expensive. I wanna make sure that they're not just sitting idle when I could be putting them to some good use or deallocating them from my cloud environment if you're in a cloud and using some form of auto scaling.

(23:43):
I also know that, in particular, Kubernetes,
their orchestration
system is optimized for
a different style of use case than what a lot of these high data throughput
workloads need. And so you might sometimes find yourself fighting with the Kubernetes orchestrator where it's saying, hey. You're done over there, and you're saying, no. I'm really not. And I'm just wondering if you could talk to some of those aspects of how teams are dealing with some of this question of resource optimization at the hardware level to be able to get the most out of their expensive compute.

Yeah, that's a great question.
I'll talk about
the Ray Kubernetes relationship, and also about just really getting compute optimization,
getting the best utilization.
I want to say one more thing on the data topic,
which is that I mentioned Ray is amazing for this world of GPU data processing,
processing data with inference, and multimodal data. I think it's worth pointing out that

(24:39):
all of this multimodal data was previously useless.
So if you think about And I'm exaggerating a little bit, but think about all the random PDFs and documents sitting in your organization,
or video recordings of meetings,
or of sales calls, or
audio recordings of things.
You would store that data and then not do anything with it. Think about how many meetings people have been in where

(25:06):
you might record the meeting, but no one's going to go back and actually listen to it.
And so that data was There's tremendous
value and information in all of this data. It's just really hard to manipulate and get insights out of it. So the thing that has changed is that
AI is making it possible to programmatically

(25:27):
analyze
and manipulate
all different types of data because you have powerful multimodal models.
And so of course, reason Spark
and other systems have been primarily working with tabular data is not that that's the only valuable data, but rather it's just the easiest to work with and to
ask questions about.

(25:49):
Tabular
data is really a tiny, it's a miniscule fraction of the world's data, And now that we can unlock value in all the rest of the data,
we're going to start storing way more of it. We're going to start using it all the time, and this is going to be tremendously valuable for her. So I wanted to share that. Now, on your point about

(26:12):
cost of GPUs, yes, GPUs are tremendously expensive. And so
getting good utilization of your expensive resource
is a hard problem, and it has to be solved.
And this is why a lot of companies have infrastructure teams that are responsible for managing all of the compute

(26:34):
and making that compute
available to their AI researchers
and
practitioners.
So think about the challenge. It's also much harder at different scales. There are many different dimensions of
complexity.
So it's one thing if you have one AI person
running
one training job on one cluster. But if I have five different Kubernetes clusters

(27:00):
across different clouds,
and I've got a team of
dozens of researchers
that need to share that compute. All of a sudden, I need to solve for a few things. I need my researchers to be productive. I need them to be able to really move quickly, debug easily, run things at scale, search for capacity across different clouds, and take advantage of compute resources wherever they are so that they can be productive. I also need to get great utilization, right? So I need

(27:28):
to be able to
prioritize different workloads against each other.
I may have a big training run that needs all the GPUs and needs to run all at once. But when that finishes,
I don't want the GPUs to just sit idle. So I might need some background
elastic job that can just soak up all the unused compute and just expand elastically to fill the unused compute.

(27:51):
And how do I set that up? How do I enable
teams to run different workloads with different priorities and appropriately
share their compute resources among them. The kinds of challenges that we see these infrastructure teams needing to solve are, one, making their developers and their end users, researchers
productive.
Two is having a standardized interface to all of their compute so that you can easily plug in new sources of compute. That's important for two reasons. One is it lets you shop around for the cheapest GPUs and plug them in. And second,

(28:24):
it lets researchers and the users run their workloads everywhere, so that factors into the productivity point. So that's a standardized interface to your compute. Third is
being able to
really do a good job of assigning the right resources to the right workloads. And that will mean
workload prioritization.
It will mean elasticity of low

(28:45):
priority workloads so that they can expand and
make sure you always have
stuff running so that your GPUs are not sitting idle. And then fourth is being able to search for capacity across different regions and different clouds,
which if you have bursty jobs,
you're going back and reprocessing all of your data to compute some new feature,

(29:05):
being able to quickly acquire a lot of capacity across a bunch of different regions is very helpful for that. So those are some of the
in order to do a good job with GPU utilization,
those are all of some of the challenges you have to solve. That is in addition to Those are challenges that sit around
what Ray does.

(29:27):
Ray is responsible for running your training workload, your RL workload, your data workload, and running that individual workload and making sure it is performance and reliable and fast and cost efficient. And then everything else I just described
sits outside of the workload and has to be solved at a different layer. So you have to think about many different layers of the stack. Now, Ray and Kubernetes, you asked about the relationship between Ray and Kubernetes. They're highly complementary,

(29:53):
and they sit at different layers of the stack. So for the AI infrastructure stack, I like to think about the PyTorch layer, the Ray layer, and the Kubernetes layer. All of this is the software stack that sits on top of your GPUs and cloud providers.
So each layer on its own is not sufficient. Each layer solves some fraction of the infrastructure problems that you need to solve. What PyTorch is responsible for is

(30:18):
squeezing the most performance out of the model on the GPU, just running the model both for training and inference
in the most performant way possible.
And
that is
not just PyTorch. There's a rich ecosystem around PyTorch. So think about frameworks like VLLM and SGLANG for optimizing
inference with transformers, or frameworks like Megatron for

(30:41):
training.
But fundamentally, the responsibility of that layer is about squeezing the most performance out of the chips by running the model. Ray, we talked about a bunch. The responsibility at that layer is about solving
the distributed computing challenges.
So this means process management, process lifecycle management, process coordination

(31:05):
and communication,
data movement,
data ingest, failure handling, because a lot of the hardware is unreliable,
resource allocation to different portions or components of your workload,
stuff like that. And Kubernetes is responsible for container orchestration,
so provisioning of the compute, managing container life cycles,

(31:27):
and things like that. And those are all
very complementary.
And we've also seen them co evolve with each other. So
the Ray open source community and the Kubernetes open source community
collaborate deeply to really
share information
and have the right interfaces and be able to optimize in a way that they

(31:49):
couldn't without each other. For example, Kubernetes has the ability to
resize containers, resize
containers to add more memory or things like that.
Kubernetes
on its own doesn't know when it's appropriate to resize the container. On the other hand, Ray is running the workload inside of those containers.

(32:10):
And so Ray has knowledge of what the workload is and what its resource requirements are.
And so Ray is actually in the perfect position to
say, hey, we need more resources over here. But Ray is not managing the containers, and so is not able to execute that without Kubernetes'
help. And so there are lots of things like this where,
by working together and co evolving,

(32:30):
the different layers of the infrastructure stack are
developing to work better together.
And this is also especially true with Ray and VLLM,
where the
VLM community and the Ray community have worked very deeply together
to
make it possible to do performant
cross node inference. So LLM inference gets far more complex

(32:55):
when you have large models that span many machines.
If you have a single model on a single machine,
that's a simpler scenario.
But a large model that spans multiple machines is its own little distributed system. A single query to the model may touch many different experts
your expert layers, and those experts may be sharded across different machines. And so the query is getting routed around in

(33:18):
complex ways.
Different parts of the computation may get disaggregated
into separate pools of compute. There are many things like this where
Ray and VLM work together to enable complex forms of cross node parallelism.

The whole idea of sharding the models is also interesting, particularly for people who aren't deep in the weeds of the actual model architectures.
And I know that the models themselves, they present as being this monolithic thing when in reality, they're just a hierarchy of the actual neural layers. I'm wondering if you can maybe talk a little bit too to some of those challenges of being able to distribute the model effectively across different GPUs for the case where you can't fit it all on a single chip.

This is an area that's become far more complex recently,
because in the past, when you were scaling inference,
you would just stick your model in a container,
and then you would replicate the container
however many times you need it. And it really just wasn't that complicated.
If you need to scale more, you replicate the container more. Now

(34:23):
you may have many containers and many
machines
that are running a single replica of the model.
And that is
so questions like elasticity
become a lot more complex, failure handling become a lot more complex, because you can't just think of failure handling at the container level.
Locality matters.

(34:46):
I mentioned when you are running a big model,
you may separate So with transformers especially,
there's this pre fill stage where you process the input tokens,
which might be more compute bound, GPU compute bound. Then you've got your decode stage where you're generating the output tokens one at a time, which might be more GPU memory bandwidth bound. And it might make sense to separate out those two stages into different pools of compute and assign different GPUs to each one. But there may be

(35:16):
certain shards of your prefill workers and certain shards of your decode workers that correspond to each other, and actually you want them to be co located. And so now you need to think about co locating containers in
the same nodes.
And that never happened before when you were thinking about just, I have a single model and a single container and replicating it. This has especially gotten more complex with large mixture of expert models,

(35:41):
where
the expert stages are often
sharded across a bunch of different GPUs
and
a single query needs to be routed around to different experts.
So that is something that is,
and there are many different
ways of sharding and partitioning your model. It is not like, oh, there's just one strategy that

(36:03):
always works the best. So that is an area that's grown far more complex.

In order to be able to
use and manage these systems effectively,
obviously, there's also the question of
observability
and being able to understand what is being executed,
how efficient it's being executed. I'm wondering if you can just talk to some of the leading reasons for
wasted or inefficient compute.

(36:30):
We obviously talked a lot about some of the ways that Ray helps to alleviate that situation,
but also just some of the organizational
and team education that's required to make sure that they understand how best to effectively apply the capabilities of a framework like Ray to
such a complex and multivariate
problem space.

Yeah.
One of the biggest reasons, I would say, is not
having a good way to share GPUs between training and inference. Fundamentally, and this is at the fleet organizational
level,
not at the individual workload level. If I have training workloads and inference workloads,
and
I am partitioning my GPUs between them and trying to provision

(37:15):
each one or inference for peak capacity,
then there are going to be a lot of times when my inference GPUs are idle and not
being used because
we're not at the peak capacity.
And so being able to
efficiently share GPUs between training and inference in a way that doesn't
risk

(37:35):
your most important production workloads,
but allows
excess capacity to be used by lower priority workloads
is
probably the number one thing.
And that is complex to get right. That's the kind of problems that are solved by the MECL platform
and the layers that sit around Ray.

(37:55):
And
I would say that's
outside of the workload.
Within the workload itself,
there can be many different bottlenecks.
It's important to be able to get
good performance within a single workload.
It's very important to have the tools to tackle
whatever bottleneck is you're running into at the moment. So

(38:20):
what do I mean by that? I mentioned with
in inference,
might separate out the prefill and decode stages into separate pools of compute, because
they are different shaped workloads. They have different types of bottlenecks, and so
it's natural that you might want different compute resources for each one into potentially different accelerators

(38:41):
and to
right size the compute for each stage.
The same thing can be true of a data processing pipeline or a training pipeline. You might have stages that are GPU bound.
You might have stages that are IO bound. You might have components that are memory bound.
And if you don't have a system that can support a large degree of heterogeneity

(39:04):
and
assigning,
breaking down a workload into different pieces,
assigning different compute resources to each piece, and rightsizing and scaling those resources appropriately,
then
you are going to likely have inefficiencies.
And this is just to give an example with training. As you scale training on more GPUs,

(39:25):
it's very easy to become bottlenecked by the data ingest and preprocessing side of things. I might be using some It might be very CPU heavy processing.
I might be using some models in my preprocessing right before it gets fed into training.
Now,
if I can't
separate out that preprocessing
into a different pool of compute

(39:46):
and then scale that pool of compute independently from training,
then
I'm not going have a way to eliminate the data in just bottleneck.
And so this is really where Ray shines. It's giving you the control
to separate out different pieces of your workload into different pools of compute
of any different type, connect them all together, scale them independently,

(40:10):
manage them independently, handle failures independently.
And so when there is a bottleneck in one component of your workload,
you can address that without everything being so tightly coupled.
And that is,
I would say,
the key for eliminating
bottlenecks within a workload. So that's probably the biggest thing inside the workload. Then

(40:32):
the biggest thing outside of the workload is
sharing resources between workloads, especially training and inference.

Yeah, the sharing resources
is one of the pieces that I think is
most complex
and probably
least effectively executed by most teams because it requires so much of that coordination of understanding
what the workloads are,
which also brings up the question of, well, what if I have two different Ray clusters running or I have one system?

(41:04):
Maybe I just have an independently deployed VLLM that's using up a portion of my GPU, and then I have Ray
using the other portion of that GPU for
data preprocessing,
and I'm just curious how you can give some level of visibility to Ray to be aware of the other workloads that are co located within that same pool of hardware.

Yeah. This is a great example of how you can do a better job by
co designing different layers of the stack,
because
the layer outside of Ray, the
overall
compute provisioning and orchestration layer, is the layer that's responsible for
deciding
which resources to allocate to which workloads,

(41:50):
which nodes to preempt, things like that.
But to do a great job of that, you really want to know what's running inside the workload, and that's information that Ray has,
information that exists at the
distributed compute, like the workload layer.
And
so
if you can combine those pieces of information, then you can do a much better job of preempting

(42:15):
non critical
portions of your workload that can be more easily scaled down or restarted later.
And
that is something that the kind of thing that we think about when co designing
Ray with these other layers of the stack. So
you're calling out a very important problem.

As
you have been
working
on
and building the AnyScale company around it and just evolving along with this fast moving ecosystem,
what are some of the most interesting or innovative or unexpected ways that you've seen Ray being applied and maybe some of the lessons that you've learned from these unexpected uses that have helped guide the future trajectory of the project?

Yeah,
it's been really
interesting to see how
the broad categories of workloads
have largely remained the same. If you think about training,
inference,
data processing,
but each one has
evolved a ton. We talked about how inference has grown in complexity with

(43:22):
large models and multi node models.
We talked a bit about how the data processing side has evolved,
to
accommodate
all this multimodal data and really these heterogeneous
inference heavy pipelines.
On the training side, we've seen the evolution of, or the
resurgence of reinforcement learning, which is far more complex than regular training because,

(43:48):
well, it has all the complexities of regular training. You're still doing regular training,
but you're also doing inference.
Are also running To generate new data, you're running simulations or environments
that are tightly coordinating
with the inference, going back and forth to generate data.
You are shuffling data around. You're moving the data back to training. You're moving the new model weights over to the inference portion.

(44:14):
You
can have failures in each of these different stages.
Training can fail, of course, but as
models, we have more powerful agentic models that can
reason and take actions over long periods of time,
you need to think about failures
on the rollout side of things, the data generation side, because if rollout

(44:36):
is happening over the course of
ten hours or a day or multiple days,
you really don't want to lose that progress. And so you have to handle failures on that side of things as well.
So this introduces a ton of both algorithmic
complexities as well as infrastructure complexities.

(44:57):
And
Ray is, the more
challenging
AI workloads become,
the more important Ray becomes.
Because
if
you're just taking
a single process and replicating that a bunch of times and running the identical thing

(45:18):
everywhere,
then
that's fairly simple to do and you
don't need Ray.
When
there is heterogeneity,
when your
workload is broken down into different components that have different responsibilities and need to coordinate with each other, like in reinforcement learning, like in multi node inference, like in these

(45:39):
AI data pipelines,
then you really need Ray.

And to that point of statefulness
and failure recovery,
it also brings
question of systems such as Temporal that are focused on that.
They're termed crash proof, which is a little bit of a misnomer, but I'm wondering what you're seeing as some of the ways that people are integrating with some of those more
stateful workflow management

(46:06):
systems like Temporal to be able to do some of that failure recovery, for instance, if you're in the middle of a training run and you want to be able to make sure that you can resume from the appropriate checkpoint or whatever.

Yeah, failure recovery is essential,
right? Because failures happen all the time. And not only is failure recovery important, but fast
failure recovery is important because
the larger your cluster, the more frequent the failures.
And if you have a failure in extreme cases happening
every few hours,

(46:37):
and you spend thirty minutes recovering,
reloading from a checkpoint, that is a significant fraction of your overall time.
So
systems that
Great
recovery from failure is really a non negotiable.
And
one interesting trend we've seen there is that

(47:00):
the application needs to have quite a bit of control over how to handle failures, because the appropriate way to handle failures
may be quite custom.
And we see this more and more with the GB200s
and 300s, some of the latest generation of NVIDIA hardware,
which
previously you would reason about

(47:21):
a node of maybe eight GPUs.
And now
the
right level to reason about is not a node, but a rack. You might have 72
GPUs in a rack where
each
rack has a number of nodes and communication within the rack is far more

(47:41):
faster
than communication between racks.
And there are multiple levels of topology here.
And because of the performance difference,
the application
needs to precisely
control
which parts of the application run-in the same rack, which ones run-in different rack. And so if

(48:03):
a GPU fails,
let's say there's 72 GPUs in the rack,
I know these things are going to fail, and they're
going to fail frequently.
And so I may start my application and run without
using all of the GPUs. Maybe I'll use 64 and keep some of them as backups so that I can
swap them in when something goes wrong. And now I need to know, okay, GPU dies.

(48:27):
My
application needs to have control over which precise
processes to tear down and making sure we can recreate them in the same rack, and it needs to track how many spares it has,
and
then wire everything up with specific other processes.
But then it may need to do something different when I've used up the spares.

(48:48):
It may need to
recreate,
tear down a bunch of stuff and acquire a whole new rack.
Or maybe it will decide
at that point, Actually,
we can't acquire a new rack.
We need to
proceed with training on just a smaller number of workers, a smaller number of racks, and that might mean adapting batch sizes and things like that. And so there is a ton of

(49:11):
logic and
control that the application needs to have
over
what to do in the presence of a failure, And that is something that
Ray uniquely provides, is really this level of control over process failures
and what to do about them.

And
in your work of
building these technologies and understanding the use cases and applications,
what are some of the most interesting or unexpected or challenging lessons that you've learned personally in your work of this very data and AI heavy workloads?

Yeah.
First of all, there are many potential bottlenecks all over the place. And when you're trying
to build a flexible tool like Ray that supports a variety of use cases,
it's not enough to
do one thing well, you have to really
have the tools to eliminate any bottleneck that could come up.

(50:11):
Now,
I would say another lesson,
there is a or there can be a tension between
building
general purpose tools
and really optimizing for a specific use case.
And
this is actually a common
question people will ask about Ray. They'll say, Ray is so general purpose, it's so flexible.

(50:34):
Doesn't that mean
it's less optimized for each individual use case?
And
there is some truth to that, but the way that Ray,
the way we approach that with Ray is to really
build two layers.
There's the lower level primitives,
which are highly flexible.

(50:54):
And really, this is the array as an actor framework. So the ability to The low level primitives is basically the ability to
spin up a Python class as an actor and to manage processes.
And
that is very flexible. You can build anything with functions
and classes, and so you can build any kind of distributed system with actors.

(51:18):
And at that level, that core part of Ray, we focus on keeping it simple,
having a very small number of highly flexible primitives,
and making sure they are
battle tested, they're hardened,
their performance.
And we are very conservative about adding new functionality at that layer.

(51:40):
Then
there is the library ecosystem on top of Ray, in the same way that you have Python and then the core primitives, then you've got the library ecosystem around Python.
And there are libraries built on top of Ray for
data processing, for training, for reinforcement learning, for inference.
And
that's the layer at which you can specialize and really

(52:03):
deeply optimize for a specific use case, building on top of the lower level primitives.
And so
that's the approach we've taken to
trying to get the best of both worlds,
combining optimization with generality.
And
I think that's worked quite well.
We've actually seen a rich ecosystem emerge on top of Ray. So nearly every

(52:26):
open source RL framework
is built on top of Ray.
Companies like NVIDIA are building their data curation frameworks on top of Ray. We see inference frameworks and agent frameworks built on top of Ray. So that is
one of the lessons.

And as you continue to build and invest in this technology and ecosystem,
what are some of the predictions that you have for some of the next set of challenges or the next evolution of use cases as we
maybe move into newer model architectures
or invest more in optimizing the existing set of transformer applications?

Yeah. Complexity is going to continue to grow
along every axis.
So
on the hardware side, there
will be more and more
levels of topology that we need to reason about, more and more topology aware
scheduling.

(53:30):
Hardware failure
rates will continue
to be very prominent. There will be more heterogeneity,
just way more types of hardware accelerators
and different cloud providers that people will use. So
the complexity and heterogeneity
will
continue to grow on the hardware side.
The same will be true on the application side.

(53:52):
Data scale will grow,
model scale will grow,
the
types and heterogeneity of data will grow. So for example,
one thing you don't really have with tabular data, with tabular data, your data roughly all looks the same.
With multimodal data, you might have a video that's
ten seconds long, and you might have one that's three hours long.

(54:14):
And
you're
going to have to deal with both.
And
really,
every different type of data you work with might introduce different systems challenges.
Working with PDFs is different from working with textbooks,
and
there's going to be more heterogeneity there.
You're going to see much longer running and much larger scale workloads. Like with reinforcement learning, you're going to see

(54:39):
rollouts that are happening
at much larger timescales.
You're going to see models continue to grow in size.
So
all of these are
things that will continue to happen.

Are
there any other aspects of the Ray project, the ecosystem,
the use cases that it supports that we didn't discuss yet that you'd like to cover before we close out the show?

I would say that
the
shift to the world becoming
GPU centric really
plays to Ray's strengths.
A lot of our investment areas are in
really
natively supporting
high

(55:28):
performance communication
between GPUs,
natively supporting RDMA and Ray,
different
transport back ends. This is
growing in importance and is really essential for performance
across a variety of use cases.

Well, thank you for taking the time. For anybody who wants to get in touch with you, I'll have you add your preferred contact information to the show notes. And as the final question, I'm interested in getting your perspective on what you see as being the biggest gap in the tooling technology
or training that's available for data and AI management today.

Yeah, it is still not a solved problem.
Would say one of the biggest gaps is
the tooling around multi cloud workloads,
and
it's very common now for
companies to work with
one hyperscaler
and one or more Neo Clouds,

(56:25):
and it's quite complex to
find capacity in the right location at the right time to make your data available in a performance and cost efficient way in the right location
at the right time to make your workloads portable so they can run-in different locations
and
elastic,
fault tolerance so they can

(56:45):
take advantage of whatever compute resources are available. So at the same time, it's incredibly valuable when you can get it right and you can have a setup so that
your infra team can just
continuously
shop around for the cheapest compute, the cheapest GPUs,
the latest hardware accelerator, plug that in into a uniform interface that

(57:07):
your
researchers and AI teams can take advantage of. So that
is a lot of the stuff that we're excited about solving at any scale,
and one of the biggest challenges we see.

All right. Well, I appreciate you taking the time today to join me and share all the great work that you and your team are doing on the Ray project and helping to
address the myriad complexities
of all of the technological
utility that we get from these large language models and the whole ecosystem of ML and data processing.

(57:39):
It's definitely a very complex problem as we discussed, so I appreciate all of the energy that you're putting into it. And I hope you enjoy the rest of your day. Thank you so much.
Thank you for listening, and don't forget to check out our other shows.
Podcast.net
covers the Python language, its community, and the innovative ways it is being used. And the AI engineering podcast is your guide to the fast moving world of building AI systems.

(58:08):
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email hosts@dataengineeringpodcast.com
with your story. Just to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.

All Episodes

Episode Transcript

Popular Podcasts

Hey Jonas!

Stuff You Should Know

Las Culturistas with Matt Rogers and Bowen Yang

.css-15opob5{left:0;position:absolute;top:0.8rem;} All Episodes

.css-14f5ked{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:2;overflow:hidden;}Maximizing GPU Utilization: Heterogeneous Pipelines with Ray and Kubernetes

Episode Transcript

Popular Podcasts

.css-r6mb8g{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:1;overflow:hidden;}Hey Jonas!

Stuff You Should Know

Las Culturistas with Matt Rogers and Bowen Yang

All Episodes

Maximizing GPU Utilization: Heterogeneous Pipelines with Ray and Kubernetes

Hey Jonas!