Episode Transcript
Available transcripts are automatically generated. Complete accuracy is not guaranteed.
(00:02):
So is Azure the best place
to build and run your AI apps and agents,
even if you plan to use
open-source models and orchestration?
I'm joined today by AzureCTO, Mark Russinovich,
who some of you probably know
as the Co-founder of Sysinternals.
- It's great to be here, Jeremy.
- And today we're talking about inference
and we're going to startwith a multi-agentic solution
to show what's possible,
then we're going to dig into what runs it
(00:23):
and what you can build.
- Let's do it.
We're going to use severalagents working together
to build a custom video ad
with voiceover from scratchusing the best AI models
and tools available for the job.
This page lets you provide a basic prompt
and upload pictures
and my agentic app willcreate a 30-second video
with voice narration and multiple scenes
for a product launch.
I'll start with the prompt togenerate an ad for our new SUV
(00:46):
with its outlander option package
that lets it go anywhereand escape everything.
I'll upload some pictures ofthe car with different colors
and angles from my local device.
Then I'll submit my prompt
and you can watch what theagent is doing on the right.
While the videos generate,
let me go into VS Code andexplain what's behind this app.
This is using the open-sourceSemantic Kernel from Microsoft
for orchestration with Python code,
(01:07):
and we can see what's happening
play by play in the terminal.
We're using Azure AI Foundry Models.
First, the R1 model from Azure DeepSeek
powers the main planning agent.
The next agent is a copywriterthat interprets my prompt.
It's using Azure Llama,
which uses the open-sourceLlama 4 model from Meta
to write narration text that'saround 25 seconds in length.
We then have another agentthat uses text-to-speech
(01:29):
in Azure AI Foundry to take the output
from the copywriter agent toadd voice over to the ad copy.
It uses our brand-approved voice
and outputs an MP3 file.
We have another video agentdeciding which video scenes
make the most sense to generate.
This is based on the script generated
by the first ad copy agent.
The app then calls theSora model in Azure OpenAI.
(01:49):
It will reference my uploaded images
and use the text promptsdescribing the scenes
in the order they appear
in the talk track to generate videos.
The prompts are used to generate
a handful of five-second videos.
This is an early look at the Sora API,
which is rolling outsoon and unique to Azure,
with image to video supportcoming soon after the launch.
Once the video and audiofiles are complete,
(02:09):
the app uses the open-sourceFFMPEG command line tool
to do the video assembly,
combining the video with the audio track,
and it will also insert prebuilt Contoso
intro and outro bumpers as thefirst and last video segments
to align with our brand.
Once all of that is complete,
it creates the finisheddownloadable MP4 file.
And it's fast
because as I was talkingthe entire process finished.
(02:31):
And here's the end result.
- Adventure calls.
Answer, with the ContosoEarthPilot hybrid SUV,
rugged, reliable, and ready for anything.
From rocky trails to open highways,
this powerhouse gets youwherever your heart dares.
Upgrade to the Overlander option package
and unlock the ultimate getaway,
sleek rooftop tents,
elevated storage, andcutting edge trail tech.
(02:51):
The EarthPilot takes youfurther, keeps you moving,
and lets you truly escape.
Contoso EarthPilot.
Go anywhere. Escape everything.
- Right and in terms of inference,
you know, this is a lot more intense
than most text generation scenarios.
And you showed that theagents are actually consuming
quite a few different models from OpenAI.
We saw open-sourceorchestration, we also saw Llama,
(03:14):
we saw DeepSeek, allAzure AI Foundry models.
So what hardware wouldsomething like this run on?
- Everything you sawis running on the same
battle-tested infrastructurethat powers ChatGPT
with more than 500 millionweekly active users
all running on Azure.
To put this into context,
if you want to run the agentic system
I just showed you on your own,
you need a pretty sizable cluster of H100
(03:34):
or newer GPU servers to runa video generation model
and encode everything,
plus a large LLM like DeepSeek R1 671B,
which is considered efficient,
requires more than 1.3terabytes of GPU memory
and 16 NVIDIA A100 ornewer clustered GPUs.
With the way we manage services on Azure,
we take care of everything for you.
You don't need to worryabout provisioning compute
(03:56):
or connecting everything together.
OpenAI including GPT-4o andSora, DeepSeek, and Llama models
are part of our Modelas a Service in Azure,
where we're running thosespecific models serverless.
You don't need to set up the runtime
or worry about tokenization,or scheduling logic
as just an endpoint withbuilt-in quota management
and auto-scaling.
- In terms of scale, last May when we did
(04:16):
the Azure Supercomputer show together,
at the time, we were already supporting
more than 30 billion inferencerequests per day on Azure.
- And we passed that a while ago.
In fact, we processedover 100 trillion tokens
in the first quarter of this year,
which is a 5x increase from last year.
And the growth we'reseeing is exponential.
Last month alone weprocessed 50 trillion tokens.
(04:36):
Peak AI performance requiresefficient AI models,
cutting edge hardware, andoptimized infrastructure.
We make sure that you always have access
to the latest and greatest AI models.
For example, we're able tooffer the DeepSeek R1 model,
fully integrated into Azure services
and with enterprise-gradesecurity and safety
just one day after it launched.
And when I say enterprisegrade security and safety,
(04:57):
I mean it's integrated with services
like Key Vault, API gateway, Private Link,
and our responsible AI filters.
You can access our modelcatalog directly from GitHub
and experiment withoutan Azure subscription.
Or if you use Azure, you canaccess it from Azure AI Foundry
where we have over 10,000 Foundry Models
including thousands ofopen-source and industry models.
(05:17):
And we've always been at the forefront
of bringing you the latest AI silicon
and making it available on Azure.
We closely partner with AMD on the design
of their MI300X GPUs with 192 gigabytes
of high bandwidth memory, whichis critical for inferencing.
And working with NVIDIA,
we were the first cloudto offer H100 chips,
along with the NVIDIA GB200 platform,
the most powerful on the market today.
(05:39):
It means we can generatetokens at a third of the cost
compared to previous generations of GPUs.
And we lead in terms ofcapacity with tens of thousands
of GB200 GPUs in our massivepurpose-built data centers
to take advantage of thebest cost performance,
we have developed advanced liquid cooling
to run our AI infrastructure.
This includes our in-house chip Maia,
which is currently used toefficiently run our large-scale
(06:01):
first-party AI workloads,
including some of our Copilot services.
And our systems are modular,allowing us to deploy NVIDIA
and AMG GPUs on the sameInfiniBand network infrastructure
to meet the specific demand for each.
- And what all this means iswhether you're building now
or for a few years down the line,
you always have access tothe most cutting edge tech.
- Right and I can prove theinference performance to you.
(06:23):
As part of our MLPerf benchmark test,
we use the industrystandard Llama2 70B model.
It's an older model but its size makes it
the industry standard forhardware benchmarking and testing.
And we ran inference on Azure's
ND GB200 v6 Virtual Machines,
accelerated by the NVIDIAGB200 Blackwell GPUs
where we used a single,
full NVIDIA GB200 Blackwell GPU Rack.
(06:44):
One rack contains 18 GPU servers
with four GPUs per node, totaling 72 GPUs.
We loaded the Llama2 70Bmodel on these 18 GPU servers
with one model instance on each server.
This is the Python scriptwe ran on each server
and using SLURM on CycleCloud,
which is an inference framework,we ran them in parallel.
You can see on this Grafana dashboard
(07:05):
the tokens per secondperformance of model inference.
As you can see in benchmarkresults at the bottom,
we hit an average of around
48,000 tokens per second on each node.
And above that, you cansee that we're totaling
865,000 tokens per secondfor the entire rack,
which is a new record.
The bar charts on the topright show how consistent
the performance is across the system,
(07:25):
with very low deviation.
- So how does thisperformance then translate
to the everyday AI and agentic solutions
that people right noware building on Azure?
- So I don't have theexact numbers for tokens
consumed per interaction,
but we can use simple mathand make a few assumptions
to roughly translate thisto everyday performance.
For example, somethingeasy like this prompt
where I asked Llama about Sysinternals
(07:45):
consumes around 20 tokens.
Under the covers we need to add roughly
100 tokens for the system prompt
and an extra 500 tokens is a proxy
for what's required to process the prompt.
Then finally, the generatedresponse is around 1,400 tokens.
So the grand total is closeto 2,000 equivalent tokens
for this one interaction.
Remember, our benchmark test showed
865,000 tokens per second.
(08:07):
So let's divide that by the2,000 tokens in my example.
And that translates toaround 432 user interactions
per second per rack in Azure.
Or if you extrapolate that over a day
and estimate 10 interactionsper user, which is pretty high,
that's around 3.7 milliondaily active users.
And by the way, everyoneshould already know
how to use the Sysinternals tools
(08:27):
and not need to ask that question.
- Exactly. That's what I was thinking.
Actually, I've committedall of this stuff to memory.
- I'm not sure I believe you.
- A little bit of command linehelp also helps there too.
But why don't we switch gears.
You know, if you're runningthe scale of an app like this,
how would you make sure the response times
continue to hold up?
- So it depends on thedeployment option you pick.
If you run your model serverless,
you also have the optionto maintain a set level
(08:49):
of throughput performancewhen you're using
shared models and infrastructure.
With the way we isolate users,
you don't need to worry noisy neighbors
who might have spikes thatcould impact your throughput.
When you provision computefor serverless models
directly from Azure,
you can use standard, which is shared,
local provisioned, and global provisioned.
Here you can see that Ihave a few models deployed.
(09:09):
Moving to my load testing dashboard,
I can run a test to lookat inference traffic
against my service.
I'll start the test.
If I move over to my Grafana dashboard,
you'll see that under load,
we're serving all the incoming requests.
This informs me how muchcapacity I should set
for provisioned throughput.
Now I'll move over toa configuration console
for setting provisioned throughput
and I can choose the date rangeI want to see traffic for.
(09:32):
This bar chart time seriesconveniently represents
the waterline of provisionedthroughput in blue,
and I can use thisslider to match the level
of performance I want guaranteed.
I can slide it to match peak demand
or a level where there isconstant predictable demand
for most of the requests.
I'll do that and set it to around 70.
Now, if traffic exceeds that level,
by design, some users will get an error
(09:52):
and their request won't get served.
That said, for therequests that are served
and within my set PT limit,
the performance level will be consistent
even if other Azure users
are also using the same model deployment
and underlying infrastructure.
I can show this in a Grafana dashboard
with the results ofthis setting under load,
where it's still getting lots of requests,
but here on this line chart you can see
where provisioned throughput was enforced.
(10:14):
That's where we can use spillover
with another model deployment
to serve that additional traffic
beyond our provisioned throughput.
I'll change the spilloverdeployment option
from no spillover to use GPT-4o mini.
The model needs to match the model I used
with the PTU portion of server traffic.
Then I'll update mydeployment type to confirm.
And now I'll go back to theGrafana dashboard one more time.
(10:36):
Here, you'll see where thestandard deployment kicks in
to serve our spilloverrequests with this spike
on the same line chart.
Below that, we can seethe proportion of requests
served using both standardand provisioned throughput.
That means all traffic will first see
predictable performance usinga provisioned throughput.
And if your app goes viral,
it can still serve thoseadditional requests
at standard throughput performance.
And related to this, fortasks like fine tuning
(10:57):
or high-volume inference,
we also support fractional GP allocations.
You don't have to rent an entire H100
if you only need a slice of it.
- And this idea of rentingGPUs makes me think
of other options out therelike GPU-focused hosters
that are getting a lotof attention these days.
So where does Azure then stack up?
- Well, so Azure is morethan just the hardware.
We built an AI supercomputer system
and our AI infrastructure is optimized
(11:19):
at every layer of the stack.
Starting with thestate-of-the-art hardware
that we run globally for raw compute power
at the silicon level,
along with power managementand advanced cooling
so that we can squeeze outevery ounce of performance.
To of course our high bandwidth,
low latency network of connected GPUs.
There are also softwareplatform optimizations
where we've enlightenedthe platform and hypervisor
to be able to access those network GPUs
(11:40):
so that the performance is comparable
to running on bare metal.
We then have a full manageabilityand integration layer
with identity, connectivity, storage,
data, security, monitoringautomation services, and more
across more than 70 datacenter regions worldwide.
And moving up to stack,
we also support a variety of popular
open-source AI frameworks and tools,
(12:00):
including PyTorch,DeepSpeed, vLLM and more
so that you can build andscale your AI solutions faster
using the tools that you'realready familiar with.
- And as we saw, at thetop of that is your AI apps
and your agents allrunning on the whole stack.
Now, last time that you were on the show,
you actually predictedaccurately, I'll say,
that the agentic kind ofmovement was going to start next.
(12:21):
So, what do you think is goingto happen in the next year?
- We'll see a shiftfrom lightweight agents
to fully autonomous agentic systems.
AI is making it easier tobuild powerful automation
by just describing what you want.
This is only getting morepervasive for everyone.
And everything we're doing in Azure
is focused on enabling what's next in AI
with even faster inference andall the supporting services
(12:42):
to run everything reliably and at scale.
- And with things changing sorapidly from month to month,
I look forward to seeingwhere things pan out.
So, thanks so much for joiningus today for the deep dive
and thank you for watching,
and be sure to check out aka.ms/AzureAI.
Let us know in the commentswhat you're building.
Hit subscribe and we'll see you next time.