Episode 397 – Local LLMs: Why Every Microsoft 365 & Azure Pro Should Explore Them - Microsoft Cloud IT Pro Podcast

Episode Transcript
Available transcripts are automatically generated. Complete accuracy is not guaranteed.
(00:03):
Welcome to episode 397
of the Microsoft Cloud IT Pro podcast recorded
live on 03/10/2025.
This is a show about Microsoft three sixty
five and Azure from the perspective of IT
pros and end users, where we discuss the
topic or recent news and how it relates
to you. We've been talking a lot about
AI recently, particularly

(00:24):
Microsoft Copilots.
But what if you want to play around
with AI outside of Copilot or chat GPT
or any other hosted AI tool? In today's
episode,
Scott and Ben dive into the world of
local LLMs,
large language models, that run entirely
on your device. We look at what models
you can run, how you can integrate them

(00:45):
into your workflow, and more.
Oh, Scott. Here we are back in the
stormy South. Stormy South. It has been stormy,
but it's bright and sunny now. So I'll
take it while I can get it. I
don't have anything to go with Nordic. From
the Nordic North, I'm back to the stormy
South.
From sea to shining sea and everything in

(01:06):
between the seas? As long as you count
Lake Michigan as the sea, which if you're
from Michigan, you do. Like, East Coast, West
Coast in Michigan are Lake Michigan and Lake
Huron. We don't really count oceans in Michigan.
Some of those lakes are kinda big, so,
you might even say they're great. They could
be great. It's always interesting.
Side topic, coming down to Florida and talking
to people about lakes and being from Michigan
and Lake Michigan

(01:27):
and Lake Superior and
they're like, but it's a lake. And I'm
like, yeah, but you can't see across it.
So it kinda looks like an ocean when
you're standing on the shore, and we get
waves that are like, well, I think the
biggest waves I've ever recorded in Lake Michigan
were like 25, 20 six feet, and Lake
Superior was up to 32
foot waves. It's like these are not just
like little lakes. These are massive bodies of

(01:49):
water.
They're really, really big ponds, you know.

Joshua Foer (01:53):
So now we're going across the
pond and that means across Lake Michigan. Joshua

Foer (01:56):
An LLM what it thinks. Joshua Foer:
We should ask an LLM because we're going

to talk about LLMs. Joshua Foer (02:00):
We're back
to those things again. So I wanted to
have a chat today
about
LLMs
and
running them locally. Like, I've been doing this
more and more, and I think there's an
interesting set of use cases and workflows. And
I was having a chat with you,
and this isn't something that you do in
your kinda day today from what it sounds

(02:22):
like, but maybe I can, like, get you
in and and and hook you in a
little bit along the way. Oh, you've already
got me hooked. You sent me a few
YouTube videos, and I started watching it, and
the wheels started clicking.
And I have one of the browser tabs
up here. We'll put a link to it,
Scott, about a use case that I already
have for a local LLM.
And you definitely got my wheels
turning about

(02:43):
what possibilities there are about how some of
this works. In Microsoft three sixty five, I
have played around with Copilot. I know a
fair amount, but I've never really looked at
running them locally and bewet my appetite
for this. So this will be an interesting
discussion, and I'm curious to see where it
goes, your thoughts,
my new

(03:03):
thoughts. And my expanding list, Scott, you added
something new to my list. I was doing
so good. It's been a hot minute, but
I I I think this is an important
one. So as we talk about
the
kinda growth
of
generative AI
and models along the way for,

(03:23):
you know, certainly the the copilots of the
world,
the OpenAI's,
Anthropic with Claude,
DeepSeek with r one,
all all these different kinds of things that
exist out there. So
they're they're nice that you can run them
in a service.
And I think most of us have kind
of grown accustomed to that, and and it's

(03:46):
it's a place that most of us are
comfortable. Like, we know how to sign in
to chat GPT on the web and maybe
either
have a chat with an LLM and and
do some structured prompting and and try and
get some responses out of it
versus
things like ChatGPT web search. And it's great.
Right? It's it's all cloud based.
Some of them are free. Some of them

(04:07):
cost money.
They really only start to get powerful
when they do cost money. So now you're
in the world where you're relying on this
external service. You're gonna pay per request.
And probably most importantly, there's a privacy angle
here where you're sending your data out into
the wild. Like, when you're chatting with Chat
GPT in the web interface,

(04:28):
you're passing that data to them. We saw
this with DeepSeq. When DeepSeq kinda came out
of the woodwork
a couple weeks ago and the market freaked
out. You know, they were about a month
behind freaking out when it had actually been
released. But that said, you know, DeepSeek immediately
had a data leak and people broke in
and they got all the usernames, they got
the passwords, they got the prompts that were

(04:48):
flowing through that system, things like that. So
I think one of the most powerful things
here is the ability to
run a local LLM
with
data privacy in mind. So I'm gonna run
these things locally. They're only going to be
on my machine. They're not gonna communicate with
the outside world. And then if you're in
that world of, you know, being a little

(05:09):
bit more cost conscious,
you might wanna try some of these things
out without paying per request in a service
like chat, GPT, or Claude, or or something
like that. And in that world, you're gonna
also have a cost savings angle.
You're gonna have offline capabilities.
So the ability to chat with these models
locally can be a little bit interesting

(05:30):
and and how all that composes.
And, you know, I think the kicker is
most of us are geeks, and we run
around with these really powerful computers. You know,
you've got a laptop with gobs and gobs
of RAM on it, and it's running a
modern processor, it's got a GPU, it's got
an MP,
you know, you might be sitting there at

(05:51):
home and you're like a PC gamer and
that's how you,
you know, just relax at the end of
the day. Well, guess what? You got that
monster GPU, you know, that fifty ninety or
whatever that you can potentially use during the
day with these things. And it turns out
that you might actually chat with local LMs,
like, more than you think. You know, like
so we've talked about how we're Apple users,

(06:11):
so iOS, things like that. The predictive text
on iOS is all based on an LLM.
It's based on a transformer.
So that thing is running a local model.
Well, you can run those similar models on
your side. So it gives you this really
interesting opportunity to kinda take advantage of AI
while maintaining the privacy aspects, maybe letting you

(06:32):
play with new things. Like, if you wanna
play with DeepSeek without signing up for the
DeepSeek service,
like, hey, that that that that's a great
way to do it. So we'll talk a
little bit about that and kind of some
of the advantages and what you can get
on with. We should also talk about what
folks can actually run,
like, what's useful useful that can run locally
for you. So we're gonna talk a little

(06:53):
bit about, like, parameter size in a model
and how big these things are. So turns
out there's a big difference between a
1,000,000,000 parameter model, a 7,000,000,000 parameter model, a
65,000,000,000
parameter model, or, you know, like I said,
if you wanna play around with DeepSeq,
I was watching some videos on YouTube of
people who are playing around with some clustered

(07:15):
servers
to do, like, 400,000,000,000
parameter model runs. And, you know, you can't
run, like, 400,000,000,000 parameters locally. You need, like,
a distributed system, and, you you know, you
can potentially do it across a series of
servers within your premises. But that said, like,
those aren't for everybody. They're gonna be too
slow, cost a bunch for the GPUs, things
like that. So we'll talk a little bit

(07:36):
about that, about like parameters and, you know,
maybe where more parameters doesn't always mean, like,
better results. I think that's important too.
There there's a little bit of nuance and
kind of trade off here between
speed of response, like how many tokens can
an LLM respond back to you with, what's
the accuracy of that, and probably most importantly,
like what are the compute requirements on your

(07:58):
end. So like the things that I'm gonna
talk about that I run today, so I
rock an m MaxBook Pro
most of the time, and that's kind of
like what I'm running on. And I've got,
you know, 32 gigs of RAM in there,
and and I'm all set in my world.
You have a a different model on a
different processor
with more memory and potentially more GPUs, so

(08:19):
you'll be able to run, like, maybe even,
like, bigger things than I can run here.
And that's okay. And then,
you know, your mileage may vary. But it's
kind of like anybody can get started with
these things, even on, like, a little,
you know, off the shelf NUC kind of
PC or things like that. So beyond chatting
with these things,
you can also use them to empower your

(08:40):
workflows.
So you can use local AI models with
Visual Studio Code. Like, you might sit out
and go and say,
I'm coding a dot net application.
Let me go find the best LLM model
for dot net applications,
but I don't wanna pay for it. I
I don't wanna, like, go to OpenAI or
Anthropic and and do the cloud thing, anything

(09:02):
like that. Well, maybe you can go out
and actually just download a model and run
it locally, and we'll kind of talk about
the hosting engines for these things that expose
things like standard OpenAI endpoints. So you can
literally point Versus Code at a local LLM
and have it write you PowerShell and all
those things that are, like, private just to
your machine
without having to go out to the Internet

(09:23):
and get those kinds of things done. So
I think that's a fun little way to
kind of think about integrating these things
into your life and how they come together.
So we just kind of want to go
end to end and full circle between, can
you run your own chat GPT
like
thing, model
locally? And the answer is yes. So, yeah,

(09:43):
like we should just kind of have a
conversation
about that. So
why don't we start with like the whole
data and privacy cost efficiency thing and all
that stuff? I think that's one of the
ones that can be super important
that people think about. And kinda like you
said, the deep sea click exposed to millions
sensitive
data records. One thing I've heard even when

(10:05):
you start looking at things like ChatGPT
versus
Copilot and Microsoft three sixty five and going
back to the local ones or doing OpenAI
in Azure or
something in AWS is it it very much
goes back to where does that data go.
Some people see rolling out Copilot as a
security benefit because then they're not taking all

(10:26):
that data from
SharePoint, from Teams, from their Microsoft three sixty
five tenant, sending it out into ChatGPT
where it's escaping that Microsoft three sixty five
boundary. OpenAI and Azure, same thing. If all
your data's up in Azure somewhere, if you're
working with Scott to store petabytes of data
in blob storage and you want that to

(10:47):
be used for OpenAI, you can do that.
But then you do get into this local
thing. All your data's
local. Or one of the scenarios I have
that we can put a link to is
I use Home Assistant for all my smart
home stuff because
I like everything
local. I don't want it all going out
to relying on Samsung or
any of those. What if you wanna integrate

(11:08):
AI into your local smart home stuff and,
again, you wanna keep it all internal? You're
in an industry where
you need to keep things on premises for
some reason or certain regulations around that. I
think there's a huge benefit to doing
local AI, whether it's at that small
in your house,
you and I type scenario of

(11:30):
smart home or something here or large enterprises
that have very stringent data requirements and need
to run it locally, maybe in their own
data centers in clusters that they build internally
and stuff. Home Assistant is a fun one.
So if you think about AI and Home
Assistant and what they're doing with like Home
Assistant voice and some of those things, it

(11:52):
relies on two paths. One is text to
speech. So can I have Home Assistant talk
to me? So some text goes in and
can I have it talk back to me?
And then it's also speech to text in
the form of things maybe
like Whisper,
which is, you know, typically what I see
integrated with most on that side. In fact,
we use Whisper for generating transcripts sometimes for

(12:15):
the show. So it's not just LLMs.
It could be things like text to speech,
speech to text. Could also be image generation.
Like, if somebody is looking to, like, play
around with stable diffusion, that that runs pretty
well locally
on most of these things as well. It
could be a little bit slow, but, hey,
that that's okay. That's that's part of the
trade off of not having to pay and

(12:35):
and push these things through. But I I
think the most important thing is just when
you're running an LLM locally, you're basically mitigating
a bunch of that risk
of having to worry about compliance, having to
worry about legal concerns. Like, hey, I'm submitting,
like, this thing that's important to me. Like,
I'm never like, for example, I'm never going

(12:56):
to chat with my taxes
with anything other than, like, a local LLM
to help me break some of that stuff
down.
But, you know, somebody else might be out
there, but good good luck when you're when
you're in the next data breach or or
or whatever happens. So there's things like that.
I think the other one that's important to
consider is kind of the cost angle of
things. Like, I'll be the first to admit

(13:17):
that I'm pretty frugal. So if you're thinking
about maybe like OpenAI and having to go
out and pay for OpenAI, and you're either
paying per request or you're on one of
the monthly plans. And those can get pretty
expensive. Right? If you wanna get up there,
you can spend up to, like, $200 a
month. But typically, they're on the order of,
like,
you know, 1¢ US per 1,000 tokens.

(13:39):
And then you're like, Well, what's a token?
Like, how many words comprise a token? Like,
it can be a little bit weird to
figure out the pricing. So sometimes you just
want to play around with these things locally
without having that cost constraint,
because costs can run away from you pretty
quickly, especially if you're being like super chatty
and doing longer chat threads and things like

(14:00):
that. Or the other place they tend to
get pretty expensive
is if you're integrating
these
AIs
into,
like, your coding workflows, like, hey, you're you're
out there and you're sitting there and you're
like, I want a vibe code. Well, great.
When you're like vibe coding across 10,000 lines
of code, it starts
to add up and get pretty expensive. So

(14:21):
you already bought this, you know, honking computer.
You got a GPU. It's got CPU.
It's got a fast disk. You might as
well use it for a little bit more
than just writing your PowerShell scripts. Like, why
why are you sitting there writing in Versus
Code by hand when, you know, you could
be just vibing your way through that stuff?
For sure. And I think that's one thing.
I guess I kind of always realized it

(14:42):
in the back of my head, comparing local
LLM
to JetGPT to Copilot to cloud based, it
kinda struck me that
from a pricing perspective, when you're using cloud
based LLMs, you're not paying for the models.
Like, these companies, these models are all out
there,
whether it's
DeepSeek
or Lama

(15:03):
or any of those. What you're really paying
for is the compute to
process the request to these models, and that's
where that cost comes in. Do you wanna
spend it in on premises hardware and hardware
running in your house, or are you giving
it to these cloud providers for the hardware
out there running models that maybe you don't
physically have the capability of running on your

(15:24):
compute that you own? It is an interesting
one. The other thing that, you know, like,
once you get a little bit more advanced
and you start going down the path of
some of this stuff, if you really get
into it, you start looking at things like
fine tuning
and doing RAG or retrieval augmented generation against
things. So we'll put a link in the
show notes to a Network Chuck episode where
he talks about running local LMs. And one

(15:46):
of the things that he does, he has
this really interesting use case where when he
attends church, all the sermons are transcribed,
and he uses local LLMs to summarize the
sermons for himself. Like, he doesn't always get
to attend live, but he still wants to
get the messaging out of it. So he
does all that stuff like local LLM, and
it's just all there ready to go. It

(16:07):
does the transcription,
like pulls it all off a YouTube thing,
transcribes it, runs it through an LLM, gives
him the summary, and then that summary is
written back as a markdown file where it
lands in Obsidian,
and then he can just use his network
brain in Obsidian to go and figure some
of that stuff out too. So you can
get pretty rich with these things if you
start to kinda, run through the use cases.

(16:29):
So we're, like, Network Chuck might be doing
I might be working on coding a new
application,
and I just want it to learn off
maybe an existing code base from, like, the
previous two versions or iterations
or things like that that I did along
the way. So you can also do these
things like fine tuning and get up and
running
pretty pretty quickly. It's actually, like, turns out

(16:50):
a lot of the tooling's already out there.
Like, these things are
not the hardest thing to stand up. But
before we stand them up, we should also
probably talk a little bit about, like, what
kinds of models you can run
because your mileage may vary here based on
your your hardware and what's available to you,
your your network bandwidths, and a couple other

(17:11):
things.
Do you feel overwhelmed by trying to manage
your Office three sixty five environment? Are you
facing unexpected issues that disrupt your company's productivity?
Intelligink is here to help. Much like you
take your car to the mechanic that has
specialized knowledge on how to best keep your
car running, Intelligink helps you with your Microsoft

(17:32):
cloud environment because that's their expertise.
Intelligink keeps up with the latest updates in
the Microsoft cloud to help keep your business
running smoothly and ahead of the curve. Whether
you are a small organization with just a
few users up to an organization of several
thousand employees,
they want to partner with you to implement
and administer your Microsoft cloud technology.

(17:53):
Visit them at inteliginc.com/podcast.
That's intelligink.com/podcast
for more information or to schedule a thirty
minute call to get started with them today.
Remember, Intelligink focuses on the Microsoft cloud so
you can focus on your business.

(18:15):
So talking hardware, do you wanna drive into
hardware or models? Where should we go? It's
kinda like a both conversation.
So I think we can cover kind of
the whole parameterization
question
and how big these things are to run
locally
along with some of the hardware
constraints
that
come along with them. So when you think

(18:35):
about the models that you can run, one
of the first things that's gonna happen is
you might go out and grab Ollama,
you might grab LM Studio. You're you're gonna
grab some system that's going to let you
basically
run that model and be able to run
prompts against it. So though those models are
gonna have different

(18:56):
sizes,
and those sizes equate back to parameters. So
you're gonna go out and you're gonna see
things like, oh, I wanna run llama three.
And llama three might have,
you know, a 7,000,000,000 parameter model. It might
have a 300,000,000,000
parameter model. It could have a 1,000,000,000 parameter.
It could have something that's even smaller than

(19:16):
that. So these things start to kind of
become important. So if you're thinking about, like,
parameters, number of parameters in a model, which
is going to equate to
kind of functionality
within that model,
In some place like a 7,000,000,000 parameter model,
if you're looking at, like, LAMA two seven
b, you're looking at Mistral seven b, like,
those are pretty good starting points, and you

(19:38):
don't need a super monster laptop or desktop
to do it, just something decent. So if
you have about 16 gigs of RAM and
some CPU, you're good. Like, you don't need
a dedicated GPU. You can absolutely do this
stuff on CPU.
I hesitate to say fast. It'll be fast
ish. It might feel a little bit slow,
like you'll see, like, the words typing out
on screen, but that's okay. That that kind

(20:00):
of equates to the experience that you might
have in a chat GPT or or a
Claude or things like that.
But they're also super lightweight.
So you you can get models that potentially
when you download the model, they're measured in,
like, hundreds of bags.
Some are in the gigabyte range. Like, if
you're in, like, a 7,000,000,000
parameter model, you're talking about maybe, like, two
to three gigs of downloading a quantized model

(20:22):
and being able to track against it. And
with 7,000,000,000 parameters, you'll probably find
that they're good enough for most tasks,
for most
personal tasks. Hey. Summarize this for me. Hey.
Give me a quick idea of this.
Translate this to this. Like, those kinds of
things, it's perfect. Hey. I wanna pump in

(20:43):
the transcript from a YouTube video and have
a local model summarize it for me. That's
an awesome job for, like, a 3,000,000,007
parameter model,
things like that. You can get a little
bit bigger, and a little bit bigger is
typically gonna be in the something of, like,
10 to 30,000,000,000
parameter range.
So
now you're getting a little bit more honking.

(21:04):
You're actually gonna need some GPU here, and
you're probably gonna need more RAM as well.
So, like, 16 gigs of RAM isn't gonna
cut it. You're probably gonna need something closer
to 32 gigs of RAM.
You're gonna need some kind of GPU to
drive that.
You know, I think you could maybe get
by on, like, an RTX thirty ninety or
something like that. You'd probably wanna be in,

(21:25):
like, a a a 40 series, like, a
forty sixty, 40 70. Or if you're all
on board and, like I said, you're a
PC gamer and you've got that fifty ninety
sitting in there, like, go ahead. Use it.
It's ready to go. Nobody has the 50
series. There were only, like, 10 of them
produced and nobody could buy them. Well, and
out of the 10 that were produced, 10
out of 10 were broken, so the the
yields are great. And melted power cables. Okay.

(21:46):
Anyways, sidetracked. Yes. But you're gonna need one
of those high end GPUs. Yeah. Well, you're
gonna need a GPU. Like, I think the
difference between, like, a 3,000,000,000, seven parameter model
and then you get up to those, like,
10 to 30 range
is, do I need a GPU or do
I not need a GPU?
So you can do the smaller models just
with CPU as long as you have enough
RAM. At some point, you're gonna want GPU

(22:06):
as well to go ahead and offload those.
So if you're thinking like, hey, my use
case for running a local LM is doing
advanced coding, like, I'm I'm beyond, like, summarization,
and I want this thing to help me
write applications,
PowerShell scripts, bash scripts, anything like that, you're
probably gonna wanna be in that range where
you've got a little bit more RAM and

(22:28):
you've got a GPU,
and then you kinda find the model that
you like, and and that ends up being
your sweet spot there. After that, you get
into, like, the big, big models. So you're
into, like, 65. I think, I was watching
another NetworkChuck video. He ran one on a
cluster
of Those studios. I think it was the
m one studios. It was, like, a cluster
of, like, six of those where he was

(22:48):
able to run, like, a 400,000,000,000 parameter model,
but it was only able to output context
you know, like one
word,
a second. Like, it's just so slow that
it's that it's not actually useful. Right. So
slow. A few times it looked like it
even got stuck and,
yeah, it was it was interesting. We'll put
a link to that video in the show

(23:08):
notes too. Yeah. So the way I think
about that, the really big models, they're basically
not there for, like, the faint of heart.
They're there if you know what you're doing,
if you've got the hardware to back it,
both CPU,
RAM,
and and GPU.
So if you think about it, like, there's
kinda like a way that you can just
break it down into a simple set of,
like, pros and cons. So when you're sitting

(23:29):
out there, you're in that, like, three, five,
seven billion range,
that's gonna be fast. You can do it
on simple low hardware,
or you can even do it on beefier
hardware. Like in my case, like when I'm
on my M1 Max, typically, I'm also running
Windows in a virtual machine. So that's typically
got half my RAM already. And then I've
got a little bit of RAM that's going
to the OS and things like that as

(23:50):
well. So even if I could run a
bigger model, I'm not going to because I'm
still having resource contention and other things. Like,
sometimes I don't wanna shut down my VM
or I don't wanna shut down Versus Code
because I'm I'm using those things. Right. You
know, smaller models, fast,
commodity hardware,
good enough for
easy tasks. Like, sum summarize that transcript for

(24:11):
me thing, they're gonna be great for that.
You get into that middle range, probably your
sweet spot, like, if you do have a
little GPU to drive these things,
good accuracy,
more context awareness,
and kinda longer context windows. So as you're
chatting with these things, they can remember,
quote, unquote, big air quotes here. They can
remember what you previously typed with them. So

(24:32):
having bigger context windows and and more RAM
and VRAM from your GPUs to host those
context windows in becomes a little bit important.
And then, like, if you're,
you know, a monster gamer, you've got just
a bunch of these things laying around and
you wanna network them all together, it's super
easy to do that too if you got
enough hardware running around,
and and you can go and,

(24:52):
make that happen. So once you've kinda figured
out your your hardware and you've got a
sense for what you wanna do and what
you're gonna be able to run locally, well,
then you need a way to
run these
things
locally, which, you know, it's not a little
decision to make. Right. And another thing about
the hardware that I found interesting watching the
NetworkChuck videos as well was because we talked

(25:15):
about the Macs, Macs have, like, that shared
memory. They don't have dedicated video memory and
system memory. So one thing he was doing
was when he was running these models, like,
all the memory was going to process the
model because it doesn't have, like, those physical
boundaries
between
physical and system memory. So I think that
was another thing to watch out for. And

(25:36):
the other thing, because you mentioned networking,
he also found, like, running a 10 gig
network. Something I didn't realize because I've never
done this locally, how chatty these are if
you're running a cluster over a network. Super
chatty. He'd, like, saturated his 10 gig network,
and that appeared I would say, I don't
know that it was definitive in his videos,
but appeared to be the bottleneck

(25:56):
using these,
clustered studios.
So then he switched to Thunderbolt, which gave
him a 40 gig network essentially. And even
that, he managed to saturate, get a little
bit more speed out of it using Thunderbolt
as opposed to a 10 gig network. But
if you do start thinking of clustering
larger models,
networking is also huge when it comes into

(26:19):
the hardware for these things. I don't really
get into the network model kind of thing.
Like, I just don't have enough hardware running
around here at home to do it. I
I certainly think it's interesting if you can
get there. So, yeah, we can kinda talk
about that maybe with, like, more advanced stuff.
Yes. So on to software. So you got
your hardware. You got your software. I keep
seeing you sent LM Studio. I've not looked

(26:39):
at LM Studio. The one that always seems
to pop up for me both in
the Home Assistant as well as in a
lot of the network check is Ollama for
running these locally. Your decision here is
how geeky do you wanna be and and
what is your workflow?
So if your primary workflow
is
you just want to
chat with a a chatbot, like, you wanna

(27:02):
hop right in, you wanna download a model,
and you wanna be able to chat right
away in, like, a nice GUI and a
graphical interface,
LM Studio is great for that. There's like,
if you go out and you look this
stuff up and you hop on Reddit or
things like that,
there there's going to be that set of
folks out there who hate LM Studio
because it's closed source,

(27:22):
but, you know, I I'm just looking to
play with these things. So for what I
wanna do, it certainly
works
works great. Comes together, does what I need
it to do. That said, you can also
do Ollama,
and Ollama is gonna be more command line
driven, like you're gonna do more installations from
the command line, you're even gonna download your
models from the command line, so you're kinda

(27:44):
trading off ease of use there. There's pros
and cons to both depending on what you're
doing. LM Studio is great if you just
want to chat, you want to immediately have
OpenAI spec ed endpoints
exposed maybe to things like Versus Code locally,
and you just don't want to wire anything
up. You're looking for just like a one
shot install, and you're going to be one
and done. The other way you can do
it is you can go to Ollama, and

(28:05):
you can find your model that you wanna
run on there. So, you know, I wanna
run llama two seven billion, and you'll go
download that, and you're gonna do all this
from the command line. Now you wanna chat
with that thing.
Well,
you can certainly chat with it from the
command line. That that's totally a possibility. If
if that's your jam or your jelly, awesome.

(28:25):
Go for it. But if you want to
chat with it in a GUI, now you
gotta go install something else. Like, you might
have to go install
open web UI
to to to get that piece going and
and stand all that up. So it's not
like it's hard to do. It's just your
your flavor and and and where you sit
and where you wanna land.
You know, if I'm looking to just do

(28:46):
things quickly and, like, I'm just in there
to maybe, like, oh, hey. I see Microsoft
released a new model for 05/04,
and they they were they, you know, just
pushed new models for 05/03 and '5 '4,
and I I wanna compare those two things.
I'll probably just spin those up in LM
Studio. Super easy. Next, next, next my way
through it. I don't have to remember a
bunch of command line parameters, things like that.

(29:08):
If I'm doing more like application development and
I'm thinking about, like, hey. I want to
stand this thing up. I wanna have it
running in the background. I want some endpoints
that are exposed. Maybe I can build, like,
an app that's doing, like, some light rag
or some fine tuning on top of it,
and I've got, like, a Python script over
here that needs to talk to the model.
Awesome. Great. Like, that's that's where Ollama sits,

(29:28):
and it has its space ready to go
for you. So much like picking a model
size, you're you're just doing a pros and
cons and a little bit of a trade
off thing. So Ollama, if you want a
simple command line experience and you're comfortable at
the terminal, go for it. Windows, macOS, Linux,
it's all there. LM Studio, if you're not
opposed to closed source and you just want

(29:51):
a GUI from the start for all the
things, for downloading, for chatting, for,
all all that stuff. Again, macOS, Windows, Linux,
ready to go. It's just closed source versus
open source is really how I think about
it. And then if you really do go
down the Ollama path, you're probably gonna end
up in a space where you wanna run
a local chat UI, like a web based

(30:12):
chatbot style thing,
and
then you'll just use something like Open Web
UI for that. And, again, super easy to
install. You're just basically hosting a little
a little web server locally that knows how
to
chat with chat with that model. And then
it could be a little bit different depending
on like the extension tooling that you're going

(30:32):
to use from there. So I talked about
maybe like integrating Versus Code with one of
these locally.
So if you're doing
Versus Code, you're gonna typically go grab an
extension. So there's things like CodeGPT,
there's continue dot dev,
there's an Ollama extension, which can actually just
talk natively to your Ollama endpoint.
Or like I said, LM Studio exposes OpenAI

(30:55):
compatible
endpoints. So that's kind of a known, like,
you know, web interface that you can throw
a request at in a structured way,
and it will respond in a in a
way that most of the extensions
are going to understand
and get you ramped up for and ready
to go with. Yeah, looking through this and
most of the videos I saw, and again,
were all Olamae, even the command line based

(31:17):
looked really
simple, lots of guides to just walk through,
type this in, this is how you tie
that in, this is how you go stand
up the WebUI,
point WebUI, to all of those.
So none of this
really seemed that complicated in everything I watched
and, again, made me excited, like, I need
to go try this out and go find

(31:38):
a computer that I can
absolutely bury with a model. See what I
can do. See what damage I can do
to my computer, Scott. It is not hard
to do. So the other thing that you
can do, if you're comfortable on the command
line, there's another project out there that's called
Fabric.
So Fabric is kind of a
it it allows you to easily network and

(31:59):
distribute traffic across multiple nodes, but you can
also do it on a single node. So
So I was talking earlier about, like, that,
you know, sermon summarization thing. Yep. And that's
all based on Fabric. So Fabric, again, command
line, it can run with local LLMs. It's
a little kinda
opaque for for how it does it. So,
you know, make sure you download one of
the the newer versions of it, And Fabric
is all run from the command line as

(32:21):
well. But then you can super easily integrate
Fabric into things like bash scripts. So, like,
I use it for the same thing. Like,
if I think about the the podcast, I
just have a bash script that runs Whispir
locally. So Whispir is
a speech to text model Yep. That OpenAI,
and I can run that locally. Like, that
runs on my hardware just fine. So I've

(32:41):
just got a little bash script that takes
that, generates the transcript, and then I just
pipe the summaries out into Fabric to have
those for myself
in just my notes on the side. Right?
Like, hey, here's the things we talked about
and and how they're coming together. So
very, very, very easy to get on with
this stuff. And I think for most of
our audience as well, like you folks are
all comfortable on the command line. You don't

(33:03):
need a GUI for this stuff. You can
follow some instructions
and wire these up. And we're not talking
like super complicated things. We're basically talking the
equivalent of like a brew or a chocolatey
install or a Winget install, like just little
one liners to get all this stuff up
and running. Absolutely. You don't need to go
write 50 line PowerShell scripts or pipe a
bunch of things. It's really straightforward from everything

(33:24):
I saw. Super easy to get up and
going with that. I would say, like, the
other thing you might wanna do a little
bit is
when you're exploring models.
So if you go into, like, LM Studio
and you're going through their model catalog or
you're on, Ollama and you're exploring their model
catalog, you might wanna just start with, like,
some of the more popular ones to get

(33:45):
up and running. So, you know, there there
are differences between these things,
you know, depending on what you're doing. Like,
you can't go ask DeepSeek
what happened in Tiananmen Square. Like, that is
not programmed into that model,
e even in the one that you you
download and
you run locally, but, you know, you can
do that with, other stuff. So these models

(34:06):
all vary. The other thing that you can
do is you can go through the model
catalogs,
and you can find models that are purpose
built for certain things.
So there are models that are generated within
these families. So you talk about, like, LAMA.
There's gonna be versions of the LAMA model
that are better for doing coding assistance things
with it than there are for doing just
straight one shot text summarization,

(34:29):
stuff like that. So
you you have to think through that a
little bit too, like, just what's your workflow
and
what are you trying to
get at
along the way?
And then be prepared for a little bit
of latency
and maybe differences in perf when you're running
with these things. I think lots of people
set out and they say, oh, I'm gonna

(34:49):
be able to run that model locally, and
it's gonna be so much faster because it
doesn't need to go out and talk to
the Internet. Like, it doesn't need to talk
to Claude. It it it doesn't need to
talk to chat GPT, anything like that. Yeah.
Like, absolutely. You've eliminated the latency of that
whole, like, request response thing having to traverse
the Internet,
but you still have to have the hardware
that's capable of running this and standing it

(35:10):
all up. So you might wanna even, like,
play around before you integrate these things. Like,
if you're interested in, like, a coding workflow
with or integrating with Versus Code, things like
that, you'll probably wanna play around with the
the models a little bit locally to find
the one that's got the the sweet spot
for you based on
number of parameters, your hardware, things like that
before you go down the path of integrating
it in Versus Code and then being disappointed

(35:30):
that it's too slow or or things like
that. There's a lot of blogs out there
that'll just tell you, like, oh, running AI
locally, like, it's super fast. It's it's super
easy. It is super easy. It's not always
super fast. So you so you do have
to be prepared for that depending on your
hardware. Yeah. Along with the model, Scott, this
is another thing again, being fairly new to
this, have you
compared at all? Because another thing you can

(35:52):
run into is quantization of these models. Right?
And this is something else Network Chuck talked
about in one of his where some of
these larger models, they quantize.
I don't know if that's the word. They
quantize them down, and it sounds like it's
essentially taking
different aspects of the model. And inside
models, they have model weights with, like, 32

(36:14):
bit precision, and they reduce these down to
eight bit, four bit precision,
which makes them not as accurate but makes
them smaller so you can run a
larger model. Some of those bigger ones we
talked about like 65,000,000,000
plus parameters
on less hardware,
but with more

(36:34):
not the accuracy
versus running maybe a model with less parameters,
but you get the full
the full model weights in there where you're
running the 32 bit precision instead of quantasize
them down. Again, when you're downloading models, definitely
something to watch out for because if these
are quantasized
and they have smaller, they can be less

(36:54):
accurate, you can run them. Like, have you
ever compared those of let's go run a
30,000,000,000 parameter model on local hardware versus a,
65,000,000,000
model or parameter model that's quantized down to
eight bit instead of 32 bit? I don't
think many folks are running 32 bit. Most
are probably running

(37:15):
Four bit. Some kind of like well, something
like 16 or lower, so like four, eight,
16. I think when you go out and,
like, you watch a lot of YouTube videos
and and,
you know, if if you do go down
this path and you start getting into it,
I think YouTube is a great place to
go to and start to see. You'll see
lots of people playing around with massive models,
but with a

(37:37):
like, only, like, four bits. Right. So they're
doing that just so they can run it,
not so they can run it effectively to
drive a workflow. Like, they're just trying to
try it out and see how many tokens
a second they can get out of it
or something like that. So a four bit
model is
absolutely going to
run on, like, consumer grade GPUs, CPUs.
Like, you're gonna be all good, ready to

(37:58):
go there, but you have to know that
it's been extremely compressed. So it can get
it down to a smaller download size,
and thus, it's going to take less memory
and less processing power to go ahead and
run it. So you might be running like,
you know, like if I think about, like,
the transformer that's running in iOS,
that's probably

(38:19):
a a a four bit model. Right? Like,
it's sitting there. It's running on commodity hardware
Right. And it's just doing what it needs
to do. Now, if I'm on my desktop
or or my m one
MacBook, you know, I might be thinking about
an eight bit model, and I'm okay with
the performance trade off. Like, I'm I'm okay
if it chats with me at, like, you
know, like, two tokens a second kinda thing.

(38:40):
Like, it can be super slow. It's it's
okay. But you're not gonna run these, like,
massive models because those are absolutely running in
those
massive data centers and and that set of
infrastructure. Like, I I just wanna be clear.
Like, you can't do the things that, like,
ChatGPT
can do with, like, o one running in
their data center
locally at your house. Like, that's just not

(39:01):
the way these things work. It's it's not
how they come together. So if you think
about, like, the the whole quantization thing, it's
all about
packing things down and basically, like, archiving them,
right? Put a tar or zip together of
this
thing and reduce the size, reduce the computational
requirements,
all that kind of stuff. So you're going
to get small models. Hey, that's great. They're

(39:23):
going to use less memory,
and you might be able to run a
larger model. Like, you could run a four
bit, you know, 30,000,000,000 parameter model, but it's
gonna be less accurate. And is accuracy important
to you? Well, you might wanna go to,
like, an eight bit like, 7,000,000,000 parameter model,
something like that. So it it's gonna be
very dependent on, like, your workflow

(39:43):
and
your
use case for these things.
I think the biggest thing you miss out
on is accuracy.
So, you know, like, if I'm summarizing
the podcast transcripts, I want those to be
kind of accurate. Like, I I don't want
them to just be hallucinating all over the
place.
But,
you know, if I'm doing something else, like,

(40:04):
hey. Help me write a poem about, you
know, iPads. Like, whatever. Do it with all
the least accuracy
that you want out there
along the way. I think the most common
thing, like so the other thing you run
into with quantization
is
there there's a bunch of different methods for
this. So
there's
Q,
which is basically like four bit quantization.

(40:27):
There's another format called
g g u f. So that's kind of,
like, the standard for running these things
efficiently. So you'll see a lot of these
things when you go in like, what's the
format of the model? Oh, it's a g
g u f. I don't even know how
it's pronounced.
But, you know, you can go in and
and grab those things
and and figure those out. So you can

(40:48):
think of, like, quantization maybe as, like, another
weight that you can put on that scale
when you're trying to find that balance between
model size, parameter count, quantization,
and the hardware that you run and the
workload that you wanna do. So how does
that scale tip and where do you wanna
land? It just becomes an another consideration in
there for you. Sounds good. Anything else before

(41:09):
wrapping this episode up? So a couple things.
If folks haven't done this yet, like Go
do it. You should totally go out and
just try and play around with Ollama LM
Studio.
If you're already doing it today, come back
and give us some feedback. Let let us
know what you're using it for. I think
there's all sorts of interesting use cases for
this stuff. We're just getting Ben started on

(41:29):
his list. Let's make his list a lot
longer for things that he is missing out
in his life. Home assistant and AI. He
needs to do to run a
chat model locally. And then if you're doing
other things besides chat models, like I said,
there's the stable diffusions of the world, there's
image generation, there's whisper, there's all these other
things out there. I was very surprised at

(41:50):
how approachable they are. I always thought this
was going to be like mystical dark arts
and magic and not for mere mortals kind
of thing. It's very much for mere mortals,
Like, super easy to get started with, super
turnkey,
and I would guarantee that almost anybody who
listens to this podcast probably has the hardware
to run this stuff and make it happen.
I'm actually excited to go try this out

(42:11):
and play around with it. I did find
an article too on a Raspberry Pi cluster
for AI. I don't know if I'm gonna
try that or use an extra Mac mini
I have sitting around here to start, but
I, like you, I would love to hear
what other people are doing. If you are
running them locally, what are you using them
for locally,
different use cases,
where have you found a good place to
start, all the things.

(42:31):
So if you do want to join us
and discuss these things,
we need to redo our outro, Scott, because
I think that has changed. I think we
actually still have Twitter in it. Let's not
say Twitter. Let's say probably Blue Sky. Are
you more active on Blue Sky right now
than any other one? Pick one anyway. Anyone
that's not Twitter, you can find Scott on,
except that I can never find you on
Blue Sky because you chose a weird handle

(42:51):
that isn't the same as any of your
other social media. You need to go grab
a new handle on Blue Sky that matches
everything else. I would say Blue Sky is
probably where I'm the most active
as of late and where I feel like
the biggest
tech community has
moved to. So go chat with us on
Blue Sky. LinkedIn is another good one. I'm

(43:11):
always on LinkedIn.
So if you wanna chat, give us feedback
on LinkedIn,
you can do that. If you wanna sign
up for membership, we still have our membership
at mscloud, I t pro Com / membership.
Todd's in
Discord today. He got a new laptop that
he's gonna go try to run some LLMs
on. So if you wanna join us, chat
with us during the recording. You can go
check out our membership

(43:32):
options
there as well and join us in Discord
for these. So
looking forward to hearing from people how you
use LLMs, what you're gonna do with LLMs,
and how they run locally. Who can bury
their computer first and
crash it? Super easy to do. Yeah.
Anything else? I think that's it. As always,

(43:53):
thanks, Ben. Alright. Thank you, Scott. We will
talk to you later.
If you enjoyed the podcast, go leave us
a five star rating in iTunes. It helps
to get the word out so more IT
pros can learn about Office three sixty five
and Azure.
If you have any questions you want us
to address on the show, or feedback about
the show, feel free to reach out via

(44:14):
our website, Twitter, or Facebook.
Thanks again for listening, and have a great
day.

All Episodes

Episode 397 – Local LLMs: Why Every Microsoft 365 & Azure Pro Should Explore Them

Episode Transcript

Popular Podcasts

United States of Kennedy

Stuff You Should Know

Dateline NBC

.css-15opob5{left:0;position:absolute;top:0.8rem;} All Episodes

.css-14f5ked{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:2;overflow:hidden;}Episode 397 – Local LLMs: Why Every Microsoft 365 & Azure Pro Should Explore Them