Measuring LLMs with Jodie Burchell

Episode Transcript
Available transcripts are automatically generated. Complete accuracy is not guaranteed.
Speaker 1 (00:01):
How'd you like to listen to dot NetRocks with no ads?

Speaker 2 (00:04):
Easy?

Speaker 1 (00:05):
Become a patron for just five dollars a month. You
get access to a private RSS feed where all the
shows have no ads. Twenty dollars a month, we'll get
you that and a special dot NetRocks patron mug. Sign
up now at patreon dot dot NetRocks dot com. Hey

(00:34):
guess what it's dot NetRocks episode nineteen forty four. I'm
Carl Franklin.

Speaker 2 (00:39):
At amaterid cap nineteen forty four. Richard, I'm looking forward
to the end of World War Two. Yeah, it's the
beginning of the end. So nineteen forty four, the Allies
launched D Day, the largest amphibious invasion in history, landing
troops on the beaches of Normandy, France on June sixth,
marking a turning point. In August, the Allied forces liberated

(01:02):
Paris from Nazi occupation. You're welcome. In December, here's an
anecdote to go with D Day if you like. Yeah.
In concern for the soldiers in D Day, they mass
produced penicillin for the very first time. There were two
and a half million doses of penicillin made for the
D Day invasion. That is so awesome. So post World

(01:22):
War two, the reason we have antibiotics was that preparation. Yeah.

Speaker 1 (01:27):
In December, the Battle of the Bulge, the Germans launched
a major counter offensive in the Ardennes region of Belgium.

Speaker 2 (01:34):
Did I say that right? Arden in the Ardennes? Yeah?
Yeap Neardens.

Speaker 1 (01:38):
But the Allied forces eventually repelled the attack and in Rome,
three hundred and thirty five Italians were killed in the
Here's another thing I had but pronounce correctly in high school.
Are eighteen r D ten ardittin all right?

Speaker 3 (01:56):
Right?

Speaker 1 (01:56):
R D ten We're going with that ar d eat
I n E massacre including seventy five Jews and over
two hundred members of the Italian resistance, various from various groups.

Speaker 2 (02:07):
So yeah, it's sort.

Speaker 1 (02:08):
Of the beginning of the end, the unwinding and leading
up to the following year, nineteen forty five, which ended it.

Speaker 2 (02:17):
Right. Yeah. It's also the year that the first plutonium
has ever made in the Hanford site in Washington, will
eventually lead to the bombit Nagasaki. Yeah. And the Harvard
Mark one, the built by IBM based on a design
from professor at Harvard thirty five hundred relays and a
fifty foot long camshaft because computers were different back then. Yeah,

(02:39):
they were, and famously because it's a relays based computer.
The next version of this they call, cleverly, the Mark two. Yeah,
we'll have a moth get trapped in one of the relays,
which race Hopper will find and remove and call the bug,
and that will be the first bug, first bug in
the machine. Yeah. I don't use a lot of relays
and computers anymore.

Speaker 1 (02:59):
Yeah, And before we get started with doctor Rachelle, I
wanted to just have you comment on the amazing recovery
of the astronauts and the space station that happened this
past week.

Speaker 2 (03:11):
Really not that amazing. It was so perfectly you know,
it was an unexpected things. Those Butcher and Sonny both
very experienced astronauts. When there was concerns about Starliner, they
sent up the next crew with only the next crew
on a crew Dragon with only two passengers, so they
had the two additional seats for them to come back
at any time. Yeah. But since they had two extremely

(03:35):
qualified astronauts already up, why pay to send them back
down when you can put them to work and in fact,
they put Sonny in charge of the mission. She took
over as mission commander for the station for the duration.

Speaker 1 (03:48):
And she and Butch were happy to stay there. They
were like, no, we don't want to come home.

Speaker 2 (03:52):
Come on. Totally. They were never going to get to
fly again. Those are retired astronauts, right, Yeah, so they
got a great gig. Now that's going to take them
more than a year to recover, which is also normal
for a six months day, and they had a nine
months day. Mark Kelly did a year, and you can
read his book on this, Like, recovery is not a
trivial thing. Yeah, I was watching him being interviewed. You know,

(04:14):
you haven't walked on your feet nine months, your vestibulous
systems messed up, your eyes have been bent out of shape. Like,
it's not a small problem, right to recover from this.

Speaker 1 (04:23):
Yeah, I watched being interviewed on the news when it
was having it. It's just still amazing to see that
falcon Booster land.

Speaker 2 (04:31):
Land on his tail perfectly.

Speaker 1 (04:33):
Always it always is just going to be amazing to me.

Speaker 2 (04:36):
Yeah, no, it's it's a miracle. The crazier thing is
it really is that starship Booster being caught out of
the air. It's literally a twenty story, two hundred ton
building that flies, yeah, and they catch it out of
the air. So yeah, we are in amazing time. So
the space industry has been funnelingentally changed by this, right.
The cost of flight is so much lower. It's hard

(04:59):
to even get her head around what's actually going on
up there right now. So it's very cool with the proliferation.
That's a very good experience for me. This week is
very I felt very good about it, all right.

Speaker 1 (05:08):
So yeah, so that's a cue for me to roll
the music for better no framework.

Speaker 2 (05:12):
So that's awesome. All right, man, what do you got
our good buddy, Simon Crop has the genius. Simon Crop
the ge. This guy is just he's so brilliant. He's
brilliant and he comes up with solutions for things that
you didn't even know you need it. Yeah.

Speaker 1 (05:33):
But this one is called symbol. It's a new GET
package and it's an MS build task that enables bundling
dot net symbols for references with a deployed app.

Speaker 2 (05:44):
Nice.

Speaker 1 (05:44):
The goal being to enable line numbers for exceptions in production.

Speaker 2 (05:50):
Oh okay, that's interesting.

Speaker 1 (05:52):
Yeah, because I guess you don't get that. Yeah, yeah,
and this is this is what it does. So if
you're in production you have an exception and yeah, I
guess you log it, you're gonna see line numbers, all right, Yeah.

Speaker 2 (06:06):
That's cool. You got to know he had that problem, right, like, yeah,
this is clearly a guy who built the thing to
fix a thing that he had, and now we all
get to benefit.

Speaker 1 (06:14):
Another alternative, I guess is just deploying the debug symbols
with it, and now you're slowing things down in productions.

Speaker 2 (06:20):
So yeah, it's a lot more weight than just yeah,
you know, use this library.

Speaker 1 (06:25):
So thank you Simon and Simon. Crop slash symbol on.

Speaker 2 (06:29):
GitHub continues to be awesome.

Speaker 1 (06:31):
See y mba l Yeah, the musical thing, the musical thing,
all right?

Speaker 2 (06:35):
Who's talking to us? Richard grabbed a comment off a
show eighteen thirty five of them when we did with
our friend mattz Targanson talking about the next C sharp
because we've got a great comment LLM related. This is
from Murray who said MADD's mentioned making sure language features
work with the tooling, such as ordering and link syntax.
Increasingly with Copilot and other lms, this is part of
the tooling. Yes. True. Obviously this is a year ago

(06:56):
this comment, so you know so much changes happen. It's challenging.
So given a piece of code using a new C
Sharp language feature, which is what Mads was talking about,
have you tried asking chat, GPT or copilot or so
the LM to describe how that code works. If it
gets it right, does it mean it's intuitive. He's an

(07:16):
LM's intuition and at least you put that in quote,
because there is no intuition in software. There is a
grood approximation for the one that human programmers have, or
a bad approximation, and if programmers are using copil, it
doesn't matter about the human's intuition or the LMS. Let's
complicate this fact with next year's LM that would be now,
which will probably be profoundly different. Yes, so, having said

(07:40):
all that, it's probably best to just aim for the
human and let the LM catch up. Yeah, no intuition
in software. The reality is, of course you would expect
it to not understand a new language feature. There has
to be some time for that language feature to be
documented properly. The good news being as they keep regenerating
these LMS on a regular basis, and Microsoft builds these

(08:01):
features in public view on GitHub even before it ships.
It's likely in the knowledge base that is the al Yeah, curiously,
you know, in my last trip to Microsoft talking to folks,
so what they're using, they've been using Claude Sonnet three seven.
That's their favorite for working in dot net, which isn't

(08:22):
that funny? Fascinating, But you know that's where it's at.
So Mary, you're right, let's focus on the human understanding
the language the most, because the software is only going
to generate what it's got in its model, and it's
up to you to assess it, although admittedly the compiler
has to say also yes, and a copy of music
Cobey is on its way to unit. If you'd like

(08:42):
a copy of music code by, I write a comment
on the website at dot netroocks dot comment on the facebooks.
We publish every show there, and if you comment there
and everything in the show, we'll send you copy of
music code By.

Speaker 1 (08:50):
And if you don't want to wait for that, or
you have other ideas and you just want to buy
music to code buy, you can go to music tocode buy,
dot net and track twenty two is new ish and
you can get the entire collection an MP three flacre
wave for a very good deal. It's a very good price,
So happy coding, all right, Well, let's bring on doctor Birchell.

(09:14):
Doctor Jody Birchell is the developer advocate in data science
at jet Brains and was previously a lead data scientist
at Verve Group Europe. She completed a PhD in clinical
psychology and a postdoc in biostatistics before leaving academia for
a data science career. She has worked for seven years
as a data scientist in both Australia and Germany, developing

(09:37):
a range of products including recommendation systems, analysis platforms, search
engine improvements and audience profiling. She's held a broad range
of responsibilities in her career, doing everything from data analytics
to maintaining machine learning solutions and production. She's a longtime
content creator in data science across conference and user group presentations, books, webinars,

(10:01):
and posts on both her own and jet Brains blogs.
In other words, a slacker.

Speaker 2 (10:09):
It occurs to me, Jody, that you and I hang
out several times a year of various conferences, But I
don't know that Carl's had time with you since we
did that show at Tekarama. Takarama was the last time
I saw you, No a couple of years ago.

Speaker 3 (10:20):
Yeah, yeah, exactly, So it's been a long time actually, Yeah.

Speaker 2 (10:25):
Things have changed your jet brains now.

Speaker 3 (10:27):
I have, certainly I think changed a lot. Yeah, yes, yeah, yeah,
I was a jet brains when we first met as well,
but I think I had only been there just over
a year and so I was still like, I don't know,
a little bit more shy, I think, a little bit
less opinionated.

Speaker 2 (10:45):
You've been hanging around with the troublemakers for a while.

Speaker 3 (10:47):
Now, yeah, you talking about you?

Speaker 2 (10:49):
Yeah?

Speaker 3 (10:49):
Actually, well, and we're going to be hanging out in
my hometown of Melbourne next month.

Speaker 2 (10:59):
Yeah, we're excited about that, yeah, NDC. Yes, so, And
of course I've got family in New Zealand, so I've
got to do a little time in Sydney to see
some folks there, and then I'll be in Melbourne for
the show with you, and then a week on the
farm hanging with the cows and the cousins and the
sheep and the sheep, No sheep, the sheep, what sheeps?

(11:20):
The South Island thing? No sheep on the farm. No, no,
it's it's it's a dairy farm. Dairy farm. Yeah. And
by the way, cows are awesome. Sheep are dumb, dumb,
dumb dumb, holy cow dumb. But they're tasty. Like how
Jody says they're cute. I say they're tasty tasty. Where
my mind is at, that's in the cow. The cows

(11:41):
are smart enough that if they're actually having distress, you know,
in birthing or anything, they will come for help. Wow. Right,
Like they're bright and they and they follow the they
follow the gates of the packs where you want them
to go. But it doesn't mean they don't know how
to open them themselves if they really wanted to. I've
seen them do it. Yeah, damn, they're just playing along.
Cows are great, they really are. In lls are great.

Speaker 3 (12:01):
Right in the right settings. Yeah, they are great.

Speaker 2 (12:06):
Yes, But even that that show we did in twenty three,
you know you were the grown up in the room there,
it's just tired, Like listen, there were limits like that.
We're so hype ish in twenty three, not that it's
all common rational in twenty five, but it's so.

Speaker 3 (12:21):
Funny actually, because I remember I was this was the
first talk I did on LMS, so that one at
Techorama actually was the first one I ever did.

Speaker 2 (12:29):
No free lunch.

Speaker 3 (12:30):
Yeah yeah, yeah yeah, And I was I was actually
really scared of getting up and giving my opinion, like
being a contrarian. Obviously, I'm feeling so vindicated right now.

Speaker 2 (12:39):
But it's right, isn't it.

Speaker 3 (12:41):
It's great being right, but it's I will say, like
the hype has died slower than I thought it would.
So I think Deep Seek finally has spelled the beginning
of the.

Speaker 2 (12:51):
End, but not the end of the business, but the
end of the hype cycle.

Speaker 3 (12:57):
The end of the hype cycle.

Speaker 2 (12:58):
Okay, I appreciate that the approach.

Speaker 3 (13:00):
To how we're going to be I guess, manufacturing these models,
deploying these models, and thinking about these models fundamentally changed
with Deep Seeks. So m it sort of showed that
this hyperinvestment in data centers, which was kicking off with
the Stargate project in the US. To explain context to
anyone in the audience who doesn't know.

Speaker 2 (13:21):
It, five hundred billion dollars.

Speaker 3 (13:23):
And intended five hundred billion dollar investment between Open AI,
the US government, and I think Microsoft was involved so I.

Speaker 2 (13:31):
Think Microsoft pulled out of it.

Speaker 3 (13:33):
It was gorecle very okak got you Yeah, yeah, that
just got announced.

Speaker 2 (13:39):
Yeah, there was a little political game here is that
was also run around the town. They sort of announced this, Hey,
you know, I know we had this deal with open
Ai wherever there's going to run an azure, but we're
ready to let that go. I think it was because
of Stargate that. Yeah, you know, there was sort of
this pressure on Microsoft. You have to keep growing, growing, growing,
and they're like, this is getting irrational. So if you
want to go play with someone else, you knock yourself out.
So yeahing it back to deepseek for a minute. From

(14:02):
what I understand, you know, the open Ai and all
these other models are looking at that and learning from
it and figuring out how to make their own models
more efficient. And at one point I heard that the
Chinese model is, you know, hey, let's spend a lot
less money on these things so that they're less expensive.

(14:24):
We don't have to use as many processors and all
that stuff. And I think I heard that, you know,
the response from the American companies was, oh no, we're
just going to make it ten times more one hundred
times more powerful, you know, so a different kind of
mindset whereas but that was originally Now I think that
there's more of a desire to make to get smaller lllms, right, yeah,

(14:53):
that are more specialized.

Speaker 3 (14:54):
The new ones of the story is that basically we've
known that there are ways to make neural nets more efficient,
right like, there are ways of making the models smaller,
or after you've trained them, actually trimming them down and
getting the same performance or almost the same performance for
much smaller number of parameters. We've also known for quite

(15:16):
a long time, and this is true with any machine
learning model, that the higher the quality of data the
you know, the better the model can perform for much
smaller number of parameters. So this was proven last year
with the Falcon last year or the year before with
the Falcon models, they were sort of the first big
open source ones that were trained on higher quality data
sets and got a lot more performance for less parameters.

(15:38):
But the most reliable way to get better performance was
to scale, and I think what happened. The story I've
heard in China is that they just couldn't get access
to the same size of GPUs because of sanctions. Not
sanctions Basically they weren't being sold in China, and so

(16:02):
they had to make do with older and much less
efficient processes, and they had to do all these tricks
to basically share the training across a bunch of smaller machines.
So this meant that they just couldn't create absolutely massive models.
And essentially this meant that, yeah, they were forced to
create a smaller model. But you know, the thing is

(16:24):
is the quality of AI researchers and AI engineers that
are being employed at companies like Open Ai and Anthropic
and companies like this. I'm sure that they knew it
was possible. It was just as I understood it, a
less reliable path to performance. And you know, the American
companies had they had the money and they had the

(16:45):
servers to train it, so why not go big?

Speaker 2 (16:49):
And they understand that race right like they understand build bigger,
keep going like it's a very American approach to things. Yes,
you can always tune later, right, do your land grab now, but.

Speaker 1 (16:59):
Also that there's a difference between having one huge model
like you know, chat ept that knows everything as bazillions
of nodes or whatever it is, and then can you know,
can cross reference things right and put connect the dots
very much in ways that humans do, but in even

(17:20):
more broadly, Whereas if you have smaller, less expensive models
that are just our lllms that are trained on specific data, right,
you'll get probably get more accurate things out of them
for that particular set you know, that particular context maybe

(17:41):
and then be able to have many of those with
that have different expertise, but you won't necessarily be able
to it won't necessarily be able to connect the dots
like a large, huge model can.

Speaker 3 (17:53):
Right. This can actually lead into a further discussion about
measurement if we want. But basically, looking at the current
benchmarks that they're using to assess performance of llms, Deepseeky
and smaller models coming out of China are actually rivaling
the performance of larger models. So basically the understanding seems
to be is that a lot of the parameters that

(18:15):
these big models have are not actually being used every
single time you try to do like inference for a
particular task. It's only a subset of the parameters. So
the way to think about parameters is think about neural
nets as like you have inputs and then you have
a bunch of neurons that are connected by what are

(18:35):
called weights. They're basically multipliers, and you can kind of
think about inference as a path that you take through
the neural net, where like, you know, the whole thing's
going to be used, but only certain weights will actually
have an impact for particular types of tasks. And it
sort of seems that what's happened with scaling down these

(18:57):
models is that because they learned on so much data,
and so much of the data seems to have not
been high quality, that they really, like a lot of
the parameters were not really being used in the majority
of cases, they were just I see dead weight.

Speaker 1 (19:14):
And so so if you wanted to translate parameters and
neurons to language, we're talking about the probability of the
next word exactly right that it spits out. Yeah, and
what you're saying is that they're only choosing from parameters
with higher weights.

Speaker 3 (19:31):
Yeah, it's it's like or words.

Speaker 2 (19:34):
With higher weights.

Speaker 3 (19:35):
Yeah. So basically the way it works is, like you
think about the last layer of the neural net is
basically like all the words in the vocabulary. So it's
obviously really really huge, and so the whole neural net
is trying to predict to the probability of which of
these words is the most likely to come next. So
it's basically saying that for a particular import only a

(19:58):
subset of that, you know, the paths that go through
the neural net are actually going to give good information
about what the next word is. And so yeah, it's
it's also like it's kind of fascinating because the models
are such black boxes. No one fully understands how the
decisions are being made. I'm putting decisions in air quotes.

(20:20):
I want to make this clear because interpretability is hot,
but this is actually interpretability is becoming a really hot
area in twenty twenty five. So actually understanding how llms
come to the conclusions they come to, or sorry, how
the predictions being made. Let's put it in more clinical terms,
and that's going to help firstly make the models more efficient,
but also demystify a lot of the assumptions we make

(20:43):
about the predictions they make. Like we look at the prediction,
we're like, oh, it's solving problems because if a person
did that, it would be showing problem solving. Or the
model's more intelligent because if a person did that, it
would be showing more intelligence, but that's just us projecting.

Speaker 2 (20:58):
Sure, yeah, anther of morphisation. Now you know, I'm maybe
I'm thinking about this the wrong way, but you know,
as soon as you say that, I'm like, hey, there's like,
what six hundred thousand words in the Oxford Dictionary that's
just English and most people use fifteen hundred of them.
So oh yeah, yeah, yeah. You know here you've built
this model that has this huge potential range of comprehension

(21:18):
and you're using a tiny subsect of it depending on
what you're doing. Especially when we're coming at this from
the copilot part of you was like, I'm working on code.

Speaker 1 (21:27):
Yeah, every symbol in the language is a is a
word essentially right.

Speaker 2 (21:33):
So, but you also talked about performance. In My immediate
reaction was, so, what do we mean when we say performance?

Speaker 3 (21:42):
Yes?

Speaker 2 (21:42):
Is that speed? Is that a speed measurement or is
that an accuracy measurement?

Speaker 3 (21:47):
Yeah? So to kind of put this in context, I
gave a keynote to NBC Porto about all the hairy
things that go along with assessing LLM. So I didn't
get into speed. We can come back to that if
we get time. But it's more about like how do
people judge if these models are good? And last time

(22:09):
we talked and you gave the episode this name, we
talked about the concept of there's no free lunch in
machine learning, and what this means is there is no
there's no one model that will be best for every
possible task you can do.

Speaker 2 (22:24):
Right.

Speaker 3 (22:25):
But what we've seen with the way people talk about
llms is there advertised exactly like this. Like it's like, oh,
open Ai just came out with the one model, and
it is the best model on the market, right, right,
And even if we're not, let's put like engineering considerations aside,
let's talk about like, let's put cost aside, let's put

(22:45):
speed aside. That's still not going to be true.

Speaker 1 (22:47):
It's like who's the best guitar player in the world?

Speaker 3 (22:51):
Yes, how do you measure this?

Speaker 2 (22:53):
That's an impossible question? Answered well, I think when they
were saying best that time, we were talking the largest
number of parameters, weren't they.

Speaker 3 (23:00):
Well, what they're talking about is there's this suite of
benchmarks that are designed to assess LLM performance. And we
talked about this last time. But llms were originally designed
to be natural language processing task generalists. So they're good
at doing a range of natural language tasks, often without

(23:20):
further training out of the box, so they can do
things like classification, summarization, they can do translation, things like this.
So generally, when these models were first designed, they were
benchmarked against how well they could do these natural language tasks,
like specific things like question and answering, translation, blah blah blah.

(23:43):
But as as the capabilities of the models have grown,
or maybe they seem to have grown, we don't know.
What we started doing is getting them to do things
like grade school math problems, or we've gotten them to
do suites of questions that are designed to assess problem
solving or blah blah blah. And then what we do

(24:07):
is we collate a bunch of these gold standard measures
together and we combine them in such a way, and
we create leader boards and we rank these models and
we say, oh, Okay, this model is the best because
it did the best at the MMLU, which is like
a reasoning benchmark, or this one's the best because it
did the best at like a collated collection of all

(24:28):
of these benchmarks. So it's doing well on reasoning, and
it's doing well on problem solving, and it's doing well
on math, and it's doing well on coding. But this
is the thing, like, firstly, a lot of these measures
have been found to have serious problems. Then they've been
found to really not measure what they said they claim

(24:49):
to measure in a variety of ways. And the second is, Okay,
I am an application developer. I want to design an
application that uses an LM. Say I want to make
a chatbot that can help people plan their holidays. What
does it matter to me that an l ELM is

(25:09):
really good at solving science problems, grade school math problems,
Like is that going to be good for my application?

Speaker 2 (25:18):
So, got a calculator do you have, like, okay, gotta
coverage and.

Speaker 3 (25:24):
It's probably going to do the math wrong anywhere because
they're not symbolically simulating exactly they do that, But.

Speaker 2 (25:33):
Then you also have it return the response in the
form of a limerick.

Speaker 3 (25:39):
That's fantastic. It's what our customers needed.

Speaker 2 (25:41):
That's it.

Speaker 3 (25:44):
So yeah, this is part of the problem. The way
we talk about the way we talk about l elms
is we talk about them like they are a thing
independent of machine learning, but they are absolutely not. And
part of the problem with that is it means that
the way that we use them is we tend to
trust their outputs too much, and we also tend to

(26:08):
you know, not have scrutiny about like whether a model
is the best fit for our use KSE, so we
don't design assessments to see like is this actually doing
what it's supposed to do, which we would absolutely do
with traditional machine learning.

Speaker 1 (26:23):
I have had the experience of using lllms in you know,
both chat, GPT and Copilot to help with coding things,
and I found a situation where I asked it to
do something, you know, to write something, and instead of
pointing to something in the framework that already did that

(26:45):
and say why don't you just use this, it just
went ahead creating the thing, you know, reinventing the wheel.
And then you know, an hour later, I've got something
that works. But I'm like, hey, there's something in the
framework that works just like this.

Speaker 2 (27:00):
Yeah.

Speaker 1 (27:02):
So it's that's why you need a human in the
in the equation.

Speaker 2 (27:06):
Although although where's there a prompt there to say is
there a class that does X?

Speaker 1 (27:11):
Well, that would have been yeah, that's the human error
that because that should be the first question. It's like
when somebody says, you know, I have an idea for
an app, and my first question is, well, first of all,
I don't I'm not going to write it for you
unless you pay.

Speaker 2 (27:25):
Me in second of all, does it already exist? And
the answer is usually yeah, if it's really that good
of an idea, somebody else's somebody else has done it. Okay, well,
why don't we take a break down. I want to
dig into some of these evaluation strategies.

Speaker 1 (27:39):
All right, we'll be right back after these very important messages.
Stay tuned. Do you have a complex dot net monolith
you'd like to refactor to a microservices architecture? The micro
Service Extractor for dot Net tool visualizes your app and
helps progressively extract code into micro services. Learn more at
aws dot Amazon dot com, slash modernize.

Speaker 2 (28:05):
Am We're bag. It's done at Rocks Amateur canvill that's
called Franklin talking to doctor Jody Burchell. Hi. And if
you don't enjoy those those ads and you'd like an alternative,
we do have a Patreon that provides an ad free feed.
Let's go to patreon dot com. Check it out Patreon
dot dot NetRocks dot com. Yeah, so I found the
deep avow site that talks about MMLU. But nearest I

(28:28):
can tell this is just a set of questions in
different topic areas.

Speaker 3 (28:32):
Yes, yes, so so let's talk about benchmarks. So there
is a very famous leader board called the Hugging Face
open Fellolane leader board something like that. Okay, So hugging
Face is a company. They're based in France and basically

(28:57):
what they do in their open source branch is provide
access to all of the major open source what are
called foundational models, so big l lams that are open
source computer vision models, those that can generate audio, you know,
do transcription, all these sort of things. And so Hugging

(29:19):
Face take the open source models. They run these models
against a suiteter benchmarks and then they call aid them.
And they used to have a used to have a
leaderboard up until June last year. This was the first
version and it included scales like Hella, Swag and the MMLU.

(29:41):
So it got retired for a couple of reasons. But
one of the reasons that got retired is people started
going through the questions and MMLU was bad. It had
a few questions that literally were like I think one
of them was something like the continuity of the theory.
That's that's the full question. And then it was a

(30:02):
bunch of multiple choice answers that were just lists of
numbers like that was the question, and think about, Wow,
the gold standard is a human, so humans meant to
be able to answer this, And then you rank how
well the LM goes, and I'm.

Speaker 2 (30:15):
Like, nobody can answer that.

Speaker 3 (30:17):
What does this mean?

Speaker 2 (30:18):
What does it even mean?

Speaker 3 (30:20):
What does this mean? But my favorite, my favorite, my
favorite was Hella Swag. So apparently Hella Swag I think
was made using mechanical turk, so they got people to
generate the questions and then validate them. But clearly whoever
picked up this task was like not particularly invested, Like,
you know, they're not getting paid a lot of money,
they probably didn't care, right, And I have actually an

(30:42):
article with some of my favorite absolutely bizarre Hella Slag questions. Okay,
now keep in mind I am reading this out as
it's written. Okay, So we have a question, and we
have a bunch of multiple choice answers, and what the
LM is supposed to do is complete the scenario. So
it's meant to pick the option that has the most

(31:05):
you know, fitting scenario end. Okay, so I've got one
for you. Man is in roofed gym weightlifting. Woman is
walking behind the man watching the man. Woman is a
tightening balls on stand on front of weight bar b
lifting eights while he man sits to watch her cee

(31:29):
doing mediocrity spinning on the floor, D lift the weight
lift man.

Speaker 2 (31:37):
That doesn't make any sense.

Speaker 3 (31:39):
It doesn't and probably around a third of the questions
in hellaswag with this garbage.

Speaker 1 (31:44):
I just want to know what mediocrity spins are. I
want to do that, and I just don't know because
I don't.

Speaker 2 (31:51):
Know the definition.

Speaker 3 (31:52):
That's every time I turn around and knock something off
a shelf with my clumsy hair. That was twenty twenty.
That was mediocol for the child.

Speaker 2 (32:04):
Yeah, there's been some times. So I mean this just
seems lazy then, like the well, let's back up. Is
asking questions of an LM actually a good way to
measure its effectiveness?

Speaker 3 (32:16):
Now? Yes and no. So you can create well defined
problem suites if you have a good idea of what
you're assessing. So this is this is basic measurement theory, right,
It's like we learned this in psychology. It's tricky with
llms because we have a tendency to extrapolate too much.

(32:38):
We try to project what their performance would mean if
a human did that, and we can't do it because
llms do not have what's called fluid intelligence or general intelligence, right.
They have what you could essentially call crystallized intelligence, which
is that they have a bunch of little templates of
how things work based on scenarios they've seen before. They

(33:00):
can patent match questions they see against this, So you've
got to be really careful about how far you deviate
from the doing patent matching to their showing intelligence, right,
But it is possible. Let's say you want to assess
how well they do specific tasks, like they can answer
questions about history or whatever. That's fine. I think that's

(33:21):
fine to assess. It gets tricky because there are two
main problems with using questions other than the one I've
just said. The first is is that the answer type
that an LM is presented with actually impacts their performance.
So most of these measurements use multiple choice questions, and

(33:44):
the reason that they do that is because it's much
easier to score because they're essentially ways of seeing, you know,
the probabilities of words I was talking about. You can
quite accurately tell what's the highest probability sequence that it
would have ended up predicting based on, you know, the
ones that is present with, So you know, it's much
much easier to work this out. But you can also

(34:06):
get them to generate answers, and generating free form answers
is really hard to assess unless you're gtting humans to
actually compare it to a gold standard because the statistical
ways we have of comparing two sequences are imperfect. So
most of the time people will use these multiple choice
answer keys. But the problem is is that elms seem

(34:28):
to do a lot better when they're presented with multiple
choice answers compared to free form answers. Sure, and the
reason it seems to be is because it's a lot
easier to just pattern match to something they've already memorized
as opposed to having to generalize a bit more. And
then the second big problem is hell, elms are ridiculously

(34:51):
sensitive to the format of the prompt template you use.
We've already talked about this, like did you tell them
to use a framework that already exists? But it's so
much more subtle than that. So using a different placement
of punctuation, using different spacing, this can impact the performance

(35:11):
of LMS on task by like thirty fifty seventy percent. Wow, yes, yeah,
and like why it seems to be again pattern matching.
So if that like particular formatting is closer to something
that it's seen already in training, it's more likely going
to be able to get it right.

Speaker 2 (35:32):
So all I got to do is ee cummings my
prompt and it just.

Speaker 1 (35:40):
Exactly it's just Richard invents a new ferbs.

Speaker 2 (35:45):
Yeah. But you know an interesting point, like anytime you
want to remind a person that this software is not intelligent,
it's that that recognize that this is pattern matching. The
fact that that as a human, I can hand you
only lower case there's no punctuation statement or a perfectly
punctuated statement, and you'll see it as exactly the same,

(36:08):
just one lazier than the other. But the software treats
it completely differently.

Speaker 1 (36:12):
Exactly do you guys know the story of what the
moment that Bill Gates went nuts over chat GPT and
began to trust it and his mind was blown over it.
So the Richard I sent you a link in the
chat you can post it there. This is the story
and I heard about this story on the on the radio.

(36:34):
So the story from this CNBC dot com things. Bill
Gates watched chat GPT asen ap bio exam and went
into quote a state of shock. And this was August eleventh,
twenty twenty three. So but what you don't know, and
I don't even know if they say it in this article,

(36:55):
I don't think they do. But a couple of months
before is when Sam All actually showed Bill chat GPT
and he added a couple of things and he said,
you know what, you know, it would be a great test.
Sam is if we could give it the ap bio test.

(37:17):
And it aced it and then Sam goes home and
two months later brings it back and it ass the exam.
So what do you think happened in those two months?

Speaker 3 (37:27):
What a mystery.

Speaker 2 (37:29):
It's so strange, I can't imagine. I'd also point out
that a GAP exam is largely multiple choice.

Speaker 1 (37:34):
There you go, Well, I mean I thought it was
I thought they were essay questions. There were five questions,
but I don't I didn't read that part, but I
heard that they were five five essay questions. Anyway, they
did not say that in the article, but that that
apparently happened.

Speaker 2 (37:53):
It all depends on what you.

Speaker 3 (37:54):
Train it on, right, Yes, and this is actually a
bigger issue. So this is an issue that's called data leakage, again,
well known problem in machine learning. It's when your model
gets access to the test set during training that it
can basically learn the answers and well, the implication from
Carl is that this may not have been an accident

(38:16):
this time. But you know, we don't have a clear
idea of what's in the training data for a lot
of these models. Even open source models now are being
super cagey about what's in their training data. So they
say it's a competitive advantage. But we know from experiments
people have done that even benchmarks have ended up at

(38:37):
least partially leaking into the data. So we know that
a lot of these companies will optimize for benchmarks. They'll
keep training the models until or not keep training them,
but they'll keep tuning them until they do well on benchmarks.
But even accidentally, because they're just scraping the open internet,
sure they've accidentally shoved a bunch of these questions.

Speaker 2 (38:57):
Which is probably where the benchmarks came from in the
first place. Any so exactly eventually you're going to meet
up with the data. It doesn't seem surprising at all.

Speaker 3 (39:06):
So yeah, the modern suite of like benchmarks that started
being created last year, they started making them private reasons
to mitigate this. But it doesn't mean that you know,
you as a consumer, you're a lay user of an
l l M. Maybe not a lay user. You might
be a bit more technically advanced, but none of us

(39:26):
here are. AI research is right, right, and so we
might be not far from that to inform consumer. Let's
put me that way.

Speaker 1 (39:40):
But you know, I just found it. I'm sorry to interrupt,
I just found it. A story is that, you know,
Gates issued what he believed to be a rather difficult
challenge to Sam Oltman, bring chat GPT back to me
once it could exhibit advanced human level competency by achieving
the highest possible score on the ap by l exam.

(40:01):
And so two months later, oh Magic, Open a Eyes
developers came back and Gates watched the top score of
five on the test. So so yeah, so there it is,
right in black and white that actually happened. And as
I was hearing this, I was like, you idiot, why
didn't you just say give it to me when it

(40:24):
can answer a test question and don't tell them what test? Yeah,
you know a test question and then just do it.

Speaker 2 (40:32):
Try it.

Speaker 3 (40:33):
Yeah, I don't know.

Speaker 2 (40:34):
Far be it from me to call Bill Gates an idiots?
Did I actually do that? But you know, there's this
confirmation bias situation you can put yourself into. Yeah.

Speaker 3 (40:43):
And this is the thing too, Like I don't blame
people for feeling enchanted by the models, like there is
something so human feeling about them because they're echoing back
our humanity. Yeah, but you need you always need to
be CAUs and like we were trying to do at
the beginning, like you see, we slip into anthromorphizing the

(41:05):
models even though we know better.

Speaker 2 (41:06):
For sure, because it's easy.

Speaker 3 (41:07):
It is easy. But really, like even with the latest
benchmarks trying to assess AGI, this one called the ARCAGI
that Open Eyes three actually did very well on got
seventy percent late last year. This is still just pattern matching,

(41:30):
but pattern matching in a more organized way. It's basically
the model has more of an ability to sort of
sort through which patterns might be the best to apply.
But again, we're just talking about a more systematic application
of crystallized intelligence. We're not talking about generalizability yet.

Speaker 2 (41:50):
Yeah, And I mean the more I read, the less
I'm concerned about the AGI side of the equation. It
seems more and more like a marketing term to hire
more people to work at open AI.

Speaker 1 (41:59):
Yeah, it's only a fluid term that keeps changing. The
definition keeps changing.

Speaker 3 (42:05):
But how do you assess AGI? Like I don't know
if we talked about this last time, because I had
that in my first talk, But you know, how can
you even assess the gap between what a model knows
and you know, a task, so like the difficulty that
a model would have doing that task based on what
it already knows, and then standardized that across a bunch

(42:27):
of different models that have potentially been exposed to very
different tasks and knowledge like it. Sure, it feels it
feels like such a difficult.

Speaker 2 (42:35):
Challenge and it's way too broad, and ultimately I feel
like it's a distraction from the fact that we're just
trying to be engineers making using a useful tool. And
I mean I let off this conversation talking about the
fact that I always ask folks like what one are
you using right now? What are you enamored of? And
the fact that you know, I had sort of a
universal everybody likes cloud right now, it's like, why what

(42:59):
do you What is your innate benchmark that made you
switch to this or is it just a social pressure
thing vibesmen because that smart person was using Cloud. Now
I'll use claud and then there'll be some nice confirmation
bias there. Well, yeah, no, it seems to be doing
the thing. Is it actually better than the chat GPT information?
I don't know how would you measure that? So we're

(43:20):
in this loop and I don't feel like I can
get a version, a new version of anything from any
of these folks come out and you open AI a
new cloud or any of these and say, okay, is
it worth switching? Yeah? I mean I know they want
me to. I know it's invariably more expensive, but is
it better?

Speaker 3 (43:37):
Yeah? Look, I have a prediction that here we go.
I'm going to do a prediction why not?

Speaker 2 (43:43):
Why not? Why not? Why not?

Speaker 3 (43:46):
I'm going to say probably in a year's time, the
landscape of providers is going to look quite different. Oh, definitely,
And it's because the advantages of using smaller models is
just drastically outweighs using bigger one. They're cheaper, they're more
momentally friendly, they're more more specialized, they can be more specialized,
like it's easier to tune them so that you can

(44:08):
focus them on specific tasks. And yeah, ultimately they're just
you know, it's easy to control what happens to your data.

Speaker 2 (44:15):
So right, that's a big one.

Speaker 3 (44:18):
That is a big one.

Speaker 1 (44:19):
It's a big one, especially with something like deep seek.
You know, the only way I'm going to run that
is on my own network not connected to the internet.

Speaker 2 (44:27):
Well, they do often a local offer a local version, Yeah,
they do, right, which n video has been benchmarking with
a fair bit I noticed, like, which I thought was cool,
like smart thing to do, not just to because there's
lots of folks saying, no, don't use the Chinese LLM.
But yeah, the fact that video just said.

Speaker 1 (44:45):
That's the same reason they're saying don't use TikTok, right,
so they don't trust it more or less?

Speaker 2 (44:49):
What could happen I don't.

Speaker 3 (44:51):
Use Yeah, I don't use TikTok just because I'm deeply uncool,
Like it makes with you.

Speaker 2 (44:58):
I'm so with you.

Speaker 3 (44:59):
I actually had to make a TikTok. I was at
a workshop for my job like two months ago, so
you know, I'm an advocate. So they're like, hey, let's
teach these old people to make tiktoks nice. And I
made this TikTok with Michelle you know, Michelle Richard, Michelle Frost,
so yeah, yeah, yeah, she just started with jet brains.
And so we made this TikTok with another of our

(45:20):
colleagues and an involved Wilbur, her dog, and it was
just like it was so bad. And then they're like
it's awesome, like you should go on TikTok right now,
and I'm like no, I'm deeply ashamed, like.

Speaker 2 (45:34):
I should see.

Speaker 3 (45:35):
This was bad.

Speaker 1 (45:39):
I do have to confess that I have a TikTok
account Carlotphoenix dot com. I have not used it yet
for anything more than hey, I'm here, and I certainly
don't scroll TikTok. I have so many better things to
do than to scroll inane, insane, crazy music videos of
people doing stupid things, and dogs and.

Speaker 2 (45:58):
Cats are also. You know what it's interesting about TikTok
is you're not really picking the content they're picking the content. Yes, yeah,
they are watching your loiter time, so it's your behavior
that's only selecting the content. But you know, it is
a different mechanism there where you can't really curate a
list or build a social graph. That's not up to you.

(46:19):
And and I find that interesting, right, like that, were
literally are handing over our attention to something else that's
driving it.

Speaker 3 (46:28):
Yeah, I do have to admit I'm so curious about
their recommended the.

Speaker 2 (46:32):
Yeah, well, as a technologist, right anything, like well, because
that's the thing that they're all upset about. This is
our secret sauce, and we'll be keeping it to ourselves,
thanks very much. Oh, it's well, here's the thing. Is
there a way when you see something that you don't
like to say, give her to death. I don't want
to see this again? No, because it's too late to
scroll past it. You've already scrolled. The problem is that

(46:53):
before you found out you didn't like it, you watched it.
You know, it's the old old I'm trying to improve
the quality of my diet by eating everything and deciding
what I like. Yeah, there is no nutritional label on
any of these things.

Speaker 3 (47:13):
Sorry, it's called democracy.

Speaker 2 (47:15):
Okay, so that's what you want to call it. The
only person who doesn't get a vote is the viewer.
I'm not cynical at all. I don't know what you're
talking about.

Speaker 1 (47:28):
What you remind me of David Mitchell, the British comedian
who just every once in a while will just go
off on a rant.

Speaker 2 (47:36):
Just start and he'll keep me going. I'm fine, Everything's fine, Okay. Yeah,
we're still at this core issue of how do I
select an LM from my app? I mean, part of
it is the running contact or I can I can
go down the cost side and can go down the

(47:58):
does it, you know, integrate well with my Do I
any cloud access? Or can I run local?

Speaker 3 (48:02):
Like?

Speaker 1 (48:02):
There's all those decisions we need an LM to answer
this question.

Speaker 2 (48:06):
I don't think.

Speaker 3 (48:09):
I have something even better. I've got a blog post, hey,
are right? He so I will share this with you
so you can share it with the audience. But I
came across when I was writing my talk, I came
across this absolutely phenomenal blog post. So guy's an AI
engineer called Hassan Hussein. So this guy works in an
AI consultancy. So exciting job these days goes out and

(48:33):
he needs to basically build applications for companies that use AI.
And one of the jobs that he talks about is
he and his company were hired to build a chatbot
for real estate agents. So basically, they wanted the real
estate agents to be able to type in natural language,
like give me the contact details for everyone in this area,

(48:55):
whatever you know, and then the LM would generate a
quer to a CRM something like that and return the information.
So when they first started building the app, they like,
they picked a good LM based on the leaderboard, good one,
and then they wrote the initial prompt templates and then
you know, everything looked good, and then things started not

(49:17):
working on the edge cases, so they made the prompt
a bit more elaborate, and then the prompt started getting
really unwieldy, and then they realized the only evaluation metric
they had was vibes and they were like, really, this
is a mess. So he set out in this really
interesting way how they actually went back to ground zero
and they started again, and he said, like, basically, we

(49:39):
realized we needed a tiered assessment. So he said, like,
the first tier of assessment is unit tests, Like it
seems really obvious, right, But he's like, the thing is
is like, because it's nondeterministic, you're not going to have
one hundred percent pass rate on your unit tests. So
you need to determine what error rate you're happy with,

(50:00):
and that's going to require a bit of experimentation.

Speaker 2 (50:03):
But you also have to accept the level of ur rate,
like you're not getting all agree exactly.

Speaker 3 (50:07):
Exactly, so it might be like you just need ninety
five or ninety nine percent or whatever to pass whatever
looks realistic. But you know, an example of unit tests
is let's say the query from the user was return
me the phone number of you know, Jane Smith or
you know someone like that, and then basically what you're
going to expect from the CRM is a phone number,

(50:28):
So you can write a unit test for that. You know,
it's basic engineering. And then he said, you know, you
can create a suite of manual evaluations, so you basically
look at the traces how the LM is interacting with
the users and the rest of the system, and you
manually evaluate that. And you don't have to keep doing

(50:48):
that forever because then you can use a new method
called LM as a judge just where you get another
LM to also do the same assessments and trying to
get them to converge. And once you have a relatively
strong sense that the LM is giving similar assessments to
your human you can you know, you need to check
in on it from nine to time to see if

(51:09):
it's okay. But that you know, takes over that part
of the assessment, and then you know, you can go
up to your normal kind of higher level assessments like
a B testing. You know, it's really it's just a
normal engineering system, and you can create a feedback loop
where you can you know, refine your prompts or fine
tune models, or use different models that maybe are smaller

(51:32):
or cheaper and see whether you can get the same
sort of performance. So you know, obviously you're going to
need to just pick a model to start with. You
might be able to get a sense of whether it's
good for chatbot applications in this language, you know, do
your research on that. But this really shows me, like
it's just it's so obvious, right, like this is how

(51:54):
we do monitoring.

Speaker 2 (51:56):
Yeah, and it's I'm sorry, this looks too adult me.

Speaker 3 (52:00):
Well, I know, it looks like a lot of hard work.

Speaker 2 (52:03):
Literally, as you actually have to work at building a
day decent testing framework specific to your your case. I
know I wanted I wanted a happy button. Jody can
be sad I want to button. Yeah.

Speaker 1 (52:18):
We recorded a show with Spencer Schneidenbach, which is actually
next week's show.

Speaker 2 (52:23):
We recorded it a couple.

Speaker 1 (52:24):
Of days ago, so we have the benefit of future
looking here, and we talked about some of these things
with him, and uh, you know that just the comment
came up and I think it was even mere Richard.
I can't remember who, but you know, we we used
to have we used to be programmers that you know,
we have a bug, we fix it. Now the program

(52:45):
is one hundred percent accurate. And now I mean it's
even it's even more like we're a psychologist now instead
of scientists. You know, we make some suggestions, we examine
the output, you know, we think about it a little
bit and it doesn't seem quite right. We ask some
more questions, examine the behavior.

Speaker 2 (53:03):
You know.

Speaker 1 (53:04):
It's like, if these things are going into our software,
I have a little trepidation about that just because of
the inaccuracies. Even if it's even if something is ninety
nine percent accurate. That's that's a bug. That's a one
percent bug.

Speaker 2 (53:21):
Yeah, and when you probably can't pin down.

Speaker 1 (53:23):
And one you probably can't fix.

Speaker 2 (53:25):
Use probabilistic tools, get probabilistic results. Yeah, exactly.

Speaker 3 (53:32):
Look, it's funny because I'm probably so much more comfortable
with this than any of you, because I'm like, hey,
this is just how stuff works.

Speaker 2 (53:39):
That was how machine learning always work. When we talk
to you in twenty three. You've been doing this for years,
and it's like you do the testing and there is
no one hundred percent exactly. Yeah, you get in the
mid nineties. You should feel good.

Speaker 3 (53:50):
Yeah, yeah, well sometimes suspicious, it depends sometimes too quickly.
But yeah, I think it's an uncomfortable new reality. And
you know, it's something I've observed for years when you know,
you bring engineers into the world of machine learning, and

(54:10):
it is a deeply uncomfortable thing not knowing that something
is one hundred percent deterministic. I think the main problem
is is it's one thing to say have a system
that otherwise works totally in a deterministic faction. So let's
say you've got some sort of system that say, takes
in queries or takes in numbers from a user, let's say,

(54:33):
like nutrition numbers for a piece of food or something,
and then you have a machine learning model that generates
a prediction that maybe within a certain band of correct.
It's more difficult when you're talking about an LM being
an actor as part of that system and generating pieces
of code that will then run that system and then

(54:54):
generating error in that way is actually quite consequential.

Speaker 2 (54:57):
I just like that phrase certain band of correct.

Speaker 3 (55:03):
We call it, we call it a confidence interval. Actually,
how confident am I that this is correct?

Speaker 2 (55:13):
But you know, I think as a developer, when you're
talking to leadership that want you to use these tools,
I think they're going to provide as advantage. Just like
one of their educations is these are nondeterministic models and
there will always be a certain level of uncertainty, and
if you're not good with that, we don't get to
use these tools.

Speaker 1 (55:28):
Yeah.

Speaker 2 (55:29):
Yeah, yeah, that's right.

Speaker 1 (55:31):
So you know, the I guess first analysis you should
do in your business is what level of uncertainty are
we comfortable with?

Speaker 2 (55:39):
Can we tolerate? You know, what is the benchmark that
we're shooting for? Well? And then the other side of
this is the consequences of the uncertainty of the mistake,
like what has happen? Right, people gonna die? You know,
I know, I see this over and over again, where
like the first case of an LLEM in an organization
is with an HR system. So it's totally internal. And
part of that is because as the consequence omitting correct

(56:01):
is minor. You know, yes, you're going to make somebody
angry if you tell them they have more vacation days
than they do, but you probably haven't cost a company
a lot of money.

Speaker 1 (56:09):
Well, and also there's a human there to make sure
that the you know, accurate information gets given to the person.

Speaker 2 (56:17):
I like your optimism, but yes, I would hope you
would hope. I would hope. But you know, get the point, like,
there is a bunch of ways to manage this uncertainty.
So there's a going to be a new corporate title
job and it's going to be a nondeterministic compensator. Oh,
I think that's from I think I think that's from

(56:38):
Back to the Future. You're thinking, CuO chief Certainty officer.

Speaker 1 (56:44):
Chief Uncertainly that's good. I think you're thinking of uncoupling
the Heisenberg compensators.

Speaker 2 (56:49):
There you go, that's star Trek, Star Trek.

Speaker 3 (56:51):
Yeah, bouncing all over the place, such geeks.

Speaker 2 (56:58):
I found the blog post from Hassan and I'll include
it in the show, or from Hammel, who's sing man
who sings your AI product needs evals And it's exactly
the way you describe it. Building unit tests, doing model evaluation,
doing a b testing. This seems like a real concrete
approach to just how do we at least be able

(57:19):
to look people in the eye and say we've done
our best to test this and have some certainty around it.

Speaker 3 (57:25):
Yeap, And well, I think what I like about it
is it's not unfamiliar territory for engineers. This is exactly
what you've all been doing for decades now, Like this
is just monitoring well, or at.

Speaker 2 (57:40):
Least should have been doing. This is looking like the
testing we do on software.

Speaker 3 (57:44):
But this is the thing. It's no one's fault. That's
an on the ground developer, because the way these models
are sold is that no their magic like they are
different to everything else. They are certainly not. They are
the same as any other machine learning model, except slightly
more problem because you're probably involving them in critical parts
of generating code.

Speaker 2 (58:05):
Just be careful and measure please be careful. Yeah. So so,
doctor Burchell, what's next for you? What's in your inbox? I?

Speaker 3 (58:14):
As I said, I'm heading down to Melbourne in a
month for NDC. I'm going to be giving this talk
actually the one I gave it Porter and I'm going
to be giving one that I gave in Oslo just
about the psychology of llms. If you will not be
with me in Australia, you can also watch that on
YouTube on the NDC channel. The moment that I'm laying

(58:35):
kind of low, I'm actually going for my German citizenship,
so nice. Yeah, I gotta do my citizenship test in June.
Just did my language test couple of weeks ago, and
like everything in Germany, it takes months, so I may
be able to apply by the end of the year.

Speaker 2 (58:49):
So you got to wratch it up your complaint too.

Speaker 3 (58:52):
I laughed, actually so hard, because a friend of mine
did her exam and one of her writing tests was
to write a letter of complaint.

Speaker 1 (59:05):
Well it started because before you came on, Richard, Jody says,
how you doing, and I said, I can't complain, but
I do anyway, She says free chairman.

Speaker 2 (59:17):
Awesome, all right, thanks Jody, really appreciate it. Yeah, thank you.
What a great conversation, all.

Speaker 3 (59:22):
Right, Always always a pleasure, Okay, and.

Speaker 1 (59:24):
We'll talk to you next time on dot net rocks.

(59:47):
Dot net Rocks is brought to you by Franklin's Net
and produced by Pop Studios, a full service audio, video
and post production facility located physically in New London, Connecticut,
and of course in the cloud online at PWOP dot com.
Visit our website at d O T N E t
R O c k S dot com for RSS feeds, downloads,

(01:00:10):
mobile apps, comments, and access to the full archives going
back to show number one, recorded in September two thousand
and two. And make sure you check out our sponsors.
They keep us in business. Now, go write some code,
See you next time.

Speaker 2 (01:00:25):
You got Jack Middle Vans and

All Episodes

Episode Transcript

Popular Podcasts

Stuff You Should Know

Dateline NBC

The Burden

.css-15opob5{left:0;position:absolute;top:0.8rem;} All Episodes

.css-14f5ked{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:2;overflow:hidden;}Measuring LLMs with Jodie Burchell

Episode Transcript

Popular Podcasts

.css-r6mb8g{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:1;overflow:hidden;}Stuff You Should Know

Dateline NBC

The Burden

All Episodes

Measuring LLMs with Jodie Burchell

Stuff You Should Know