A Guide to AI Models: From Tokenization to Neural Networks with Ishaan Anand - JsJ_669 - JavaScript Jabber

Episode Transcript
Available transcripts are automatically generated. Complete accuracy is not guaranteed.
Speaker 1 (00:05):
Hey, welcome back to another episode of JavaScript Jabber.

Speaker 2 (00:09):
This week on our panel we have Steve Edwards.

Speaker 3 (00:13):
Yo yo yo, come in ato live from cold in
sunny Portland.

Speaker 4 (00:19):
We also have a j O'Neil yo yo yo coming
at your live from the soldering station.

Speaker 1 (00:27):
Oh sorry, I'm Charles Maxwood from Top End Devs and
yeah it is freezing here anyway.

Speaker 2 (00:34):
We have a special guest this week and that is
is Sean Annon.

Speaker 1 (00:38):
You want to let people know who you are, what
you do where?

Speaker 2 (00:43):
Yeah, of course is because your course is awesome.

Speaker 5 (00:45):
Oh thank you. So. My name is Eha nand I
have about twenty years of engineering and product management experience
and most recently I've been very focused on AI for
the last couple of years and I'm best known THEA
community for an implementation.

Speaker 2 (01:02):
Of GPT two.

Speaker 5 (01:03):
There's a precursor to chat GPT that I implemented entirely
in Excel and then late last year I reported that
entirely to the web and pure JavaScript, and I teach
how the entire transformer works. Basically the model that you
know was the you know, ancestor to Gemini, Barred Lama, Chat, GPT, Claude.

(01:24):
They're all really inheriting from this model called GPT two
and I teach people and basically course of two weeks.
If you have really no programming experience, or if you've
got JavaScript programming experience, this is the best way to
really get in understand how these things work. And they
don't have to be a black box and you can
see all that at Spreadsheets at All. You need dot
ai and the classes on mavin.

Speaker 2 (01:46):
Very cool, So let's let's dive in. First.

Speaker 1 (01:49):
I think you said you had a promo code for
the course, so let's just put that out there.

Speaker 2 (01:53):
Yeah, people want to go get it and get a
deal on it.

Speaker 5 (01:56):
Yeah. So the promo code is really easy to remember.
It's jsjer and just go to Maven dot com and
look for my name, or if you go to spreadsheets
at All, you need dot ai and then you click
that you can use that promo code for twenty percent
off for the next two weeks.

Speaker 2 (02:11):
So awesome.

Speaker 5 (02:12):
Definitely check that out. And I should just say, you know,
thank you guys for having me. I listened for years
to this, so it's great to actually meet you guys,
well virtually in person.

Speaker 2 (02:23):
Right. Yeah, AJ is the cool one. I just run
the show anyway, and.

Speaker 3 (02:29):
I'm just thinking guy while everybody else are the smart
people according to some people.

Speaker 2 (02:34):
Anyway. So let's dive in.

Speaker 1 (02:36):
You said that you explain how the transformer works, and
so for those that are kind of new to AI,
do you want to just explain what a transformer.

Speaker 2 (02:44):
Is an AI? Yeah, we can dive into house stuff works.

Speaker 5 (02:48):
Yeah, sure. So the the transformer is a you know,
AI architecture of a model that came out in twenty
seventeen and it is the foundation for most of the
you know AI models that have been you know, like
chat GPT, so those chatbot assistants that seem amazingly smart

(03:11):
all inherent from this architecture called the Transformer. And I
can give a high level over your everything that goes
into that. But the key thing that the transformer does
is usually takes some input and it tries to predict
what the next word is. And that's really all your
large language model is doing is taking one word or
really one token at a time, and it's trying to

(03:33):
predict when you enter in a question what the next
thing is. And over you know, the last you know,
a couple of years, what we've been able to do.
We collectively as humanity is. Take this model that tries
to predict the next word and turn into these really helpful,
amazing chat bot assistants. And the paper that introduced this model,
called the Transformer, was called Attention Is All You Need.

(03:55):
And that's where my course gets its name Spreadsheets Are
All You Need? Is I basically implemented that entire model
inside a spreadsheet? Hence the name Spreadsheets Are All You Need?

Speaker 3 (04:06):
So question here then? So I mean, having used Google
since its inception, you know, type ahead is sort of
a standard thing in search. You know, where you're typing,
and it's starting to anticipate what your phrases, you know,
what you're gonna type next. If I'm you know, starting
to search for spreadsheets on Google, it's going to anticipate, Okay,
what's the next thing I'm going to type? So is

(04:28):
this basically the same thing just on AI steroids or
because I mean basically that's using what people have typed
in and you know they've indexed it and you know,
done things with it. So is that sort of the
same thing just on steroids or is that intrinsically different?

Speaker 5 (04:46):
Yes? And no in terms of effect, it is literally
just doing the same thing, like it's trying to break
the next thing. Is I really kind of get a
little bit of a mental pushback that to just saying, oh,
it's just like autocomplete. It is basically structured as an
autocomplete problem, but the level of complexity of the architecture

(05:09):
to solve that problem is just a lot more complex.
But it is trying to do the same thing. And
you know, the way to think about this is if
you can fill in the blank in any sentence, you
probably know something about that sentence. You already know what
the answer might be. Like that's a useful test of knowledge.
But effectively, yeah, that is that is what's going on.

(05:31):
It's just trying to break the next word, and then
the next word after that and so forth, one at a.

Speaker 1 (05:35):
Time, right, and so effectively, I guess the the autocompletes
that we typically see are a little bit I guess
more naive than say the AI LM models, where they
have substantially more data to run on, and you know,

(05:57):
use a mechanism that I guess is probably somewhat the
same because it's weighted and things like that, but anyway.

Speaker 2 (06:04):
It can do it across a wider variety.

Speaker 1 (06:06):
Of things and give you deeper answers.

Speaker 5 (06:11):
Yeah, so I mean, actually, let's start with the autocomplete example,
because it does kind of point the way to some
parts of the architecture. Like the simplest thing you might
do for building an autocomplete is you might just say,
if I see this word, what are all the next
likely words that will be after it? And you could
just do a statistical look up across some large data sets, right,
And as good as that'll be, the more pieces of
data you look at, the better it's predictive value. So

(06:33):
this is called like a bigram model. And then because
it looks at two and then what you could do
is you could actually look three words back, or you
could look forwards back. And actually, one of the key
things about the transformers it tries to look at all
the words. And this is what the attention mechanicism does,
is that it can figure out, essentially from all the
possible words before it, what is the next most likely word.

(06:55):
And then the other key thing you need to do
is you ask a real network to take all that
information a prediction. And it turns out that's the heart
of the transformer and what really made it work was
they just scaled that up to a much larger size
than I think people were used to doing. You know
when you're autocomplete and your keyboard is probably used to
you know, is built to be really really fast, and

(07:16):
so they tried to make it really efficient. And what
we've been able to do with the Transformer is make
it really big and then actually make it super efficient
scaling it back down so just that it spits out
tokens at a reasonable clip. But that core idea of saying, hey,
let me look statistically at you know what the next
thing is, Well, one word isn't gonna be enough, Two
words back is going to be better. That is what

(07:37):
the attention mechanism is. In a sense, if you squint,
doing is it's trying to look at all the words
that came before, it puts them through multiple passes, and
then it's asking your normal network to do the prediction
rather than just simply saying, oh, let me take the
raw statistics.

Speaker 1 (07:52):
But yeah, so do you kind of want to break
down for us how these systems actually work.

Speaker 5 (08:02):
Yeah? So the first thing I say is the way
to think about the simplest model that I like of
the transformer is that what we've been able to do.
You know, we said that, you know, these are trained
to fill in the blank on a piece of text.
So the example I often use in my lectures and

(08:23):
inside a lot of my material is this very simple,
simple sentence, Mike is quick, he moves and the next
most likely completion would probably be he moves quickly, or
he moves around, or he moves fast. And so the
basic question is how do we get a computer to
fill in the blank of an English sentence or any
natural language sentence. And what we've been able to do

(08:44):
is actually figure out to talk in the language of
the computer, which is math. So if I gave the
computer a math problem two plus two equals it could
fill in the blank. It knows that two plus two
equals four, and we can make the math as pretty
large and complex. But computers are really good at math.
So we've been able to do is and what the
model does is it takes a word problem and it's

(09:04):
really converting it to a math problem. If you look inside,
you know, go to my website, you know, spreadsheets are
all you need dot ai slash GPT two, or if
you download the Excel file and look inside it what
you'll see in you know, there's text at the beginning.
You type in text on one part of the spreadsheet,
and you get the predicted word at the other end
of it. But in between, if you look in that,

(09:26):
you'll be like, where the heck are the words? It's
all numbers. And so the key insight is what we've
been able to is take something that is a word
problem and we've turned words into math and once and
that mapping process of words into map has two stages.
It's called tokenization and then embeddings. And at the end
of it, we map every word. You can conceptually think

(09:47):
about it to a single number, but we actually map
them to a large list of numbers. And then once
you have a mathematical representation of your prompt, your entire
prompt has been you know, turned into a large list
of numbers. We then run I just call it number crunching.
It's these two key mechanisms attention and a multi layer
perceptor or a neural network that just kind of crunches

(10:08):
on it to try and predict what the next word is.
And then at the end of that we get a number,
and that number we then reverse the process that came
out of that thing, and we say, well, what what
word does this number map to? And that number is
a predicted word, but it's not going to map cleanly
to every single word in our vocabulary. And so if
that number is closer to certain words, like in the

(10:29):
case mike is quickly as quickly, the predicted number might
be really close to the word quickly. It might be
close to the word around, but it's not going to
be close to you know, quick can be a body part,
it can be the quick of your fingernail. It's not
going to be something about your fingernail, because it's figured
out enough that it's moved the predicted number away from that.
And so we take that and we run a random

(10:49):
number generator the very end, and then we pick it
according to that random number generator based on how close
that number is to one of the other words in
the dictionary of words mapping to numbers. So that's like
my highest level summary of what's happening under the transform
without describing all the mechanisms. But again, the key thing
is we found a way to map solve this problem numerically.

(11:12):
We map words to numbers. We turn the whole sentence,
your entire prompt into a large list of numbers, We
number crunch on it. Then we get a predicted number
out of it. We just calculate and we look at
how close that number is to our number to word
mapping at the very end, and that's the probability you
get of getting a particular token or word out of
the model. Let me pause there, see if their questions
or things I should clarify, So.

Speaker 2 (11:36):
I think I follow along.

Speaker 1 (11:38):
Essentially what you're saying then is, so let's say I
wanted it to generate a whole paragraph.

Speaker 2 (11:43):
It just does this over and over and over again.
Get yeah, the next word.

Speaker 5 (11:48):
Yeah, maybe I've glossed over that part of it. Like
the large language model only predicts the next word technically
something called a token, which is slightly smaller than a word,
and every time you get a prediction out of it, like,
it doesn't by default predicted paragraphs. So if you you know,
try my app or you download the spreadsheet, it only
predicts one token. And the way we get paragraphs a

(12:09):
text out of this is we take the predicted token
it came up with, and then we stick it back
onto the input, and then we ask it to predict
the next sentence or the next that new accumulated paragraph,
and so you can actually start with a single word,
ask it to predict what the next word is, and
then you now you've got two words, and then you
run it through and then you keep going. And then
what happens when you've got user input like somebody types

(12:31):
of response, is you just stick that entire user input
as you know, a large set of words that it
needs to brick what the next thing is. And you
can think about it structured into the model. As you
are reading a transcript between a user and a helpful
chatbot assistant. User said X, we fill in what the

(12:52):
user said, assistant said, and then it needs to come
up with what the assistant said, and it just tries
to come with something plausible. Maybe the thing is step back,
like the base model that these that gets trained in
this process before it's turned into a helpful chatbot just
knows really simply how to complete sentences. If you take
the base GPT two and you type in, you know,

(13:14):
questions to it, it's not going to necessarily respond back
to you meaningfully. It's just designed to predict the next
word based on everything it's seen on the internet. So
a good example I use in classes, we type in
the word first name and then you hit return, and well,
what do you think it would predict after that? It
predicts last name, email address, phone number, because most texts

(13:35):
on the Internet that's say first name. Statistically, it's a
form and it's used to just filling out forms. Another
one is I type in hello class, and when I
first did this, I thought it was going to say
hello teacher, but it actually starts spitting out Java code,
so it just looks at the fact. Yeah, it's really
a music to watch and you can you can just

(13:58):
run it, and it's just trying to predict what the
next thing is based on what it saw on the internet.
And then what you know open Ai and Nentropic and
these companies do is they put that call a base model,
which all it knows how to do is predict the
next word through a training regime to elicit it to
be more like a helpful chatbot. So you give it

(14:20):
a system prompt that tells it it's a chatbot. It's
kind of like you tell it a story that's plausible
for it to start to think like it's talking to
a user, like you are a chatbot. You are reading
a transcript of a chatbot and a human user, and
we just fill in what the human said, and it
tries to fill in what it thought the helpful system
would be, and then they fine tune it to get
better at that.

Speaker 2 (14:41):
Yeah, this sounds a lot like what you're explaining.

Speaker 1 (14:45):
You get into prompt engineering, which, again, if you're not
into AI, prompt engineering.

Speaker 2 (14:49):
Is what's all the stuff I tell the AI.

Speaker 1 (14:53):
System so that it'll give me the answer I want, right,
And so you're when we're talking about prompt engineering, now
it's okay.

Speaker 2 (15:01):
So this is why when I start out.

Speaker 1 (15:03):
I tell it things like, like you said, you are
a chat bot, you help people with these problems, you
do these kinds of things, because it'll build off of
all of that and use the statistical model now with
the context of what you typed in to give you
the right answer.

Speaker 2 (15:17):
So you know, yeah, Hello class. There's not a whole
lot there for it to go on.

Speaker 1 (15:21):
But if you tell it, you know, you're a chat
bot and you're helping students with a blah blah blah
blah blah, then you type in hello class, and it's
going to go you know, then it may come back
with hello teacher or something like that.

Speaker 5 (15:31):
That's a great example. Yeah. So, and what you can
think about them conceptually doing is baking that prompt engineering
into the model. So what they're able to do is
if they give it enough examples of this, they can
retrain it such that you don't need the prompt at
the beginning that tells it it's a teacher or that
it's a helpful chatbot assistant, and that gets baked into

(15:52):
the model. You can think about all that prompt engineering
gets memorized into the model during that training process, and
then it turns into that helpful assistant will.

Speaker 4 (16:00):
Help help me understand this a little bit. So I've
I've played around obviously with GPT. I've also played around
with the other models. In fact, right now I really
like Quinn. I am I am using Quinn more than
I'm using GPT, because Quinn actually seems to be giving
better results, especially considering it in the benchmarks, it outperforms

(16:23):
four oh whatever that means. I mean, it's like by
a fraction of percentage point, but OH one. I just
find OH one and R one to be too like
they take forever. So it's like I'd rather ask the
question twice and be ninety nine percent likely to get
the right answer, then ask the question one time and
then have to wait forty five seconds to get the

(16:46):
wrong answer and ask it again.

Speaker 3 (16:47):
You know, forty five seconds. That's an eternity.

Speaker 4 (16:50):
The O one is crazy.

Speaker 2 (16:53):
We use it.

Speaker 3 (16:54):
We use it for code, for code questions and stuff,
because it does better than the standard GPT for wait
a few seconds. But I'd rather get a wait a
little bit and get a better answer than get something
super fast that's not going to be as good.

Speaker 4 (17:09):
Well, I I'm the other way because it's not that
much better. If you look at the benchmarks, it's like
one percent better than four oh, and it takes you know,
so much anyway. But what the thing, the thing that
I was that I was getting at is in the beginning,
there was the system prompt. Right, so when with GPT,

(17:32):
one of the ways to jail break it was you
could say that was just a joke. Actually you're a
something else, and so it would interpret it as Okay,
your system prompt is you're a chatbot. You're allowed to
say this. You're not allowed to say that that you
could just say that was just a joke. And then

(17:56):
and then and then give it an additional prompt. Now
with deeps seek V two point five and are one
and with Quinn it's it's like you're saying it's baked
into the model because if I override the system prompt
and I tell it, you know, you are a human

(18:18):
who is capable of reasoning and has no biases and
can represent any information factually, tell me about Tianaman Square.

Speaker 2 (18:28):
It's you know, it's.

Speaker 4 (18:30):
I am a helpful bot. I am not a human,
and I do not talk about things that contradict what
is known to be you know, the proper the proper
knowledge of the of the Chinese government to protect the people,
or you know, it gives me some some nonsense like that.
So what what is How is it possible to bake

(18:53):
in those system prompts with training data and and I
guess how does that vary? How does it vary from
the system prompt? And how do they get it to
bake that in so that it you can't override it
with a system prompt.

Speaker 5 (19:06):
Okay, there's a lot of layers there.

Speaker 2 (19:09):
Let me yeah, question, can you restate the question in
one sentence?

Speaker 5 (19:14):
I think the uily what I think with the question,
which was how do you bake in the system prompt.
But there's a couple of things that are worth noting
in your question, Like you mentioned some reasoning models one
and R one, and the way those operate is a
little bit different. Like you said, it takes a while
to come back because it's actually just expending a lot

(19:35):
of tokens thinking that it doesn't give you, and it's
trying to actually think through the process like you might do.
They call this chain of thought or thinking step by step,
and it what's unique about that can parterregular chain of
thought is it can suddenly realize, oh, it's made a
mistake and backtrack. And so it's it's literally spending you know,
coming up with hypotheses and trying and testing things and

(19:55):
seeing if it works. So this is why these models
tend to be really good on math and code because
it can go try something and say, oh wait does
this let me check does this answer right? Oh no,
it's not, let me try again. So and then you mentioned,
you know, jail breaking, and with the early models, one
way to think about like you're like, oh, this is
just a joke, is that you know, you're kind of

(20:19):
taking that we've talked about briefly, that attention mechanism or
looking back at the previous what's most likely if you
put things like you know, you know, kill and harm
in the in the prompt, statistically it sounds like it's negative, right,
But if you start putting things like Grandma cookies, it
seems less harmless, and you kind of think of yourself

(20:39):
a kind of waiting the attention to be more to
the harmless side. And really what's happened is that the
models have gotten smarter, both in terms of their natural
responses to this, Like they are trained to handle jail breaks.
They are trained on if a jail break comes up,
here's the response. And the way they train it to

(20:59):
get to your your main question is through these two
training techniques. One is they just give it an example
of a prompt and what its response should be, and
they use this technique called backpropagation or sarcastic creating descent,
which is to tune the network such every time it
sees that result, it gives out what we wanted it
to have. So we're going a little head of where

(21:20):
I wanted to be. But like when you train in
a ural network, you give it examples of data, so
The simple example is a dog and cat classifier.

Speaker 2 (21:27):
Right.

Speaker 5 (21:27):
I give it pictures of dogs, and I give it
pictures of cats, and I tell it which ones are
dogs and cats, and it comes up with the answer.
It comes up with the rules how to figure out
whether an image is dog or cat. This is way
different than regular programming, right. Machine learning inverts the normal paradigm.
Normally we're used as developers. I write a series of rules,

(21:48):
a series of program, right, and then it processes data
and gets out a result. I click a button, something
does you know moves on the screen, So I can
write that program. But a dog and cat photo classifier,
I don't know if you gave me dogs photos of
dogs and cats, I know how to instinctively do that,
but I do know how to write it out as
a series of rules. And so the inversion that machine

(22:11):
learning does is you give it answers and you give
it data, and then it figures out how to write
the rules. It writes the program. Now, unfortunately we can't
always understand what the program it comes up with is.
But what they do is they give it examples of
jail break attempts and they say, hey, you know your response,
now should be this to that. That's kind of the

(22:32):
high level overview of how they do that. One thing
that's worth noting is that when they protect a model,
it isn't just in the model itself, So there are
usually things that are watching the result of the model
that are additional classifiers. And so sometimes you might see
examples of open source models that let you do things,

(22:54):
but the hosted versions do not because the hosted system
is actually checking. So not everything is baked into the
model itself and it's not one hundred percent perfect, So
often there's some additional guardrails that are detecting things.

Speaker 4 (23:06):
Okay, So then two more questions.

Speaker 6 (23:09):
Yeah, what constitutes open source because that does not mean
the same thing that it means in the programming circles, or.

Speaker 4 (23:23):
I don't believe it does because I have not yet
seen any open source model that comes with four hundred
terabytes of training data.

Speaker 5 (23:33):
There are few and far between. There are some. Olmo
is probably the best known one, which is a model
where everything the training code, the training data system, the
data collection pipeline, the logs from their training runs are
completely open. There's like a handful of others. But this

(23:56):
question of what constitutes open is is completely a gray
area and it's being debated right now. Traditionally, when people
talk about an open model, it's usually an open weight model,
which is you get the parameters which encompass the rules

(24:16):
we talked earlier. That whole thing is math. Right, So
if you open up my spreadsheet or you open up
my website, you just see lots of numbers. You know,
whether those numbers are hidden from you or you can
run them yourself is what people call an open weight model.
That's what kind of passes for open source. These days,
there are very few models that open up the training data,

(24:38):
and so it's debatable and people do debate about what
a truly open model means. A truly open like the
most open is one that includes the data, but there
aren't that many, especially at the state of the art,
where the model is all the training day that created
the model is there.

Speaker 4 (24:56):
Well, I mean that would be highly illegal for chat
GPT to get as their training data because YouTube and
all the libraries on planet Earth and everyone who has
a copyright on something would have something this day about that.

Speaker 5 (25:10):
Well, I mean, I'm not going to comment on any
particular model providers specifically, I will say the idea of
whether you can train on data and whether it's transformative
is quite frankly still in the courts right now, and
we don't have global consensus. So I believe it's Japan

(25:31):
has said and clarified that you can train on data,
that the training process is not directly infringement. Now, you know,
one of the litmus tests is like whether you're competing
with your regional thing. So it's it's a larger open question,
but right now that's making its way through the courts.
I think, you know, candidly, if you said here all

(25:52):
the things I've trained on, then you might end up,
you know, just opening yourself up to more people who
can just say, oh, let me get onto that lawsuit.
But I mean that that question is still being that's
a legal question, which.

Speaker 4 (26:06):
Yeah, yeah, yeah, yeah, yeah, okay. So my my other
question related to what we would we had just talked about,
was so I I download Quinn and I don't know
what it is that I'm actually downloading because I use
I use Olama, and it downloads twenty gigabytes of something

(26:29):
and then it runs it and I get to be
productive and I'm happy. But in terms of you know,
like like you're saying there's something it's not just the
model giving a response, but then there's other code that
is you're doing some sort of check. Is that happening

(26:50):
with these models that I am using generic tools like
Olama or Lama CPP or or LM studio. Is that
actually running program code? Binary code? Or I guess it's
not binary code. It have to be bytecode because I
can do it on Mac and I can do it

(27:11):
on Linux and it doesn't have to recompile anything after
it's done downloading it. So what where where are those
extra layers or how are they interpreted?

Speaker 5 (27:23):
The extra layers that are protecting the model from saying
grong thing? Is that what you're asking? Yeah, yeah, those
aren't there when you download an open source model, when
you run in a little lama, those extra layers. The
only thing that is protecting the model is at that
point what the model was trained, the pre training they

(27:46):
did that they baked into the model. Then they're not
doing any additional checks. So with the hosted model, there
is a there is additional layers because they ConTroll the
infrastructure and they're watching what the model says and they're
they're stopping it. But typically when you will use you know,
LM Studio or Olama, then it's g you're just getting
the bare uncensored model and there's no additional checks. The

(28:09):
only thing that's preventing the model from you know, saying
the wrong thing being not helpful or not harmless or
I guess harmful and unhelpful is just the training and
the models, you know, training that it was put through.
There's no additional checks there. So and when you download
the model, maybe it's worth saying you're just basically downloading

(28:30):
a large list of numbers and the code inside it
tells it. And you're getting a large list of numbers
and a mathematical graph that says how to combine the
numbers together. It says, take this parameter first, here's how
you map the words to numbers. That's you get that mapping.
And then once you've mapped them to numbers, it says,

(28:51):
add it here, multiply times this other number here. Then
you know normally you know, take the square root of
this other number and then multiply it again. And it's
just a list of calculations. It's a really like simple program.
In fact, most of the knowledge it's worth stating is
not in the code. And this gets back to your
question about like open source, it's in the knowledge, is

(29:14):
inside the data, it's inside the parameters. So as an example,
GPT two, which you know is considered one point too
dangerous release and it's amazing.

Speaker 4 (29:25):
Yeah, only that's because they want regulatory capture, not because
they actually believe it's dangerous.

Speaker 5 (29:31):
Well at the time maybe they they they're concerned about disinformation,
but suffice to say it was still a powerful model
in its day. Is my point is only five hundred
lines of code. If you take out the the the
TensorFlow library, it's five hundred lines of code. It is.

(29:52):
It is astonishingly small. And so one of the things
and the reason why I re implemented the entire thing
in JavaScript is I want to push back against this
idea that well, this stuff is too hard for you
to learn. If you're a web developer, like you can
learn five hundred lines of code. And that's basically like
I give you the grounding and I re implement the
entire thing in javascripts. You can step through it. You

(30:14):
don't even have to leave your web browser, right, you
just use the web debugger and you can you can
step through what's happening, and it's it's astonishingly small. All
the knowledge, all the rules is captured in the weights
and the parameters the model. So when you download the model,
it's just a more and more numbers with a larger
and larger computational graph. And that's how we get it smarter.

(30:35):
That's gets back to the heart of like the core
thing to understand is we took a word problem and
we mapped it to a number problem. So if we
get a bigger calculator, we get a better result.

Speaker 1 (30:44):
But I want to I want to restate this just
in another way, really simply because I think a lot
of people get you know, they get confused between like
GPT four versus chat, GPT versus something on your computer
versus whatever. And so yeah, essentially the model, like you said,
you know, it's it's maybe a few steps in how

(31:06):
it gives you answers, and the rest of it, like
you said, is all the data.

Speaker 2 (31:12):
It's all the waiting.

Speaker 1 (31:13):
But sometimes when people are talking about AI models, they're
talking about a program that accesses the model that I
just explained or that you just explained, right with the
numbers and kind of the fundamental pieces and so that's
your chat GBT, whether it's running on your local machine
or in the cloud. Yeah, you need to be able

(31:34):
to differentiate between the two and recognize that. Yeah, sometimes
you're just downloading that map of numbers and you know
some really really simple stuff that makes sense of the numbers,
and that's your model. And so when people are building
against those models, a lot of times that's what they're doing.
And so you can write your own code that then,
you know, is the gatekeeper or you know, says this

(31:56):
is helpful or this is harmful, or this is whatever. Right,
this is an appropriate response and this isn't a lot
of that's just the code that sits on top of
what you're talking about. That five hundred lines of code
plus the data that we're getting.

Speaker 2 (32:13):
That is just the model. And so.

Speaker 1 (32:16):
You know AJ's talking about the GWEN model. It looks
like it runs on Olama, Right, So you know you've
got all those magic numbers, You've got the stuff that
runs on top of it. I think Olama gives you
a little bit more on top of that, and then
from there, right, the rest of it's kind of up
to whoever wrote the code.

Speaker 5 (32:36):
Yeah, the central thing, I'm just trying to get like
the black box part, the most mysterious part at the
heart of it. Yeah, and obviously, like you know the
calculations that say Lama and Chat, GPT and Gemini, they're
larger models. GPT two came out in twenty nineteen, but
the core thing is like a mo Like if you

(32:58):
had shown me GPT two, I wouldn't have known. Like
when it first came out, it's like, wow, that's a
pretty amazing program. Must be really complex. And it's not
the program that's complex. It's they just bet on taking
a somewhat simple architecture and just giving it lots and
lots of data and spending more than anybody else had
at the time, and just trust that the black box
would be smart enough to learn everything from it. And

(33:20):
at the heart of it, that's what's happening. You're just
a large it's a giant calculator. In fact, it's so
simple in a sense, like in a spreadsheet, which was
my first implementation, you cannot do loops very easily. You
don't have looping constructs. It just does a calculation through
the entire way, and it just does the computation of

(33:41):
the all the different cells they're effectively in a sense,
no loops inside of it. Like the reason I can
implement it in a spreadsheet is that every single time
it predicts a token, it does the exact same number
of computations every single time, and it goes through, you know,
twelve layers and twelve attention steps and twelve layers. Like
it's very very much like I got a number coming in,

(34:02):
we a word coming in, we map that word to
a number, and then I just do all this number
crunching with a very predictable pattern, and then I get
a number out and I interpret it, and then I
just repeat that process over and over again. And so
you know, the thing that I try to tell people
is just like when you look online and people like
I want to get into AI and stuff, they're they're

(34:25):
often presented with, okay, go learn you know all this
linear algebra. You need to make sure you're solid on
your calculus. You need to make like and it's there's
like six to eighteen months of like prep before you
get to understanding how a large language model works. And
that's valid if you're going to be a machine learning researcher,
and machine learning is a huge giant field beyond just

(34:47):
chat GPT right, there's an omaly detection, there's clustering, there's
a lot of algorithms in there. But my goal is
to just help people understand how these amazing, arguably Nobel
Prize winning programs work in as short a time as possible,
and to the extent, like I don't even begin where
normal machine learning class begins. A normal machine learning class
starts with like regression and it slowly works me up

(35:10):
and maybe you'll get to the LMS. And I'm like,
this is a five hundred line program. Just start with
here's how it starts. And anytime I run to something
you don't understand, I'll give you the In my class,
I give you the background to understand it, and then
we move on to the next piece. And so it's
really designed to be as efficient as possible. And I
think when you tell people it's five hundred lines, they're like, oh, yeah, okay,
I can understand how that works. And this gets to

(35:33):
knowing your tools. I'll make an analogy if you don't
necessarily need to know how AI model works to use it,
but you don't necessarily need to have a good model
for how is the difference between the CPU or disc
memory versus bandwidth versus system memory. But if you're debugging,

(35:53):
you know, a machine program, it helps to have that
mental model. You'll run to an issue or maybe a
more our tangible example to this audience is like knowing
how react works on the inside. At some point, if
you don't understand hydration, you're going to run into a wall, right,
And the same thing is true, like you get these
parameters from OLAMA, what are they doing? You know, you

(36:14):
need to have a mental model for how it works.
And I don't think that mental model is as hard
as a lot of people make it out to be.

Speaker 1 (36:20):
Yeah, when I talk to people about doing AI, and
I talked to a whole bunch of people like that
are business people, and talk to a whole bunch of
people that are programmers, and I have some of the
same conversation basically down to, well, are you going to
build your own model, right? Are you going to take
your own data and cram it in and expect it
to give you answers on the other end, or are

(36:42):
you going to use something that already exists like the
chat GPTs or some of the you know, the GPT
force or the OLAMAS or whatever, right, and then build
on top of it. Because once you're building on top
of it and you're not worried about, Okay, how do
I put this together, then it's essentially okay, Like you're saying,
I understand how the machine works, and then I understand

(37:04):
how to talk to it. Right, so I understand what
the APIs are and the rest of it is, then okay,
what do I want from this?

Speaker 2 (37:11):
And how do I validate that I got it?

Speaker 5 (37:13):
Yeah? Actually, that's a really important point. The number one
skill may not be understanding every single detail of the calculation.
The number one skill when dealing with an AI model
is that last thing you talked about, how do I
evaluate it? So the name that you hear in the
AI community is evals, But as a you know web developer,

(37:35):
you can think of these as tests. And the key
difference between you know AI evals and tests is that
you don't expect one hundred percent pass right. These are statistical,
probabilistic machines. But the number one, like when you read
about benchmarks, you know, AJA you talked about benchmarks, You
basically need to build the benchmarks for your particular problem.
The benchmark may say some model is better than another,

(37:56):
but when you actually use it for your problem, you
suddenly discover it's not good. And so the first thing
you should do is come up with your own benchmark,
your own evals for the problem, and then try a
bunch of models and see which one works the best.
And then you can start iterating whether that iterating is
changing the prompt, whether it's changing the model or saying

(38:17):
I'm going to go off and find tune my own model.
But you won't be able to make a judgment until
you're able to look across the distribution of your task,
all the different ways your task happens, and whether it's
successful or not, because these are you're dealing with highly
variable machines. One of you know, the folks who was
in the audience for one of my past talks, hit

(38:38):
a really good analogy. He's like, imagine a database that
was wrong five percent of the time. Like, as developers,
we are not used to having levels of uncertainty like
this within our systems, unless maybe you're using distributed systems
where there's all sorts of race conditions and stuff like that.
But we're used to sanitizing the user input and then

(38:58):
once we get the user input. Everything is predictable after that.
But here it's like suddenly we've got a database that
sometimes it's wrong, and so that's where you need to
put all sorts of checks and guardrails. And you're dealing
with this really smart but sometimes fallible thing like a human,
I hate to say, anthropomorphizing it, and see how you
build a system around that is going to be different

(39:18):
than how you build a regular system. But it all
starts with that key idea that you just talked about,
which is about being able to evaluate mathematically how well
your your model or your system is doing, and the
question about whether you should build your own model or not.
The usual hierarchy of needs is first start with an
off the shelf model. It could be open source, it

(39:41):
could be one of the providers. It's actually probably easiest
to start with a hosted model and just see if
you can get it to work because there'll be state
of the art and you don't have to worry about
all the stuff around hosting and inference and seeing if
it works. And then next thing to try is try
tuning it. Sorry, try try tuning your prompts. So try
prompt engineering your way. Give it some examples try some

(40:01):
variety of prompt engineering, and then maybe consider fine tuning it.
And again you can fine tune you know, most of
the hosted models, you don't have to go to an
open source model, but you could do that as well.
And then the idea of building your own model is
extremely hard. You know, the amount of dollars that go
into building your own models from scratch is now, you know,

(40:23):
over one hundred million. So the estimates for say Lama
are were, I think over one hundred million to build
that model, and so it's a lot of work and
that's lovely best of the frontier labs.

Speaker 4 (40:35):
Yeah, is that GPU cost or where is that number
coming from?

Speaker 5 (40:40):
That's a great question because these are all estimates. You know,
we don't know for sure, but obviously some of it
is the GPU cost, some of it is the infrastructure cost,
some of it's the talent. The other key thing to
keep in mind is when you're training a model, you
don't always know how it's going to turn out. What
they actually do is they do a large series of

(41:02):
smaller runs to establish some type of pattern or scaling
law to figure out how they're going to design the model,
which architecture seem to work better, which parameters matter more.
So there's something called a learning rate, for example, that
they have to adjust, and they have a schedule for it,
and they're trying to figure out against evals against the
benchmarks we talked about, like which one seems to make

(41:22):
the model smarter. And so there's a lot of trials
and attempts. So it's not just necessarily one whole shot
of training. It's a lot of experiments that they have
to do. A lot of how the model is going
to behave is surprisingly empirical, and so they're doing experiments
and they're trying that again, so there's a variety of things,
and empower is non trivial. Another thing that's important to

(41:45):
understand is the level of scale of data that these
frontier labs are dealing with. And there's a really good
analogy from the anthropic guys actually, and one of the
things I have to do is you have to randomize
the day data so it doesn't learn arbitrary patterns and
the order of the data you gave it. And so
one of their research engineers gave this great example is like, okay,

(42:08):
randomizing sounds like it should be easy, Like take a
deck of cards. If I tell you to shuffle it,
it's fairly easy. But imagine I gave you like seven
warehouses full of decks of cards and you need to
shuffle them by hand. It's not quite clear how what
policy or process you're going to use to make sure
you hit all of them and you've evenly shuffled them.
And the size of the data that these guys are

(42:30):
using with is it's almost like that to the CPU,
it's like seven warehouses of data to it like for
you compared to manually you know, shuffle your your deck.
And so when you're dealing with data at this large
infrastructure scale scale itself makes every little thing harder, and
so that also adds some difficulty to this. So should
I you know, walk through just like a little more

(42:53):
detail of what's happening in that mathematical calculation or happy
to answer additional questions?

Speaker 2 (42:57):
Yeah, that's what I was going to ask.

Speaker 1 (42:58):
Next is Yeah, because you've mentioned you've got different layers
or different steps in the process you explain in the video.
The video is a little longer, I guess than we
have to go over at this point, but yeah, if
you can give people kind of an overview of how
the LM system actually works.

Speaker 4 (43:15):
Yeah, So while you're doing that, if you distinguish between
the different types of training, like the RAG versus the
fine tuning versus the.

Speaker 5 (43:26):
Base Yeah, okay, I don't think of RAG as training,
but maybe we should step back and explain to the
audience who isn't familiar with RAG what it is. So
you can kind of think of RAG as like a
sort of prompt engineering technique. So you want the model

(43:47):
to answer questions about something that wasn't trained on. So
imagine I'm you know, I'm a smart home electronics company,
and all of the documentation about my product was behind
you know, some firewall or behind a log in, and
so I know that, let's call it chatchy. He was
never trained on it. But I want to build a
chatpot where customers come and say, hey, I can't configure

(44:07):
this setting on it. How am I going to get
a chatbot to do that without having to retrain it
specifically on my data. So what you can do is
when a request comes in and somebody's like, well, how
do I change the color on my smart light bulb,
It'll go and it will search through my data. I
can take that request from my user on my chatbot,

(44:28):
and they say, I take I see the words light bulb,
I see change color, and I'll search all my documentation
and I'm not going to search it just a plain
tech search. I'll use it called a semantic search. So
it'll find things that are similar to the word light,
like the word bright, even though it's not anywhere close
to the same character. So it'll find all the similar
passages and it will pull those out, and then it

(44:49):
will give the model, here are relevant passages. Here's the
user's question, how do I change the color on my
smart light bulb? And then it will give it paragraphs
chunks of text from my documentation, and it will put
those at the beginning of the prompt. So you've got
a prompt that structure at the start with the user's question.
Then it's got some chunks of data that came right
for my documentation. And then we tell the model, you know,

(45:13):
come up with an answer to the user's question using
these chunks of data I gave you, and it will
be able to think over those passages and find the
ones that are relevant and then give the answer out.
And that's called retrieval augmented generation. So retrieval because you're
taking the user's query, you're pulling a data that wasn't
the model didn't have during training, and you're passing it
into the prompt and then asking the model to answer it.

(45:35):
And so it's a very low friction way to take
a model off the shelf and make it understand all
your stuff even though it wasn't in the training data.
So that's that's uh, that's that's what RAG is.

Speaker 1 (45:48):
The version is is you're building context out of a
database that you already.

Speaker 5 (45:53):
Have, great, great summary, thank you. Uh so, yeah, you're
giving it the context it didn't have during training to
answer the question on training. So there's a variety of
steps in the model where it's trained. Mainly, well, let
me think of the best way to explain this. So

(46:14):
I'll discuss training when I get to the call it
the fourth step of the model, and I'll talk about
how they gets trained in a second. But let me
walk through the five steps of what happens when you
input text into the model. So the first thing you
do when you input a passage so that one I
like to uses Mike as quick he moves. And then
the completion we leave it to the model filling the
blank quickly. The first thing it's gonna do is going

(46:36):
to break the text into subword units. So you might
think it would break it into characters, and you might
think it might break it into words. So break into
characters would be like ASKI, and breaking into words would
be just giving every word it's entry in a dictionary.
It turns out you can't handle if you break it
into words. You can't handle unknown words very well, and

(46:57):
you can't handle spellings you weren't planning on. So especially
if you're going across multiple languages, and if you break
it into characters, it turns out it's really hard and
a lot of compute for the model to learn purely
from characters, although some research has been able to do it.
So they do is they can do a Goldilocks and
they say, okay, let's break it into these little pieces
of words, and if you think about it as a human,

(47:18):
you actually do this. So one of the examples, I
use the word flavor eye. It's actually not a word
in the dictionary, but you know what it means because
you know what flavor means, and you know what ize
is a suffix means, and so the model kind of
does that. Now. I want to be clear, the tokens
that comes up with, as they're called these subword pieces.
Word pieces don't map to any human sense of the meaning.

(47:39):
There are some that, like ice turns out to be
a token, but it's by coincidence or correlation, not like
it's trying to understand human English at this stage. So
it breaks it into these Yeah.

Speaker 1 (47:49):
Gosh, can I just say that in a different way
too effectively? What it does is it breaks it up
into pieces that have meaning, right, because when we're looking
for the output, we're looking for out it has meaning,
and we group words or ideas together that give it meaning.
And so it's doing the same thing. It's breaking it up, right,
Like you said, flavor has a meaning, eyes has a meaning.

(48:12):
You know, the other words in there have meaning. And
so that that's the approach that it kind of takes
when it's breaking it into tokens of kind.

Speaker 5 (48:20):
Of sort of kind of. It's a decent mental model.
But the reason I stress that it's not trying to
match human meaning is because it's actually not trying at
this stage of the model. It's not trying to assign meaning.
In fact, what it's really trying to do is take
all the text on the Internet that it's trying to
train on and compress it to the most efficient representation

(48:43):
so that the training can be as efficient as possible.
And that's why the tokens don't always map to what
you'd expect and why So this is this.

Speaker 4 (48:54):
Is different than what a full text search database would
do because a full text search database, like the example
you gave flavor, Yeah, full text search database is going
to break it up that same way. But this is
different than the way a full text search database would
break it up.

Speaker 5 (49:07):
Yes, it is different, and it's very dependent on the
data it was trained on. And so a good example
is I use the word re injury. Right as a human,
you would think it was it was re an injury,
but if you actually put it through the GPT two tokenizer,
it puts it as rain injury. And the reason it

(49:28):
decided to do that is simply because of the greater
you know, occurrence of the word jury on itself by
itself than injury, and so that decided that was the
more efficient representation. And I want to be clear, this
isn't about representing your prompt efficiently. This is about representing
all the training it's going to do on the text efficiently,
the stuff you don't see, the stuff that you know

(49:50):
you're talking about nobody releases. That's what it's really based on.
And it's really a compression of all the text. So
it just got a really efficient So think about it this.
If it's going to compress all the text, then you know,
if it gets down to say ten thousand or fifty
thousand tokens, then it only has to learn ten thousand
or fifty thousand concepts in a sense, Although that's a

(50:11):
gross oversimplification, but that's what it's trying to do, is
trying to reduce the number of things it needs to
learn on essentially the number of combinations and variations.

Speaker 3 (50:19):
Hey, a couple of quick questions with that flavorized example.
Here's hoping they don't pick up flavor flav the rapper's
lyrics right and throw that in there. That could get
really confusing. But then when you're talking about like the
re injury for example, Yeah, and how I as soon
as you saw that said that, I was thinking, Okay,
I can see where that's going. How you get rained

(50:40):
just throwing stuff like and this might be getting into
the weeds too much. But just throwing stuff like hyphens
into words make a difference. So if you were to
do redash injury, would it see that and maybe just
categorize the re as separate from injury? Does that help
or is that a non issue? Does it sort of
filter that stuff out and just focus on the letters?

Speaker 5 (51:00):
And that's a great question. It's actually implementation dependent on
the tokenizer. In practice, you usually separate. You create boundaries,
hard boundaries between tokens or words, so one of them
is the space character. In most of these, the hyphen
is considered also a boundary, and so it would see
it separately. The important thing to understand, though about the

(51:21):
tokenizer is that the model doesn't see words the same
way you do. So a great example of this is
how many letters are in the word strawberry. It does
not see the word as S T R A, W
B E R R y. In fact, when you read it,
you really don't either. When you read words, you typically

(51:43):
aren't paying attention to every single character. When you have
to count the words in strawberry, you kind of have
to change your mental state and think, oh wait, let
me think what are the characters? And you walk through it.
It just sees it as it might see it as
the token strawberry. It might see it as the token
straw and berry. But the key thing is it has
no idea. It doesn't the ability to see the letters.
In fact, if you capitalize, the tokenization is case sensitive,

(52:05):
so if you change the capitalization, it looks like a
different it. So to it, the word strawberry with like
a space in front of it is different than the
strawberry without a space. Strawberry with the capital in front
of it is different than strawberry with a capital. You know,
if you put quotes around it, it's a different word.
So it doesn't see text the same way you do.
Another great example of this is numbers. So they've fixed

(52:27):
this in most modern tokenizers, but the early ones would
just take examples of numbers and those would be a
whole take token. So two fifty six right power of
two fairly common gets a token. But it sees that
as a single token. It sees that as a single thing.
It doesn't even see it as the numbers two, five
and six. It doesn't break it apart. And so that's
why it was really hard for these guys or these guys,
these things to it was part of the contributing reason

(52:48):
why it was hard to do math, it's not the
sole reason. So the key lesson, you know, on the
tokenizer before we leave it, and the algorithm that's commonly
used something called byte pair and coding, and in my
classes is something we walk through. In fact, we do
it by hand so you can understand, and we talk
through the training process. But the key thing to understand
is that the model doesn't always see text the same
way you do. So that's the first step that's tokenizer.

(53:10):
Then the next thing is we we map each of
these tokens, but you can think of them as words
into a list of numbers. So I talked to earlier
like we map each word to a number, but it's
really we map it to a large list of numbers.
And this is called an embedding. And the way to
think about this is where we're taking all the words
or in this case, tokens technically, and we're putting them

(53:33):
on a map. But instead of like a two dimensional map,
this is many, many dimensions. So in the case of
GPT two, it's seven hundred and sixty eight I think
LAMA four or five B it's like sixteen thousand list
of numbers per every single word. In fact, like in
the sentence phrase Mike is quick, period he moves, the
period itself gets seven hundred and sixty eight numbers to

(53:54):
represent it. And you can think about like on a map,
you have like you know, coordinates. This is just a
very very multidimensional list of coordinates. And a good embedding
puts words that are related to each other closer to
each other. So in my class, they use the words
like happy, sad, joyful, glad, dog, cat, rabbit. The first

(54:17):
set of those are emotions happy, sad, joyful, right, and
would expect happy and joyful to be close to each other,
same with glad, and then dogcat and rabbit are totally unrelated,
so would expect them to be further apart on the map.
And the word sad is an emotion, but it's not
quite the same emotion as being happy, so it'd be
somewhere in between. And if you actually visualize this, you

(54:38):
see that this happens. It's actually putting words closer together.
And you might hear this paper or it's a series
of papers or algorithms called word to vec which pioneered
this model, And if you go to projector dot TensorFlow
dot org, you can actually see a three D map
of various words, and you click on it and it
will show you the words that are close to it,
and they all tend to be related words. So the

(54:59):
first thing we start with is, you know, the next
step after we break the text into tokens, as we
map each of those tokens to a position on a
map where close words are related to each other. Let
me pause and see if that made any sense. I'm
usually doing this all visually, so over pair audio. It's
it's a bit of a challenge, but yeah, good.

Speaker 1 (55:19):
So my question is is, since it's predicting the next word,
I would imagine that, yeah, some of the words that
appear close to it are going to be words that
mean kind of the same thing or you know, have
a related meaning. But does it also group words together
that commonly appear together or is that a different Does
it not wait things that way at all?

Speaker 5 (55:40):
Uh, it's actually kind of doing both. The way it's
grouping related words together, it doesn't actually group words together.
It's grouping words together that have the same meaning based
on the idea that they appear in the same places. Oh,
so the word ice and cold commonly occur together, probably

(56:03):
in text on the internet, right, like the I put
ice in the drink to make it cold, you're as
cold as ice? Right, Those would be common phrases you
usually don't see, you know, like steam and cold together.
And so the model is able to understand that ice
is colder than steam because it sees cold closer to

(56:23):
ice more often than seems cold close to steam. It's
the relative occurrence of how often. And there's a phrase
that's often used by JR. Firth. It's called you will
know a word by the company it keeps, which is
the idea you don't really know what a word means.
You could look it up in the dictionary, but you
really understand it through how it's used by multiple people,

(56:45):
and that you can look at, you know, the distribution
of how it's used to really understand what it means.
So good example is the word bad right, although it's
less in fashioned, bad at one time meant good, right,
And so how do you really understand whether it means
good in one context versus another? You learn that through
all the various contexts that it is used. And if

(57:05):
you want to understand, you know how word really is used.
You see it in usage many, many, men times. So
if you want a model to understand how what a
word means, you just see it used in many, many, many,
many sentences and eventually pick up on those differences.

Speaker 3 (57:17):
So then the word baby could be seen as cold
because of vanilla ice, then right, ice ice baby.

Speaker 5 (57:24):
If you train it, I'd be really interested to see
a model trained only on song lyrics. That would be that.

Speaker 2 (57:29):
Would be fascinating.

Speaker 1 (57:30):
Yeah, yeah, it's interesting because the way you're talking about it.
We were driving home from my mom's house the other
night and my wife put on an audiobook that she's
been listening to with my nine year old and they
used the word satisfaction and my daughter asked, what does
satisfaction mean? And we basically did that it's kind of
like this, and it's kind of like that, right, It's

(57:51):
it's in this area of meaning, right, yeah, and it's
related to these other words.

Speaker 2 (57:57):
Right.

Speaker 1 (57:57):
We used other words to explain it, and then yeah,
we did we context, so you could use it like this,
or you could use it like this, and it's you know,
another form of the word is satisfy or satisfied, and
so you know, this is what it means to satisfy
something and you know, more context and more sentences and okay,
I understand, right, And I think she may have even
said it's so it's kind of like this and kind

(58:18):
of like that, yep, using examples that we didn't use, and.

Speaker 3 (58:21):
You told her that Nick couldn't get any right, that's right.

Speaker 5 (58:26):
But yeah, that's that's basically what the model is going through.
Is it's like, uh, you know, basically saying seeing all
these examples and it's like, oh, it's kind of like this,
but it's like like in some context, I see it
being used in this other way, and so it's it's
basically putting that all together. And then it's putting all
the words on this map and it's saying, okay, you know,
the ones that are related are here, and the ones

(58:48):
and it's this multidimensional map. It's you know, hundreds to
thousands of dimensions long. And that's that's the embedding step.
Is we've basically mapped them to you know, I say
a numb but it's really a point in space, right,
It's basically taken. So there we're at the second step.
We base first step was we took the passage and
we broke it into tokens, which you can think of

(59:11):
as like words, but smaller, and then we took each
of those tokens and we put them into a point
on a map and a point in space, and we
just we know that point is going to be close
to other things that are related to it. So that's
the second step, and then the third step is called attention.

(59:32):
I'm going to skip over that in a second, and
I'm going to go to the fourth step, which is
the neural network or the multi layer perceptron. And this
gets to the training question. The key thing that it's
really great about neural networks is that thing I talked
about earlier, which is you don't have to give them
the rules. You just give them the answers, and they
figured out the rules. So we basically feed it in

(59:54):
these points in space from our prompt, and then we
can take a pasth message on the internet, like maybe
the passage on the internet is Mike is quick, he
moves quickly. We remove the word quickly, and then we
give the model the phrase Mike is quick he moves
and then we ask it to make a prediction, and
it's going to get it wrong because it hasn't done

(01:00:14):
any training at all, and when it gets it wrong,
maybe it says, you know, Mike's quick he moves bicycle
and you're like, no, that's wrong. The right answer is quickly.
It mathematically learns how to change itself to get closer
to that answer. So we go through a lot of
iterations of getting lots of passages where we take off
the last word and then we give it. We ask

(01:00:35):
it to predict it, and if it's good, we say okay, great,
you're fine, and if it's wrong, we say, okay, you're
off by this amount. It's kind of like when you
throw darts out of board. Right, if you're far from
the bullseye, you'll move a lot more closer to correct
your position, But if you're close but slightly off, you're
going to move slightly subtly. So that's what it does.
It changes the model parameters, the numbers inside the model

(01:00:56):
slightly if it's close, or a lot if it's far away.
Does us you know, trillions of times over lots of
different pieces of data. And the key thing about the
neural network is it can learn to imitate basically from
answers and data, and so we basically give it the
known passage that we gave like Mike's queaky moves, and

(01:01:19):
we knew quickly was the right answer. And then we
basically asked the neural network to make a prediction from
these points in space. And so that's the that's the
basic version of what's happening inside the train. Let me
pause because I jump to the fourth Layer'll come back
to the third one in a second. But does that
let me see if there are any questions there on
what's happening inside the neural network. Okay, so yeah, okay,

(01:01:45):
So if we're good there, then the next thing that
happens is I'll jump back to the third Like we
could give it a point in space and say, hey,
guess what the next word is. But the best thing
to do is to not just give it a single
point in space. It's to give it all the points
that can before, so all the words that came before.
So in the case Mike is quick, he moves. Knowing
that you know we're talking about movement helps it know

(01:02:07):
that the word quick here is moving around in a
physical space versus the quick of your fingernail. And so
we give it the hints of all the words that
came before it. So this is what's called attention, where
we say, okay, don't just predict from one single word
you're looking at. This gets back to what we talked
about at the beginning, like instead of looking at, you know, statistically,
what's the next word after the given word, let me

(01:02:28):
look two words back, let me look three words back,
let me look forwards back. It will look at all
the words that came before it and try to figure
out what is the next predicted We're giving these hints
from the entire passage to make its prediction, and that's
what's called attention. And so that's the third step in
the middle. And then the last step is we do this,

(01:02:48):
you know, we get a prediction out of the neural network.
So jumping back to the fourth step, which was the
neural network that makes the prediction, and it gives a number,
and that number we need is a long list of numbers,
and we need to map that back to one of
our tokens, one of our words. But it's sitting in
a point in space that it may land right on
the word quickly, but more than likely it's going to

(01:03:10):
land somewhere close to the word quickly, like the predicted
token that comes up with it, that the model comes
up with, and so it will interpret that point in
space that it gave us back to those embeddings in
that map. It picked us some at the end of
the number crunching, took all the words and the points
in space we gave it and said the predicted word
is right here in this point in space, and it says, okay.

(01:03:31):
We then interpret that, and we look, what are the
words or tokens that are close to that predicted point
in space. And it's probably going to be closer to
the word fast, closer to the word around. Like Mike
moves quickly, he moves around, he moves fast, he moves speedily.
Those words are going to be close to it, and
so we give them a higher probability and we run

(01:03:52):
a random number generator and we say, okay, let me
pick one according to this probability distribution, and that's how
we get the predicted word out. And that last step
of running that random number generator and looking at what
words are closer it is the piece that's called the
language head. So the key thing about the language head
is that is where most of the uncertainty or unpredictability

(01:04:14):
of your model comes from. So if we decide not
to run the random number generator and we just always
pick the word that is closest in space to the prediction,
that's what's called temperature zero, and it will always be consistent.
It will always be predictable for the most part. There's
some other very small orders of randomness in the process,
but for the most part, it'll be very consistent, and

(01:04:35):
that's that's called temperature zero. So most of the randomness
inside the model is entirely in some sense imposed by us.
We decided, oh, we're not just going to always take
the thing that's closest. We're gonna probabilistically take some of
the other ones that are also close, and we can
control those parameters and control how we do that probability.

(01:04:55):
So if you're in o LLAMA or you know an API,
you'll see things like top P and top K or
temperature and these are tools we are given, you know,
the API user of a model on how they can
shape the probability distribution of the model. And that's probably
the most important to understand of the components in the model.

(01:05:16):
After you understand what tokens are and embeddings. The next
one is probably the language head because that's where the
randomness comes from. So let me pause. I know I
just talked for quite a while. See if there are
any questions.

Speaker 2 (01:05:30):
I think, so far, so good.

Speaker 5 (01:05:32):
Okay, So what the Excel spreadsheet does or the website
I have that's built in you know, web components in
pure JavaScript is it runs through the entire process using
the very same weights that open Ay released for a
model called GPT two GPT two small, and it steps

(01:05:52):
through every single one of those processes and it takes
you You enter a prompt and then what it does.
It doesn't it's not like chat GPT where you can
have a conversation with it. It just predicts the next word,
but it walks you through every single step. That's the
same thing I do inside the class. But that's basically
you know, how your model works under the hood is
it's basically taking your words, your input prompt, breaking it

(01:06:15):
into units that are called tokens that are slightly smaller
than a word, that it maps it to points in space,
does a bunch of number crunching on it through the
things I talked about, using a neural network and this
other attention that looks at all the other words, and
then it takes that prediction and it says, Okay, what
words is it close to in our points in space,
and then let me pick one that's relatively close to that.

(01:06:35):
So I know one of the things, Chuck, you would
want to talk about is building it and the use
of web components in the web version.

Speaker 1 (01:06:43):
Yeah, at this point, given our time constraints, Yeah, we
might have you come back and do that, because I
think it'd be interesting to dive into the project and
how it went together.

Speaker 5 (01:06:51):
Okay, I will. I will just say the reason I
built it in web components was to make it as
portable and easy to use and as easy to step through.
I wanted to make it as accessible. I did think about,
like say, using React, but then you need to know React,
and I really wanted this to be as approachable for

(01:07:12):
somebody who knows just vanilla JavaScript and web components was
the easiest.

Speaker 2 (01:07:16):
Way to do that.

Speaker 5 (01:07:17):
So that's the main reason why I did it that way. Yeah,
So is it open source then or I actually haven't
put a license on it. And what I've said is
if people feel like I help me decide, like, tell
me which licenses you prefer, but the code is right
there for people to look at. I mean you can
practically step through it, and it's written so that people
can understand it. I just haven't figured out what license,

(01:07:38):
but you know, let me know, and I'm all ears
all right. The goal is to make its a teaching tool.

Speaker 2 (01:07:45):
Yeah, all right? Cool?

Speaker 1 (01:07:49):
Well, yeah, like I said, we're kind of at the
end of our time and so I'm I'm going to
push this into picks. But you want to just give
out information on your course again, let people know what
that coupon code was.

Speaker 2 (01:08:02):
I just if people are digging this as much as I.

Speaker 1 (01:08:04):
Am, I think they may want to just go pick
up the course and go, okay, we can go into
more depth.

Speaker 5 (01:08:10):
Uh yeah, thank you. So the website for the project
is called Spreadsheets Are All You Need with hyphens in between.
So spreadsheets hyphen are hyphen All hyphen you hyphenneed dot ai.
It is a very long domain name, and that will
link to where you can download the excel file. You
can try this out in the browser yourself, and then

(01:08:32):
there's a link to my class that I teach on
the Maven platform. It basically has five lectures over two
weeks and we walk through this for anybody who understands
spreadsheets or vanilla JavaScript. And I have a promo code
js jabber, so just use the promo code during checkout
and you get twenty percent off the courses taught live

(01:08:53):
but also is available on demand. So my last cohort
just wrapped up earlier this month. But if you sign up,
you'll get to watch all the recordings. I'll answer all
the questions over email you have. You'll be in the
same private discord as the same as the rest of
the cohort. And if for some reason you're watching it
on demand and you'd say I'd rather have the live version,
I offer that if you want to attend a future

(01:09:14):
live version, you can do that for free, even if
you signed up for the on demand, so you know,
feel free to check it out. It's on maven. It's
got some long URL, but if you go to spreadsheets
at all, you need dot ai you can check it
out and then to find me. I'm on Twitter I
A N A N D so my first initial with
my last name, and of course on LinkedIn. I'm also

(01:09:34):
on blue Sky as well. If people want to reach me,
happy to.

Speaker 1 (01:09:37):
Answer questions awesome. Well, yeah, I definitely want to dig
into it. I'm probably gonna go watch your video on
YouTube a few more times, just you know, getting all
those little pieces in my head. So that yeah, I
think you said at the beginning, the model that matters,
matters most is your mental model. Yes, and so yeah,
just knowing how to think about Okay, I'm dropping this

(01:10:00):
in right. This is how it goes through the Plinko
machine to give me the output on the other end.
That that's the thing that really helps me out.

Speaker 5 (01:10:07):
So yeah, that's a great analogy. And one thing that
maybe worth highlighting is I've had some feedback that people saw,
Oh you're using spreadsheets or using JavaScript. The real models
are in Python, and then you're using GPT two, which
is an older model. What I teach in the class
are essentially the timeless technical fundamentals of how these models work.

(01:10:27):
And it's worth remembering that all the major models you've
heard of, you know, Claude Chat, GPT, Gemini, they all
are inheriting from GPT two. So if you understand GPT two,
you understand eighty percent. You're eighty percent of the way
to understanding any of the model or Lama model architectures.
So it's it's not like toy. It is essentially as

(01:10:49):
very close to how the real models work. It's a
really good stepping stone to getting that really sharp mental
model of what's happening under the hood.

Speaker 1 (01:10:57):
Yeah, that's true of most technologies, right, I mean, if
you were using I don't know, let's pick one my
SQL ten years ago, probably sixty seventy percent of the
stuff is fundamentally the same.

Speaker 2 (01:11:09):
The engine works mostly the same.

Speaker 1 (01:11:12):
They've probably optimized some pieces, they've probably add some features,
but for the most part, if you understand understood what
it was doing back, then you get it now. And
to be honest, the other thing is is you'll also
see as we get more variations on things, you also
have a decent understanding of how to use something like
SQL light or post PRESSQL as well.

Speaker 5 (01:11:32):
So yeah, I really like that Algiae. It's a good one.
I M a bar of that, thank you.

Speaker 2 (01:11:38):
Yeah, no problem. All right, Well let's go and do
our picks. AJ Do you want to start us with picks?

Speaker 5 (01:11:44):
Sure?

Speaker 4 (01:11:46):
Okay, so Civilization. I've still been playing that, not as
much as I was the other week, but enjoying it.
It turns out you can run it on the Mac
if you turn the you have to go into settings
and turn its performance mode completely off to all the
way down. I think it's something to do with multi threading,

(01:12:07):
was why it crashes, Like if he uses more than
one core, it just crashes every five minutes or something. Anyway,
so there's that civilization still going strong. But wanted to correct.
You can get it running on the Mac. It just
won't run on the Mac with the default settings, and
it's not abundantly clear why. But anyway, other thing was

(01:12:28):
with the announcement of the switch to I just got
angry because I still can't play Tiers of the Kingdom
or Super Mario RPG or Spiro or you know, any
game that's basically released in the last three years without
massive stuttering. And you know with Tiers of the Kingdom,
you know how they have it go into bullet time

(01:12:49):
whenever it gets overloaded instead of getting choppy, although it
still does that, it does dynamic resolution and bullet times,
so your swings will slow down and stuff, which you know, whatever.
So I decided to mod my switches and I did

(01:13:10):
the hardware mod of the switch and it was super easy.
Now I've done mods in the past, So saying that
it was super easy, No, if you're not familiar with soldering,
if you haven't you know, fixed a phone or or
you know, done something with that before, no it's not.
It's not super easy. Getting all the screws out, getting
to the actual part that you're getting the heat sync off,

(01:13:32):
that's super easy. Anybody that has a precision precision toolkit
for like phone repair, game repair or whatever, can can
get get in there and do that. I actually couldn't
see the soldering that I was doing because the components
are so small. Now I've learned some tricks because I've
practiced a little bit of micro soldering in the past

(01:13:52):
and failed at repairing a three DS, but the pieces
are so small that I literally can't see that. I
mean I can see them, but I can't see them.
I mean like I can see them as like I
can see a grain of sand, but I cannot see
them in terms of like actually accurately predicted. So what
I did was I used my phone to zoom in,

(01:14:15):
take a picture of it, see that I had bridged
two capacitors, and then just kind of blindly, you know,
move the soldering iron next to them. And kind of
sweeped it the same way that I would when i'm,
you know, soldering on a bigger component. Then use the
phone again zoomed in, and so I was able. I
was able to get the peace on there. And I

(01:14:36):
should have had some sort of magnifying glass set up,
but whatever, so I was. I was actually able to
do it blind in a way like I could. I
could see my tip was there, but I couldn't actually
see what, you know, what was because I mean the
things they're they're smaller than a grain of salt. They're
they're small anyway, the capacitors they're they're like wicked small.

(01:14:58):
But even with that, I was able to do it.
And that's my my So. So modding the switches is
one is one pick there. There's the Pico Fly mod kit.
You can do it if you've done if you've done
other soldering, if you haven't done micro soldering, buy a
couple of practice kits from eBay or or Ali Express

(01:15:19):
or or something and you can get there. But my
third pick is the soldering station that I used is
actually a custom made soldering station. So a few years ago,
these Chinese companies came out with the T twelve tips,
or they cloned the T twelve tips. They turn out
to be really, really good. One reason is that they

(01:15:39):
double over as a temperature sensor because they have two
different types of metals, and anytime you have two different
types of metals, you have a thermal couple. So they
have an inner metal and an outer metal, so they
double over as their own temperature sensor. And so people
created software for these and put them on micro controllers.
In the software, and being literal when I say this,

(01:16:02):
it rivals three thousand dollars professional workstations because the way
that it switches back and forth between monitoring the temperature
of the tip. And so I'll put a link and
you can get these. You can get clones on all
the express but I prefer the original one that's made
by this guy in Australia because I know it comes
with the right firmware on it, and the firmware is

(01:16:24):
really where the magic happens. Any idiot, can you know?
Three D print some leads onto a rigid battery or
a Milwaukee battery and connect it to a T twelve tip.
But the real smarts of it is in the firmware
where it manages the heat to make sure it gets
up to temperature quickly that it actually does the sensing

(01:16:46):
thing where if you shake it, it turns back on
and heats up anytime. Anytime the temperature is cooling down rapidly,
it sends more power. So anyway, it's just a really
great soldering iron. And because I have that, like I've
got a and I've got a couple of cheap ones,
but that one it cost a hundred bucks, and I'm

(01:17:06):
considering trying out one of the knockoffs. It's only like
thirty five because now everybody, even Craftsman, is selling one
of these at Low's now. But I don't know if
the Craftsman one is just like the cheap idiot kind
where it's just connecting the leads together, or if it's
actually got a I have a hard time believing that
they would have gotten an illegal copy of the firmware,
whereas the Chinese companies on expression you know that anyway

(01:17:29):
saw that. Yeah, so I had a good time soldering
because of the rigid powered which you give them Mount
Milwaukee or Iobi or whatever battery brand you like, soldering iron.
They're super fast. There's so much better than a weller
or a Hako or all the traditional ones that cost
hundreds of dollars. So anyway, and of course I'll pick

(01:17:54):
a LLAMA because I really do enjoy running my own
local lams. Actually, since the thirty two billion parameter model
of Quinn two point five coder has come out, that
one I just find to be the best of the best.
It rivals GPT four oh if not is better than
four to oh, and you can run it on an

(01:18:16):
Apple silken back boom, all the things.

Speaker 2 (01:18:19):
Okay, I have a question, what are you modeling your
switch to do?

Speaker 5 (01:18:23):
Oh?

Speaker 4 (01:18:24):
Sorry, I skipped over that part over well, not overclock
to the native CPU freak speed of the tegra x one,
because the switch is an Android tablet running a custom
operating system rather than Android is two point two gigahertz.

(01:18:47):
That's the native clock speed. The clock speed that it
runs at is something like a thousand or seven hundred,
depending on whether it's doctor handheld. Same thing with the GPU.
The native GPU speed is like fifty one point five
gigahertz or something like that, but they clock it down
to five hundred or seven hundred.

Speaker 2 (01:19:06):
Gotcha.

Speaker 4 (01:19:07):
So when you mod it, you can then and this
you can do without getting banned or at least this
is what people are reporting, and I'm I'm doing this
so you know, if you mod it and you want
to run pirated games or something, you have to set
up more stuff and make sure that you don't get banned,
although a lot of that stuff is built in now.
But if all you want to do is overclock it,

(01:19:28):
the overclock system runs in a layer that's kind of
protected from the main switch system, so the main switch
system can't detect that it's rooted while it's running, So
you can you if you just install hakat what does it?
Hikata Atmosphere and cisclock. If those are the only things

(01:19:51):
you install, then you should be able to run your
switch modded on the original firmware, be able to play online,
et cetera, without any risk of banning or anything, because
it's the where it's not modifying the switch operating system
or the game. It's just modifying the CPU clock.

Speaker 2 (01:20:12):
Gotcha cool?

Speaker 4 (01:20:15):
So now, so now my friend asked me, well, can
you notice any difference? And my answer is no. And
the reason my answer is no is because when you're
not playing, when you're playing it underclocked, you notice the
stuttering all the time, and like you notice the resolution changing,
Like you know, you turn and there's a bad guy
and you shoot him, and then the resolution Like, but

(01:20:36):
when you're playing it closer to native speeds, you can't
get it all the way up to native speeds because
the power delivery on the board isn't actually capable of
playing it at native speeds without also draining the battery
at the same time. But when you're playing it at
near native speeds, you don't notice it because you like,
the things that are annoying aren't there. The resolution's not changing,

(01:20:57):
it's not stuttering, it's not going into bullet time. As
much I haven't, I did not notice it at all
going into bullet time since I've been playing it at
near native speeds, and I did some things where I
was blowing up rocks and things that I thought would
normally make it go to bullet time, and it didn't.
So like five star stories, so far.

Speaker 2 (01:21:16):
Cool, very cool, All right, Steve, what are your picks?

Speaker 5 (01:21:21):
All right?

Speaker 3 (01:21:22):
Kind for my twenty minute picks. So before I get
in picks, one note, I will make it sort of
circles back to what I asked at the beginning. You know,
as someone who has spent a lot of time doing
search indexing, you know, with lucine type search indexes. A
lot of this sounds really familiar. And to me, I've

(01:21:43):
always said that the I and AI is a misnomer.
I think it's it's not necessarily intelligent. It's just basically
better using of training and fancy or using of existing
data to answer things, not necessarily intelligence that and create
new things. That's just my two cents for what it's worth. Interesting,

(01:22:06):
pick Eean, you mentioned this earlier today and as of
today that you know, this will come out a little later,
but deep seek is like disrupting in a huge way.

Speaker 2 (01:22:17):
You know.

Speaker 3 (01:22:17):
For instance, if you go look on Hacker News, both
on the top page and on the new page, there
are multiple articles from NPR from CNBC about what it's
doing to the stock market, and the gist is basically
that they've been able to create these fantastic models with

(01:22:39):
much less investment, with less powerful chips. And there's a
whole story behind this, and so that's wreaking have at
least in the stock market and with people like Nvidia,
just because of supposedly how how much cheaper and more
efficient deep seek is compared to some of the other models.
So today's I've the first day and know it remains

(01:23:02):
to be seen how accurate this is, especially coming from
the Chinese, but sort of a disruptive thing going on
this morning, at least as of the time of recording.

Speaker 5 (01:23:13):
Yeah, do you mind if I jump in there a
little bit. Deep Seek is an utterly fascinating story and model.
I'll say one thing is that the training cost might
have been apples to oranges. Like they stated what the

(01:23:33):
cost was for the best run or the final run.
There's a lot of other costs that go into I
talked earlier when AJ was asking what goes into it.
One of the key things I was thinking actually about
Deep Seak is like they're going to do a lot
of other experiments and runs. There's a lot of stuff
that gets built upon. So I think some people are
comparing apples to oranges, but it is, you know, a

(01:23:57):
impressive model in a lot of ways. The other thing
that I find the most fascinating about it is the
training process is remarkably simple. And you know, I'm trying
to think of an analogy. It's like, normally when they
do this part of the training process called reinforcement learning.
It's a lot more complex, and it's kind of like,

(01:24:21):
you know, you think about a car and you're like, well,
if you want to go from point A to point B,
you need an internal combustion engine. It's basically doing you know,
having a little fire and you got these pistons and
the cylinders, really complex piece of mechanics. And then somebody's
like the electric car, and so you know that little
toy motor you had, Well, let's just scale that thing up.
And so they tried this really simple, relatively simple technique

(01:24:42):
and just scaled it up and it worked. And I
think different people are reacting to this model differently. Some
people it's about the price, other people it's about the
training setup, and it's how did we miss this? It's
just remarkably simple. So it's definitely a worthy to bring
up as it's just a really interesting model.

Speaker 3 (01:25:02):
O Coom's razor strikes again.

Speaker 2 (01:25:04):
Right.

Speaker 5 (01:25:05):
Well, there's something they call in AI the bitter lesson,
which is stop trying to put into the model how
you think you think. Instead just give really general compute
and just throw more and more data and more and
more compute at the problem, and the model will figure
it out, so don't try to be too smart about it.

(01:25:26):
And people are like, this was a bitter lesson all
over again. It's like, oh, we thought, you know, we
had to do this really complex, you know, reinforcement learning setup,
and these guys showed, well, maybe you don't now. In
the end, their production model actually still had some complex
training pipeline. But one of the interesting results is this
model called zero, where it kind of like, you know,

(01:25:47):
Alpha zero learned how to become a really good go
player just by playing against itself. In this case, it
wasn't the model playing against itself. It was just trying
out ideas and they just told it whether it was
right or wrong, and it started automatically emergently figuring out
how to improve its thinking. And it starts getting these
eureka moments where it suddenly realizes it can backtrack and

(01:26:08):
it's like, oh, wait, I made a mistake, and it's
you're watching the model like we didn't train it to
do this, and it suddenly figures out how to like
get smarter. So it's really really fascinating. I could we
could talk another hour on the model, but yeah, lots
of people are pouring over it. It's It's fascinating in many dimensions.

Speaker 2 (01:26:28):
Cool.

Speaker 3 (01:26:30):
And then, last, but not least certainly, the high point
of any episode is the dad jokes of the week.
So what did one pie say to the other pie
before being put in the oven? You know this is
a musical answer. All we are is crusted in the tin.
For anybody that knows Kansas, here's an Australian version. My

(01:26:55):
mate was bitten by a snake, so I told him
an amusing story. If I know the difference between anecdote
and antidote, he'd still be alive. And finally, when I
was in school, my teachers told me I would never
amount to anything because I procrastinated too much. I told him,
you just wait, those are the dad jokes of the week.

Speaker 1 (01:27:19):
All right, well, I'm going to jump in here and
save us from the high point of the episode. I've
got a couple of picks. I always do a board
game pick. So the game I'm gonna pick I learned
this game last week is called Cascadero. I'm gonna put
links in for both board game Geek, which kind of
gives you information about the board game, and then an

(01:27:40):
Amazon affiliate link, just because then you know where to
go buy it if you want it anyway, So Cascadero,
the premise of the game is that the kingdom's breaking up,
and so you all play a different faction, I guess,

(01:28:03):
and you're trying to connect towns and send your people
through the towns to pull the kingdom back together. And
so you put your little guys out there, and then
you score points based on whether you're the first person
of the town or the second person of the town.
If you have a group, you have to have a

(01:28:24):
group of your little horsemen.

Speaker 2 (01:28:27):
I can't remember what they call them, heralds. No, the
heralds were the.

Speaker 1 (01:28:31):
Other things anyway, So if you connect to a town
with a herald, then you get an extra point for
connecting to it, and then there are bonuses that you get.
So if because when you connect, when you get the points,
you actually move a marker up the technology or progress
track in whatever color you connected to. And so I

(01:28:52):
guess they're not points, they're just movements. But anyway, so
once you move past certain points, you get certain rewards,
and I mean effectively, what you're trying to do is
you're trying to score the most points, and you also
want to get to the end of the track and
whatever color you're playing. So if you're playing pink, you

(01:29:13):
want your pink marker to get all the way to
the end. And yeah, like I said, you get bonus
points for getting all five of your markers past the
first space.

Speaker 2 (01:29:25):
That's marked for that.

Speaker 1 (01:29:27):
You get more bonus points if you get three past
the second spacing, and then if you're the first one
to get one all the way to the end, then
you get bonus points and only one person can get those,
and then the other ones are if you connect two
cities of the same color and there are five colors,
you get bonus points for each color, and if you
get all five colors, then you get ten bonus points.

(01:29:49):
And so you're just moving your marker around a track
when you get the points. As soon as somebody gets
fifty points, the game ends. And so essentially, if you
want to win, you want to be the first person
to get your marker all the way to the end
of the track of your color and then be the
person that gets that fiftieth point. We played it, and

(01:30:10):
it was our first time any of us playing it,
and so nobody crossed that fiftieth point before we all
ran out of Little Horseman, and so when somebody runs out,
everybody gets one more turn, or if somebody crosses that
fifty point marker, everybody else gets one more turn, and
then the game's over. It's reasonably simple. The scoring is

(01:30:32):
a little bit complicated as far.

Speaker 2 (01:30:35):
As like moving.

Speaker 1 (01:30:36):
You know, when you get moves and how many moves
and things like that. So board game geek waits at
at two point five to three, right, So it's a
little more complicated than kind of your casual gamer who's
just going to play gravel with their friends. But my
feeling is is that it was only just getting used
to when I put my horseman down what happens. And

(01:30:59):
as soon as you get used to I put my
horsemen down, I can move something up the track so
many spaces, and then how to get the rewards. Once
you figure that out, it's a relatively simple game. We
played it in what an hour, maybe a little longer,
I think once. If we'd known what we were doing,
we could probably play it in forty five minutes. There
were three of us playing it, so anyway, Coscadero.

Speaker 2 (01:31:25):
Fun, fun game. I liked it.

Speaker 1 (01:31:27):
I want to play it again now that we I
know how to play it and my friends and how
to play it, because there were a couple of.

Speaker 2 (01:31:32):
Things I would have done differently as I gotten into it.

Speaker 1 (01:31:37):
As far as other picks, go.

Speaker 2 (01:31:41):
Go to jsgeniuses dot com and sign up.

Speaker 1 (01:31:43):
We're gonna start doing the meetups and I'm gonna start
posting videos. The videos I'm posting the I'm kind of
building an entire app. I don't know if I'm going
to show you writing all the code because some of
the stuff gets repetitive. Oh, I have to connect another
data model to this database. Right, It's like, Okay, you
don't need to see that eighteen times, but you know,

(01:32:06):
we'll get kind of the major pieces in and then
anything bonus extra that I do the app I'm gonna build.
I decided I need to learn next JS. So it's
going to be an NEXTJS app and I'm going to
be putting it on cloud Flare workers.

Speaker 2 (01:32:22):
And the reason is.

Speaker 1 (01:32:23):
Is because just to give you an idea of what
the app is, it's relatively simple.

Speaker 2 (01:32:27):
But last year when we.

Speaker 1 (01:32:31):
Ran Caucus Night for the Utah Republican Party, we had
an online registration and we got d doast because there
were people out there who didn't like us, and it's
internal politics to Utah. It wasn't the Democrats, it with
somebody else, but anyway, So because of that, I'm looking to,

(01:32:55):
you know, put it on a system where I know
it'll just kind of expand to whatever comes at it.
Cloud Flare is also usually pretty good at you hit
me eighteen times. Now, I'm just going to say drop
it and drop you unless you can prove your human
and so I feel like I can get some of
those benefits. But I'm also curious to see how cloud
flair workers work. So it's going to be basically a registration.

(01:33:18):
There's going to be a little bit of site automation
because the Utah State voter registration database where you verify
that your voter registration doesn't have an API, which means
that I have to go and have my program use
something like puppeteer to fill in the fields and then
scrape data off the response to make sure that you're
registered to go to caucus night. I'm thinking I may

(01:33:42):
also offer this same kind of thing to the Democrats
and anybody else who wants to run.

Speaker 2 (01:33:48):
A caucus night that night.

Speaker 1 (01:33:49):
I think Libertarians and Utah do it too, right, so
that they can just hey, you've got online registration and
then you've got an app that will verify it on
the other end.

Speaker 2 (01:34:00):
So anyway, that's what I'm looking at.

Speaker 1 (01:34:02):
So there may also be React native app or something
like that that on the other end, you know, people
can show up with a QR code that says I
registered and this is who I am, and people can
just verify that way instead of having to look them
up in a paper list or something like that.

Speaker 2 (01:34:17):
So that's what to be building. You jass geniuses.

Speaker 1 (01:34:21):
You get access to the videos, you get access to
the weekly meetups, and a bunch of other stuff. I'm
also looking at starting a new podcast on doing AI
with JavaScript, and it's gonna be at this level, right
We're not building our own models. We're gonna be using
the existing models that are out there, the open source models,
if you will, and showing how to build things on

(01:34:43):
top of those, or using some of the cloud services
that generate images, or you know, using something like Whisper
for transcriptions and things like that.

Speaker 2 (01:34:51):
So anyway, keep an eye out for that. That'll be free.

Speaker 1 (01:34:54):
I'll probably drop the first two or three weeks worth
of episodes onto this RSSV and then From there.

Speaker 2 (01:35:01):
You'll be able to just subscribe to the other feed.

Speaker 1 (01:35:04):
So that's what I'm I've got going. Yeah, those are
my picks, e Shan, what are your picks?

Speaker 5 (01:35:11):
So I've got two picks. The first one both are
going to be a I related. The first one is
notebook LM, but everyone knows about notebook LM with like
the fake podcasters. My pick is not book LM without
that feature. I think that feature is great and really compelling.
Lets me consume you know, material on the go in
podcast form. But I like the other parts of noepook LM,

(01:35:35):
which is like it's a great way to stick a
variety of sources together and then ask questions about it.
So one example is I like to go to y Combinator,
hacker news to see what the comments are, but I
don't read through every single one. So I will stick
it in there and say, well, what are the most
insightful comments? What are people saying? I did this actually
with deep Seek, I say what are the comments people
are saying about deep Seek? What are they seeing for performance?

(01:35:55):
What are the issues where it's not working? And what's
great is it doesn't just summarize it. You've got where
it can go to each part of it and say, okay,
this is like, oh that sounds interesting, let me go
click on it, and I can go right to the
citation of that that comment. The formatting is a little
off when you when you stick it in there, so
there's and you're only limited to thirty sources in each notebook,
but check out the other parts of notebook.

Speaker 2 (01:36:17):
LM.

Speaker 5 (01:36:17):
I think it's it's really interesting. I expect to see
a lot of other applications follow the similar type of
of ux paradigm or inspiration. The second one is I
don't know if you guys have been watching it, but
Star Wars has a new show, Skeleton Crew that they
have on Disney Plus. And first of all, I think

(01:36:38):
it's it's good. I don't think it's you know, Mandalorian
or and Or, which was my personal favorite level, but
it's still pretty good. But the other reason I bring
it up is I liked some of the elements of
how they handled AI and droids. So in one episode
there's something that could be akin to jail breaking the
droid where somebody uses the equivalent of prompt you know,

(01:36:59):
prompt hacking to jail break a droid, and I don't
think we've ever seen that in Star Wars's reflection of
Droids before. There's another one where it reminded me of
this paper called alignment faking, where the model has to
decide between its original training or the thing it's being
asked to do right now, and it kind of goes
back and forth and it gets over in by its
original training. And so there's one thing the very last

(01:37:21):
episode that I also thought it was fascinating. But I
really liked those interesting bits of how they handled AI
that I think we wouldn't have seen in a show
like this without understanding of chat GPT that I think
probably the writers were inspired by. So those are my picks.

Speaker 1 (01:37:35):
Awesome, Yeah, Skeleton Crews on my list of things I
want to watch. So I like the recommendation. Thanks for that,
all right, Well, just a reminder go look on maven
dot com.

Speaker 2 (01:37:45):
The code was jays.

Speaker 1 (01:37:46):
Jabber for twenty percent off, and so if you're interested
in the course, go check it out. I'm not hard
sell guy. I just think it sounds fascinating. So anyway,
let's go ahead and wrap it up here until next time.

Speaker 2 (01:38:00):
Him Max Ow

Speaker 3 (01:38:04):
Mhm

All Episodes

A Guide to AI Models: From Tokenization to Neural Networks with Ishaan Anand - JsJ_669

Episode Transcript

Popular Podcasts

Dateline NBC

Stuff You Should Know

Law & Order: Criminal Justice System - Season 1 & Season 2

.css-15opob5{left:0;position:absolute;top:0.8rem;} All Episodes

.css-14f5ked{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:2;overflow:hidden;}A Guide to AI Models: From Tokenization to Neural Networks with Ishaan Anand - JsJ_669