Vectorizing Your Databases with Steve Pousty

Episode Transcript
Available transcripts are automatically generated. Complete accuracy is not guaranteed.
(00:00):
My general advice to people is start with Postgres before you
start with any of these specialized databases, whether that's
a document database or any of those other, start with Postgres.
And if then you run into scaling issues or something else, go read
the documentation because you're probably doing it not optimally.
Welcome to Fork Around and Find Out, the podcast about

(00:21):
building, running, and maintaining software and systems.
Welcome to fork around and find out I am Justin
Garrison with me as always is Autumn Nash.

(00:42):
And today on the episode, we have Steve Posty,
principal developer advocate at Voxel 51.
This episode is you're going to learn about how to embed things
in your head for vector databases or something like that.
It's because I saw Steve at all things open in 2024.
I walked into your talk, Steve, about vectors, databases, and embeddings.
And I had been asking every.

(01:04):
Talk.
What is a vector database?
Why do I need it?
And none of them gave me a good answer.
And then you gave great descriptions of it.
So I said, you need to come tell the rest of the people because 2025 everyone
learned they didn't have a vector database and for some reason they needed one.
And so I wanted to have you.
Do you know how many podcast guests he's asked that question?
And it's
been a running joke.

(01:24):
Yes.
Really?
And they just like, they can't explain it.
Well,
most of them can't.
Yeah.
They, they can't explain at least to the, the, the dumb
down simpleness that you did for me, which was great.
This is because people don't care about pedagogy.
That's what I have to say.
I love when people get excited about like,
you just, you missed these whole face.
Like he lit up and he was like, I'm ready.
Call me in for vector database right here.

(01:46):
Like I got this.
I got, this is exciting because I've also talked about it enough.
figured out the, let's start from things, you know, which
is pedagogy to things you don't know and make that bridge
explicit so that you can ground yourself in what you do know.
So that's
why I give talks and do stuff.
That is my favorite moment.
Working in tech is when you take something and you like go down the rabbit

(02:09):
hole and you finally get to the point where you understand it in a way
that you could use an analogy to like, either make it a real world thing
or something that people will understand, but to explain a technically
depth concept, that's like the, like, I love that light bulb moment.
I live for it.
Before we go too deep down the rabbit hole, I want to do a little
bit of housekeeping for the podcast, just for everyone listening in
FYI, we are going to be moving this podcast to a monthly cadence.

(02:34):
Uh, we also.
Are going to be editing the show ourselves, which
means I don't have enough time to bleep out any words.
So there might be a little more, hopefully
the levels are still going to be fine.
There might be a little more distractions.
A lot more ums and likes possibly, but definitely
some more things that would have been bleeped before.
So if you're listening to this show with kids or in a

(02:54):
public space, that's a, you don't want that going on.
Then just FYI, probably this episode going
forward, we'll might have some of that.
And I'll, I'll try to warn some people for the next couple of episodes in
case you miss one, but I just want to put that out there because we have been
for the last three months paying editors, which have done a fantastic job.
Uh, it's just a, a lot of.
We don't have, we can't afford it.

(03:14):
And we're, so we're moving, moving the podcast
a little bit more to make it sustainable.
So you mean everyone doesn't teach their kids like bad
songs and NWA and just tell them to not say the bad words?
No,
I just let them listen to it.
And I don't say anything at all.
I mean, not, not, not NWA.
That's not the kind of rap I like, but in other
songs, I'm more of an old school East coast guy.
So.

(03:35):
East Coast?
You live on the West Coast, Steve!
Yo, I'm from
f ing New York, don't you give me that b h!
Yo, I'm not here for that, I'm not here for that.
I
mean, I am definitely a West Coast 90s,
like, person, but we love you anyway, Steve.
Thanks, that's really nice.
I mean, I like some West Coast, like, like Farside, but

(03:55):
that Farside feels kind of like East Coast though, right?
Like, it's not that same kind of How dare
you?
That's where you were gonna pick.
What's wrong with Forsythe?
Okay, Steve, do you drink coffee?
Generally, no.
Is that
a good or a bad thing?
I'm going to start keeping a hash over here on my wall.
I know, we need a tally on who
uses Mac and Linux, and who drinks coffee.

(04:15):
Linux, Linux, Linux.
Okay, so, the only reason I run Windows is because I play a lot of video games.
And so, video games I have a Steam deck and I can run some
of the games on there, but if you want to run like a triple
A, like Cod or something like that, I need a Windows machine.
If I could be Linux every day in and day out, Linux day in and

(04:38):
day out.
Shout out to Bazite, uh, the Linuxistro,
which I absolutely love it on my Steam deck.
See, that's why I run like Linux on a dev desktop.
And use a Mac because I like it when everything syncs together
and I don't have to think about stuff in my personal life.
Well, so here's the problem for me with that using Mac, Max, my least
favorite operating system, by the way, because it's close enough to

(04:58):
Linux without being Linux, that all my muscle memory is frustration.
That's what I love about it though, because it gives me the continuity,
but I can use all my like Unix commands for Linux commands on Unix.
Yeah, but it's Z shell and stuff like it's like, it's, it's BSD.
So no.
Once you use
sed, if you don't have GNU sed, you're done.
Like, it's just like, I'm, as soon as I

(05:18):
go to sed and on a Mac, I'm like, I'm out.
I'm installing
power shells.
Oh, no, no.
I don't do any of that kind of stuff.
Oh,
I was going to say like, how would that be better?
Like,
no, no, no, no, no, let's, let's be calm and clear here.
He was like, well, too far, too
far, too far, too fast.
No, I mean, I played with PowerShell just a little bit, but like

(05:39):
sound effects.
Sorry.
I mean, PowerShell, I mean, it.
Kudos to Microsoft for at least moving off of DOS.
So, kudos to them for that, but I mostly have WS If I
need to do that kind of stuff on my Windows box, it's WSL.
Which is a way better Linux environment than a Mac is ever going to give you.
And so it's just like
Okay, I don't want to be in this religious war.
I'm clocking out.

(05:59):
I'm just letting you know.
You see,
Justin just leads you into spicy and Steve was not, like, not today, Satan.
Not today.
Today.
I've been down this
road before.
Next, you guys are going to be like, what's your favorite IDE?
And if any, if you guys say anything, don't say it because if you say anything,
yes, if you say anything other than a JetBrains product, I'm hanging up.
Okay.
See?

(06:19):
We, I was, I'm glad Justin, he has to be like the super special
snowflake of IDEs, operating systems and all of the things.
And he doesn't drink coffee and he has to have Dr. Pepper,
which means next week we will be walking up and down the
streets of California, trying to find him Dr. Pepper.
I'll bring my own
Dr. Pepper.

(06:40):
It's fine.
You said that last time when you did not bring enough Dr. Pepper.
That's true.
It wasn't cold.
That was the problem.
Anyway, back to the topic at hand.
I just used NeoVim.
It's mostly because I do remote desktop, like I just,
y'all didn't see Steve's eyebrows.
Like you just missed a gem.
Steve's eyebrows, the judgment and his eyebrows.

(07:00):
Like
I have never given IntelliJ a fair chance.
Like I tried VS code for years.
Also, I'm really bitter that they got rid of Adam and like Adam was
like my
favorite.
And I'm going to bite my tongue on that
one, too.
Now I don't like either one of you, but, um,
the thing with NeoVim.
So look, here's the thing I'm going to say about like those command line

(07:21):
people, like Vim Emacs people, which I don't want to even get into that.
Emacs different.
That's a religion.
Like that is joining a cult.
Okay.
But the thing is, I get it.
Like I've seen some of them, like when I was working on OpenShift.
Back in the day at Red Hat, like there was the lead engineer.
He was just like, I have VI with all these mappings.
And he was just like, and I was like, that looks miserable.

(07:42):
I don't want to
waste.
Ever since I've been a kid, come on Legos.
Can Steve just be in the background of all our podcasts?
And then like, when someone says something, Steve
will just drop in with the like sound effects.
I love it so much.
Is that called Foley's?
Are those Foley's?
Is that what those Foley's are?
Foley engineers.

(08:02):
Yeah.
Yeah, yeah, yeah.
Thunder.
See, look, if tech doesn't work out, Steve
can moonlight as like a Foley engineer.
Totally.
So yeah, we could get stuck like this for a long time.
I don't, I don't know if it's true, but from what I've
already observed of my, of my hosts, it seems like we might.
All three be somewhere on the ADHD

(08:23):
spectrum.
And so, you know, bottom lining is kind of like, uh, moving and jumping.
You didn't have to attack us like that.
I'm attacked
like that every day.
I wasn't a diagnosed until I was 52.
Do you know how hard that it was?
We can talk about that for a while.
We're nine minutes in and this is our eighth topic.
So

(08:44):
none of them.
And it's not linear.
I
think at this point, this is what's expected.
Like if we actually held a topic too long, I think they'd be disappointed.
We get bored.
I know I am the adult usually.
So I am going to try to bring it back.
Let's go back.
Let's go
back.
Let's go a little bit back.
Back to the topic of, of embeddings, vector databases, uh, not IDE stuff.
I was going to make an old man joke, but

(09:06):
like, like you are the old man of the podcast.
I'm
the responsible one.
I know.
Technically though, I bet you I'm older, but let's go.
I'm sure
you're, it's fine, but we won't.
We're going to get stuck.
Next thing's going to happen if we stay focused, because otherwise
we're going to start talking dating advice and stuff like that.
I mean, we'll get
there at some point.

(09:26):
We will,
but so like in good ADHD fashion, that will be our reward.
We will talk about vector databases.
And when we get through that topic, we're allowed,
we'll get to talk about other things.
Okay, let's set a timer.
Even before we started recording, Steve hinted that he had some
parenting advice, which we're going to leave that for the end.

(09:47):
Like, we got to stick to this whole thing.
And then
he like, didn't tell us, like, it was like when someone's
like, Hey, I got something really important to tell you.
And then they're like, I'll tell you later.
And then you just.
Did you
use Pomodoro's?
He did!
It's not even a tomato or anything.
I'm trying to set a 15 minute timer.
Is that
15?
15? I think
that's 15.
It's the adulting of the podcast has been taken over by Steve.

(10:08):
Alright, here we go, it's a timer.
Oh crap, it went to the wrong one.
Okay, there, 15 minutes.
We got 15 minutes.
We have to talk about vector databases and vectors for the next 15 minutes.
Let's start with embeddings.
Okay.
Describe an embedding.
Okay, and we have to start with embeddings,
otherwise vector databases don't make any sense.
Okay, so the idea behind an embedding is there's two types of

(10:30):
data that you put on your computer, structured and unstructured.
Right.
Structured data is something like a database table and Excel spreadsheet.
It's really easy.
And it's like all numbers and little strings and stuff.
So it's really easy for the computer to say, Oh, I know what to do with this.
I know what to do is two bigger than three.
Yes.
It's wrong, but three is bigger than two,
but it can, that's a bad, that's a bad,

(10:52):
that's what the AI would say.
That's right.
Exactly.
Cause it can't tell numbers, but the point being that in normal space,
normal computing stuff we do, it's all small stuff that's structured.
Like the computer knows what to do with it inherently.
At a certain level, I know I'm fudging that, but unstructured data
is things which is doesn't actually have that kind of, this is an

(11:12):
integer, this is, you know, 16 characters, this is whatever that
is, and there is no real easy way to represent that in a computer.
So examples of this are.
Photographs, right?
A photograph.
Yes, it has.
It's made up of numbers, but there there's
no way that you can just look at them.
If I just showed you the matrix, which was the number for each of the pixels,
you can't look at that and go, Oh, yeah, I know exactly what that's showing me.

(11:34):
And there's nothing that computer can do with images.
Like if you say, is this kitten picture cuter than this kitten picture, right?
It can't answer that question.
It has no way of there's no semantics around that because it's unstructured.
They can say, is this image the exact same or very similar?
But other than that, yeah.
Computers don't know what to do with the rest of it.
And other examples of this, I'm going to be talking a lot about vision, probably
mostly around this stuff, just because that's what I do now for my day job.

(11:55):
But this is the exact same thing we do with all
the AI stuff, what people call LLM stuff, right?
So the book, you know, what is the general theme of this book?
A computer can't tell you that inherently.
Can you give us a general like overview of what you do currently,
just so we have like the context, whatever you can talk about, like.
I can talk about everything.
I'm a developer advocate.

(12:16):
So I work at voxel 51.
And so voxel 51 is, it was started by a bunch of computer vision
scientists who are like, we need this platform to do this.
Oh, Hey, a bunch of other people really need this platform to do this.
Cause like working with, for all of these models, all this
computer AI stuff, even if you've taken any statistics,
you understand that the data is the most important thing.

(12:37):
Right?
Like if you have messy data, if your data is not clean, it doesn't matter
how fancy your model gets, it's still going to basically produce garbage.
And so what voxel 51 is, is this piece is like you pick a sensor
and you pick a question, like what's your question that you
want to ask, and then we kind of bind together the whole process
of, you know, tuning your data, cleaning your data, making sure

(12:57):
it's okay, annotating your data, like all the pieces in between.
And then once you've training your model, fine tuning
your model, and then you put it into production, right?
And then we don't.
Put it into production, but then we'll help
monitor it as it comes out of production.
Like, why are these, what is the class of all these errors
we keep getting with our vision model and stuff like that?
And so my job as a developer advocate is to be a bridge, right?

(13:19):
So I'm the bridge to help you understand what voxel
does, which hopefully did an okay job right there.
And then go out into the world and help people understand that.
And understand computer vision more, rising tide lifts all boats.
But then also like when I teach a workshop or when I will talk
to Justin at All Things Open and he says, Hey, blah, blah, blah.
I tried 51 and then I come back and say to the engineers, Hey, you know,

(13:41):
I've talked to a bunch of people and they're having problems with this.
Or I hear a lot of people are doing this.
We should really think about putting this into our project product.
We have an open source project and then we have an enterprise product.
And so that's what I do.
And then I'm also like.
A bridge between sales and engineering and support.
Like we advocates generally are bridges between lots of different groups.

(14:01):
And that's what we do.
So does that make sense?
Did I give it a good explanation for that?
Okay.
You can do embeddings for all sorts of unstructured things.
Like, so like big text, lots of text, PDF document, a book, several
paragraphs, even like for most of us who've worked with computers,
we did that exercise of like comparing two strings, right?
That's pretty easy.
But other than saying this three paragraphs is exactly the

(14:23):
same, or there's this, here's all the characters that are
different, there's not much you can do that has meaning with it
in the same way you can say is two bigger than less than three.
I got it right that time.
So videos, audio, all these things that are generally big and
without we as our brains can process is super easy to process,

(14:43):
but computers don't have an inherent way to process them.
So that's unstructured data, right?
And so for a long time, there wasn't
anything we could do with unstructured data.
And then There are some simple things we could do,
but what comes along are these neural networks, right?
And so basically what an embedding is, is you take that
unstructured data, you shove it into the neural network.

(15:03):
It has already been trained to pick out the word that they use as features.
So important things.
So things that are, another way of saying is
things that have semantic information, right?
That have like meaning you'll hear the word semantics
thrown around a lot called features called thrown a lot.
And at the other side of it, it spits out a vector.
Usually it's like 512.

(15:25):
It could be up to 2000.
It depends on what, but it's a vector of numbers typically between
minus one and positive one, some sort of floats, but there's
all sorts of other ways to do stuff with But what's encoded in
that vector is the semantic meaning of that unstructured text.
And one of the properties of these is that things that are
more similar, unstructured things that are more similar,

(15:47):
should, in that 512 dimensional space, be closer to each other.
So let me, let me try it.
Explain this back in a way that I'm trying to think about it.
If I have a text file and I run it through SHA 256, I
would get a string of letters and numbers out of it.
And if I change one character in that text file,
theoretically, I get a completely different SHA 256 out of it.

(16:08):
Like that's a good hash is different, no matter, you know, with small changes.
And in this case, we want like the opposite of that is
like, I changed one character or one word in that text file.
And I want something that looks kind of like.
The thing that came out before that would be, I could like, Oh, you know what?
Like 80 percent of this is the same, right?
Like that hashes.
And so this, this vector is just a series of 512 numbers or

(16:30):
more or less, but generally 512 numbers between zero and one.
And then that's our quote unquote hash in this example.
And I can say, Oh, look like.
You know, 30 percent of those are pretty close.
So that's probably similar in, in to this
other thing that has a similar characteristics.
In a very broad sense, that's exactly what it is, right?
I mean, one of the things you should be aware
of is embeddings, calculating embeddings.

(16:52):
That's what they call these in vectors.
What these vectors, they call them embeddings.
It is a compression technique.
It's a lossy compression technique, right?
I've taken a big chunk of text and converted it into 512 numbers.
And it's lossy because I can't go exactly
back from that vector to the original text.
text.
I remember in the early days, people were
using hashes as vector embeddings, right?

(17:14):
Like they would just say, I'm going to hash it and that's a vector embedding.
Some other examples of vector embeddings that people
have come across are principal components analysis.
I don't know if any of you did, did you guys, any of you do stats?
Nope.
It's a statistical technique.
Like when you have tabular data numbers and all that
stuff, you do matrix manipulations and then you basically
end up with vectors where each vector is orthogonal.

(17:37):
The first vector captures the most variation in the data.
The second vector captures like the second most
variation in the data orthogonal to the first one.
That's another way of creating an embedding, right?
It's basically taking some data and then reducing the amount of numbers in it.
But
the general idea of an embedding is, is just a way to almost fuzzy match.

(17:58):
Something else that is similar,
but it turns out it's better than that.
So it's not just for matching.
So this is the important part.
So first, let me explain why they're called embeddings.
So the reason, cause I just call them vectors, but the reason why some
people in the machine learning community who need to rename everything
that's already well known in the statistics community is some funky name.
The reason why they call it embeddings is because, and this is related

(18:18):
to that similarity that That vector represents the coordinates of
that piece of unstructured information in a 512 dimensional space.
So you're taking that unstructured information
and embedding it into that 512 dimensional space.
So it's
a point.
Yes, a point.
It is the point coordinate in a 512 dimensional

(18:40):
space, which no one can visualize, right?
But, but
I can visualize three coordinates, right?
And say, okay, if I, this was a three characteristic
vector, I could say it goes in this point, and if I say a
cat is, is, you know, this point, then a leopard might be.
A point close to it.
Yes.
So I mean, you can actually do that, right?
There's no reason you have to spit out a 512 dimensional vector.

(19:02):
You could just spit out a two dimensional vector if you wanted.
Right?
And that would be like through the neural network.
It would be looking for features and doing stuff.
But then at the end it would say, here's the X
and here's the Y. And that's where it ends up.
And so, yes, what should happening in that dimensional space?
And if you wanna just think in two or three
dimensions, CAT should ends up near cats.
And if you have put in a picture of a dog,
the picture of the dog should be farther away.

(19:23):
And if I put in a picture of a motorcycle, that should be far away from
either, farther away from both and somewhere else in that dimensional space.
How do vectors and graphs, graph databases relate?
So beyond my expertise, A, this is, this is one thing you have
to be good at as a dev advocate and saying, I don't really know.
But I'll make stuff up.
I
think everyone in technology should be good at that.
That's real.
That's so real.

(19:44):
It is because a lot of damage could be avoided.
Or at least say I'm making this part up.
I mean, I think something like that.
So what I've seen people actually care about graph
databases in this space a lot as well, because one of the
things that graph databases do is capture relationships.
Between things.
And so when you're looking at relationships between paragraphs and

(20:04):
stuff like that, people want to kind of ground with a graphing database
because it actually had, like, or you could say like this picture
of this cat, this picture is the picture of the parent of that cat.
Or you could say, this is a, you can do stuff with it.
Honestly, though, I haven't played as much with them.
Neo4j does have an embedding extension, right?
So it's, it's obviously something different.

(20:25):
It's just a way to take a piece of unstructured data and turn it into.
Numbers.
It's what it's trying to do is take an unstructured data and
turn it into an array of numbers that captures the most amount
of meaning from the original structure, the unstructured data.
That's its goal.
I think one of the things I've learned about graph databases is just the how
bad they are at things that aren't explicitly put into the database, right?

(20:49):
Like if I have rows in a, you know, SQL database or something,
like I can't just like make up like, Oh, what would be the.
Customer between these two, like it can't, it can't do that.
It can't make up something in between.
And graph databases are similar to that where it's like, Oh, we're
going to capture all this information about the things you put
in here, but we can't, if you give us something new, I can't tell
you like what the relationships might be with those other things.

(21:09):
And that's where it seems like.
Vector databases and embeddings fill that gap.
And that's, that was my understanding.
What was like, Oh, and actually just go ahead and
describe vector database and how they relate to that.
No, I can't yet because there's really, there's a whole bunch of really
fascinating things about embeddings that I really want to cover because
vector bases are interesting, but like embeddings are even more interesting.

(21:31):
Embeddings, you can do algebra with embeddings.
So this part is kind of really fascinating that you can do this.
What you can do is you can say king minus woman.
And it will return queen or you could say things like it
knows that relationship through algebra, like you can add

(21:51):
and subtract embeddings and it will obey that relationship.
I think I might have gotten that exact, not exactly right, but I do
know that you can do algebra like man is to woman as king is to queen.
It knows that relationship, right?
So that you can say things like king minus woman equals queen or.
King one plus woman is equals queen or something like that.
I think it may be plus so king plus woman equals queen.

(22:12):
Have you ever seen it get something really, really wrong?
So here's the thing, of course.
I mean, the algebra could be, of course.
The thing is, remember the part that was in the middle that makes the embedding?
It's a neural network, right?
And everybody, especially all the tech bros, want you to think these
neural networks are actually thinking and they're kind of like humans
and they're smart and they have a soul and they don't want to die.

(22:35):
They're all really basically just really, really, really fancy.
Regression equation.
I mean, I think people really need to understand.
And this is the part that I think is important.
I mean, if you want to go a little bit more sophisticated,
it's a whole bunch of nonlinear regression equations
at each node in the neural network is what it's doing.
So it's this big nonlinear system of equations,
but it's still a statistical fitting to the data.

(22:57):
So what people are calling hallucinations, Now, we're not of embedding space.
Now we're into model, like model and prediction space, but what
they're calling hallucinations, they're not hallucinations.
Hallucinations are what humans do.
It's a model error.
It mispredicted.
That's all it did.
It made an error in predicting and they will always make
errors in predicting because they're a statistical model.

(23:18):
There's no way they can't, the way those they're made, there is.
If anyone tells you that they can get rid of what they're calling
hallucinations by using this technique or that technique wrong
and run because, oh, we did 15 minutes, that was pretty good.
Um,
I'm so proud of this, but I think like that's kind of, I
love that for one you said, Hey, I'm gonna explain this,

(23:40):
but I might get parts of it wrong and I don't know this.
And then just to like, I think people think
that we're against AI and I use AI every day.
I use it to make life easier all the time, but we can't keep.
Selling these products and selling this thing for what it's not.
It doesn't think it doesn't have emotions.
It's really fancy math and a bunch of like regression.

(24:02):
Like you said, you know what I mean?
And if we just use it as for what it is and we understand that, then
we know where to put it and how to use it and how to use it safely,
you know, but I just don't get why we can't just use it and call
it what it is like, what, what is the, such a hard time with that?
Well, I can give you one reason why one is copyright violation.
They're all pretty sure that they're scraping

(24:24):
copyrighted information and not reporting it.
And the thing is, if they can make it portrayed to be a human, like
human like and it's learning, then it's not violating copyright.
Like there's nothing stopping, I'm not violating copyright if I read
the Encyclopedia Britannica and then tell you information I learned
from it because I've taken it and consumed it and I've learned it.

(24:44):
That's not a copyright violation.
If I though, say I'm quoting the Encyclopedia Britannica and I keep quoting
it without referencing it, then I, it's a copyright violation, right?
Every, every 90s CD said the, the bands that influenced them, right?
They're like, you read the like five different
artists that like were the most influential.
And if that wasn't a person that would be copyrighted.
Here's what's wild to me.

(25:05):
It would make it a more reliable source and it would be easier to use, right?
Like, okay.
So when we were all in college and high
school, we had to quote our sources, right?
If.
It gave us an answer and then said, I got it from this source.
Like we figured out how to move the times in multiple other areas before
people used to sell CDs and they used to get money in a certain way.

(25:28):
Then they had to figure out how to make money off of like iTunes.
And you know what I mean?
At what point do you figure out how to pay a little bit back,
but you are giving, you're establishing the trust of knowing,
like, let's say it says, Hey, I was, I think you mean this.
I got it from this source.
It would help us to trust those models and use them
in a better way because it's citing its sources.

(25:49):
And then it would also help our kids and just the future
and everyone that's using these to know where it's going.
Like, it's just wild to me.
This is like a billion dollar industry.
We're willing to sink so much money into it and you could almost legitimize it.
But they're like, no,
because if they legitimize it, they won't get the valuation.
One is copyright.
The other I think is they won't get the valuations they get right.

(26:11):
They keep wanting to say that we've got AGI and we're not even close to AGI.
If
you make something understandable.
Then it was like, Oh, I can't dream what it might be.
It's like, Oh no, I understand how that works now.
I don't really, it's
wild.
How much of the black box meant magic that tech just, you
know, like it's like back when they had it, like, yeah.
Like remember when like people thought the cloud was magic and like, yeah.

(26:33):
And people had to start going around and making the shirts
that said, bro, it's just somebody who's like Linux server.
You know,
the cloud is somebody else's server.
So much of it is like, you'll hear people talk about this new cool thing.
And you're like, that's the same thing we've been doing.
With this guys,
it's just someone else's.
I don't have to manage the servers anymore.
I mean, so I'm violently agreeing with you on, and

(26:53):
I don't think it's going to change, unfortunately.
And I know we're supposed to be talking about embeddings,
but everybody gets to this eventually as well.
Do you want me to explain what an LLM is doing?
Has someone explained that to you?
Not yet.
I have two other questions.
Okay.
Because I want but I want to say it because it's
do that because I really think I would love to contribute
to educating people because it like, you know what I mean?

(27:13):
Like, yes.
And the reason I want to do this is because I for the exact reason you said,
which is people then can understand why it's hallucinating or what it's doing.
And then maybe we can work on tech to make it better.
But that apart, just you can just tell it where
to use that tool.
It is a tool.
Let's use it.
I'm for using it.
Let's just not use it where we're going to hurt people and make them.

(27:33):
Dumb like implementations, like
two questions.
Have you all played infinite craft?
I love that.
This was your question that you stopped him from explaining it.
And this is exactly
what he was saying, where you can do, you can do algebra with.
The embeddings.
And that's literally what this topic.
Okay, dad.
So that, that is exactly what it's like.
I'm going to put it in the show notes.

(27:54):
Fun about Justin though.
Like really?
Like he pretends like he's just like super funny and just
so spicy takes, but underneath the spiciness in that beard,
he was like a math major and sometimes Justin has the most.
Interesting, random, I read these three books and I connected
them like his brain works like a vector and graph database.

(28:14):
It's fire.
So infinite craft, you haven't played it.
It's just is literally just doing algebra on the embeddings.
And it's great because it does exactly that where
you're like woman plus, you know, king is queen, right?
It's like you can do that math and you can try to Like create
everything really fun, but you're not allowed to play it right now.
On the podcast.
You would not say another word.
My kids played it for like three weekends straight.

(28:34):
It was
hilarious.
It's called infinite craft.
I will send the link, but not right now.
Okay.
I don't know.
I just want to know the name, but
Google
it later.
Okay.
Go ahead.
I want it.
So my kids will be busy for three weeks, but
the nodes and, and how things become, how a feature becomes a number.
Basically, here's my thinking of it.
How I think about it in my head is, is we basically take.

(28:55):
A picture of a cat and we chop it up into all the features that we think.
And we put it in like a, you remember the Plinko machines from,
yeah, we're like the ball, the, like the token falls, all the pins.
And I kind of think of all the nodes as those
pins that like put it one way or another.
And in my opinion, like the way I visualize it is a weight
is just making that pin slanted one way or another, right?

(29:16):
It, like it, it bends it to one side or
another and say, Hey, this should be weighted.
a certain way.
I want this thing to fall a certain way more often than not.
And in my head, that's how I kind of visualize it as okay.
A node is all these little nails on a board and the weights
are just, sometimes we just block off entire pieces.
We're like, you're not, we're, we are waiting
out this thing and you're not allowed to do that.

(29:36):
But also when we use the model, we don't hit all the pins.
Like, we're only hitting certain pins for certain features.
So, like, we cut up the picture of the cat into 200 little tokens, and
we're only sending it out parts of the board to get a certain thing out.
How wrong am I?
And do you have a different way of describing nodes to people?

(29:59):
Yeah, I don't have a specific way, but I can
add some, a little bit more flavor on that.
I don't think you're necessarily wrong, right?
I mean, that's a good way of thinking about
it, about you sent in this group of pixels.
Where is it going to end up when it comes out?
Right?
At least in vision models, it's a little bit easier because we can actually
look at what the activate, when you activate the different neurons, you
can see what comes up as you're moving up layers in the neural network for

(30:22):
a fully connected neural network, you're building higher level features.
So like, if you look at the first level of a computer vision neural network,
it's like, Finding straight lines or maybe a gradient or things like that.
And then what happens at the next level is it combines some of
those different neurons together to form, let's say a curve, right?
Cause that's multiple, a curve that has colors going through it or something.

(30:43):
And as you keep going up, you start getting things like noses.
And then actually what you'll start getting is it.
Some of those nodes like a nose and a couple eyes will
combine and then now you've gotten a face feature, right?
And so towards the end you're getting Very close to the things
that you want to actually get out of the model in the end
And that's the big difference between deep networks and wide networks, right?
Cuz like a wide network is just gonna be

(31:04):
like straight lines at different angles.
It can't really do the abstractions necessarily You have to categorize
every a possible thing that you want to become an embedding versus
a deep neural network where like I can build eight straight lines to
make a triangle in a different shape in a nose or something, right?
Yes.
Yes.
And I'm not an expert on that deep level stuff just to let you

(31:25):
know, but everything I also do know about talking to those people
who are at that level is it's kind of magical in the sense, right?
It's black box.
Oh, so this is another other topic we should talk about.
There's two reasons why you build mathematical models or statistical models.
One.
Is I want to actually understand what's happening.

(31:46):
Like, if I increase the temperature, what actually happens to that thing, right?
So I don't care if I actually get the prediction exactly right, what I care most
about is the relationship between temperature and the rate of reaction, right?
I might not be able to predict most accurately, but I really know what
it means to change the temperature and I understand the mechanism.

(32:06):
And then there's another type, which is prediction, like, I don't.
Well, I'm not, I'll, I don't give a about what is going on into the hoods.
I just want to know tomorrow.
What's that stock price going to be?
And should I put money into it or not?
I don't care how you figure that out.
Just, I want you to be as accurate as possible with that stock price, right?
And those are two different ways of modeling.
And to be clear, all this stuff that we're talking about here is all black box.

(32:29):
I don't care about the relationship underneath.
I just want you to give me as accurate as a prediction as possible, right?
So these models are not built for understanding what's the relationship between.
It's just, if I give you this, get really accurate at giving me this, right?
So that's important to understand.
So I think that's part of why we don't know what's going
on underneath the hoods because they haven't been built.

(32:50):
To be the kind of model where you can figure out the relationship.
They're just like, just get really good.
And if I show you this face, is this the person I should let into the computer?
Right.
That's all they really want to be.
I don't care if you've got the nose angle, exactly.
Right.
Whatever you do, which can also be bad though.
You want to hear a funny story about that with computer vision?
I can tell autumn loves funny stories.

(33:10):
Yeah.
She doesn't like to laugh, but she does like funny stories.
I do like to laugh.
I love it.
Autumn, you were laughing your
ass off just two seconds ago.
I'm kidding.
Okay.
What are you like from California or something?
What are
you from California?
We don't laugh.
Sorry.
That's a, that's a little bit aggressive, Stephen.
I'm just saying it's been very, it was a good.

(33:32):
10 to 12 years of cultural acclimation for me
coming from the East Coast to the West Coast.
I can just talk about that later too.
Just because you guys are grouchy and we have tacos, leave us alone.
We have better pizza and we're more direct.
Okay, so back to and bagels too.
Um, but you guys are, but I'll give you tacos and

(33:52):
sushi.
No, tacos and
sushi.
Tacos and sushi are life.
Yeah, I give you, I'm with you on that.
Okay.
You're going to tell us a funny story.
You're going to tell us about how someone, someone found out.
Oh, vision models.
Thank you.
All three of us together.
We make one brain.
The thing was, there were these people who were
like, Oh, let's see if we can tell dogs from wolves.
Like, can we train a model that can actually say, this is a wolf.

(34:13):
This is a dog.
Cause people are always like, Oh, is that dog
part wolf, whatever reason they wanted to do it.
And they got this model and it was.
Awesome.
Like it predicted like with 99 percent accuracy on new data
data sets that like, yes, this is a dog and this is a wolf.
And how can this be?
How can we have created such a great model?
And then when they actually looked at your
data and this is tying back to data again.

(34:36):
All the pictures of wolves had either snow or
forest in the background and very few of the dogs.
Yeah, so all the model had actually been
trained on is if you see a forest or snow.
Yes.
If you see a forest or snow wolf.
If you see if you don't.
Yeah, exact same story.
Someone was training a
model.
On, on, if something was cancerous or not, and
it was literally health, like a health related.

(34:56):
They were training this model and they kept getting it right.
This model is the main exact same thing that, Oh, guess what?
All of the pictures with a ruler in it were cancerous because they
were the ones that were like, and in a doctor's office, it wasn't
like someone just randomly taking a picture or something, right?
We're like, Oh, now we
actually need to know how big that tumor is.
So we got to put a ruler in there.
Oh, look
at, look, this is how we decided it wasn't the model.
And that's
innocuous.

(35:17):
Yeah.
Those examples are kind of innocuous about
that, but this has also helped happened a bunch.
Oh, we can talk about open source AI too.
This is a part of the important part about why
any open source AI needs to include his data.
There was something similar to that.
I think with health and race, I forget which races and which
health item, but it was basically because most one of the races

(35:38):
had more disease in it than the other, if it knew it was that race.
Or something like that, or something like that, it would predict
that that person had the disease, even if they didn't have it,
just because it had learned, this is more likely to happen.
So I'm going to wait more towards that because of
the way the data set was unbalanced or something.
Just think about how that worked in like the mortgage
scandal with Wells Fargo and several companies.
You know what I mean?

(35:59):
Like.
And it's crazy because I argue all the time that, like,
people try to put these fancy applications on top of
bad data and I'm like, how many examples do we need?
How many examples do we need?
But it's ongoing though.
I mean, the Leon data set, this was a very famous and they wanted
You know, open data set and they said it was great and it turns
out because they had actually opened the data, some researchers

(36:21):
had found there was a whole bunch of kiddie porn in it, right?
Like there.
Yeah, there was like a whole bunch of kids because it was such a huge data set.
There was no way for a human to actually go through and check it all.
But it was like this large scraping.
And the only reason we could tell that is because
somebody could actually go and look at data, they said.
But after that, they said we're closing the data, but we fixed it.
And so now there's no way to know.
And so the thing is, yeah.

(36:41):
This is a huge debate in the open source community and in the AI community.
What does it mean to be open source AI?
I've been also doing a lot of research on that because I'm writing
a talk about security and how to make your code secure and like
writing open source like repositories and how to keep them secure.
And it's so interesting that like you would think that You know,

(37:01):
like most enterprise companies are like, it has to be, all of our
code has to be behind closed doors and we don't want anyone to
look at it, but there's also so much evidence showing that a lot
of open source problems have been found because the data is open.
So it's like a really interesting balance of what you want.
To be seen in public with open source and whatnot.
Let me just want to say one thing on this though.

(37:21):
Just one thing.
I won't go deep into it.
I just don't think everybody needs to be open source though.
Like I love open source and I'm a huge, some products
should definitely not be open source
and some data sets should not be open source, right?
Like if your data set.
He's using proprietary, like people's health information
and it was personally identifying for those people.
Please do not like
our government.
Yeah,
plus

(37:47):
this is being recorded.
What are you thinking?
So, but just don't call yourself open source and that's fine.
You want to open the weights up and give us the weights.
Great.
I mean,
they're called open AI.
Like we have a problem.
Oh, that's a very basic problem with them in general.
It's like them and Metta are the two worst.
Metta keeps wanting to They can't
tell what they want.
Do you want to be a non profit?
Do you want to be open?

(38:07):
Do you not want to be open?
They
want profit.
They want it all.
Two recommendations.
Anyone listening, read the book, The Alignment Problem.
It was the best, like it had all the examples of everything
wrong with like aligning your data with the outcomes you want.
Fantastic book.
And also go watch the movie, The Mitchells, Mitchells vs.
The Machines.
I love that
movie!
Oh my whole, the whole problem breaks

(38:27):
down to like them trying to get a vision model
that can detect a dog versus a muffin or whatever.
Like, that's how they beat the, it was great.
Go watch that.
Dude, it
was, that is the cutest movie, like.
It was great.
Netflix original.
I ignore my kids movies like 90 percent of the time.
If they turn that movie on, I'm like, move out of the way.
Move.
Where's the popcorn?
Also, my favorite part is the tech bro gets

(38:49):
kidnapped by his own robots and it makes me so happy.
But Autumn, this, the fact that you just said that,
makes me think you're showing your kids the wrong movies.
Do you not show them the whole Miyazaki series?
What's that?
What?
Autumn, you're leaving the podcast right now.
I'm sorry.
You have, you have, you have, you have cat

(39:10):
headphones and you, and you have like, It's
funny when I tease you guys, but I don't like it when you do it back to me.
Oh,
Autumn.
You have just, Don't
give me the Steve eyebrows.
Those
are for Justin.
Okay.
Autumn.
What I will say, Autumn, is we've, you've,
Actually, I'll say it in a positive way.
Autumn, let me introduce you to this great
new world that you and your kids can share.

(39:32):
So we'll do it off podcast, but.
Steve, your expressions are just the top tier.
Not Princess
Mononoke.
Just FYI, don't start there.
How old are your kids?
How old
are your kids?
Seven, five and 11.
Okay.
So the 11 year old may be Princess Mononoke.
Listen, Miyazaki is a Japanese animator.
But he writes the most glorious movies ever like the imagination

(39:56):
They're strong female character, which is important for your boys.
Almost all of them have really strong female characters Even the leads as well.
They're just Yes, it's amazing Start.
My suggestion is start with my neighbor Totoro.
All the kids will like that one.
And then your five year old may not get freaked out.
I started my kids a little too young.
I'm like spirited away.
And the other is

(40:16):
very, it was a little, yeah.
But to row Kiki's delivery service.
Fantastic.
The best thing you need to know about Miyazaki is he had
a quote about, they showed him AI generated animation
and his quote was, it was an insult to life itself.
I think this might be how artists actually finally get paid their worth
because A. I. is going to ruin art so bad that they're going to be like,
Now, everything that someone makes for hand will pay a bunch of money for.

(40:39):
Also, Justin, you did a great job because you
just found Steve randomly with a vector talk.
Like, you found all this personality in a vector talk?
I didn't know he was from New York, though,
and so it was a little unfortunate there.
I couldn't get all wins, but we're, we're getting Whoa!
Whoa!
We got, we got Wait, but I think the personality
is because he came from New York, so we got, you know, there you go.

(41:02):
Bing, bing, oh, sound effect, bing, bing, bing.
We accept
you for who you are.
Thank you.
We're not just canceling people and firing them just
because of We got, we got roughly 20 minutes left.
All right, so look, so once you've calculated these
embeddings There's all sorts of cool things you can do.
You can actually use these embeddings to replace your information.
Like people are now doing analysis just straight
on embeddings, not the pictures themselves.

(41:22):
They throw it.
Once they calculate the embedding, they throw the picture away and they do their
analysis on the embedding because it's got all the relevant information out.
Okay.
So now we've done a whole bunch of these embeddings, every picture you.
If you send the picture through an embedding
model, you'll always get the same numbers out.
Same picture, same numbers always.
As long as you don't
change the model, you don't change the picture.
As long as you don't change the model, you can put any picture you

(41:44):
want, and all of those images will get mapped into that space, right?
And so now you've got, let's say, 10, 000 embeddings.
You're like, what do I do with all these embeddings?
Like I said before, embeddings that are closer in space should be more similar.
And so what vector databases do is they allow you to take those embeddings,
put them into a database, and then create an index so that you can

(42:09):
quickly search who's close to this embedding that I just passed in.
So, like, find a picture similar.
Or find text similar or an abstract similar.
Take your, your unstructured thing and put it through the exact
same embedding model that you use to create your other embeddings.
Get the embedding that comes out, ask the database,
Hey, what embeddings are closest to this embedding?

(42:30):
And it'll give you back if you say, I want five,
it'll give you back the five closest things.
So
basically an embedding would be something good if you had
data that you didn't want to keep and have to worry about
securing and it could be something, you know what I mean?
Like, because then you wouldn't have to care or worry as much about like,
Hey, I have all this data and I don't want to be responsible for it.

(42:52):
I don't want to pay the storage.
Could you just keep the embeddings?
and use those?
There's a reason I'm not in the security field.
I will say, tentatively, yes, in the same way that to
secure the embeddings, but I wonder if that would be less to store.
It's the same
as, yeah, it's the same kind of thing though, right?
Like, it's any of those kinds of things where I
can compress and retain the essential information.

(43:13):
I can't, it's not the exact, right?
So I can't go back to saying, yes, this was a picture, exactly a picture of
a cat, but I can say, This is a cat like picture that was originally there.
Yes, you could do that.
It's a way to compress your data while retaining as opposed to something
like in a hash, which doesn't have any semantic meaning in it, right?
It's just a hash.

(43:33):
This says compress the data, but retain a bunch of the stuff That's important
to me, which I, we haven't gotten to this yet, but I want to get all the
way through in vector embeddings, the model you use is actually important
in terms of what actually comes out in the embedding vector as well, right?
It's not like I can throw it into any model and I'm
going to get the exact same kind of features to come out.

(43:54):
It depends on how that model was trained.
So when you pick an embedding model, you want to pick an embedding model that.
Has both the architecture that you would want, but also was trained
on data similar to the stuff you want to ask questions about, right?
So like if I trained a model on like, so a very
famous data set is M. N. I. S. T. Image data set.
It's those handwritten numbers, which I'm
sure you've seen over and over again, right?

(44:16):
So if I train a So, uh, computer vision model on that.
But I want to actually tell apples from oranges
or good apples from apples that have rot on them.
A model trained on that data sets not going to do really well, right?
Because it never saw those things.
It only saw handwritten numbers.
So you really do have to pay attention to the model that you use when
you create these embeddings, because you can create a whole bunch of

(44:38):
embeddings, but they might not actually show what you want it to show, right?
And you can put them all the way through, get
similarities, but they may not actually be things.
Because the model doesn't even know that they were, was
never trained to know that those were similar things.
A lot of this stuff, if you're going to do
it internally is a lot of experimentation.
Right?
This isn't like writing code.
And did I get the right output?

(44:58):
It's let's try to run some of these stuff.
Let's try this model.
Let's see if we add this many more parameters.
How much does it, if we use a bigger model, yes, it's more
accurate, but it takes this much longer to run and we have to use
as much GPU and it's like, there's all these things, there's all
these trade offs and things that you need to pay attention to.
I'm giving you the simplified version of this whole thing, just so you can.
Get the framework, but
no, but that helps.
And I also think that it's another example of when

(45:20):
you need the education to use these things, right?
Because it's effective.
It sounds like it could be used to have
new, more efficient ways to do these things.
But you need this education to make sure you pick the right tool.
So rad.
Yeah.
Cause that's definitely one of the things
you're going to need to pay attention to.
Okay.
So then you put them in the vector database.
The, one of the reasons you want to put them in a
vector database rather than just calculating them and.

(45:43):
If you didn't have the vector database and you wanted to do
that similarity query, you'd have to keep all your embeddings
in memory and then actually you can calculate it exactly
right using an exact thing or you could build an index on top.
Most databases, the nice two parts about them is their low resource consumption.
Right.
Compared to keeping everything in memory and two, they have an index, right?

(46:04):
So you don't have to brute force, calculate the distance
between everything all the time, which you can do.
And that'll give you an exact search.
Most of the vectors searches that you do now in a database, I use
Postgres because why not Postgres, you should Postgres everything,
just like you should use IntelliJ, you should use Postgres because.
You all know relational already, probably
you've used some sort of relational database.

(46:25):
Postgres is open source.
It's owned by a foundation.
So nobody's going to be like, well, we closed source that.
You guys are out of luck now.
And then the third reason is there's a ton
of plugins and I'm like a big geospatial guy.
That's actually my really deep background.
And there's a, it's like the gold standard for geospatial
analysis and a database is PostGIS, which is in there.
Also, I feel like
you, you want one database, you want performance, you want.

(46:49):
Something that like, there's a lot of good databases and you need to
pick the right database for the tool, but relational databases are
what we teach in school, first of all, non, uh, like we'll not, I won't
say non relational cause that's like a lie, no SQL databases will.
Like there, you still have to learn the
access patterns, which people don't know.

(47:11):
So even though it might be the better database, they'll
dump it in and not do the right things that they need to do.
So if you have to make a choice that you need something that's slightly
more no SQL and not as much structure, but you want it to perform well.
And like your team has like a relational database
background and the fact that you don't want to license.

(47:31):
Pulled from underneath you.
Postgres has so many
data type.
Yeah.
And then like, not just that, but B Row half of the relational database
have Postgres somewhere underneath them with a bunch of cool tuning
and like an interstate phase, which is why they're so expensive.
We can get on that whole thing.
But my general advice to people is start with Postgres before
you start with any of these specialized databases, whether that's
a document database or any of those other start with Postgres.

(47:53):
You might've noticed that engineers in
general think they need the fastest, biggest.
Whatever server there's going to be.
And then you, they get purchased and then they're using
like 2 percent resource utilization the entire time.
So start with Postgres, try to, and if then you run
into scaling issues or something else, go read the
documentation, cause you're probably doing it not optimally.
After you've done the documentation and tune your database
again, if you still can't get it to work, then look at something

(48:15):
specialized, but just start with Postgres from the beginning.
Okay.
So back to the databases and the similarity search, right?
So those are the reasons you want it in the database.
Once you've done this.
The important part with these kinds of
searches is you're not doing an exact search.
If most of your questions are exact questions,
like how much money did we make last year?
You don't want vector, a data vector database for that, right?

(48:35):
That's like, that's relational or whatever.
Yeah, that's analytics
and relational, which again is like giving people the information to make
the right tools for the job because they think it's a black box of magic.
And I'm just like,
Right, exactly.
So, so for this one, there's questions like, like.
What is like this thing or what thing is close to this thing or show me

(48:56):
other things like this things or clustering or looking for anomalies, right?
It's not like looking for exact opposites or anything like that.
You're looking for something that is not similar, like credit card transactions.
You could do that.
You could build an embedding around credit card transactions
that was both like the picture of the receipt and all sorts
of other things so that it like comes up with some vector.
I don't know.
And you can say, look, show me the closest

(49:18):
credit card transactions for this person.
To this new one.
And if the distance between those is really far, you could be
like, this is probably not one of their credit card transactions.
Right.
And so that's what would happen with that.
I, I think there's different ways to measure the distance as well.
Um, just so you know, and then they have different properties to them.
The one that most people use is cosine.

(49:39):
So it just looks at the angles.
It does really well because mathematical reasons.
But you can also do things like Euclidean
distance, even in 512 dimensional space.
You know, like, Euclidean distance is like the hypotenuse of a triangle.
X squared plus b, the square root of the hypotenuse is a squared plus b squared.
That thing, like straight line distance,
but you can do it in all sorts of space.

(49:59):
There's also like, some people call it Manhattan, some other
people call it Taxi distance, which is like if you had to
drive around blocks, right, rather than straight lines.
So those are those guys.
There's all sorts of
New Yorker does make you funny.
Yeah, but I'm not from New York City.
Thank goodness.
Um, I don't like I'm from Rockland County.
Do you know where Westchester is like White Plains that area?
Like I'm like, it's a commuting suburb, 45 minutes north of New York City.

(50:23):
But it's in New York, but it's just like a suburb of it.
Yes.
And if you're from New York City, I'm from upstate.
And if you're from upstate, I'm from New York City.
Right.
It's like one of those places where our claim to fame of the town that I'm from
is we were one of the places supposedly that they went to on Sex and the City.
Ooh.
I've watched that show.
You did?
Okay.
So you remember Carrie had like the woodworker boyfriend?

(50:46):
Yeah.
Not Mr. Big, but the other one.
I forget what his name was.
The one who like had a dog and like they bricked
down the bridge, the apartment between them.
Oh yeah, yeah.
Whatever his name was.
Anyway, he had a cabin.
He was my favorite.
Yeah, mine too.
But he's from Northern Exposure too.
He was in Northern Exposure.
Was he
Aiden?
But Aiden?
Yes.
Yes.
So a yes.
He was the
only one that wasn't toxic in total garbage, but keep going.

(51:06):
Yes.
So Aiden had a cabin in the woods, was supposedly
in the woods, like far out in the middle of nowhere.
And Carrie goes to visit him and she calls up, I think Miranda
or somebody who's complaining, I had to go all the way to
New Jersey to get a cup of coffee, blah, blah, blah, blah.
The name of the town was suffering, which is where I'm from.
So it's an easy play on words.
I'm suff, I'm in suffering, I'm suffering here in suffering.
But there was Starbucks all over the place and it was.

(51:29):
They're
just lying out here in Saxon City, and also, like, maybe
you had to go all the way out there because he was the only
one that wasn't toxic and hot garbage, but okay, whatever.
And actually had a connection to nature and real things.
Okay, anyway, I think we've covered what
you use vector databases for in general.
So their use case is generally around similarity, not exactness.
And you're using it For things which are unstructured,

(51:50):
generally, one of the things that you might have heard about
a lot lately is this thing called creating a RAG system,
Retrieval Augmented Generation, that requires a vector database.
What the idea between a RAG is Could
you explain RAG versus, was it A G?
What is it a IG or a Oh, ag agentic.

(52:10):
Yes.
Between the two.
Okay.
So here's my under, I'll just give you a quick on Agen
'cause I haven't really worked on agen systems that much.
Agen to me is a fancy what, from what I've seen, is a fancy way of written.
We've written a set of microservices using LLMs is basically what it is.
So when you say it's a Damon
Yeah, it's a Damon that runs a Damon that does AI stuff and it's like, oh, this.

(52:30):
You send in a request to the agentic system, which is probably
going to have a specialized LLM that's going to say, Oh, given that
type of question, this is the kind of LLM you want to answer this.
So it's going to send that query off to that other LLM.
And if it's a multi step problem, we'll say, okay, I'll get that answer first.
It'll come back to that same agent again, and then it'll send it off to

(52:51):
another agent, another microservice that handles the next part of the question.
And then it reassembles the answer and sends it back.
That's the general idea of a Gentek.
Maybe there's something more exciting, but given the history of how
people have bragged about stuff, I'm basically thinking it's mostly
like a microservice architecture for connecting LLMs together.
That's an agentic system, right?

(53:12):
So I'm something smart enough or has some logic to say, put these together and.
Bring it back into a coherent whole rag is completely different.
The other option to rag is fine tuning one of your models, right?
So fine tuning means I take somebody's model, like the llama model, and I
chop off the last layer, or I open up the weights again, and then I give it

(53:33):
my own data and I retune parts of the model to be more focused on my data.
My questions, the stuff on, so for example,
Justin is going to build an IOT specific LLM.
Is that, is it all that stuff in the
background for IOT or is it more for Woodshop?
It's, it's all different things.
Okay, sure.

(53:54):
Justin is going to build a It's
called ADHD Hobbies.
Exactly.
It's a pegboard of stuff.
It's a
pegboard of This was January of 2023.
This is So he's going to build a maker.
Like a maker LLM that answers all sorts of questions
around makers, not just the general internet, right?
And so what he would do is he'd have a whole bunch of

(54:15):
question and answer and text pairings around maker stuff.
Like you take maker magazine and scan that and do some other
stuff, scrape some maker site specifically, and then open up
the last couple of layers of the model and retrain or do that
same training process basically again, but with his new data.
So the model gets all this really good information
it learned early on in the earlier layers.

(54:37):
We're leaving that kind of structure, like how sentences are
put together or how, what are features and images I care about.
But the later layers are now getting tuned specifically
to the questions and the data that he's giving it.
And that's called a fine tune model.
And that's pretty cool.
And it's fun once you do it.
What RAG is, is I don't have an ML machine learning staff.

(54:58):
really there at my company.
I don't have that kind of hardware.
So I, but I want to see if I can take advantage of some of this stuff.
And the problem with using something like OpenAI or Cloud or any of these really
large foundational models is they were trained on the whole internet, right?
Or some huge corpus of text.
So the example I really wanted to build that was perfect for RAG, one of

(55:18):
my friends when I was at VMware started using OpenAI as a dungeon master.
And so he would be like, we're doing this campaign.
These are the use cases that we need to use open AI for.
Like, you know what I mean?
It's not going to hurt anybody.
It's not responsible for any crazy decisions, but we're going to be efficient
and let humans live their best lives and spend more time doing fun stuff.

(55:39):
When you notice when he was doing it, it kept forgetting
things and it also didn't know all the rules, right?
So he had to keep saying that was close.
But remember kobolds will attack usually like
this and then their hit dice is like this.
And then it would generate a random number and do stuff, but he had to keep,
it wasn't really getting it, which is a perfect system for what RAG is.
So what RAG, it stands for retrieval augmented generation.

(56:01):
We were working on a demo and I've actually calculated the embeddings.
I just, I actually had to get a job rather than doing consulting.
Um, and I actually had to get paid rather
than just doing something that was fun.
But, so what I did is we found all the
fifth edition D and D manuals in Markdown.
And so what I did is I took all the manuals and created embeddings
for every markdown header section from the D& D manuals.

(56:22):
Like we had the player's guide, the dungeon master's
guide, a couple of the monster manuals, a couple of
campaigns, and I made embeddings for all of them, right?
And then I stuck those embeddings in a database.
And so now what we can do is we can when the user says, Hey, I
open the door to dot dot dot dot before we send that to open a I.

(56:43):
We intercept that query.
We take that query, create an embedding for that query, then
use that to query our database for information in all the
guides that is similar to the query the user was asking.
And now when we send that query on to open a I. We say, Okay.
Here's the original query.
And then underneath it, we say something like for context, and then we

(57:04):
include the information that came from the database that was tied to you.
Like we stored the original text along with the embeddings,
and we include that original text now with the user's query.
Now, when we send that to open AI, open AI says, Oh, right.
We're not talking about the whole internet.
We're talking about this very specific stuff.
And I'm going to focus my answer particular to, and I, and I

(57:26):
have more recent information, like you're talking about kobolds.
I maybe crawled a couple of things on kobolds, but
now you're giving me like paragraphs about kobolds.
So I know more about kobolds to give my answer back to you.
So like.
Embedding is like extra information kind of, is it
like a text file or like what, like just basically
extra information to help kind of give it more context.

(57:48):
The embeddings used in this case is you have this database of really
relevant information to the user's question that the open AI doesn't
have access, either doesn't have access to or it's not really focused on.
So you want to find the most relevant information in your
database to help give more flavor to the user's query.
Okay.
So does it then just use your database or does it use your database

(58:10):
and the whole internet, but just uses your database to enrich the data.
And now we get to what is an LLM actually doing.
So this is why you need to understand what an LLM does right at its essence.
And if you.
Substitute time for position in a sentence.
An LLM is an autoregressive time series model.
This is the exact same thing.
So what happens is you send in your query.

(58:34):
So here's our query, right?
It's this chunk of text over here.
What we ask, ask the model to do is to start predicting words.
So I don't know how they come up with their first word.
They've got some sort of magic thing where
they come up with their first word, right?
That's going to answer your question, but they come up one word at a time.
They don't predict the whole sentence at once.
That's why you see it scroll across.
Sometimes takes that first word.

(58:55):
Let's just put that first word in, the.
Let's just say for some reason it came up with the, I don't know how it did it.
It's got the.
Then it says, okay, I got it, and it's time for me to predict the next word.
I'm going to predict this next word based upon the, the word
before, and also constrained by the stuff that you passed in.
Right.
So when they trained the LLM, it learned all these relationships

(59:17):
between words and how they appear in sentences and how which
words appear more often closer to each other and all that stuff.
So what it's doing is it's saying, okay, given these words that
you told me in your query and this word I predicted, I'm going
to predict this is the most, this is the likeliest next word.
And then the next word is given.
It can be given these two words or just the word before or

(59:37):
whatever, or the rest of the sentence, given what you passed
in and those words, What is the next most likely word?
And it just keeps doing that over and over again.
It just keeps predicting words one at a time as it goes down a chain.
Can I ask a quick question?
Of course.
Why does it use the compute power or, you know, space to think, to

(59:59):
predict the next word, instead of just taking in all the information
and then looking through whatever data to do it, like, what is the
purpose of the prediction instead of just taking it all in at once?
No, it's taken all these, these words in at once.
The ones you passed in, and it's using those words plus the words that came
before to constrain the probability of whatever the next word is going to be.

(01:00:21):
Oh, so the next word of the output?
Yes, of the output.
Okay, the output, got it.
Sorry, sorry,
sorry.
So it's going to be like the Cobalt and then it's going to put what's
the next word that's going to come after cobalt given that this is
the information the question you asked me and all the relationships
I've learned in the word and the first two words were the cobalt
what's the next most likely word in that sentence does that make sense

(01:00:43):
that does make sense and it makes sense the way that it's
it just constantly thinks as it spits out more information
because it's predicting as it goes on so that makes sense.
So, and this is, this is how the beauty of ADHD, we're going to tie this all
back to stuff all the way from the beginning about the safety and what we
should be able to do, it's the way it's implemented currently is what's causing

(01:01:05):
the problem, because the way that we use most of those models is we tell it.
It doesn't matter how much confidence you have in that
word, you could, you know, your probability could be 0.
1, that that's the right next word, I still want you to put it
in there, you still have to make a complete sentence for me, I
don't care what probability that the next word is, it could be 0.
9, that you're really confident that that's the next word, or it could be 0.

(01:01:27):
1, I have no idea, but this is the best I got, and so what ends
up happening is, that's how an error gets introduced, Right?
Like we force it to predict a low confidence
word or it knew it was a low confidence word.
Statistically, it knew, but it still had to put something in.
So it puts it in.
Well, now we're going to predict the word after that.
Are there any models that don't force it to do that?

(01:01:47):
The programmers could when they go to output it so that
when that's getting done opening, I used to have this
really great example and I can't for the life of me find it.
And I wish everybody would just.
Do it now is they would color code the, like the highlight each word
given the probability that the model thought that was the right word.
So the sentence would look like yellows and greens and reds based

(01:02:09):
on the probability that the model thought that was the right word.
So if we're looking, it's not exactly perfect for us, but if you see like a
red in a sentence that seems kind of fishy to you, you could be like, oh well.
That's probably a low probability word.
This sentence is probably off.
This totally goes back to what we were saying.
They want it to be the magic box so bad,
but that would give more trust and more.

(01:02:30):
And like, I would rather you, I would pick and probably pay for a model
that would tell me when it's unsure, tell me where it got it from and
what are the other possibilities, because then you have context, I don't
think you're ever going to get it to tell you where it got it from.
And I'll, I can explain that one in a second, but you can get this uncertainty.

(01:02:50):
You can also, they have other, like, if you
play with some of the examples on OpenAI.
Maybe
not tell you where it got it from, but
tell you if it's a reputable source, right?
Like a reputable source into context.
It still can't get that for you, because the only way you could
get that is you could constrain the data that was trained on.
To only be reputable, what you consider reputable sources.
I mean, I would still take that too, you know, because like, look

(01:03:12):
at, like, I usually really agree with Mark Cuban on a lot of things.
I think he's great, but he basically told all the blue sky a week
or two ago that you don't need to go to college or learn things.
You can just kind of.
Use AI because AI knows so much, then you can do all these highly skilled jobs.
And it's like,
you can do importantly,

(01:03:32):
exactly.
And I think AI can be a learning tool.
Cause sometimes when I'm tired of just reading and doing stuff, I use it to
kind of almost play games with, to like ask questions and help me get started.
Like, how would you phrase this?
Or like how to make.
My ADHD brain sound a little bit more coherent, but I still have to go back,
read it, make sure it's saying what I need it to say and do all these things.

(01:03:53):
And like, it's almost a disservice to the fact that this could help kids.
It could help even kids with learning disabilities.
It could help people like get more into things.
It really could be an educational tool, but we're
selling it as a Bible and not an education tool.
You know what I mean?
And like selling it as an.
You don't have to be educated is a lie.
Selling it as this can be your new tool to get educated would be helpful.

(01:04:16):
Like,
so I wrote a blog, I wrote two blog posts, why I don't
like the current LLMs, and then I wrote someone why
I do like them, like what I'm excited about for them.
And one of the things I'm excited about, and this is
probably back again, related to the HD ADHD stuff and all
the neural stuff, but writing has always been hard for me.
I think I probably have a writing disability.
To me, these LLMs are writing calculator.

(01:04:37):
Yes, that's exactly what I use it for, because I have all these thoughts,
so I put all the thoughts out there, and then I'm like, how would you
say this incoherently, and like full, and then I go back, read it, add
what I want, because it always takes out stuff that's important, and it
doesn't have context, but it's like wild, because we could give the kids,
or us, or ADHD folks context to use these tools to help us to do things.

(01:05:00):
Just like a math
calculator, right?
You have to, you have to learn math.
You have to learn the basic operations, you can't get out of it,
but we're not good when we you go to do advanced calculations,
we're not going to penalize you from being able to do that advanced
thing just because you can't do this early like this, you have
trouble actually physically in your brain doing these early things.
It's the same thing like looking at a blank
page for me and saying, write a full essay.

(01:05:22):
Sure.
Guarantee of like terribleness,
dude.
I've been in like this email, like talk,
total just paralysis for like three days.
And I'm just like, I literally was like,
tell me where you would start with this.
And then wrote like.
A ton of like, I got so much stuff done because I just need
it to get the like, blank page paralysis out of the way

(01:05:46):
and even if it's completely wrong, you can at least
look at it and be like, Oh, they got this part wrong.
And then this thing is
so much stuff.
I had to give it more context.
I had exactly like, but it was the point of, I wasn't stuck for hours
staring at a blank page and with like, I can edit great.
Yeah.
A lot of people are better editors than
they are writers and you, you can't express.
What it is that you want, but you know, when it's not

(01:06:07):
the right thing and you can, you can nudge it back.
But it's also
executive functioning of just getting started.
And it seems like such a daunting task.
And when you feel like, okay, I don't have to do too much more.
I just need to add this in and make sure it didn't forget things.
And then exactly.
I've never let it just write anything for me, but it's a great start.

(01:06:28):
So I feel unstuck.
That's basically what an LLM is doing under the hoods.
And that's why we could make them better.
But we don't, I, it could start to do, we could start to do
things like predict the whole sentence before you send it to me.
And if there's more than like three words in there that are below 0.
1, just tell me, you don't know,
dude, I would pay so much for that.
Like, you know what I just, it's wild that it could, like, you're telling me

(01:06:49):
that we got these scores and everything, and they're just sleeping on it and
like, and they're all doing the same stuff and not differentiating themselves.
And these are wild.
Clear ways they could differentiate themselves and take the
market with hundreds of millions of dollars being dumped in.
And they're like, no, it's good.
We want to be the
best.
You see them fighting on those leaderboards over a 0.
1 or like a 1 percent improvement or a 0.

(01:07:09):
1 percent improvement or whatever.
And it's like.
Uh, that's like, you know, little bits on the end of the pencil.
The thing that I need is a better features
and like, it's a whole, it's very tech bro.
Typical is what I would say.
Before you drop Steve, we have to hear your parental, uh,
advice and the other things that we were talking about at
the beginning of the show that you were going to give us.

(01:07:30):
Right.
But there's one other
thing I still want to say about the model, but I want to explain to you
why you'll never get, cause I know, I understand why you want sources.
And I said that in my blog post.
You can't get sources out of it.
You took regression though, right?
Like you remember, do you remember regression, right?
So what comes out of the regression is an equation, Y equals MX plus B, right?
Like if I increase temperature by this much
plus some error, I'm going to get this.

(01:07:51):
I can't, what it has is like one of those equations under the hood.
There's a big, huge equation under its hoods.
And I can't go back and say, Hey, If we're doing height versus
weight, tell me exactly which height made you predict this weight.
You can't do that with a regression model and
you can't do that with these models either.
I can't go back and say, because all the words and
relationships gets mixed up into like a large equation.

(01:08:15):
And so you can't go back and say, Which article specifically taught
you this relationship to this word or whatever it is, you can't.
It's just the nature of the beast.
You can't rag can help some with that, though, and things like
that, because you can give it a source and say, please include these
sources in the result like that, like related to your stuff, because

(01:08:35):
that's what that's exactly what like the Bing search does, right?
Like Bing search is doing that.
It's like it's going to fetch a bunch of searches that does a
web search and then generates an answer based on that web search.
And it's like, Hey, yeah.
These are all the sources that I use to generate this other answer,
but then everyone has to use Bing and that's just not as terrible.
Bing is such a bad search engine.
So, I mean, I want to keep wanting to, what's your
parenting advice?
Okay.
So the dating advice, that was the one that you

(01:08:57):
were first, you were like, why can't guys just be.
Okay, so here's my advice.
This is what I gave to my friend who, she had some
anxiety issues and stuff and she kept dating guys.
Yeah.
So they were all, most of the guys she was dating.
And so what I said to her, like they didn't understand
it and they were like, why are you getting so upset?
And blah, blah, blah.
All this stuff.
I said, listen, I'll call her Mary.

(01:09:18):
Even though that's not her name.
Mary, listen.
If you can find a guy who's divorced with kids and you see his relationship
is good with his kids, that's the kind of guy you want to be with.
That's literally what I look for.
Do they exist?
And then like, how do I like, just where do they live?
Do, do they hide under rocks?
Like I just.

(01:09:38):
Do you have to look for nerdy ones?
No, those are mean
to me.
Those are mean to me.
Do they have
kids
though?
Yes.
I refuse.
They don't even like their kids.
I broke.
Okay.
So
that was the point.
What was the thing?
I said,
okay, but they don't, the two don't go together.
Yes, they do.
Oh,
you know what, Steve, you send me a list of your friends.
Okay.
It's
hard.

(01:09:59):
It's hard.
It's hard to find.
I'm just saying they don't
exist.
I have searched.
I'm about to create a bot.
That's AI that does my dating.
Okay.
I really do have to run to go get the kids, but Steve,
I will come back.
Okay, when you come back, I will, but just say, look, when you
look at their dating profile and they say they have kids, look if
they have pictures of themselves with their kids in their profile.

(01:10:21):
It's like the using a picture of a puppy.
I know, but it's better than just dating some guy.
I mean, you could get lucky and find some guy who's never had kids.
We have to, we have to.
Okay.
Okay.
Okay.
I have, I look, we had, we need to just
look, Steve, I will be back in 10 minutes.
Thank you, Steve, so much for coming on.
Thank you everyone for listening and Steve, where can people find you online?
I'm the Steve zero on blue sky.

(01:10:43):
You can find me as the Steve zero on GitHub.
I am in multiple different Slack communities.
I've worked in Kubernetes before, so I'm in the Kubernetes Slack.
I'm in the geospatial Slack.
I'm in the voxel 51 discord.
Where else can they make them find me on voxel 51?
We did put you in the blue sky starter pack for fork around and find out guests.
So if you are looking for you, you can, people can check it out there.

(01:11:06):
So
I was also going to say, you can find me on YouTube
also, but I don't know if I'm the Steve zero on YouTube.
Um, I, and some of these talks of the stuff we talked
about today are actually, I have YouTube talks, you know.
I linked to the recorded talk.
I will put that in the show notes too.
Once we
get, I'll send it to you so you can have it in the show notes.
Okay.
Thank you everyone for listening and we will talk to you again soon.

(01:11:28):
Okay.
Bye.
Thanks.
Thank you for listening to this episode of fork around and find out.
If you like this show, please consider sharing it with

(01:11:49):
a friend, a coworker, a family member, or even an enemy.
However, we get the word out about this show
helps it to become sustainable for the longterm.
If you want to sponsor this show, please go to fafo.
fm slash sponsor and reach out to us there about what
you're interested in sponsoring and how we can help.
We hope your systems stay available and your pagers stay quiet.

(01:12:12):
We'll see you again next time.

All Episodes

Episode Transcript

Popular Podcasts

Bookmarked by Reese's Book Club

Dateline NBC

Stuff You Should Know

.css-15opob5{left:0;position:absolute;top:0.8rem;} All Episodes

.css-14f5ked{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:2;overflow:hidden;}Vectorizing Your Databases with Steve Pousty

Episode Transcript

Popular Podcasts

.css-r6mb8g{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:1;overflow:hidden;}Bookmarked by Reese's Book Club

Dateline NBC

Stuff You Should Know

All Episodes

Vectorizing Your Databases with Steve Pousty

Bookmarked by Reese's Book Club