Leland McInnes: UMAP, HDBSCAN & the Geometry of Data | Learning from Machine Learning #10 - Learning from Machine Learning

Episode Transcript
Available transcripts are automatically generated. Complete accuracy is not guaranteed.
(00:00):
One of the biggest difficulties with unstructured data right now is that we have great tools for

(00:07):
search and picking through, but that assumes you already know what you're looking for.
And a lot of the time you don't know what you're looking for, especially if it's a brand new dataset
and you have no idea what's in it. So it's the tools that allow you to step back and see the
big picture of the whole dataset or bring it into focus in a different way that that's bringing in

(00:31):
that sort of coherent whole information rather than you know looking at the data through a straw
of just search and return the most likely results to your search term that only answers the questions
you already knew to ask. How did the best machine learning practitioners get involved in the field?

(00:52):
What challenges have they faced? What has helped them flourish? Let's ask them. Welcome to Learning
from Machine Learning. I'm your host Seth Levine. Hello and welcome to Learning from Machine Learning.
On this episode we have a very special guest, Leland McInnis. He's a researcher, a mathematician at

(01:13):
the TUT Institute for Mathematics and Computing, a Canadian Research Institute. He's the maintainer
for many machine learning packages including UMAP, HDB scan, PyNN descent, Datamap plot,
which has created an ecosystem for unsupervised learning and has transformed the work that I'm
doing. Leland, it is such a pleasure to have you here. Welcome to the show. Thanks, it's great to

(01:37):
be here. So let's start off with this. What initially attracted you to the world of mathematics,
computing and research? Well, to be honest, mathematics was something I grew up with. My
father is a math professor or was a math professor at University of Canterbury in New Zealand where
I grew up. By the time I was heading off to university, I wasn't sure if I was going to do

(02:02):
math or not, but I ended up getting the opportunity to do math and physics, direct entry to second
year. So I jumped in at that opportunity and it turned out I liked the math more than the physics
and I had to quickly choose one or the other. Otherwise, my course load got overloaded. So
ended up in math, stayed in pure math for a while there, but I actually took a break

(02:28):
from academia. Winton worked in industry and with government for a couple years before heading
back to do a PhD. So I've a little bit of experience with actually working with real data,
as opposed to pure theoretical math, but then I just stayed with math for a long time after that.
But working at the TUT Institute, we have a bunch of different people working on

(02:51):
slightly different problems and I ended up leaning more into the data science machine
learning questions despite my, to be honest, pure math background. Very cool. So what do you think
having that like pure math theoretical background, how do you think that that
influence your approach to some of these problems in the data science and machine learning world?

(03:12):
I guess I'm coming at them from a different perspective. So my background was in apology
and algebra. So I tend to think of things in terms of algebraic structures and the geometry of things.
So I'm always thinking in terms of the geometry of the data. I don't know if that's super novel,
but it's, I think a different take than I've encountered often chatting with people.

(03:36):
So I guess it gives you a unique view on how to think about data, how to explore data. And I guess
if you think about the geometry of it, when we're doing something like unsupervised learning,
you're trying to figure out the underlying structure of the data. So there's definitely
something that goes goes hand in hand there. Do you want to talk about how you approach the
unsupervised learning problem? Like when you get a new data set, what are you thinking about?

(04:00):
So unsupervised learning, that's definitely where I live precisely for that question. It's like,
well, what's the geometry? What's going on in this data set? Those are the questions that interest
me personally, just trying to explore something new rather than the more supervised learning
approach where, you know, I've got a whole bunch of answers and I just want to reproduce them. I'm

(04:20):
sure that's interesting, but it's less my field. I just want to understand data sets in general,
and I don't think there are easy ways to do it for the vast majority of data that's out there now.
So there's a lot of great exploratory data analysis methods for tabular data. You know,
if you've got a database or a spreadsheet with nicely formatted rows and columns,

(04:45):
there's standard statistical approaches, faceted plotting. This is a well understood problem.
But all the other data out there, which is to say the vast majority of data we collect, record,
and store is not that. It's free form text, it's videos, it's audio files, it's system logs on

(05:06):
computers, it's all kinds of things. And how do you explore that kind of data set? How do you
understand what's going on in it? That's an interesting challenge. So that's kind of the
sorts of problems I want to try and solve. Yeah, and the amount of unstructured data is just

(05:26):
continuing to increase, you know, obviously exponentially. So yeah, there are different
techniques and tools that you can use to take unstructured data and try to bring some structure
to it. So I guess like each individual piece of data, and then also trying to understand how the
data points, the interrelationships between those data points as well, and how you would group data,

(05:48):
right, how you partition data, how you could explore the data, you could visualize the data,
and that's really where your libraries come in. I don't even know where the best place you would
want to start with, you know, in terms of clustering, dimensionality reduction features,
what's the usual kind of pipeline, right, you have certain features for your data,

(06:09):
then usually there's too many features, you definitely can't visualize it because there's
going to be more than two, three, maybe four of your dimensions, then you're going to want to have
it in a shape and form where you could do some clustering, then you're going to want to visualize
it, you're going to want to represent that data. So yeah, tell me a little bit about, you know,
the libraries that you have and how they address some of those problems. Sure, so the first step is

(06:32):
just getting the data into useful representation. And so that's turning this messy unstructured data
into ideally something nice and mathematical and vectors seem to be the world we live in now.
So you need a way to vectorize the data. Now, there's a lot of neural embedding techniques

(06:52):
and they work super well. So I don't have too many solutions on that front. But if you have other
kinds of data, we have a library called vectorizers that tries to deal with some of the ways of getting
other kinds of unstructured data into vector formats. So whether you're using fancy neural
embedding techniques or something else, if you can get it to a vector, now you have something you

(07:17):
can work with. Because now all the data lives in a space where hopefully distance between data points
is a meaningful thing. So that means you have geometry of some kind and you want to explore that.
So there's a few things you can do. You can try and pull out dense regions, groups, clusters,
and you can try and visualize that data. Now, usually if you're using any modern vectorizing

(07:42):
technique, it's going to give you very high dimensional data. It'll be anywhere from a few
hundred to several thousand dimensions. That's really hard to work with. For starters, most
clustering algorithms aren't actually built around that kind of data set. Most clustering
algorithms, a lot of the assumptions that are actually quietly baked in behind the scenes,

(08:04):
assume that the data is pretty low dimensional, like anywhere from two to maybe 50, really.
So your first problem is you have to manage to get the data to a state where you can cluster it
reasonably and then hopefully one of those clustering algorithms that are out there will
do the job. So I started out by building clustering algorithms because that's what I was interested

(08:30):
in and then pivoted to how do I get the data to a state where I can cluster it. So I worked in
dimension reduction after clustering and that led to visualizable representations of the data, but
then you actually have to make that work for people. And so data map plot is my latest attempt

(08:54):
there. So let's see what are some of the libraries that go on there. HDB scan for clustering. So
that's density based clustering, but we want to be able to handle variable densities and do it
as quickly as possible. U-map is a method for dimension reduction and visualization. That'll

(09:15):
get you either down to a clusterable number of dimensions, maybe five or ten, or down to two
or three dimensions so you can visualize the data. There's also data map plot, which the goal of that
one is, oh, you've produced a two-dimensional representation of your data via U-map, TSNE,
it doesn't really matter. You can hand it to this program and it will make you nice visually

(09:41):
appealing static plots or interactive plots that you can explore and play with from there.
So those are a few of the libraries. Cool. So we'll start with HDB scan. So you started working on
that over a decade ago at this point, right? Yeah, yeah about that. What were you using for

(10:03):
dimensionality reduction at the time? Different ones? I wasn't, I was working with lower dimensional
data. So it was once the dimensions started going up, then you realized, okay, then you had to get
it. Yeah, yeah, yeah. Then I had to pivot. Well, one, yeah, as you get to other data types, you
suddenly realize, oh, this thing I was using is not working anymore. What's not working about it?

(10:26):
So yeah. Right. So with your implementation of HDB scan, I mean, there's so much to it, it's such a
fabulous algorithm in and of itself, this extension of DB scan where you can kind of control some more
of these things. And yours is fast also. How did you get it? I mean, in the most, I guess, lay terms,

(10:49):
how did you get it to become a much faster algorithm? Well, I mean, the algorithm was written by some
other people who published great papers on it, but it was a slower algorithm. So what it came down to
was I really read their paper and then wrote a very basic implementation that was quite slow,
and was like, this does a better job of clustering than anything else. And then I just read some

(11:10):
other papers, other random papers, and was like, oh, well, if I if I take this paper and this paper
and just push them together, then it'll it'll go faster. And so it was really about the question
of nearest neighbor search, because that was fundamental to what was going on internally
inside HDB scan. So how do you make that faster? And there are some algorithms to do that. And it

(11:37):
turns out that you can adapt them to work within HDB scan and make the whole thing flow together
consistently. All the bits and pieces are there. There are slightly more messy and complicated than
one would like. They're not the easiest of algorithms. So gluing them together took a little
work. But really, I'm just going to get a work by other people as far as I'm concerned. Yeah,

(12:02):
that's sometimes the most like that's sometimes when the breakthroughs happen, when you take the
best pieces of them, or you can apply something from some other type of technique to the work to
the work that you're doing. What was the speed up? It was like what and n squared and and log n or
something and log n. Yeah, had to be looking up everything for every point before and then using

(12:24):
minimum spanning trees or yeah. So the problem is that it wanted to compare every pair of points
that gives you your n squared. But you can use space trees, space partitioning trees to cut
that down to n log n. Although that doesn't work in high dimensions, it turns out because space trees

(12:46):
don't work very well in high dimensions. But that's okay because you just need to get down
low enough dimensions to make it work first. Yeah, so I guess that then brings us to UMAP.
The really a breakthrough, I know that you major breakthrough in dimensionality reduction being
able to capture both, you know, local and global structure of the data. But yeah, so you have had

(13:10):
this problem where now you're dealing with data that had I guess a higher number of features and
you had to reduce the data. So yeah, tell me about your thought process in creating something like
UMAP. Well, I mean, I think the real breakthrough actually was TSE that came out in 2009.
That's Hinton. Yeah, Lawrence of Underbutton and Hinton. And I think that was a real breakthrough

(13:40):
because it really demonstrated the possibility of these kinds of algorithms in general.
Well, nothing else in dimension reduction or manifold learning up until then had really been
as effective, especially for like visualization. So getting down to really small numbers of
dimensions and having still good representations that capture a bunch of useful information.

(14:03):
So I think that's where it started. So I was very impressed with how effective that approach was.
And so I just wanted to build a method that would work with the math and theory that I was
familiar with. So for me, that was very much sort of, as I said, geometry of the data, algebraic

(14:25):
topology was what I knew. So I was trying to build a theoretical basis for an algorithm out of that.
Then it's just a matter of slotting various pieces together. Again, like I was a bit of a magpie and
grabbed different papers from all over the place. So there was a paper by David Spivak on fuzzy
simplicial sets, which I read and realized. So I actually read it because I was interested in it for

(14:50):
HDB scan. And it had a lot of application there in some ways, because again, HDB scan, you can view
from an algebraic topology lens or topological data analysis lens. But it really just opened my
eyes to the possibilities of what one could do and how one could interpret things. And so just
grabbing random bits and pieces of interesting math and algorithms and gluing them all together.

(15:17):
Very cool. I've used, well, yeah, I mean, all of your libraries a lot. But UMAP in particular,
you really see how using different parameters, you can get very wildly different results.
Sometimes it's hard to know and evaluate if you're, how do you know if you're reducing dimensions

(15:41):
correctly? Do you have any sense of that? I have some sense. But actually, this is a challenging
problem. And I think the answer is that you don't, because it depends on what you're trying to do
with it. What is it that you're trying to represent in low dimensions? I think this is a problem also
for clustering that I find is a bit of an issue is people expect the clustering algorithm to produce

(16:07):
like the true clusters, but there aren't any singular true clusters. It depends on what kind of
things you want to get out. And so expecting there to be some magical true answer, and these
other algorithms are approximating or how good do they compare to the true clusters is like,
there isn't a single right answer. And so that's the disappointing thing with unsupervised learning.

(16:34):
Supervised learning, you've got all these metrics that you can measure how well you're doing.
Unsupervised learning, it's kind of just like stare at it and go like, well, this is doing
what I need. That's good enough. Which is unsatisfying in some ways. But at the same time, if it's doing
what you want, isn't that good enough? Yes. Yeah, it depends so much on what the use case is.

(17:01):
Clustering is such an art. I think of it as equal art to science. How many clusters should there be?
Just understanding that what should the structure of them be like, you know, having them in hierarchical
nature, understanding what the group should be, understanding what point that group should belong
to, you know, all of those things, depending on what the use case is going to be, what are you

(17:27):
going to use it for? Is this to set you up for some supervised learning later? Is this just some
exploratory data analysis? Is just to figure out, you know, some idea of how you could group or
think about your data set. Something that you said that I really liked is that while it might not
give you the answers, it helps you ask better questions of your data, a lot of these tools.

(17:50):
Yeah, yeah, and that's very much what I'm interested in. I think one of the biggest
difficulties with unstructured data right now is that we have great tools for search and picking
through, but that assumes you already know what you're looking for. And a lot of the time you

(18:13):
don't know what you're looking for, especially if it's a brand new data set and you have no idea
what's in it. So it's the tools that allow you to step back and see the big picture of the whole
data set or bring it into focus in a different way that's bringing in that sort of coherent whole
information rather than, you know, looking at the data through a straw or just search and return the

(18:38):
most likely results to your search term that only answers the questions you already knew to ask.
Right. So going back into UMAP, you were inspired by T-SNE, but so why was there a need for
another dimensionality reduction algorithm? It's a good question. I mean, T-SNE was at the time

(18:58):
on the slower side and it had a tendency to kind of smush all the clusters kind of together. It
separated them into clumps, but all the clumps were just wherever they landed. And I was interested
in something that could get a little bit more of that non-local structure, some representation of

(19:19):
how the clumps relate to each other, and also just a lot faster. And with a theory basis that I
understood because I wanted to extend it in various ways. So one of them was semi-supervised
and supervised versions of dimension reduction, but it's provided a framework for me that I can

(19:40):
hang a lot of different adjustments to it, which is why I don't know if you've looked at the input
hyperparameter set for UMAP, but it's kind of excessively large. It's because I just kept
seeing good ideas and you know, I'm like, oh, I can add an option for that.

(20:00):
Yeah. Well, we could talk about a couple of them. I mean, the ones that I think
are my go-tos are obviously number of neighbors, min distance, number of components,
depending on the visualization. Those are the three I play with the most, but I've dabbled in some
other ones, any other good ones that you want to talk about?

(20:23):
Yeah. So one actually is output metric. That's kind of fun. So there's a metric parameter where
you determine how you're measuring distance in the between data points, but you can determine how
you want to measure distance between data points in the output space, in the embedding space.

(20:44):
So if you know that your data has periodic structure, it loops around, you can embed onto
a tourist, not the plane or onto a sphere or one of the one of the interesting options is you can
embed not as points, but as Gaussians with a covariance structure and measure distance between

(21:06):
Gaussians. So then you get an embedding where points have some uncertainty about where they're
going to land. Very interesting. Yeah. It's sort of going back to like what your use case is for
what you're doing. If you understand what the use case is, then you can incorporate that into
the parameter. Very cool. In terms of UMAP and creating a library like this, so you were kind

(21:30):
of creating it to scratch your itch to solve your problem. But there's so many of the visualizations
of data are using UMAP now. What are some of the most unexpected uses of UMAP that you've seen?
So in 2020, I was very surprised to see it coming up in COVID research repeatedly, which

(21:58):
as a mathematician, that was not something that I felt I would ever be able to contribute to.
And yet at the same time, at the start of the pandemic, when everything was panic stations,
it was really interesting to see that I had managed to do something that helped some people

(22:19):
somehow on solving this problem that was inspiring. It gets used in art actually a bunch. The pictures
behind me are by an artist, Rific Anadol, who uses UMAP among many other machine learning tools to
help develop art. There are a bunch of other artists I've been in touch with who also make use of it

(22:40):
in various ways. And like that's not a use case that I ever had in mind. Right. Yeah. Well, the
outputs that you can get are quite beautiful. The interesting thing from my perspective is,
so I was doing topic modeling, I guess before UMAP, I was using, I was doing it when LDA,

(23:04):
LSA, and MF, you know, were the state of topic modeling at the time. And now I've seen how your
libraries have transformed this whole topic modeling space, a highly used library like
BERT topic, the default parameters for dimensionality reduction, library that you're the maintainer.

(23:25):
And you know, one of the creators of is UMAP. And the default clustering algorithm is HDB scan,
same same thing for you. So I have like witnessed your work transforming the work that I do every
data set that I take. The first thing I do is run it through some pipeline that involved that
involves your libraries. And now as of late, I've been using data map plot, which is such a cool

(23:51):
visualization library. There's so many cool things that are happening. Because when you try to get,
you never know exactly what you're going to get, right? When you get when you get an output,
you could have 40 clusters, you could have 400 clusters, you could have like, you know,
depending on what what your data set is. And the nice thing is, you've created something that by

(24:13):
default tries to give you a really nice output that you can just kind of just take it and you
can read sometimes you have to do some minor tweaks. But a lot of the things are done in the
back end for you. So you don't need to worry about like a lot of the spacing and yeah, a lot of that
stuff. What was like the most what was the motivation there you just wanted a way to visualize some of

(24:35):
these outputs? I think the real motivation was for some internal use cases. I've seen people
using this and they were giving presentations at the end of a workshop or something like that.
And that put up some nice UMAP plots. And they don't have a lot of time. It's a short workshop.

(24:55):
And then they want to give a presentation on the work they did. So they just plot however they can.
And you know, I would look at it and be like, ah, if if I had the time I could get in there and I
could, you know, tweak a bunch of things about that plot and make it look a whole lot better.
And I saw that happen often enough that I realized I should it's not fair that like I have the time

(25:17):
to sit down and tweak all the parameters on the plot. But these other people they don't have that
experience of having done a lot of these kinds of plots and they don't have the time to fiddle
with Matplotlib or something like that for, you know, a couple days to make the plot look just
so. I should take all the stuff that I've learned and just try and shove it into a library that

(25:43):
allows them to get a pretty looking plot that does all the things that I would want to tweak for them
out of the box and then hopefully give them enough knobs that they can
they can make it look like what they want at the end of the day as well.
Yeah, it's it's such a great tool. I mean, in so much of the data science work that we do,

(26:04):
like, you know, there's the technical, the heavy aspects, there's the coding, there's the
understanding of the data, there's the data engineering. But almost equally, you have to be
able to share your results, right? And you have to be able to kind of let other people in from all
different levels, executive level, different technical level, anyone, and those types of
visualizations. That's what I get most excited about is that when you show someone like that,

(26:28):
anyone can kind of relate and anyone can quickly understand. So just yeah, like the power of
visualizations, if you wanted to like dial in on that how a visualization and the ability to kind
of share your work and get other people involved in it, how that kind of helps you with your work.
Seeing your data is just such a wonderful experience, because we're we're visual creatures. Our vision

(26:50):
is their primary means of getting input into our brains. If you can turn data that's this
giant mess into something that somebody can see, that's an enlightening experience. And I've had,
you know, plenty of cases where I've worked with, you know, domain expert on data set that they care

(27:11):
about that I realize I know nothing about. But we can work through, get it to a plot, stick it in
front of them. And their first question is almost always, well, what's what's that? And why is it
there? Because there's something new in the data that they didn't realize was there. And then we
have to dig in to those questions. And so that's that's where I guess also the interactive plotting

(27:34):
really starts to provide a lot more value because just seeing seeing the plot is good.
But now now I want to answer the questions. So how do how do I do that bit next?
Yeah, that's a good point that you brought up because like, at first, I was just thinking about,
you know, like, oh, you're exploring the data, but why are you exploring the data? Right? Like,

(27:56):
yes, you want to figure out good groups. But yeah, another thing that you that I always find
whenever I create that output, you always figure out where those outliers are, like those weird
artifacts of the data that somehow got in there. And then you can kind of figure out a strategy
of how to have how to deal with that. And then what I'll do is I'll kind of just like do like a
loop of that. Right? Like, I use clustering to figure out what data points maybe I should pull out.

(28:21):
And then I find you map works better than I find like HDB scan works better. And like,
I just kind of like keep doing this iterative process, which gives you which ends up giving
you some some really nice results. It's amazing, like how outlaw like outliers affect can affect
like everything. But that's why HDB HDB scan handles noise. That's one of the things that makes

(28:42):
that that a very robust algorithm. Yeah, your work is just got I I'm just I'm happy that we're
talking because I remember I posted something I used one of your visualizations and one of
in one of my projects. And I was looking forward to chatting with you. And then I we bumped into
each other in New York, when you're on that panel with hugging face, because yeah, the work that

(29:04):
you're doing it, it makes the work for so many other people so much easier. So much easier.
So many of the things are abstracted away, where you can just kind of focus on your data sets,
focus on understanding your data better, focus on asking better questions of your data. And I think

(29:25):
it's probably because you went in writing you didn't you weren't allowing it to be a black box,
right? In one of the times that you spoke, you mentioned something about like,
decomposing these black boxes, and then that kind of sets you up and you can understand
how these things interact with each other. Can you can you talk to that? Yeah, I mean, so this is

(29:49):
a thing that I definitely see. And so you mentioned BearTopic and Martin Grugendorff, who wrote that
has done a fabulous job with that of exposing the innards as Lego bricks that you can swap in. So
yeah, the defaults are UMAP and HDB scan, but you can use an SVD and then K-means if you want,

(30:14):
or you can you can swap all the different components apart. And so seeing it as this
composable piece of Lego bricks, I think is really valuable. But the same applies to even the innards
of the other algorithms themselves. So personally, I see HDB scan as a pile of Lego bricks, there are

(30:35):
a bunch of different things. So there's there's a density estimation step, there's connectivity
related step, there's building a tree of clusters step, and then there's a cluster extraction step.
And these are all like, there are different algorithms for doing each of those things.
HDB scan packages together a default set, but you can easily swap out each of those parts with

(30:57):
something new. It's just a pile of Lego bricks. And the same with UMAP, there's constructing
representation of the high dimensional data in some sort of graph based way. If you want to think
in terms of algebraic topology, it's a it's a simple set. But there's lots of different ways
of doing that. You could swap out that component. There's how you're going to optimize a low

(31:21):
dimensional representation. Again, this is just components and pieces. If you look at something
like LDA, a lot of that you mentioned as earlier topic modeling, one way of looking at that is,
you know, this plate model in the probabilistic view. But you can also just view it as a matrix
factorization algorithm and decompose it into the parts of how matrix factorization works,

(31:45):
and give yourself, if not an understanding of everything there is to know about it,
and understanding of the pieces and how they fit together in a way that would allow you to adapt it
to different problems if you wanted to do that. So, you know, it works with categorical data,
because you're looking at a distribution of words and the prior for that is a Dirichlet

(32:08):
distribution. But if you had count data that was much more Poisson distributed, well, then you need
a prior for the Poisson, there's gamma distributions for that, you could build an LDA like algorithm
pretty easily that would work for an entirely different data type, if that's what you want to do,
if you have broken it down into the pieces like that. I think more time spent understanding

(32:33):
what the actual pieces of these algorithms are is pretty valuable if you ever want to adapt them
or use them for something else. Right. That makes a lot of sense. Yeah, if you don't just say, oh,
this is a black box, I'm abstracted away, the only thing I have control over just, you know, these
these parameters, well, you'll never really be able to fully grasp what's taking place. You won't

(32:55):
be able to really build on top of the work in a way. So you have to decompose you have to decompose
those things. That's probably my guess on why you have pyn and descent, I would say. Yep. So, I mean,
you talked about taking pieces to make make tasks easier for other people. Pyn and descent is an
example of me taking a piece that makes work easier for me. Because nearest neighbor search,

(33:23):
I mentioned HDB scan needs to make use of some of those sorts of things. This is also comes up a
lot in UMAP, it comes up a lot in a lot of other places. I needed a thing that would do that in
the ways that I needed it done. So, you know, again, I'm borrowing algorithms from other people. There
was a great paper on an algorithm called nearest neighbor descent that builds approximate k nearest

(33:49):
neighbor graphs very efficiently. There's a bunch of, again, that decomposes into chunks, and you
could pull pull apart the pieces, swap in some other pieces, change a few other pieces. And, you
know, that's what I did with with pyn and descent to get the implementation that I have that solves
the problems I have. So, I wanted things like work with a lot of different metrics, a lot of

(34:12):
approximate nearest neighbor search will do Euclidean and cosine, and that's it. I wanted to be able
to do anything. And I needed needed to be able to work with sparse data structures as well. So,
again, does that out of the box. These are, you know, problems that I needed solved, and it was
best to just package it up. And now I get I get to reuse it all the time. So, I have a new clustering

(34:39):
library called evoke for embedding vector oriented clustering. And it steals a whole bunch of stuff
from pyn and descent because it saves a lot of trouble to just reuse all of that work packaging
them up so they can be reused. But at the same time, being able to decompose them back again into
the parts that you need. So yeah, you've basically created a whole ecosystem of all of these parts

(35:02):
that you can fit together to approach many different problems, but specifically like unsupervised
learning in a very powerful, powerful way. In terms of the future of some of this work, or maybe one
of the challenges in unsupervised learning or topic modeling in particular is something
like trying to find new topics over time, right, trying to incorporate the temporal feature.

(35:28):
Have you thought about that? Have you been thinking about that at all?
Yeah, so temporal topic modeling is definitely something I've been giving a bit of thought to
and how best to handle that. There are algorithms from topological data analysis, an algorithm called
Mapper that's based on Morse theory that actually lends itself to these kinds of problems pretty well.

(35:53):
And I actually just recently had the opportunity to work with a co-op student who was visiting the
Institute. He's now off doing his PhD at Waterloo. He worked on a paper on an extension of Mapper that
would be pretty much ideal to solve this kind of problem for specifically for the kinds of things

(36:13):
you get from topic modeling over time. So that's hopefully that's on archive now and if people
want to go and read a highly theoretical paper, I would recommend checking out that. I don't have
the link on me at the moment, but I can edit. These are fun and challenging problems. So

(36:36):
adjusting over time is a big challenge. So again, another thing that would be great would be to have
a UMAP embedding that can evolve over time as new data comes in. Right now, if you rerun UMAP,
it gives you qualitatively pretty similar results, but it's in variant, the optimization

(36:58):
problems in variant under rotation and flipping. So at the very least, it could flip things around.
And if you just use the transform method as it exists now, that is based on the data that we've
already seen. It's just going to fit new data in as best it can, given this training set. So if a new
cluster of data actually shows up, it's just going to squeeze it in amongst all the other data that's

(37:24):
there. So there is some ongoing work to try and make a version of UMAP that would allow you to
evolve in this way. So it's sort of adapt to the new data that comes in. So if you get, let's say,
you're looking at a month of data, then another week of data comes in. Instead of just trying to

(37:45):
force that week into the last month, it would be adapting and changing. So if a new cluster came up,
you could see that new cluster. Yeah, a new cluster would form and the rest of the data would have to
move to fit around it and so on. Yeah, that's very much the goal. And I think that's definitely
possible. So there's some great work being done by some other people on that that I'm

(38:07):
desperately trying to follow and keep up with and will happily merge into UMAP main as soon as
they get it in the state that they're happy with. Very exciting. Cool. So we talked a lot about
unsupervised learning. I want to zoom out a little bit. And I'm going to ask the question around the

(38:29):
hype of AI and machine learning. There's this, all of these promises are being made and coming
from somebody who, you know, is creating these algorithms, seeing the things that we can do,
maybe seeing some of the limitations. I'm curious what your view is on the gap between the hype
and the reality that you see? Let's start with the reality. There's a lot of value and a lot of

(38:51):
the stuff out there. It has enabled all sorts of things that are really just incredibly powerful
and useful. But the hype is something else again. Like I am not a fan of the amount of hype around
a lot of these things. I mean, for me personally as a user, a lot of the value in the generative

(39:15):
large language models is just as a natural language interface. Right? Like, I mean, retrieval
augmented generation is the thing. But really, that's old school information retrieval.
A whole lot of hard work. And then you hand that to the LLM, which does the interfacing to the user

(39:37):
of taking the question in natural language and then taking those information retrieved results
and turning it into a nice natural language answer to the question. Now, there's a lot to be said
for the value of providing a natural language interface to computers for users. So there's
the value. That's a huge value proposition. But that's not what most of the hype about them is about.

(40:02):
And at the same time, I think a lot of the like embedding approaches where you vectorize text,
images, video, whatever, and just turn it into vectors, a lot of people seem to view that as a
thing that you can then put in your retrieval augmented generation system. But it's so much
more valuable than that. And there's so much more you can do with that. I think there's untapped

(40:26):
value there still in all the various things. So I mean, topic modeling is one example. But
I mean, you could easily convert that to topic modeling for images. I think Burr Topic already
supports a basic version of that. But you could turn that into all kinds of things very easily.
Yeah. One cool injection of generative or the complement of generative to topic modeling

(40:52):
was one of the toughest parts was you get your clusters in the end. And you just have to kind
of figure out what the name is and what how to define what that topic is and you know, things
like that. So that has been a really for me an exciting use case of generative. I mean, maybe
not everyone would find that exciting. But if you know the pain of trying to name all of your

(41:14):
clusters, yes, if you can get anything to help you do it, that for me felt like a very good
and valuable use case for generative like here, take, you know, with these keywords and take
with these example documents, and now give me a good three word or less name for this for this
cluster. It allows you try to understand your data quickly or get a good sense of your data quickly,

(41:39):
create potential classes. That's something that I found. But yeah, in terms of the hype, the hype
of all of this couldn't be higher. It's supposed to solve all of our problems.
Rag systems in general, you know, the idea, Oh, I'll just vectorize everything. And I'll just
find the most relevant documents like I don't then you haven't really used cosine similarity,

(42:01):
because you don't really like information retrieval is really hard. And it's and people have been
working on it for a long time. And there's what there's a reason why there's so many different
algorithms. There's a reason why there's like whole businesses around it. And you have to take
into account so many things, obviously, semantic patterns, lexical patterns, just like incorporating
metadata. I found some, you know, you can call them rag systems, but just being able to retrieve

(42:26):
the necessary documents just by like kind of using like, other types of filtering that and
give you very, very good results. It is exciting to see what will see what will happen. But I think
that there's a lot of hype and new terminology being used for things that have been worked on for
a long time. Yeah, for a long time. What do you see as a question that you believe remains unanswered

(42:47):
currently in either machine learning or some of the work that you're doing? So I think there's a
whole lot of scope for work to be done still in unsupervised learning. That's hugely biased
because that's the field I work in. But I look at the state of things and I'm like,
this could all be so much better. It's not like I have the answers for how to make it better. But

(43:09):
I can definitely see lots of room and directions for improvement. I mean, even simple things like
we have these sentence embedding models like SBIRT and they're fantastic. But do you need the
full power of a giant transformer based neural network to make that work? Because I'm personally,

(43:29):
I'm a fan of what's the simplest possible thing you can do that still does a good enough job.
And I'd be really interested to know if you can produce sentence embeddings,
like 98% as good as the transformer model with something that's just way simpler and easier
to understand. Because the internal workings of what that pre-trained model has learned,

(43:52):
that's harder to pick apart. But if you've got, that's a black box that's harder to decompose
into pieces. I'd love a more decomposable version of say sentence embedding or vectorization or
any of these sorts of things. Yeah. So some of the interpretability and understanding of

(44:18):
those of those embedding models, there's a lot of very, you know, very interesting work. Tom
Arson is doing an incredible job with sentence transformers and the ability to create custom
embeddings and things like that. That's something that I'm very excited about. But yeah, like,
I mean, to go into some of the things that we're doing, like, yeah, you create these embeddings,

(44:39):
they're 512 dimensions, 760, 810s, 24, like they're huge sometimes. And it's like, well,
do you really need it to be that big? Then you have to go through some sort of dimensionality
reduction. I wonder if there's some combination of embeddings that could feed right into a
clustering algorithm. That's something that I think would be cool. Yeah. Yeah. But now you need

(45:02):
to decompose the parts and glue them together slightly differently. For like, if you have a
specific task, there's definitely things you could do. How does that work? I don't know. But I
think there'd be great answers if you could figure it out. Yeah. Well, there has to be some things
that we don't have the answers to. So there's a reason to come and continue thinking and working

(45:23):
on all this stuff. There's so many exciting things. All right, to zoom even further out,
I'll ask an advice question. What advice would you give to someone that's just starting out in
the field of either research, data science, machine learning? So specifically for data science and
machine learning, my advice is don't follow the hype in the same direction that everyone else is

(45:47):
going because you're not going to make a dent in the field that everyone is already working on.
Go do whatever is interesting to you that isn't necessarily the hype thing and be good at something
that you're good at. And that's probably going to be good enough. The time will come when whatever
you're working on will come around. So and the other thing is that I think interdisciplinary

(46:13):
spaces are where a lot of value comes. That doesn't mean you need to split yourself between like two
wildly different subjects. But if you can find some time to stretch into subject areas that
people aren't otherwise necessarily working on, that can make a big difference. I mean,
I kept talking about being like a magpie and just grabbing different random bits and pieces to stick

(46:38):
together. And that's because I touched on a few different fields. So you know, a bunch of pure
math, but some machine learning things, algorithmic things, and just being able to have enough of a
stretch to grab things from different fields that other people aren't putting together. That's often
what you need to create something new. Yeah, yeah, I have to agree. And then just like a small

(47:00):
extension of that, I guess the interdisciplinary part to it. I always found like when I was doing
research, when I was, you know, in my engineering program, you're very like, you can be very insulated
sometimes around people who are very like minded and who approach the problems in the same way.
So sometimes like being exposed to other people that are just look at the world in a different way

(47:24):
or think of the problem in a different way. Are there any people either in your work or in the
fields that you found that have kind of like opened up the way that you think about things?
There are many, many people. Let me see if I can think of a few that just spring to mind. So Matt
Rocklin, who built Dask and runs Coiled, I don't know if you've ever had the opportunity to interact

(47:50):
with Matt or watch him interacting with other people, but he is amazing at going and listening
to people about whatever their problem is. And that is inspiring because that's how you actually
find out what you need to build. Not necessarily sitting down yourself and coming up with the
whatever you think is cool. Work out what the problems that your potential users are actually

(48:17):
having actually are and listen to them. So and that is something I met is amazing at that.
Who else? Lorena Barba has done amazing work on reproducibility and also just education. So
thinking about how to explain things well, especially with like interactive tooling or

(48:38):
anything like that. She's done a fabulous job about just computational thinking in general.
Vincent Warmerdam is awesome because he always wants to build the simplest thing that still
works. And that is very much something that I definitely believe in. And he's just so great at
always putting those together and explaining it all so well. Absolutely. And then for something

(49:02):
completely different, Emily Ryle, who's a category theorist, she does the most amazing,
incredibly deep, complicated pure math work and still writes like great textbooks. Her
introduction to category theory is one that I would recommend for anyone category theory in
context is just a great introduction to the subject and it's just so approachable. And she's

(49:28):
just amazing in general. So there's four people off the top of my head. Okay, that's great. I'm
going to look. Well, I know Vincent and I'll look into the other ones. All right, I think we're
ready for the final or almost final question. So, well, yeah, you describe yourself, I think, what,
as a data, as a mathematician turned data science, but I'll phrase the question,

(49:52):
what is a career in research taught you about life? Well, one of the things I've learned
talking to domain experts on their data is that there's a whole lot that I know almost nothing
about. In fact, almost everything I know almost nothing about. And that's always worth keeping

(50:12):
in mind is how many other different things there are out there and how little you know.
So, research also taught me I definitely don't have all the answers and that's okay. You've got
to learn to live with that. You don't have all the answers, but maybe you can get there and

(50:36):
that you just need to be patient with problems. Sometimes these things just need to sit in the
back of your brain for a long time and then eventually, I don't know what subconscious
process works back there, but eventually answers do come out. So just be patient.
I like that. Yeah. I think that's research has always led me not to answers, but to just more

(51:03):
questions. And then I can definitely relate to the second one, sometimes problems,
not that they solve themselves, but like once you stop thinking about what the solution is
going to be, or you do something else, go exercise or like, you know, the shower ideas where you're
not thinking about anything else, that's when you get some of your best ideas. That's great.

(51:26):
That's really good advice. It's definitely how you could apply some of the thinking of research to
dealing with the uncertainty of life and yeah, how little we know. Like how little we know,
the only thing I know is that I know nothing. Leland, this has been such a pleasure.
I've really enjoyed talking about all of these things. Thanks for going through

(51:50):
topic modeling, all of your amazing libraries. If there are listeners out there that want to
learn more about you or any of the work that you're doing, where would you direct them?
Probably GitHub, I guess, has most of the projects. I try and make everything open
source as much as possible. So Al McKinnis on GitHub, but also the Tat Institute

(52:17):
GitHub page, you can find a bunch of our projects there as well. You can also find me,
I think, on Twitter and Blue Sky, search for my name. I guess I have a sufficiently novel name
that hopefully you'll find me. And my email address is out there, so you can always reach out there

(52:38):
or on LinkedIn or whatever if you want to get in touch. Very cool. Yeah, and for anyone that is
not familiar with these libraries, you should definitely check them out. UMAP, HDB scan,
and Datamap plot, and I'm sure soon enough there'll be some more exciting ones or extensions to these.
Leland, thank you so much for the work that you do. Thank you so much for the time

(53:01):
and letting me pick your brain for a little while. Thank you. Appreciate it.
On this episode of Learning from Machine Learning, I had the privilege of speaking with
Leland McKinnis, the creator of a suite of data science tools, including UMAP, HDB scan,

(53:22):
and Datamap plot. His work has significantly impacted the field, particularly in the realm
of unsupervised learning. What sets Leland apart is his unique approach to data science problems,
deeply rooted in his background in pure mathematics, particularly algebraic topology.
He views data through a geometric lens, seeking to understand its underlying structure. This led

(53:46):
him to develop UMAP, an essential dimensionality reduction technique that excels at capturing
both local and global structure of data sets. We also discussed HDB scan, a robust clustering
algorithm known for its ability to handle noise and variable densities within data sets,
making it highly effective for real-world applications. Beyond his technical contributions,

(54:10):
Leland shared valuable insights for aspiring data scientists and researchers. He stressed the
importance of not just blindly following hype, but instead pursuing passions. He discussed the
importance of embracing interdisciplinary thinking, and how many of his breakthroughs came from
connections between seemingly disparate areas of study. Leland highlighted the importance of

(54:32):
decomposing black boxes, encouraging a deeper understanding of how algorithms work, rather
than treating them as impenetrable mysteries. By breaking down complex algorithms into their
fundamental components, data scientists gain knowledge and flexibility to adapt them to new
problems and data types. This approach promotes transparency and empowers data scientists to

(54:55):
be more than just algorithm users. They become algorithm creators and innovators. Leland's journey
underscores the power of curiosity, the importance of interdisciplinary thinking,
and the value of understanding the tools we use. By embracing these principles, we can unlock the
true potential of data science and continue to push the boundaries of what's possible.

(55:17):
Thank you for listening, and be sure to subscribe and share with a friend or
colleague. Until next time, keep on learning.

All Episodes

Leland McInnes: UMAP, HDBSCAN & the Geometry of Data | Learning from Machine Learning #10

Episode Transcript

Popular Podcasts

Bookmarked by Reese's Book Club

On Purpose with Jay Shetty

Dateline NBC

.css-15opob5{left:0;position:absolute;top:0.8rem;} All Episodes

.css-14f5ked{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:2;overflow:hidden;}Leland McInnes: UMAP, HDBSCAN & the Geometry of Data | Learning from Machine Learning #10