The Common Crawl

Episode Transcript
Available transcripts are automatically generated. Complete accuracy is not guaranteed.
Speaker 1 (00:00):
Brought to you by Toyota. Let's go places. Welcome to
Forward Thinking. Hey, they're in Loving and Forward Thinking, the
podcast that looks at the future and says, Kitty McGee's
in Dublin Town upon the Crawl. I'm Jonathan Strickland and

(00:21):
I'm Joe McCormick, and today we're gonna be talking about
the crawl. We are talking about the crawl, not a
pub crawl. No, sadly, not a pub crawl, which is
what I was referring to in the lyric. But that's
not what we're talking about today. The krawl. What is that?
Is that the name of a movie that was like
a like a fantasy movie from the eighties, or it

(00:41):
sounds like a I'm thinking krawl. Oh yeah, that's a
science fiction fantasy film with a phenomenal one I might
add phenomenal science fiction fantasy film. Okay, So why would
we be talking about a crawl that's not a pub
crawl and not a sci fi fantasy movie. And it's
not the future of baby these crawling right. No, it

(01:02):
has something to do with the Internet. Yes, it has
everything to do with the Internet Web in particular. Actually, uh,
and here's a funny little little tidbit of information that
you probably already knew, but you might be a little
bit fuzzy on. Wait, what's the difference between the Web
and the Internet. Because when I say the Internet, most
of the time, what I'm talking about is the place

(01:22):
where people leave comments and argue about things, which would
be the Web mostly. Right. So Internet is the network
of networks of computers. Right. So You've got all these
different computer networks that then connect to a larger backbone
that allow all these various networks to interact and communicate
with one another. That is the Internet. The Worldwide Web

(01:44):
is one thing that sits on top of this network
of networks, other things being email and FTP servers and
other stuff that uses the Internet as its method of
transmitting data to and from different computers. But the Old
Wide Web is often what we think of with the
Internet because it is a very forward facing part of

(02:06):
the Web, or the Internet rather, Right. One way to
think about the Web is that it's a gigantic collection
of interactive documents. Yeah, exactly. Yeah. Some of those documents
are very static and they don't change frequently or at all.
Some of them are more like programs. Yeah, yeah, some
of them are more like like white boards, where you

(02:27):
know stuff is being put up and taking down and
put up and taking down constantly. So some of them
linked to lots of other documents, some do not, yep,
so some are applications. Right, So you've you've got this
massive number of documents. And when we say massive, uh,
it's hard to put it all into context. First of all,

(02:47):
if you talk about all the information that we have created,
not us, but humanity, humanity itself, you have the three
of us have done our share, but no humanity overall.
All the information that has been created, well back in
two thousand and twelve, that was estimated to be at
two point eight zetta bytes two point eight trillion gigabytes,

(03:10):
trillion gigabytes. It's bigger than my hard drive significantly. So yeah,
if you're hard drive can hold two point eight zetta bytes,
I need to see your gaming rig sir. I think
I have downloaded two point eight zeta bytes of Pirated
anime before I was gonna say I have two point

(03:30):
eight zeta bytes of Skyrim mods. But so no, not
all of this data is necessarily available for access on
the web, right, This is just data that we have created.
So let's let's narrow it down and look at the
information that's actually on the web. So the web has
between ten billion and one trillion documents on it. Now

(03:51):
that's a huge range, but it tells you that it's
hard to make an estimate about something that one is
so big and two is rapidly evolving. Right, there are
always things being added to it and deleted from it. Yeah,
you have servers that go offline from the Internet. If
those servers had web pages on them, those, unless they've
been mirrored onto other servers, are no longer accessible. They

(04:14):
have they have left the web. Other people deleting their
MySpace accounts. Why would you do that? Look, I have
so few friends there, but so many awesome bands. Uh yeah,
so seriously true moment. Does my Space still exist? Yes, yes,
it's largely How recently did you check? Probably? Probably then

(04:37):
eight months ago. Let's look it up. Yeah, because it's
a it's a music discovery site more than anything else. Now,
Oh yeah, here we go MySpace dot com. Oh oh,
it's it's breaking my browser home. I was about to say,
why did you go to that? You realize that my
Space is like the home of the auto play music file. Right, No,
we just talked about how that's like my least favorite thing. Well,

(04:58):
let's not. Let's not invoke the auto playing music gods. Yeah. So,
so the reason why we're even talking about how much
information is on the web and how many documents there
are out there is that the web. You can think
of the web as representing the world's largest database of information,
and that information spans every topic imaginable. Yeah, and there's

(05:21):
lots of great stuff out there that might be really
relevant to you, might have answers to questions that you have,
or it might just be very interesting to you. But
a strange question that you may never have considered is
how do I get the stuff that I want to
get from the web? I mean, you know, how you
get it in practical terms while you go you sit

(05:43):
down at Google and you type in terms or Google yeah,
or you or you maybe have some kind of aggregator,
like a friend on social media or some kind of
things content writer, or perhaps you have received the direct
to your l of a website that you wish to visit. Ye,

(06:03):
might you might have one in particular in mind that
you go too frequently, and so like dinosaur Comics Dot
com awesome should example. Yeah, so they're all all these
different ways. But let's say that you want to use
the web to do something dion just visiting a particular
web page if you know the U r L, that's
pretty simple. But what if you're you're just trying to

(06:25):
find something. Yeah, maybe that you don't even know what
that thing is, or you know what that thing is,
but nobody has gathered that and placed it into an
easily digestible piece of information. So, in other words, let's
say that you're looking at some sort of statistical uh
result that you want to know. You want to know
the percentage of people who drove red cars in two

(06:50):
thousand and twelve who ended up getting speeding tickets, and
you know this this sort of thing like, there may
be a web page out there that has that specific
answer on it, but there may not be. However, there
may be the data out there that exists across multiple
web pages and multiple places that could answer that question

(07:11):
for you, but there's no easy way for the average
person to be able to collect and coalate all that data,
analyze it and get to a meaningful answer, especially not quickly,
because if you wanted to go through the entire Internet
to try to find that information, it would take you
a minute. Yeah, it would take quite some time. So

(07:32):
what we wanted to do is tough. Sometimes well, again,
depending upon what it is you're looking for, right, because
in some cases you may have very little information and
it may take you some time just to make sure
that the information that you do have is worthy of consideration.
Or you may have the opposite problem. Let's say you
want to look at anything that has to do with

(07:54):
about cats, good grief, You're gonna have so much information
on the inner and on the web that relates to
cats that finding the you know, separating that the signal
from the noise would take you a really long time.
So and uh, this problem is already has a solution,
and that is why we are today talking about web crawlers. Yeah.

(08:18):
And web crawlers are something that have been around for
about as long as the web has been around, because
people realized early on that in order to make the
web really user friendly, especially once it grew beyond a
collection of you know, three computers, right, Yeah, three computers
with twelve web pages altogether, Once you get past all

(08:40):
of that and you get to a point where it
really is growing rapidly. You need a way to navigate
through the web and find the stuff you're interested in.
You need an index. Yeah, you have to have that
index because otherwise the only other option you have is
to know the address of a particular web page and
then to just follow whatever links that web page have
is to have, and then once you hit a dead end,

(09:02):
you've got to backtrack. And you know, it's kind of
like a choose your own adventure book, And it's a
choose your own adventure book that's that isn't even connected
to all the pages that you need. Right, So indexing
is a way of creating a means to find web
pages about any given keyword. Right, And again, this is

(09:23):
a big, big job. You can't expect this to be
something that only humans are doing under human power. It
would take way too long and it would be exhausting.
So there have to be automated ways to index web pages.
Well yeah, I mean, just consider the ridiculousness of the alternative.

(09:43):
So let's say you are searching for a term and
that term is, I don't know, lobster baseball. Somewhere out there,
there might be a page about lobster baseball, but it
would not be a good way to find it. To say, well,
I'm going to ping every web server in the world
and see if it's offering any public pages that say

(10:07):
lobster baseball on them. Yeah, that would not especially you know,
as the Web grows and gets larger and larger and larger,
that task becomes impossible. It would just it would take
your computer longer than your lifespan to complete the job,
especially considering that, as we mentioned before, the web is
constantly changing, so we would have new web servers joining

(10:28):
while you're still doing this pinging operation, which means you
just have you know, you've added more that you have
to ping before you're done. You never finish. So what's
the solution, Well, web crawlers would be would be the solution, Joe.
Web crawlers and search engines are our favorite things here
at Health Tough Works. I mean, if if it weren't
for them, our jobs would be significantly more difficult. So, uh,

(10:52):
let's say that you've got all right, So to break
it down, we've got web servers that have web pages
on them, right, we have it's a computer somewhere out there.
It's got a public facing document that it will show
you if you ask for right, and your browser is
the way that you ask for it right, So your
browser is your conduit to getting the information that's stored

(11:14):
on other computers that maybe on completely different networks, on
another on another part of the world even and the
fact that you have a browser that is what allows
you to have the access to that document that exists
on that other page. But those servers can have really
funky names. Um, the web pages may not have a

(11:34):
title that is is identical to what it is you're
looking for, but the information may be in that page.
Sure for to to use my prior example, dinosaur comics
dot com used to be known only as quantz dot com.
Perfect with the QW the way that you sometimes spell
words much yes, exactly the way the way words are

(11:55):
never spelled in English. Uh yeah, I I My example
of my notes I wrote is that let's say that
you're looking for funny cat memes. The funniest memes happen
to be on a page that has the title things.
FDR definitely didn't say, Well, the title of the page
wouldn't tell you that there are cat memes on there.
You would need something to have searched that page to

(12:15):
understand what actually appears on that page, the context within
which it appears, and to be able to serve that
up to you. And that's really where the crawlers come in.
They they build out these indexes of words and where
to find those words on the web like uh, they
use lots of They use well, actually pretty simple software.

(12:36):
They are often referred to as either robots or spiders,
and they're called spiders because they crawl the web. That
is good. Yeah, alright, So here's where we mentioned that
most of these terms. So wait, are we the flies? Good?
Good question? I mean, I think the uh cams, I'm

(12:59):
not so. So here's where we mentioned that a lot
of these terms were all invented around the same time.
And boy, when we when we go with a metaphor,
we just go whole spider. So um so all right,
So spiders typically start by traveling to web servers that
have lots of traffic, the ones that are the most popular,

(13:22):
and they explore the most popular web pages and start
to build up the index of words of those web pages.
Then all the links that are on those popular web pages,
the spiders start to follow those links and index those
pages in turn, and then do the same thing over
and over and over again, so they just you know,
it's it is like a spider web or a crack

(13:44):
in the glass where you see its splintering over and
over while the glass shatters. The same sort of thing.
It's following all those potential pathways, and they can hold
hundreds of pages open at a time. We're talking like
three hundred pages a second. So yes, more than Google
Chrome will allow me to have opened before my computer

(14:07):
says listen, I give up. Uh. So, depending on the
crawler of the spiders will index these pages based upon
which words appear in the page and where those words
actually appear in that page, like in what context. So
you may remember in the early days of the web,
before web search engines got really sophisticated, that some people

(14:27):
would make a web page and then just litter the
bottom of the page with tons of random words that
we're doing really well in search, mostly because they had
ads served on the page of the type that they
got money from per page view. So I might have
a talk Brittney Spear, right, Yeah, it would often be
celebrity rumors and gossip that kind of stuff, and just

(14:50):
random recipe. Yeah, yeah, it'd be weird stuff, like totally
some of it would be disturbing to read. You're like, wow,
I can't believe that that. To know that this particular
turn m is a very popular search term is disturbing
others would Yeah. Yeah, I'm more of an A C,
D C kind of guy myself, so I'm I'm with
you there. So anyway, Uh, you know, this was a

(15:12):
way of fooling search engines into into indexing that page
on multiple indexes so that it would appear no matter
what search you put in, your page would pop up.
You as a as saying, if you're assuming you are
the one who are administering this web page, you have
no ethics, Like, you don't care if people come to
your page and are completely disappointed because it has nothing

(15:34):
to do with the search term they put in there.
You just want to get those sweet sweet clicks. You
just need the page views because you need to pay
the bills, right, So, uh, search engines and spiders got
more sophisticated, so they were able to look for the
placement of words where it fell in the page, whether
or not it appeared more than once within a page,

(15:54):
to understand if a page really was about that particular
search term, or if it was just one of those
things where the word happened to appear once, it may
be a saying or a quote that has very little
to do with the actual substance of the rest of
the page. You know, this would help the search engine
rank the page in search right. Right, So, the final

(16:16):
product of of these spiders doing this indexing is called
a crawl and and it's essentially a lightweight copy of
the Worldwide Web that's built to be much more easily
searched than the whole web itself. Uh and and a
crawl usually consists, therefore, of this huge cash of data

(16:37):
about the web, including like the text of each page
it's spiders encountered, the code that constructed those pages h T, M,
L or et cetera, uh, and a certain amount of
metadata um you know, certainly the pages r L and
maybe the tags. As we discussed, that's not always as
useful as it used to be due to uh scammy stuff.
But yeah, so so creating a crawl is a huge

(17:00):
project in terms of time and computer equipment and drive
space and spider programming and just sheer Internet bandwidth. Right. So,
for the longest time, this is something that was really
only accessible by big corporations like Google or Microsoft, Yahoo that. Yeah,
we're talking huge companies that have the computer power and

(17:22):
the bandwidth to pull this sort of stuff off on
a on a regular basis. And while those are incredibly
useful for us as consumers, if we are looking for
a specific piece of information that happens to live somewhere
on the web page, if we want to do more
of a big data analysis something where we need to
collate the information across multiple, perhaps hundreds or thousands of

(17:44):
web pages, it's not easy, right. We don't have those
tools for the most part, right right, Because when you
go to Google, you can't access that level of information.
Yeah you can, you can ask, uh, you know what
Hugh Grant was doing last week? Right? Yeah, you can
get the most popular or the highest ranking search results,
which could give you at least some useful information. But again,

(18:07):
if you want to do a wide spread study on
a specific thing, unless someone's already done it, in which
case you may just need to replicate their their study
to make sure that it was correct. Um, you're you're
kind of out of luck. So where can someone turn.
Let's say that it's a researcher who's working on something

(18:30):
and they don't work for one of these big companies.
Where can they turn to leverage the incredible asset that
is the World Wide Web? One? Gil Lbez started up
a nonprofit corporation called the Common Crawl Foundation and it
has been since then working on providing public, publicly accessible,

(18:51):
free crawls to anyone who wants to use them. And
uh ls is a really interesting dude. A little bit
of background on him. Um. He co founded a company
back in the nineties called Applied Semantics, which created software
that matched ads to web pages like contextually and automatically.
Oh we know a little bit about that. Yeah, yeah,

(19:14):
And this prompted Google to acquire them in two thousand
three for like a hundred two million bucks, So not
doing too bad for himself. Also, that's that's essentially the
reason why Google AdSense exists. That the programming that led
to Google AdSense so very very practical application of that
contextual understanding, right right. Um. In two thousand eight, interestingly,

(19:37):
and kind of a side note, he founded a company
called Factual, which seeks to gather and analyze global location
data in order to create a repository of really high quality,
easily accessible location data that's uh factual um and and
companies like Bang and Samsung and Yelp all use factuals

(20:00):
to construct local maps and personalized advertising for mobile consumers.
So uh so pretty nifty stuff. And what I am
saying is that elbas is passionate about and experienced with
big data, right, and we've talked about it before on
this podcast. That big data is. You know, it sounds

(20:20):
like one of those just buzz industry terms, but it
really is one of those things that holds a huge
amount of potential to affect our lives in different ways,
assuming that we've developed the right means to analyze that
massive amount of information that's had there to collect it
in the first place. Sure, and once you have a
way of processing, of collecting and being able to access

(20:43):
and process vast amounts of data, you can do a
lot of amazing things like big data. And the ability
to process it might be the key to say, for example,
computational modeling that predicts complex social phenomenon by analyzing big
data coming from social media and from news and from

(21:03):
weather and from all kinds of sources, it can really
mean that we are able to actually see elements of
order and what previously appeared to be a truly chaotic system,
which is kind of exciting. Sure. And then on the
other hand, a lot of people think that big data
could be one of the ways that we finally achieve
that next level of artificial intelligence by having machines sort

(21:27):
of plumb the depths of this data with self teaching
and self learning mechanisms. Right, well, let's get back to
the common crawl okay, Right, So the Foundation began compiling
crawls back in two thousand eight. The most recent one
that they released as of this podcast at the end
of May, was from April. It was some a hundred

(21:50):
and sixty eight terabits in size bytes in size huge.
That's big, uh, and contains some two point one billion
web pages. That's not that many, really, I wrote I
wrote forty seven million web pages before breakfast. No I

(22:13):
did not. I'm just kidding. No, that's a lot. Yeah, yeah, yeah,
it's it's a it's a bunch um uh. But so
they're they're continuously indexing and releasing new crawls, right, It's
not like it's here's the Internet and now we're done. Yeah. Yeah,
they've been releasing a new crawl every month since July.

(22:35):
That's I mean, that's incredible you think about the amount
of work that does. It also means that you have
like a a time a timestamp, like photograph of what
the web was at that moment from these crawls. Yeah. Yeah,
I hadn't thought about it quite that way before, but yeah,
that's it's kind of you know, things that may not
exist from one month to the next, you could actually

(22:57):
see and watch those trends. Yeah, it's fascinating. Yeah. I'd
say one of the main ways that I often encounter
web archives is when I'm trying to find evidence of
something somebody did in the past that they wanted expunge. Right,
This makes sound like I'm some kind of detective, not
like I'm trying to find but you know what I'm

(23:19):
talking about. No, I know, I will post something and
then they'll be like, oh wait a minute, No, that
was a bad idea. I know if you try to
delete it. I know, if you use archive dot org,
you can find one of the web pages I built
way back when, and I never want anyone to ever
see it because it was that bad. But they will
forever be able to Yeah, you shouldn't talk about it

(23:40):
on podcasts. Pretty sure they're not going to be able
to find it. Tens of thousands of people. Let me
guess you had some You had a bunch of rage
against the machine lyrics, and it auto played midias. They're
so close closest, No, you're really far away. I wrote,
I made some web pages for I particular company I

(24:01):
worked for, which is unless you know the company I
worked for that I'm specifically referring to. That's why you're
never going to find that particular web page, and you shouldn't.
It was terrible. It was. It was about, we're going
to find these Go look on his LinkedIn profile. We
can figure out what company was. I already found some
lobster baseball stuff, so yeah we can. That comes from

(24:22):
you almost Got a spit take on you almost got
and uh yeah, that comes from a pre episode conversation. No,
that was actually in the episode, wasn't it. I can't
keep track anymore, all this dungeons and dragons saving throw
talk we had. OK, hold on, I've got a question,
So hold on. If the Common Crawl is trying to

(24:44):
preserve and make accessible continuously updated snapshots of the web
weird on Earth? Are they're going to like store that
and make it available, right, because it's not going to
fit on like a thumb dry So where I think
it's also funny that that sort of becomes part of
the web, Like the web now incorporates a snapshot of

(25:07):
the previous web, and so it just gets that much larger.
So yeah, where's the stuff living? It is all living
on Amazon's web services. Uh, specifically, it's it's stored in
Amazon Simple Storage Service or S three as it is
sometimes known, and you can analyze it via Amazon's a
Last Compute Cloud or e C two. And this is

(25:29):
so cool guys, because because okay, given Amazon's web service scope,
it means that practically anyone in the whole world can
download entire crawls for free, or can if they don't
want to, you know, use hundred sixty terabytes of space.
They can they can just use e C two to
really easily run simple data crunches for like an hourly charge,

(25:53):
like anywhere from a few bucks to maybe fifty dollars
for pretty simple computations. So and of course, more savvy
users can write their own code to investigate stuff with.
But but but yeah, for for the common user. This
is revolutionary. Yeah. To me, this is uh. I mean,
obviously I would I would recommend doing the approach where

(26:14):
you're you're searching on Amazon stuff. I can't imagine the
phone call you would get from your internet service provider.
I see you're trying to download a hundred and sixty
eight terrified it's worth of data over our lines. You
have gone significantly over your bandwidth cav. Yeah. And they
can do all of this because Amazon has specifically chosen

(26:37):
to wave their storage fees for for this and a
handful of other things that they consider to be of
wide public interest, like like like weather in census data.
And it's that's incredible, right, I mean, because you are
talking about a significantly huge amount of data, So to say,
you know this, this is so important and so potentially

(26:57):
beneficial to mankind that we're not going to end up,
you know, charging these the storage fees for it. That's
that's great, Yeah, very very encouraging. So when we get
down to all, right, well, what can you actually use
all that data for? Well, just think about the stuff
that's on the web and pretty much anything you could
think of that you know, you have a question you

(27:19):
might have that could not be answered through a simple
search query, you could potentially answer by leveraging this information. So, uh,
you know, we're talking about everything from lots of stuff
that deals with AI. Actually, like you were mentioning earlier, Joe,
stuff like developing better natural language algorithms so that the

(27:39):
the machines of the future can understand a wider variety
of inputs and make meaningful connections between that input and
the desired output. So in other words, I could talk
to my computer or my phone as if it were
a person, and no matter how I might word things
in my own quirky way, the machine understands what I mean.

(28:01):
So it's not it's not responding to what I say
versus it's more respond to what I mean, which would
be awesome. Uh. Also stuff like speech recognition, UM emerging
global trends. We mentioned that. You know, let's say that
you wanted to track the outbreak of a disease and
to try and get at you know, where did this start?

(28:23):
How can we prevent this from happening again? That kind
of thing that can be really useful. And sometimes you're
tracking this through uh, not like official documents, but through
you know, people on Twitter saying, oh, I have the
flu right, Yeah, it might be it might be social media,
it could be news reports. I mean, it could be
all these different, completely separate pieces that would be way

(28:47):
too hard for you to put together on your own. Sure,
we did a whole episode once about financial purposes for
big data, so play in the stock markets and all that. Yeah, yeah, exactly.
So this is this is really important stuff, and we're
gonna see a lot more of examples of people leveraging
big data, especially now that it's outside the realm of

(29:09):
just mega corporations right where we can see people are
researchers who have interest in all sorts of different fields
taking advantage of this massive amount of information that we
continue to accumulate day after day. And uh, that's really exciting.
It makes me think of having access to the best

(29:31):
research librarians in the world, all boiled into the largest
library you can imagine. That's essentially what we're talking about here.
So very exciting, and the common crawl is pretty inspirational.
It's one of those things where you realize it took
a lot of determination to make that become a reality

(29:52):
and the potential benefit, and it also is incredibly forward
thinking to be all the way back in two thousand
eight and putting the US together, And that was before
we had really developed sophisticated tools that could leverage it properly.
Now we're getting to see that as big data has
become an industry unto itself. I mean, it's really exciting now.

(30:13):
So thank goodness that the idea was was implemented long enough,
long ago enough for it to actually have uh you know,
uh an established presence and now we can really see
how we can take advantage of it. So Las is
such a such a fascinating dude. I have found so

(30:35):
many interesting interviews and stuff with him. I think that
we should maybe maybe not on the show, but maybe
if you'd want to do a text text episode kind
of focus on them, that would be kind of cool. Yeah.
I love to do episodes where we are able to
look at influential figures in technology on that show, So
that would be fantastic. I will add that to the list.
Uh So, the Common Crawl is really an interesting project.

(30:57):
If you have not looked into it, go check it out, um,
you know, because it may be one of those things
that could come in handy if you're working on a
research project. If you are just curious about how big
data is gonna continue to have a huge impact in
our lives, go go seek that out. Yeah, it's a
you can find them at a common crawl dot org.

(31:20):
And if you if you're into making donations to nonprofit
organizations that are tax deductible, you can do that thing too.
That's pretty cool, all right, guys. That wraps up this discussion.
If you have any suggestions for future topics for forward Thinking,
you should let us know. Send us an email that
addresses FW thinking at how stuff Works dot com, or

(31:42):
drop us a line on Facebook, Twitter or Google Plus.
At Twitter and Google Plus, we are f w thinking.
Just search fw thinking in Facebook and we will pop
right up. Leave us a message. We read all of them.
We look forward to hearing from you, and you'll hear
from us again really soon. For more on this topic

(32:05):
and the future of technology, visit forward thinking dot com,
brought to you by Toyota. Let's go places,

All Episodes

Episode Transcript

Fw:Thinking News

Follow Us On

Hosts And Creators

Jonathan Strickland

Joe McCormick

Lauren Vogelbaum

Show Links

Popular Podcasts

Stuff You Should Know

Dateline NBC

Las Culturistas with Matt Rogers and Bowen Yang

.css-15opob5{left:0;position:absolute;top:0.8rem;} All Episodes

.css-14f5ked{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:2;overflow:hidden;}The Common Crawl