The Internet Archive

Episode Transcript
Available transcripts are automatically generated. Complete accuracy is not guaranteed.
Speaker 1 (00:04):
Welcome to tech Stuff, a production from iHeartRadio. Hey there,
and welcome to tech Stuff. I'm your host, Jonathan Strickland.
I'm an executive producer with iHeart Podcasts. And how the
tech are yet. So let's take a little literary trip.
In Anthony Burgess's a clockwork Orange, the extremely wicked protagonist

(00:29):
it's putting it lightly. At one point early early in
the novel, reflects on the nature of permanence. He thinks
the reader might not remember what milk bars were like
due to quote things changing so scory these days and
everybody very quick to forget, newspapers not being read much

(00:49):
neither end quote. Alex in this case is saying that
the combination of the world changing very quickly scory is
derived from a Slavic word meaning swiftly or quickly, and
people having short memories means that referencing something that happened
even just a few years ago might mean you're met
with blank stares because the world has moved on. Now

(01:12):
take that same sentiment and crank it up to eleven
when you talk about the Internet in general and the
Web in particular. So, on the one hand, we know
that the rule of thumb is that once something gets
posted online, that's kind of it, right, it's sort of
perpetually online. Like that's kind of the joke. Like once
it's up, it's up, and you can take it down,

(01:33):
but there's going to be a copy of it somewhere.
So even if the originator tries to take down whatever
the stuff was, somebody's got it. But on the other hand,
we also know that so much stuff gets added every
single day to the Internet. There's actually a colossal mountain
of content out there that just keeps getting bigger moment
by moment, and everything that came before it can end

(01:56):
up getting buried in the process. And sometimes stuff can
be added and taken down without anyone being the wiser. Now,
on top of that, web pages obviously can change. A
website might adopt a new format or style, might incorporate
new technologies and interfaces that are added to web browsers,
or it might choose to remove sections that once might

(02:18):
have been relevant but maybe now not so much. Or
entire websites could disappear as servers go offline or companies
go bankrupt, or you know, web administrators just lose interest.
The entire spectrum of human output can be found on
the web. Not every instance of human output, but an

(02:39):
example of everything is out there. Everything from deep philosophical
musings to the most banal posts you know, which often
revolve around what someone is having for lunch. All of
that finds its way to the Internet. And while you
might argue that a lot of it, or perhaps even
most of it, is it really worth the time it
takes to consume, let alone keep it around. There is

(03:03):
undeniably a huge amount of valuable data out there too,
but there's no guarantee that it will stay there or
remain easily findable. And that's where today's topic comes in.
I wanted to talk about a project that began back
in nineteen ninety six. It's a project that aims to
preserve as much of the Internet as possible and little
slices of time, little snapshots. Not only does that mean

(03:26):
you can potentially dig up something that hasn't been online
for years, but also you can get a look at
what different sites were like in various eras of the Web.
It could be a really eye opening experience to see
something like Amazon and what it looked like, you know,
shortly after it launched, compared to what it looks like today.
So we are going to talk about the Internet Archive. Now.

(03:48):
To do that, we need to talk a little bit
about the people who founded the ding dang darn thing,
and that would be Brewster Kale and Bruce Gilliat. So
Klee graduated from m with a degree in computer science
and engineering. After he graduated, he joined fellow MIT graduate
Danny Hillis, who had created a company called Thinking Machines.

(04:10):
So this was a super computer company. His team specialized
in building massively parallel computer systems, mostly with the aim
of building machines for AI research and development. So yeah,
Calee was working on the challenges of providing AI researchers
with the compute power they need, decades before our current
AI explosion. Bruce Gilliot is also a computer scientist, and

(04:33):
that's just about all I know about him. I mean,
I know he is, or at least was married, and
I also know he owned a series of very impressive
houses in the San Francisco and San Jose areas because
it made the news whenever he sold one or bought
a new one. But other than that, there's precious little
information about him that I could find, which is somewhat ironic.
When you consider that he has dedicated a lot of

(04:55):
time and effort to preserving information on the Internet. He
would go on to co found the company called Alexa
Internet with Brewster Kale, but that's getting ahead of ourselves.
So most of my story will center around Kale simply
because out of the two co founders, he's the one
who acted more as the face of the efforts, and Gileat,
from what I can tell, has just been really good

(05:15):
about kind of maintaining a very personal private life. So
I don't mean to diminish Gileat's contributions, but at the
same time, you know, I can only cover what I
can find. So in nineteen eighty nine, Kale, along with
a colleague named Harry Morris, created an innovative tool for
the blossoming Internet. Now remember this is the Internet. It's

(05:38):
not the Worldwide Web. It didn't exist yet the Web
the Internet did, and the tool they created was called
the Wide Area Information Server or ways WAIS. So people
could create a web server. They could host documents on
their web servers. But finding these documents was really hard

(06:00):
because you didn't necessarily have hyperlinks connecting one document to
others and vice versa. You didn't have an easy way
of even navigating through different documents from one to the next.
So it was almost a case that you needed to
know where something was and what it was called first,
and then you could go to the relevant server and

(06:22):
retrieve that document. Otherwise the document would just remain quietly
sitting on some server somewhere and no one would know
about it. Now, that is antithetical to the entire purpose
of a wide area information sharing system, because, I mean,
the name tells us the whole purpose of this technology
is to allow information to be widely shared. Jeremy Norman's

(06:45):
History of Information lists ways as quote the first Internet
publishing system, just predating Gopher and the World Wide Web
end quote. In a recorded presentation to some Xerox employees,
Kale laid out personal perspective on what he wants from
his experience on the Internet. So first up, he said

(07:06):
he wanted his own personal information to be easily accessible
by him. Specifically, not that it should be easily accessible
to everybody, but specifically to him. He wanted the ability
to get access to all the different stuff he generates,
like articles and such, and to make it really easy
to do that. He also wanted the ability for publishers
to get their work to him. So in Kal's mind,

(07:28):
the best approach would be for published works that are
relevant to his interests to find their way to him,
as opposed to Kale having to go out and hunt
down these published works himself. And he pointed out this
is what publishers want too, because you wouldn't publish something
unless he wanted folks to actually read it. He also
said that he wanted this technology to be usable anywhere.

(07:48):
He wanted people to be able to access it no
matter what kind of device they were relying on. Now
he was specifically referencing laptops at the time, but he
was also saying that portable computer systems, essentially things that
would become smartphones and tablets, were on the horizon and
that these needed to be able to access that stuff too.
And he said that he wanted people to be able

(08:09):
to use what he had learned should he choose to
share the information, that if he had come up with
something that was useful and he wanted to share that,
he wanted other people to be able to access that.
Cale didn't say that people should be compelled to share,
but if they wanted to it should be possible to
do so. Ways was Cale's attempt to bring these ideas

(08:30):
to life. In that presentation to the Xerox employees, he
defined ways as electronic publishing. He further defined that term
to mean the distribution of information. So whether the end
user was to look at this information on a computer
screen or they just chose to print out the information
and then read it that way, that was beside the point.

(08:51):
Electronic publishing was all about how information got from the
originator to the end user. That's what made it e
publishing that it was publishing over wires. Now, one thing
Cale introduced in this presentation to Xerox was this concept
of conducting searches using natural language. This concept is one
that we're really familiar with today. You enter a query

(09:13):
into a search bar. You describe what it is that
you want to know or learn about, or have access to,
or retrieve or whatever. This search engine brings back search
results that are ordered by some kind of relevance depending
upon the search engines, you know, various algorithms. How the
search engine determines relevance really depends upon the system itself,

(09:33):
of course, Like you could run the same search across
different search engines and get very different results based upon
that methodology of determining relevance. If the system believes it's relevant,
it may or may not be relevant to what you
actually want. Like hopefully the two are aligned. If it's
a really good search engine, then you're going to get
something that is actually meaningful to you. Anyway, Ways was

(09:57):
kind of following in that approach back before there was
a World Wide Web, you know, when you just needed
a way to find stuff that was being stored on
these Internet servers and to be able to retrieve these
documents to make use of them. Otherwise you had this
incredibly powerful communications tool, but it was so challenging to

(10:19):
use in a meaningful way that the information stored there
would be not that useful. I think of it akin
to imagine that there's this one remote library and it's tiny,
but it has the world's only copy of some text.
But this libraries in the middle of nowhere. It's really
hard to get to the fact that that library has

(10:42):
that document would not be terribly useful to most people,
and so it might as well not have the document
at all. That's kind of what Ways was trying to
do is solve this problem of making it easier to
get access to this wealth of information that Kale saw
was only going to get more complex and more full
of data. Well, we'll move away from Ways, because we

(11:05):
could do a full episode about that. I will say
that Cale and Morris, the founders of Ways, the guys
who created the Ways technologies, would actually leave Thinking Machines
and they would found a spinoff company just called Ways Incorporated.
And it was around this point when the mysterious Bruce
Gilliot joined the team. And while the Worldwide Web would

(11:26):
debut in the early nineties, which really opened up accessibility
to information on the Internet for a lot of people,
most of them for the first time, Ways would continue
to remain relevant. In fact, it was relevant enough that
in nineteen ninety five AOL would come calling with an
offer to purchase the company for a cool fifteen million dollars.
If we adjust that for inflation today's money, that would

(11:48):
be around thirty million bucks around that ballpark. Now, a
lot of the folks that Ways Incorporated would split off
to create new companies after this acquisition, and within a
year that included Cale and Gileat, who went on to
found a new company called Alexa Internet and you might think, huh, Alexa,

(12:10):
you mean like the same name as the Amazon Digital Assistant,
And yes, exactly that, because, as it would turn out,
Amazon would ultimately acquire Alexa Internet just a few years
after it was founded. But the name derived from the
Library at Alexandria, the ancient library of Egypt that at
one point housed one of the world's largest collections of

(12:33):
accumulated knowledge. Now around forty eight BCE, Julius Caesar Julie
Baby and his boys they barged into Alexandria, and as
a consequence of their rowdy invasion, the library caught fire
and much of the collection burned. Sadly, that was not
the only indignity. In fact, it wasn't the first indignity
that the library suffered that would impact its relevance. Further

(12:57):
conflicts a couple of centuries later pretty much wiped out
whatever had been left from the previous calamities, and the
Library of Alexandria became kind of a touchstone for folks
who have stressed the importance of access to knowledge and
the protection of that knowledge, and that the consequences that
could follow from the loss of such knowledge can be
really dire. See also like the Middle Ages the Dark Ages,

(13:20):
for example, that loss of knowledge is a really terrible thing.
So the impetus for Alexa Internet was that Cale and
Gillat wanted, in the words of the Web Design Museum quote,
to develop advanced web navigation that would continually improve itself
on the basis of user generated data end quote, which
is a pretty advanced idea for nineteen ninety six when

(13:42):
the Web was still very young and the general public
was still just trying to get a grip on exactly
what the Web and by extension, the Internet were. One
of the first tools that Alexa Internet developed was a
browser toolbar. So installing this toolbar into a browser would
give the user's access to a sort of crowd powered
recommendation engine. In some ways, it's not that different from

(14:04):
sites like dig and Reddit that would later rely on
the user community to actually work and to recommend links
to really interesting sites. This toolbar would recommend the sites
to users based upon how the overall community was browsing.
So the more people who were using this toolbar, the
more information was going into where they were going, and

(14:27):
thus you would get different recommendations. So if a lot
of people were navigating to a specific site for whatever reason,
you might get a recommendation to do the same. It
was an attempt at an organic way for folks to
suggest websites, kind of like a word of mouth campaign,
and Alexa Internet would also provide meta information about websites
to users if they wanted it. Meta information is information

(14:48):
about information, so this would include stuff like how many
web pages were part of an overall website, or how
many other websites were pointing back to the one you
were on, and so forth. A lot of the stuff
that Alexa Internet could tell you would reflect a specific
web page's relevance. It's the same sort of information that
search engines like Google would take into account when deciding

(15:10):
relevance for search results. And that meant that it didn't
take very long for Amazon to come around with an
offer to purchase Alexa Internet. I'll talk about that more,
as well as the development of the Internet Archive after
we come back from this quick break to thank our sponsors.

(15:35):
So Amazon in nineteen ninety nine takes a look at
Alexa Internet and says, Wow, this is pretty incredible. This
little company has created some means of checking for stuff
like relevance and metadata that could be really really useful
for us, And so Amazon made an offer that Alexa

(15:57):
Internet couldn't refuse to acquire the company for the and
slee some of two hundred and fifty million dollars in
Amazon stock in May of ninety nine. So this is
a little different than the earlier deal we talked about
where AOL bought you know, the Ways Incorporated, because they
bought it with two hundred and fifty million dollars with
a stock. If we just treated that like it was

(16:19):
a cash exchange, then if we had just for inflation,
that's like around four hundred and sixty nine million dollars
worth of stock. But that's not really how you deal
with the value here, right. You have to think about
how much was the stock worth back in nineteen ninety
nine versus how much is the stock worth today? I
checked and I saw that in May of nineteen ninety nine,

(16:43):
Amazon stock was trading for around two dollars eighty nine
cents per share. These days, it's closer to one hundred
and eighty dollars per share. Plus. Between that time, Amazon
had two different stock splits. There was a two to
one split in late ninety nine, and there was a
twenty to one stock split in twenty twenty two. When
you factor all that up, that two hundred and fifty

(17:06):
million dollars in stock ends up being a ton of wealth.
Like it's just a huge amount. It would take a
lot of calculating to get an estimate, and even then
it wouldn't really be accurate just say that deal is
worth a lot. So anyway, the important thing with the
Internet Archive is that Cale and Gileat, through their work

(17:29):
and creating tools for Alexa Internet, found themselves able to
create snapshots of the Web. So they were using Alexa
Internet to have a commercial business, and they established the
Internet Archive as a way of preserving information that had,
at some point or another found its home on the Internet.
So they were using Alexa Internet tech to crawl the

(17:52):
young Web in order to index everything, which is a
necessary step if you want to give people access to
the various documents posted on the web. We first have
to know what is there and where is it. To
do that, you've got to index everything. And then they said, well,
now that we are able to index this, we could
actually download these little snapshots and keep them. And according

(18:14):
to the Internet Archive, that would be important because the
average lifespan for a new web page was not very long,
So contrary to our belief that once something is posted
to the Internet, it's there forever, the archive found that
on average, new web pages stuck around for about seventy

(18:34):
seven days, which means it's less than three months, and
then puff they would disappear, like maybe they would change drastically,
maybe they would just go away. Now, imagine that you
were to walk into a brick and mortar library, but
then you found out that on average the books in
that library would only stick around for three months before
being lost forever. And think of all the knowledge that

(18:57):
would disappear on a regular basis and ongoing basis. It
would be impossible to calculate the impact of that kind
of reality. It would be like losing the Library of
Alexandria regularly every three months. So Cale had come to
the conclusion that knowledge should be preserved and made available
for posterity. This is similar to an idea that was
proposed by Stuart Brand back in the nineteen eighties. It's

(19:20):
a complicated idea that typically gets boiled down to the
saying information wants to be free. That's actually an oversimplification
of what Brand was really communicating. But his point was
that information's value is kind of like a paradox. The
information could be incredibly valuable, right, it could be absolutely critical,

(19:41):
and therefore it could be expensive, but the cost of
distributing information was consistently declining. It was getting easier and
cheaper to share information, and the benefits of making information
accessible are typically pretty tremendous. But information is only accessible
if someone is able to hold onto that info. Otherwise

(20:03):
it's lost. Right, The Internet was such a volatile thing
that there was no guarantee that what you saw today
would be available tomorrow. In the days before the dynamic web,
it wasn't really unusual for someone to establish a web page,
to publish that page, and then later on to wipe
the slate clean or you know, otherwise alter vast portions

(20:24):
of that page in order to use that same web
landscape to host a totally different document. So the old
stuff would just disappear. And so Calee and Gilliat created
the Internet Archive, a nonprofit organization dedicated to the archival
of information across the Internet. And I think most people

(20:44):
are familiar with it from the web wayback machine, but
that's just one part of what the Internet Archive does.
As stated in the Library of Congress, the mission of
the Internet Archive was quote offering permanent access for researchers,
his story and scholars to historical collections that exist in
digital format end quote. Cale and Gilliat founded the Internet

(21:07):
Archive the same year they founded Alexa Internet. So that's
nineteen ninety six. And it wasn't easy. And why is that? Well,
you got to think about the challenge you face if
you want to archive everything on the Internet, or at
least everything that you're allowed to archive on the Internet.
We'll come back to that a couple of times. So,
for one thing, you need to create a way to

(21:28):
capture the content of a web page and to preserve
that for posterity. And you need a way for people
to access those archived web pages and to navigate them.
So Alexa Internet would end up developing these technologies and
commercializing them in various ways, and the Internet Archive was
made possible through these tools. So you could think of

(21:51):
Alexa Internet as being the funding machine for Internet Archive
in the beginning, at least as far as the tools
Internet Archive would use in order to achieve its mission. Now,
on the capturing front, Alexa Internet created a web crawler.
So for applications like web search engines, primarily web search engines,
web crawlers are the soldiers that they send out. A

(22:14):
web crawler's job is to index content across the Internet
and to capture information about what the various web pages
on the Internet are actually about. It's complicated, right. You
could just have a directory of web pages that's based
off the title of the web pages, but title and
content are not always in alignment. So web crawlers are

(22:36):
all about following the various branching pathways across the web.
They crawl through the web, in other words, indexing every
page as they do. So. Not everyone, however, wants their
web page indexed. So you can actually include some HTML
language in your web page that indicates that it's off
limits for indexing, and appolite web crawlers such as the

(22:58):
ones that Alexi Internet was using, will honor those instructions
and it will not index that page. But other pages
that lack this specific instruction of hey, don't index this,
they're fair game. I like to think of web crellers
kind of like Doctor Strange from the Marvel Universe the
Cinematic Universe in particular, they all want. He uses his

(23:21):
time manipulation abilities to see where all the different possible
pathways can lead to. The web crellers do that across
the web. They explore all the nooks and crannies. They
follow each link that even the ones that no one
ever clicks on, they follow those two. And you know,
hats off to web crellers for doing that to build
out these indices, because without it, web search wouldn't work,

(23:44):
and Alexa Internet wouldn't have been a thing anyway. Alexa
Internet and by extension, the Internet Archive used several different
web crallers over the years, but they all basically do
the same thing, or they they you know, more accurately.
They all aimed to achieve the same results. So the
crawler starts with seed URLs. This is like the starting

(24:06):
point where you let them go, and then they follow
each link and they download documents to the archives servers.
The crawlers also reference the links to ensure that they're
not double dipping on a specific crawl. So if you
have a ton of different sites that are all linking
to the same document, like let's say that someone has
published something, and hundreds of other resources on the internet

(24:30):
reference that published document, Well, That means there's all these
different pathways that lead to the same destination, right, and
it would be somewhat wasteful to capture this exact same
document multiple times during the same crawl, so there's cross
referencing that happens in order to prevent that from occurring.

(24:52):
This process does work, but it also has limitations. So
for one thing, these crawls they do create snapshots of
the web in intervals, So if you use the wayback machine,
we'll talk more about that in a second. You'll see
that the history of a web page consists of a
series of dates from which the Internet archive first received
a snapshot of that page, and it leads all the

(25:13):
way up to the most recent reference of that page,
the most recent snapshot. The various dates and the wayback
machine are not necessarily relevant to any major changes that
happened on the web page itself. This is just when
the web crawlers went to that particular web page. So
it may be immediately after a massive change has been implemented,

(25:35):
it may be well after. In fact, there might be
a point where between webcraller visits a web page has
changed a couple of times. Well, that means that the
ones that are happening in between those changes aren't going
to be captured. It's just whatever was there the first
time the web crawler came through, and whatever was there
the next time the web craller came through. So interesting

(25:57):
thing is that if a particular page does have a
ton of other links pointing to it, that page is
more likely to have very frequent snapshots throughout its history,
because again, through subsequent crawls, there are various routes that
take web crallers through that web page, so they're more
likely to capture a snapshot of it. For pages that

(26:18):
have fewer links pointing to them, maybe there aren't that
many other web pages out there that cite this particular page,
they're more likely to have sporadic updates throughout their history.
You might pull up a page in the Wayback machine
and see that there's only maybe half a dozen captures
of that particular page, and that means that there could

(26:39):
be a lot of changes that were missed in between visits.
So not everything gets captured in the Internet archive. I
think that some people work under the mistaken presumption that
anything that was ever published to the web is captured
and archived. There that's not the case. It's whatever was
there when the web crawlers came through it. So, because

(27:00):
even the Internet Archive is not a perfect record of
everything that's ever happened on the web, other elements, like
I said, could also be lost to time due to
the complexity of web navigation. For example, so when web
designers started to incorporate things like flash, which really is
no longer a thing but it was for a while,
or JavaScript, then the web callers that were being used

(27:24):
to index the web, a lot of them just couldn't
navigate these types of tools that were made through flash
or JavaScript. So while human users could, and they could,
you know, interact with interfaces that had these tools created
through these various methods, web collers couldn't. And that meant
that if a website used like tools that were made

(27:46):
in JavaScript to act as the interface, the web creller
might only be able to index the homepage, but not
any of the other links branching off from the homepage
because it couldn't navigate that same interface. So there's a
lot of stuff from that era that's lost to the
Internet Archive as well, simply because the crawlers just could

(28:07):
not navigate those pages. They were never captured. And like
I said, if you happen to have the instruction, the
HTML instruction not to index the site, well then that's
not going to be there either. Now let's move on
to another challenge, which is the storing of these files.
Indexing everything was one thing. How do you store everything

(28:30):
that can be indexed on the web in an archive?
That's what we're going to come back and explore after
we take another quick break to thank our sponsors. Okay,

(28:50):
so the Internet archive, how do you store all the
information that you find across the web. Well, the big
one for web pages was that you had to figure
out where do you store and how do you organize
snapshots of the web so that one you have a
record of them, and two you can find what you're
looking for. You can navigate to the specific instance that

(29:12):
you're looking for. Keep in mind again, the archives not
capturing everything. As I said before the break, there's a
lot of stuff that web crawlers could not access for
one reason or another. Those things would be either off
limits or inaccessible and thus would not be in the archive.
But everything else was still fair game. So to store
and organize everything, Alexa Internet created a new file format

(29:36):
called an ARC file. ARC ARC files contain information about
all the stuff that's inside them, the metadata of the Internet.
So again, metadata is data about data. It makes the
small files inside the larger ARC files all self identifying,
so there's no need to actually build out an index.

(29:56):
The self identifying information includes stuff like the URL for
the file, like what the URL for that particular document is,
how big the document is when it was retrieved, and
other stuff like that. Each ARC file would have a
capacity of around one hundred megabytes, and it was possible
for a single website to span multiple ARC files. I mean,
there's some big websites out there that have been around

(30:18):
for a long time, so yeah, sometimes a single ARC
file would just be a portion of that website. At first,
the Internet archives stored all this information on magnetic tape,
So you would do this indexing of the web, all
these snapshots, and you would save it to magnetic tape.
I remember I used to work for a company, a

(30:40):
consulting firm that had magnetic tape backups. So it was
my job, one of my jobs to occasionally back up
all the data on our network to tape, and I
would have to swap tapes out and label them and
everything and archive them properly. The Internet Archive worked under
the same idea. It would capture a snapshot of all

(31:01):
the files across the web, save them to tape, and
that was how the Internet Archive kept track of things
for about three years. But eventually activity on the Internet
was such that that was not going to do it.
There were too many users who wanted to be able
to access things that were stored or saved within the

(31:23):
Internet Archive, and this method just couldn't keep up with
demand and necessity, as we all know, is the mother
of invention. So the Internet Archive needed an alternative way
to store these snapshots. And of course, the Web was
really growing dramatically, which is putting it lightly, and there
was a real need to step things up considerably. So
to that end, the staff at Internet Archive developed a

(31:46):
storage system they called the PetaBox PetaBox, and it was
called the PetaBox because it could house a petabyte of information.
A petabyte, in case you're curious, is a million gigabytes. Now,
the most recent data I have about the PetaBox storage
system actually comes from December twenty twenty one, so it's
a few years out of date. But at that time,

(32:08):
the Internet Archive was using two hundred and twelve petabytes
of storage, which is a lot that wasn't all the
Wayback Machine. However, only around fifty seven petabytes of that
was for the Wayback Machine. The rest was for other
things like archiving various forms of digital media as well
as what Internet Archive references as quote unquote unique data. Anyway,

(32:33):
the page on Internet Archive site says that the data
centers there are four of them that house the petabyte
storage system, don't use air conditioning, which helps keep electric
bills down. They actually let the heat from the data
storage devices provide heating for the buildings that they're stored
in and that you know, this is all part of

(32:56):
a strategy to keep things at low cost but high
usability and high efficiency. So that's really the big requirements
for the PetaBox system. It has to be efficient. It
cannot require too much power to operate any single PetaBox.
Another requirement is that each rack of hard drive storage

(33:17):
has to hold a ton of hard drives. We're talking
like one hundred plus terabytes worth of hard drive space.
Another requirement is that to serve as an administrator, it
needs to be easy like it can't be complicated to
administrate this storage system, and according to Internet Archive, the

(33:37):
structure of this is such that you need about one
administrator for every petabyte worth of data, so you know,
that's like two hundred administrators. Essentially, the whole goal was
to create systems that were relatively inexpensive, relatively efficient, and
relatively easy to use. At least from an administrative perspective.
That's really tall order. It's hard to meet all those

(34:00):
but the folks at Internet Archive made it happen, and
it was such a useful approach to storage and to
being able to organize the files within storage so that
you didn't have to build out indices that ultimately Internet
Archive would deploy this same strategy for other organizations and institutions. Okay,

(34:21):
but that's all about, you know, collecting and storing all
the information across the Internet. How do you access it?
How is a user? How is a researcher? Are you
able to tap into this? Because again, unless accessibility is easy,
then there's not much point to doing this. You're just

(34:42):
making a record that nobody can reference. Well, I would
argue the most famous of the ways to access information
contained within the Internet Archive is the wayback machine, which
is specifically for web pages. The Internet Archive first introduced
the wayback Machine in two thousand and one, and the
way it works is pretty simple. There's a little it's

(35:05):
kind of like a search bar, but it's a urlbar.
You put in a URL for the web page that
you're interested in, and the wayback machine pulls up the
snapshots that are contained within the archive if there are
any snapshots. As I mentioned earlier, not everything is in there,
but if it is in there, you will see options
available to you to look at the page at different

(35:25):
points in history. One thing I like to do is
look back at how famous web pages have changed in
their design over the years. If you put in something
like really big like CNN dot com, you can see
how the look and interface of that site has transitioned
during different eras across the web. I also used to
do this with the old website I worked for houstuffworks

(35:47):
dot com. I mean that's where tech stuff gets the
stuff and its name is from HowStuffWorks dot com. I
like using the wayback machine to look at what the
site looked like when I first joined, which was a
in February two thousand and seven. In case you're curious.
It looks entirely different now than how it looked back then,
and through the wayback Machine you can see what it

(36:09):
looked like back then. Also, these days, the wayback machine
is the only way I can see some of the
articles I wrote for that site, because the articles have
been either deleted or more likely rewritten over the time. Now.
To be fair to how stuff works, a lot of
my writing was in the computers and electronics sections, and
obviously things change in those fields very quickly, and something

(36:32):
that was relevant fifteen years ago is definitely not relevant today.
So you have to replace old stuff on a regular basis.
But it is kind of sad that a lot of
my work, a lot of my work for the first
you know, ten years of my career doing this kind
of stuff, is not accessible unless you use something like

(36:52):
the wayback Machine. Now, one super neat thing about the
wayback machine is that you can still follow links that
are on pages, like if the archive has those linked
assets also in the archive, then you're going to be
shown a record, and the record will be one that
was captured closest in time with the first page that
you were originally on. This sounds complicated, Let me give

(37:15):
an example, it makes it way easier. So let's say
that I visit the web capture the snapshot for HowStuffWorks
dot COM's homepage on February nineteenth, two thousand and seven.
By the way, this snapshot on feb nineteenth, two thousand
and seven is the closest date to when I started
working at that company that's in the archive. The actual

(37:37):
date when I started the website was not captured on
that day. Anyway, By clicking around on this homepage, I
can actually follow links and it'll pull up archived links
of archived articles, which is really neat. And when I
did that, at one point, I clicked on a link
for more information or related articles to how helicopters work.

(38:01):
That page, the related page was actually archived on February
twenty second, two thousand and seven. So one was on
February nineteenth, the other was February twenty second, but the
link still worked. Right. Yes, these were two different pages
that were archived on two different days, but the nature
of the archive allows those links to still work between

(38:26):
the two, which is neat because I'm not just popping
around through a web of links. I'm also kind of
time traveling, right, I'm looking at a timeline of snapshots
that are all still interlinked together, even if they were
captured on different days. I think that's really cool. Now
it gets even more cool when you think about the
scale of this project. So, according to the Internet Archive itself,

(38:48):
the archive contains eight hundred and thirty five billion with
a B web pages, And as I mentioned earlier, that
just makes up part of all the data that's stored
on Internet Archive servers, because the organization is also home
to more than forty four million books and other texts,
fifteen million audio recordings, more than ten million videos, and

(39:11):
more than a million different pieces of software. Again, some
of this stuff might not be recorded anywhere else. There
may not be duplicates or copies of some of this
stuff anywhere else. While you might have things like Blu
ray DVDs or whatever of some of those videos, others

(39:31):
might not have anything. And history is filled with instances
of media companies generating stuff or others, you know, independent
people too, generating stuff but not keeping a copy for posterity,
and then it's here and it's gone. Sometimes that's on purpose.
Sometimes it's a statement, like you make something ephemeral for

(39:52):
that very reason. Other times it's out of convenience, Like
there are stories about how the BBC would regularly reuse
tapes and tape over previous programming because there was no
thought about preservation or a home theater industry. So there

(40:13):
are entire eras of stuff like Doctor Who that are
just gone or believed to be gone because the BBC
would just tape over old tapes and so you lost
whatever was on there originally. That's why things like the
Internet Archive exist is to avoid that in the case
of stuff that's stored across the Internet, to make sure

(40:35):
that there is an accessible record of those things and
that they don't just disappear. In two thousand and seven,
the state of California recognize the Internet Archive as an
official library, which was important it's not just an honorarium.
It would allow the nonprofit organization to receive federal funding,
which is a pretty important development for the longevity of

(40:56):
the program. But while the usefulness of the organization is
beyond question, the methods that the Archive has used this
have not always been met with universal approval. For example, recently,
the Internet Archive has been embroiled in a pretty nasty lawsuit.
It's called the Hatchet versus Internet Archive suit, and it
revolves around a group of publishers that object to how

(41:17):
the Internet Archive scans physical books for the purposes of
lending them out as digital copies. Publishers are in the
business of publishing and selling copies of books, but for years,
libraries have existed in order to get copies of various
books and to make them available for lending. So libraries
have to purchase the books or have them donated to

(41:38):
the library, and then makes those books available to lend
out to members of the library. The Internet Archive has
a controlled digital lending program to handle this sort of thing,
only we're talking about digital formats, not a physical copy
of a book. This is where things get tricky because obviously,
if you, as a American citizen at least, if you

(42:02):
go out and buy a copy of a book, you
can do whatever you like with your copy of that book,
apart from making your own copies of it and then
selling those. You can't do that. That's copyright infringement. But
if you own a physical copy of a book, you can.
You can keep it for yourself. You could lend it
to a friend and let them read it, they return

(42:22):
it to you later. You could give the book away.
You could resell your copy to someone else, even if
you're selling it for a fraction of what the book
is going for in bookstores. You could do that. You
could even burn the darn thing if you're so inclined.
Just don't do that. Don't burn books. But all of
those things are permitted with your personal copy of the book. However,

(42:43):
a digital copy, well, now we're starting to talk about
different rules. So yes, you can lend out a physical
copy of a book. That's allowed. That's fair use. But
actually it's not even fair use. That's under laws of property.
But we won't get into all that. A digital copy
is a lot trickier because it's easy to replicate, much

(43:04):
easier than replicating a physical copy of a book, and
so different rules have developed to handle digital information compared
to stuff that's in our physical meat space. So this
lawsuit argues that the Internet Archive first digitized physical books
without permission from the publishers, and that that was problem
number one. There's been some different arguments about that, like

(43:27):
if there was no ebook equivalent of the copy of
the book, if the publishers had not digitized that, that's
slightly different than if the publishers also offer an electronic
version of the physical books they sell. But the other
problem is that the Internet Archive received donations and funding
that in part stemmed from the practice of lending out
digitized books, So the publisher said that made the Internet

(43:50):
Archives activities a commercial enterprise. In twenty twenty three, a
judge found in favor of the publishers, saying that the
Internet Archive failed to argue that their work fell under
the principles of fair use. Again, getting into fair use,
that's a whole thing, but generally speaking, fair use covers
a relatively narrow set of use cases in which the
copying are the use or the distribution of a copyrighted

(44:12):
work does not count as copyright infringement. But it has
to meet certain criteria, and it's only ever decided in
a court of law. It's not something that's just you
can apply to proactively. It's something that you use in
a defense if you're brought up on charges of copyright infringement.
So by the time you're actually talking fair use, it's

(44:34):
already pretty late in the game. But anyway, this particular
lawsuit is under appeal. The Internet Archive recently made final
arguments in the case I have not seen anything about
the case being decided one way or the other since then,
so I'm not really sure which way it's going. Again.
I didn't see anything about a decision made, but then

(44:57):
most the articles about this are about the initial trial
that happened in twenty twenty three, so hopefully I will
find some follow up on this at some point. But
there's no denying the Internet Archive has done a tremendous
amount of work in the field of knowledge preservation and
knowledge accessibility. Without the Internet Archive, there's no way of

(45:18):
knowing how much information would be lost to us forever.
Stuff that could have been incredibly useful or even just
diverting could be gone, and we'd never have a way
of retrieving it again. And I am very thankful that
an organization like the Internet Archive exists. If you're not
familiar with it, if you never used it, I recommend

(45:38):
you check it out and explore the Internet Archive. Look
at some of the things that are in that archive,
like some of the books, some of the recordings. There's
some great stuff. I think there's like a quarter of
a million live performances archived just on the Internet Archive,
like live music performances. That alone is super cool. Anyway,

(45:58):
I hope you found this episode informative and entertaining. I
hope you check out Internet archive. I also very much
hope that you are all well and I will talk
to you again really soon. Tech Stuff is an iHeartRadio production.

(46:20):
For more podcasts from iHeartRadio, visit the iHeartRadio app, Apple Podcasts,
or wherever you listen to your favorite shows.

All Episodes

Episode Transcript

TechStuff News

Follow Us On

Hosts And Creators

Oz Woloshyn

Karah Preiss

Show Links

Popular Podcasts

United States of Kennedy

Dateline NBC

Stuff You Should Know

.css-15opob5{left:0;position:absolute;top:0.8rem;} All Episodes

.css-14f5ked{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:2;overflow:hidden;}The Internet Archive