A Small Episode About Big Data

Episode Transcript
Available transcripts are automatically generated. Complete accuracy is not guaranteed.
Speaker 1 (00:04):
Welcome to tech Stuff, a production from iHeartRadio. Hey there,
and welcome to tech Stuff. I'm your host, Jonathan Strickland.
I'm an executive producer with iHeart Podcasts and How the
tech are You. So early on in the days of
tech stuff, back when I was still a staff writer

(00:26):
for a little website called HowStuffWorks dot com, my boss
Connell Burn, who is now a big shot over here
at iHeart, he came over to me with an assignment.
He wanted me to do some articles in some episodes
about this buzzword concept called big data. And I had

(00:46):
heard the term big data, and obviously there's a pretty
darn good hint I want. Big data is all about
just in the nature of the name itself, but beyond that,
I didn't really know much, so I jumped to it.
And the interesting thing is that since that time, the
discipline of big data has evolved significantly. When I was
first working on my articles and episodes, we were mostly

(01:08):
talking about how technological tools made it easier to collect
vast amounts of information very quickly and to store it.
But we didn't necessarily have equally sufficient tools to do
anything useful with all that information. We have or at
least those tools weren't widely known and understood beyond a

(01:29):
certain circle of computer scientists. Flash forward a few years,
and we'd see companies developing new methods to analyze large
chunks of data. Oh, by the way, I do the
weird data data thing, and there's no rhyme or reason
to it. I don't even know which one I'm going
to say before I say it, so I apologize because
I know it's irritating. It irritates me too, Anyway. Other

(01:53):
companies sprung up with products that were meant to help
with data analysis, and it seemed like we were going
from an era of well, now I have all this information,
what do I do now, to an era of I
have discovered cryptic secrets that were hiding in plain sight
thanks to data analysis, and that somehow it all happened overnight.

(02:13):
So today I thought we would actually look back over
the history of the big data concept, how various systems
have made it possible to sift through seemingly meaningless information
in order to find nuggets of wisdom, and why we
might not always be able to trust the answers that
we discover. So the history of big data starts in

(02:36):
the twenty tens, or maybe it starts in two thousand
and five, or maybe in nineteen ninety, or maybe the
sixteen hundreds, or maybe nearly twenty thousand years ago. You
might have already picked up on the fact that folks
don't quite agree on where we should start when talking
about big data. But that makes sense. Ever since humans
have started to write stuff down, we've been pretty darn

(02:58):
invested in the collection and then the classification of information.
Whether it's to figure out the best time to sew
or harvest crops, or keep track of how much we've
traded with that other band of neair dwells who live
on the other side of the holler, or we just
want to make a record of how great it was
that we kicked the butt of that mastodon, real good.

(03:20):
We've been really obsessed with data and collection and retrieval. Now,
this obsession also means that we had to come up
with various ways to store and analyze this information. Raw
information doesn't do anyone much good, and so throughout antiquity
we came up with means of recording and storing and

(03:41):
making use of information. Not only did hardworking humans create
libraries where we could gather all this knowledge and then
lose some of those libraries along the way due to
the fact that we humans also are pretty stupid and
we end up having disputes that involve burning each other's
stuff to the ground. Yeah, I'm still bitter about certain

(04:03):
libraries being destroyed over in antiquity, but it means that
we also had to come up with methodologies to categorize
and classify information. Otherwise you may as well just have
a big old pile of scrolls or books or whatever,
and then people just you know, have to sort through
them and see if they can find anything, which actually
sparks two different memories in my head. One is that

(04:25):
there used to be a used bookstore I would go
to here in Atlanta, and often the used bookstore was
completely unorganized, right, Like you literally could go through a
bookshelf and it's just going to be books that are
more or less the same size, but otherwise there's no
rhyme or reason as to why they were put there,

(04:46):
and it was like you were on a treasure hunt.
And then I'm also reminded of a naval museum in Appalachicola, Florida,
which is on the Panhandle. I went to this little,
you know, naval museum, like a ship museum, and I
reminded that all the exhibits were kind of in a
pile on the floor, and you would literally pick things

(05:08):
up and look at them. And that's kind of what
it would be like if we didn't have these means
of classification. Once you get to a certain size, like
that little museum in Appalachic Cola wasn't so big as
to be a problem. But if you're talking about a
big library, obviously, if you want anything useful, you got
to come up with a way of classifying all this.

(05:29):
To that end, ancient folks began to develop a science
called taxonomy. And this isn't when you stuff dead animals
so that they look like they might still sort of
be alive. That's taxon dermy. No. Taxonomy is the science
of classification, and it's perhaps best known in the field
of biology, thanks in large part to a Swedish scientist

(05:51):
from the eighteenth century named Carl Linnaeus. But there are
many applications of taxonomy that extend beyond biology. It's just
the biological taxonomy is the one that I think most
of us are familiar with because most of us were
taught it when we were going through basic biology science.
But the ancient Greeks made some early progress on developing

(06:11):
systems of classification, and obviously, within modern library science, taxonomy
is an important discipline, though oddly enough, you could say
taxonomy in library science is distinct from classification. When I
was looking this up, I found resources for library science
that made these two distinct disciplines. Classification was one in

(06:32):
taxonomy was another. Now. This is because there are various
methods of classification in library science. The one that I
was most familiar with when I was growing up was
the Dewey decimal system, which I don't even think is
the dominant form now, but it was when I was
growing up. And it's meant to connect a specific work
to a specific physical location in a library for the

(06:54):
purposes of, you know, checking down the book, right. But
taxonomy in library science tends to be more towards metadata
or data about data. In fact, metadata plays a huge
part in big data. Oh man, I did it both
ways in one sentence. I feel awful. Anyway, the information
about information can be as useful as the information itself.

(07:17):
In some cases. I have often talked about this with
personal information about how info about info can give you
a lot of insight into a person. Maybe you don't
have a person's name, but you have a couple of
different data points about that person. In some cases, you
can actually narrow down the identity of the person you're

(07:37):
thinking of just by looking at this metadata. You don't
even have to see the information about them, which shows
you how powerful metadata can be. So you start to
see a cascading effect here where you slowly realize that
you actually have access to even more information than you
first anticipated because you also have information about that information.
It gets pretty wild. Another important development in the history

(07:58):
of big data is the creation of statistics. So let's
give the Merriam Webster definition of statistics. Shall we just
have a baseline. It is quote a branch of mathematics
dealing with the collection, analysis, interpretation, and presentation of masses
of numerical data. End quote. Now. One famous early example

(08:21):
of statistics comes to us courtesy of a fellow named
John Grant Graunt. He was looking at mortality rates in London,
and that gave him a lot more information and helped
him analyze the course of the plague. For example, he
could see when the plague was spiking or receding. Pretty

(08:41):
cheerful stuff, right, But he also used this information, the
mortality information, to start drawing some conclusions about the population
of London as a whole, So counting up everybody, like
figuring out who lives in London. That would have been
challenging at the time, to say the least. But Grant
took information like the number of funerals and then he

(09:02):
compared it to things like the average family size in
London to try and make an estimate of London's population.
So it gave him kind of a working figure that
was useful for certain applications, specifically government ones. Statistics as
a branch of mathematics would mature over the following centuries.
Often it would be the tool that allowed social scientists

(09:25):
to draw broad conclusions about large populations, but others found
plenty of alternative applications of statistics. Anyway, the age of
data analysis was well and truly in swing at this
point in the late nineteenth century. The United States was
getting in a bit of a pickle. And I know
we're making jumps of centuries here, but we need to

(09:48):
We can't go through every single evolution of data collection
and data analysis that would be a podcast series all
in itself. So we're in the late eighteen hundreds and
the USA US isn't a bit of a problem. The
country holds a census every ten years, where they're essentially
gathering information about all the citizens in the United States.

(10:09):
This is required by the US Constitution, and there are
several reasons why the Census Bureau holds a census every
ten years. But one of those reasons is that the
US House of Representatives its membership depends upon population. So
the more populous a state is, the more representatives that

(10:29):
state has in the House of Representatives. So if your
state has a big population, there are more representatives that
go to the House. If you have a relatively small population,
then you have fewer House representatives, right, That's how that works.
So by eighteen eighty things were getting to a really
difficult situation. The process of collecting and then analyzing all

(10:53):
the information was so cumbersome that it would take nearly
the whole decade just to get to a result, and
that means by the time you're drawing conclusions, it's actually
time for you to administer the next census. In fact,
they projected that in eighteen ninety working on the same
process that they were dependent upon previously. It would take

(11:15):
a whole decade, so literally you'd be holding your next
census while you were just getting your information from the
last one. So the Census Bureau needed a way to
collect and analyze this information in a much more efficient process.
They tapped a man named Herman Holleeth to accomplish this.
So Holloweth took a punch card system that had been

(11:36):
used in weaving, weaving with mechanical looms. I've talked about
this in the past with the history of punch cards.
In fact, this also gets into perhaps a somewhat apocryphal
story of where the word sabotage comes from, but that's
for another time. So he took this punch card system
that had been used to set weaving patterns with mechanical looms,

(11:58):
and then he adapted that to serve as a way
to record information so that you could feed the card
to a tabulation machine which then could actually tabulate the results.
And his invention meant that ten years of labor done
by clerks who are working at desks would actually boil
down to about three months of labor using the tabulation machine. Obviously,

(12:21):
that was a huge improvement. Hollerith formed a company that
over time would evolve into one of the most famous
companies in all the world, Kentucky Fried Chicken. I'm just kidding.
It wasn't KFC. Instead, it was IBM. That's the company
that would grow out of Hollowarith's company that he founded
in the nineteenth century. Anyway, we're not going to spend

(12:44):
too much time in all these centuries gone by. We're
actually going to speed things up and get up to
the twentieth century. But before we do that, let's take
a quick break to thank our sponsor. We're back, okay,

(13:08):
So the actual term big data is still waiting for us.
We're not going to really get to that until we
hit the late nineteen nineties or so. But there are
a few things to point out before we get up
to there. Folks were starting to notice that we were generating, collecting,
and storing an awful lot of information in the twentieth century,

(13:29):
and that the rate of data generation was on the rise.
Not only were we generating a whole bunch of information,
we were doing it in larger amounts year over year.
In fact, it was rising much faster than our rate
of consumption of information, meaning that we were making way
more data than we were actually able to use. And

(13:50):
a big thanks goes out to Forbes for an article
that's titled A very Short History of Big Data by
Gil Press. A lot of the information that I'm drawing
upon came from that article. It is fantastic if you
want to learn more about this. I'm not going to
cover every element that they do. I mean, that would
just be me regurgitating their article. You should check it

(14:10):
out if you're interested in the history of big data.
We're going to touch on a few of the important points,
or what I think of as the important points. So
one of the earliest ones we're going to talk about
is in nineteen forty four, a librarian named Fremont Writer,
which is a fantastic name, wrote a work titled The
Scholar and the Future of the Research Library. So Writer

(14:32):
made an observation that reminds me a lot of Gordon
Moore's famous Moore's law, except this involves not silicon chips
but physical libraries. So Writer said that your typical library
in your typical American university was doubling in size every
sixteen years. He projected that this would mean that by

(14:52):
the year twenty forty, the library at Yale University would
be so large as to require a staff of more
than six thousand people to manage it. Of course, this
was before we had digital storage and digital filing systems
that has largely mitigated this particular requirement. We don't need

(15:13):
the physical space necessarily that we would if everything were
still in hard copy. But the observation showed that data
accumulation really had a steep trajectory even back in the
nineteen forties. Similarly, in the early nineteen sixties, a guy
named Derek Price published a piece explaining that the number
of scientific journals and papers was on a path of

(15:35):
exponential growth. It was doubling every fifteen years, so similar
to the rate at which university libraries were doubling in
size now. Part of the reason for this, he said,
was that scientific discoveries inevitably fuel further discoveries. So you
find out something new, this inspires other scientists to look
further into it, they find other new things, and so on.

(15:56):
In nineteen sixty five, the United States government needed to
build a place that would store records, including things like
tax returns and fingerprint sets, and so the plan was
to take the paper records and then transfer them to
magnetic tape, and then to store that magnetic tape in
this so called data center. This project fell through, however,

(16:18):
because the public got nervous. They felt squiky about this
idea of the government hoarding vast amounts of information about
its citizens. They did not fully trust the government. So
you understand like they're thinking, I don't really feel comfortable
with you just gathering all this information about us. It
feels kind of oppressive. Now, what's funny to me is
that today the average person is more than willing to

(16:41):
let companies do this to them without even protesting it.
Because that's how all the online social network companies work, right,
They work on the basis of gathering information about us
and then peddling that or or hoarding it, however you
might think of it. And it's very similar to what
was happening in the nineteen sixties. And back then we
were like, no, that's not cool, and now we're like,

(17:02):
that's just how it works. It's wild to me. Anyway,
I'm going to skip ahead a little bit to the
nineteen eighties. There was a lecturer, I a Tjomslend, and
I know I butchered his name. I apologize anyway. He
gave a lecture at the IE or IE Symposium in

(17:24):
which he posits that one reason all this information is
piling up is that we don't really have a good
way to determine which information is relevant and which information
is not. And we can make that determination, but it
requires work, and meanwhile, we're still accumulating more information. So
it's the kind of work where you're never done, and

(17:45):
it feels like you're never making any progress. So most
of us never bother to do it at all. And
if our ability to store data is sufficient, in other words,
if we have ways of storing the information, then we
have even less incentive to make any determination about the data. Right, Like,
if we've got plenty of storage, well, let's just go
ahead and keep the information. There's no reason to have

(18:06):
to worry about it whether it's useful or not. We
should keep it because it's better for us to keep
useless information without needing it, rather than accidentally deleting something
that turned out to be important. Right, And this kind
of makes sense. I mean, I'm sure a lot of
you out there can apply that to your lives. I
certainly can apply it to my life, right Like, I

(18:28):
have file folders that are full of stuff that I'm
never going to touch again, but I still feel reluctant
to delete it just in case I do need to
touch it again sometime in the future, even though the
likelihood of that is very low. So that's anecdotal. I
can't really call that evidence to prove the point, but
it feels like the point is relevant. So this is

(18:50):
also how I play a lot of those big open
world computer RPGs, by the way, things like Skyrim or whatever,
because I'll just hoarde potions and scrolls and I never
use them because what if I need it more in
the future. Balder's Gate three has really done a number
on me with this. I got a real problem with
that anyway. The Forbes article details several more entries indicating

(19:12):
how very smart people were taking note regarding the accumulation
of information, as well as methods to store the information,
and increasingly, as time went on, how we can do
useful things with all this information. So I recommend you
check out that Forbes article if you want to learn more.
I think it goes up to about twenty twelve at
this point, it has been updated numerous times, but obviously

(19:34):
twenty twelve was quite a long time ago, so it's
it's not exactly up to present day. But it's still
a really interesting article that gives lots more details about this.
But I don't want to just regurgitate the article, so
we're going to hop on ahead. Now, Folks, in general,
we're becoming more aware of this information challenge that was growing.

(19:55):
But where did the term big data actually come from? Well,
chances are it's sort of rose organically in conversations within
the computer sector. As you know, hackers and computer scientists
and programmers and researchers were all wrestling with ways to
deal with data. Now, by this time, folks had adapted
an observation made by Cyril Northcote Parkinson to apply to

(20:19):
computer systems and to information. So Parkinson's original observation was
that generally speaking, in public administration offices, you know, like
government offices, work expands to fill the time that was
allowed for that work. So if you have a project
that's going to be due in three weeks, but really,
if you were to be brutally honest, there's only a

(20:40):
week's worth of work to do for that project. Well,
that work will almost magically expand so that it actually
takes three weeks to complete. This gets more nuanced and
it brings into account elements like bureaucracy. But you get
the point right that somehow it doesn't matter, you know what,
who is working the job. It doesn't matter the nature

(21:02):
of the work. The work will expand to fill the
amount of time it requires to do that work, which
meant that if you had said it would take two weeks,
it would have just expanded to two weeks, not three.
It's very weird, right Anyway, Folks in the computer biz
adapted this to say that data will expand to fill
whatever space you have available for that data. So, in

(21:23):
other words, you make a bigger storage unit, you're going
to fill it like that data will just expand to
fill that even though you thought, oh, I'm future proofing this,
and again anecdotally, I have observed this in my personal life.
I remember when hard disk drives first became a thing
in personal computers, like they were already existed, but personal

(21:44):
computers didn't have them when they first it came out, right,
you were using external drives like floppy disks and stuff,
and I remember whenever there would be a dramatic expansion
of storage space, and it always seemed to be dramatic, right,
it always seemed like it had doubled since last time.
And typically that's how it worked. Anyway, I would walk
away thinking, Wow, I'm never gonna fill all this space.

(22:06):
I mean, who even needs that much space? Two hundred
and fifty six megabytes? Who the heck needs that much space?
That's way too much. I mean, I'll never fill it up.
But of course I would prove myself wrong, typically in
record time. But beyond anecdotes, which again don't really count
as evidence, the observation really pointed out that we will
eagerly fill up whatever space we're given. You could argue

(22:29):
this goes back to our tendency to avoid deleting material
out of concern that it might one day become useful. Anyway,
By the mid nineteen nineties, there was a computer scientist
named John Mashie, and he was giving presentations that related
to this concept of big data. Now, Mashie has dismissed
the idea that he personally coined the phrase. At most,

(22:52):
he says that he popularized the term big data in
his talks but his point was that he used the
phrase big data because it was a shit shorthand way
to give a nod to several related challenges, ranging from
storage to analysis. So one could argue that Mashie's use
of the term approached what we mean by big data today,
but it wasn't one hundred percent the same thing. And

(23:14):
the earliest use I've seen cited happened sometime around nineteen
ninety eight. So we know Mashie didn't invent the phrase,
and we know that partly because researchers found an instance
that predates his talks by nearly a decade. Steve Lohr
wrote a piece for The New York Times titled the
Origins of Big Data, An etymological detective Story. A great,

(23:37):
great article. By the way, Lore spoke with an associate
librarian in Yale Law School named Fred Shapiro, and Fred
Shapiro did some research and uncovered an instance of the
phrase big data in a nineteen eighty nine article in
Harper's magazine. The author of that piece was Eric Larson,
who said, quote, the keepers of big data say they

(23:58):
do it for the consumer's benefit, but data have a
way of being used for purposes other than originally intended quote,
and boy howdy, we have seen that observation play out
again and again, haven't we. It's remarkable because nineteen eighty
nine predates the World Wide Web, certainly predates all the
social networks that we talk about. But Eric Larson's observation

(24:19):
is just as relevant, if not more relevant, today than
it was in nineteen eighty nine. Also, incidentally, Eric Larson
wrote one of my favorite books of all time. It's
titled The Devil in the White City. Famous book. I'm
sure a lot of you have already read it, but
for those who haven't, it's a book that tells two
somewhat intertwined stories, the eighteen ninety three World's Columbian Exposition

(24:43):
in Chicago and the tale behind HH Holmes, credited as
one of America's first serial killers. Now, I originally bought
the book because I was interested in Holmes's story, but
I got to be honest, I actually found the chapters
about the exposition to be far more captivating, and it
ties up into a lot of the stuff we talk
about on tech stuff. So it's a great book if

(25:03):
you're looking for something to read. But now let's get
back to Big data. So things continue on their inevitable
path through time. As it goes, time marches on, we
get up to the two thousands. By now the Internet
has greatly exacerbated our data creation and accumulation problem. In
two thousand, Francis Diebold wrote, quote big data refers to

(25:25):
the explosion in the quantity and sometimes quality of available
and potentially relevant data, largely the result of recent and
unprecedented advancements in data recording and storage technology end quote.
So we're really starting to close in at this point
on the concept of big data as we understand it today.

(25:46):
Then we get up to two thousand and five, and
a couple actually several important things happened that year in
the realm of big data. We get Tim O'Reilly and
his media company, fittingly enough called O'Reilly Media, and this
is the year that he would publish an article titled
what is web two point zho, a famous or perhaps
infamous article in tech circles. So the dot com bubble

(26:09):
had burst several years earlier, around two thousand and two
thousand and one, and O'Reilly was making observations about the
qualities that helped the companies that survived that crash versus
the companies that went under, like what set them apart?
What are some of the qualities that we can say
are really valuable on the web. And part of that
involved how successful web ventures were handling data. Now. That

(26:31):
same year, he had a guy named Roger Mugalus or
Mugalas actually I don't know how to say his last name,
but he was also with O'Reilly, and he argued that
big data refers to how we now had the capacity
and the capability to gather and store data sets that
are so large that our traditional business tools are incapable
of doing anything useful with that information. It's makes me

(26:56):
think of the joker in the Dark Knight film where
he says, as a dog chasing a car, he wouldn't
know what to do if he caught it. That kind
of thing. Yeah, we've got all this information, but the
tools we have aren't sufficient to do anything meaningful with it.
We were overwhelmed with information. But that same year, because
an awful lot happened in two thousand and five in

(27:17):
the big data space, Doug Cutting and Mike Cafferella released
a tool that would really change things. I'll explain more,
but first we're going to take another quick break to
thank our sponsors. Okay, before the break, I teased that

(27:42):
we were going to talk about a tool made by
Doug Cutting and Mike Cafferella that would actually change our
approach to big data and make it possible to do
meaningful things with it. So these two had read papers
about Google's file system as well as a tool that
Go was using called map reduce. Now, the purpose of

(28:03):
map reduce is to take large clusters of data and
essentially break them down into more manageable chunks, and then
analyze these chunks in parallel, and this makes the process
of data analysis faster. It's really just another form of
parallel processing when you really think about it. Anyway, Cutting
and Cafarella were inspired to make their own tool that

(28:24):
could do similar work, but you know, they can make
it for everybody, and so they created a project called
hadoop hadop, and the first version of hadoop would come
out in two thousand and six, and it's an open
source project. It's still around today with thousands of contributors.
But the important bit is that we were now starting
to develop new business tools that actually could handle the

(28:48):
massive amounts of information that we were accumulating. But let's
take a quick step back. Let's also consider what's going
on around this same time, the mid to late two thousands,
and by that I mean the first decade of the
two thousands. So for the first several years in the
computer age, it was really computer systems themselves that were
seen as the genesis of data creation, right like it's

(29:11):
the computers are the things making all this info. But
other elements were starting to come into play by this point.
So when we get up to two thousand and seven,
we're into the consumer smartphone era, because that was the
introduction of the Apple iPhone. These consumer smartphones can generate
enormous amounts of information. You can perform all sorts of
computational tasks on them. They can track your location, you

(29:32):
can connect the internet, et cetera. We also were getting
into the age of the Internet of Things, so we
were starting to create millions of these tiny devices, usually
designed to collect specific bits of information and then zip
that info off to somewhere else. So it might be
a speed sensor along a road. It might be a

(29:52):
thermometer at a weather data collection site. It might be
a thermostat in your own home. It could be anything.
Could be a smart speaker. All of these individual little
components would add to the amount of information we were
gathering and storing and creating, all in the hopes of
being able to do something useful with that info. And

(30:14):
we also had another buzz term that was starting to
gain traction, just as big data was really beginning to
transition from a topic that was talked about in a
relatively small subculture of computer scientists and such into a
topic that the general public had actually heard about. You know,
usually we're a few years behind whatever group is really

(30:36):
focused on the subject matter. So this other buzz term
was cloud computing, which I also got an assignment to
work on right around the same time as big data. Now,
the simplest way to describe cloud computing is that it's
when you use someone else's computer to do your computational
tasks because you log in through your computer, but it's

(30:56):
this other computer that's actually doing the work, or it
might be a net work of other computers doing that work.
That work could be that you're storing photos or ekitty
cats on a drive on some cloud storage, or it
might be that you're using cloud computing to help you
crunch really big numbers that your computer could not handle

(31:17):
and you're peeling back the mysteries of quantum mechanics or something.
So cloud computing would rise at the same time as
big data and cloud computing and big data are very
closely related. They're enablers of one another in a way.
Organizations and companies feel the need to engage with cloud
computing services because their data tasks are growing increasingly complex

(31:39):
and voluminous, and it gets harder and harder to handle
all of that on your own. Right, Like most businesses
these days are not using exclusively on premises computing systems
to do all their computation and all their storage. It
just is not practical, right. You would have to continuously

(32:00):
buy or lease more space just to hold all the
systems you would need. So instead they engage with cloud
computing companies that will provide those services for them, and
then the cloud computing companies will go out and build
a warehouse and fill it full of computers. Big data
leans on cloud computing to make it practical to even
accumulate all that data in the first place, let alone

(32:21):
analyze it. Now, the lure a big data the reason
why we're concerned with it. I mentioned this in the
very beginning of this episode. The lure is that there
are nuggets of truth hiding inside vast amounts of possibly
useless information. There is signal, but there's also an enormous
amount of noise. If we can identify those little nuggets

(32:46):
of truth, then we can potentially benefit from them. But
these huge piles of information are just so vast that
our ability to zero in on the important stuff is
just not up to snuff. It is the proverbial needle
in a hay haystack problem. So the promise of big
data in our current age is that when we use
the right tools, we can sift through the haystack and

(33:10):
we can find all the needles, which is a really
tempting concept, because who knows what you might find when
you analyze large amounts of information. Maybe you identify patterns
that you can then use to lead you to change
things so that you can save huge sums of money
in the way you do business. Or maybe you identify

(33:31):
a previously unknown opportunity. Or maybe you can spot connections
between data points that you didn't see before and you
start to see correlation. Maybe you even determine causation. Maybe
this leads you to make some incredible scientific progress, and
it might be on anything from medicine to astronomy. It
all depends on the type of data. Obviously, However, there's

(33:54):
a big caveat that goes along with this sort of
beautiful concept, and it's possible that the tools we use
will make mistakes that they're going to spot patterns or
meaning when in reality there isn't anything there. They mistake
something to be meaningful when in fact it's not. This

(34:17):
is kind of like when you look up at the
clouds and you see a pattern that makes you think
of a specific shape, very like a whale, As Hamlet
and Polonius would say, so, the shape of the cloud
might remind you of a whale or a dog, or
a hand or whatever, but you probably are aware that
the cloud isn't actually a whale or whatever. In fact,

(34:38):
you might even realize that your point of view is
part of what is shaping your perception. It's part of
the reason why it looks like a whale. Maybe if
you were a mile away to the east or something
and you were to look at that same cloud, the
angle you would be at would mean that the cloud
wouldn't look anything like a whale. Maybe it would look
like something entirely different, or maybe it wouldn't remind you

(35:00):
of anything at all. So, from one perspective, the cloud
shape appears to have some meaning. From other perspectives, it doesn't.
So it would be a mistake to draw any conclusions
based on that one perception, because it would just be
the illusion of meaning, not actual meaning. And that can
happen when you're looking at huge data sets too. You

(35:22):
might see something that looks like it's meaningful, that it
represents a pattern or a connection, when in fact it doesn't.
That can lead you on a wild goose chase, and
in a worst case scenario, you might dedicate a lot
of time and effort and money toward pursuing this perceived
meaning and you only find out much later that there
was nothing there at all. Now that's not to say

(35:43):
that we can't trust the outcomes of big data analysis,
but it does mean that we have to make sure
that we have tests to ensure the validity of those analyzes.
We need to take a scientific approach toward big data,
or else we run the risk of chasing a dream
rather than and learning more about reality. And anytime there's

(36:03):
any uncertainty, there will be people who move in to
exploit that uncertainty, hucksters, scam artists, snake oil salesman. So
as an example of this, I would point to the
explosion we are seeing in artificial intelligence right now. AHI
has tons of applications, including in the analysis of big data,

(36:23):
and that means that there is also opportunity there to
take advantage of people. So it doesn't take much imagination
to think of a company that actually uses a cheap
human labor and to pass it off as a truly
AI company and to market that company's services to big
businesses that may not know any better. And really you're
just exploiting people in poorer countries and passing it off

(36:46):
as being this really high tech business. As it stands,
even if you're not doing that, human labor is already
the backbone of the AI industry, like it or not.
People in countries that have low wages and have very
little protect in place for working citizens, they're spending countless
hours tagging data so that AI can actually make use

(37:07):
of it. So as we marvel at how clever AI
tools appear to be, there are folks out there on
the margins who are the ones labeling images and applying
metadata to text so that the AI can grab the
right stuff based upon a query. Anyway, I think it's
important to remember that big data can, with the right tools,

(37:28):
provide us insights that we might not otherwise make because
the amount of information is just too large for us
to handle. Those insights might mean we can do things
like streamline supply chains, or identify a market for specific product,
or find a new way to treat an illness. Big
data can also lead us to some darker outcomes. Companies
will scrape as much of your personal information as they

(37:49):
possibly can. They will sell it to other companies. These
other companies will market you to yet more companies on
an effort to serve you ads or to lure you
into doing something foolish like downloading malware or consuming misinformation.
Because behind every silver lining is a big, scary cloud.
Maybe it's in the shape of a whale. That is

(38:12):
a brief history of big data. It is a history
that is ongoing. I'm sure we're going to see some incredible,
incredible discoveries thanks to analysis of big data. I'm sure
we're also going to see some pretty scary stuff as
a result of it as well, such is life, but
it is fascinating to see how we have arrived at

(38:35):
this point, like first from the point of how do
we collect all this information? And then what do we
do with it? I hope you enjoyed this episode. I
hope you are all well, and I will talk to
you again really soon. Tech Stuff is an iHeartRadio production.

(38:57):
For more podcasts from iheartradioit the iHeartRadio app, Apple Podcasts,
or wherever you listen to your favorite shows.

All Episodes

Episode Transcript

TechStuff News

Follow Us On

Host

Jonathan Strickland

Show Links

Popular Podcasts

1. The Podium

2. In The Village

3. iHeartOlympics: The Latest

.css-15opob5{left:0;position:absolute;top:0.8rem;} All Episodes

.css-14f5ked{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:2;overflow:hidden;}A Small Episode About Big Data