[Bite] Data Science and the Scientific Method

Episode Transcript
Available transcripts are automatically generated. Complete accuracy is not guaranteed.
Jason (00:00):
Welcome to the DataCafe.
I'm Jason.

Jeremy (00:02):
And I'm Jeremy. And today we're talking about the
scientific method.

Jason (00:18):
So I thought we'd stop in for a bite today, Jeremy,
because I've had it in my mindto talk about the scientific
method and what that means fordata science and for data
scientists. So I'd like to pickyour brain about it a bit.

Jeremy (00:32):
Okay, that's interesting, because I mean, I'm
not sure I, I come from atraditional scientist,
background, I'm a mathematicianby training, and then I got into
sort of a hybrid area ofcomputer science, but you are
the scientist, Jason. So I'minterested. Yeah, you know,
almost in rebounding thequestion and saying, Well, what
does the scientific method meanto you as a physicist, as an

(00:55):
astrophysicist?

Jason (00:56):
Yeah, yeah, I guess. So we've talked about, like, the
scientific method has come up acouple of times in our jobs. And
I think we've even mentionedthis in the cafe a couple of
times. So as we go through ourscientific training, and we set
up experiments in school andcollege and fundamentally gather
our data and make ourobservations to prove or

(01:19):
disprove a hypothesis, and thiswhole enterprise, I guess, falls
under the realm of thescientific method. So, right,
you can draw it in a pipeline,and there's all these pictures
online. And I don't think itreally flows so
straightforwardly, as thatcourse, if I look at a picture

(01:39):
and say that, you know, mainboxes in this process flow,
generally, or historically, youobserve something in the world,
and it raises a question. Andthat question is a topic of
interest and may form ahypothesis. So you decide to run
an experiment. And to do that,you need data from the

(02:00):
experiment, the process cangather your data, or you may
have some data, when yourexperiment report on the
conclusions. And if there'ssomething interesting that like,
proves what you're taught wastrue. You tell everybody evade?
Yes. And then other peoplerepeat that. And if enough
people repeated enough times,confidence is built that your
hypothesis, yeah, wasn't afluke.

Jeremy (02:23):
So there's an interesting set of concepts
there, which I think sort ofform the essence of, of a
scientific method. It's the it'sthe, the observation based
aspect of it, you're takingmeasurements, often you're
you're you're you're gettingdata so that that fits right
into data science straight awayanyway, you're constructing a

(02:45):
hypothesis around how aphenomenon or an object works,
and you're testing thathypothesis. And so there's lots
of, you know, associatetechniques that you might have
to develop around that. And ofcourse, then you've got sure I
love the bit that you said atthe end the repeatability, yes.
How important that what you'vedone? You can you've written it

(03:08):
down in such a way intraditionally in the scientific
journal, of course, or but not,not necessarily these days.
Yeah. And then so another groupcan take your data, take your
algorithm or take your approachand go, yeah, we were able to do
that. And we got exactly thesame results. And within the
margin of error, I guess.

Jason (03:25):
Yeah, exactly. They replicated it in some way. And
there's a big push nowadays tomake sure that everybody can
replicate it. So we haveversioning, let's say so you can
say exactly what version of yoursoftware that you used, was what
got you this result. And itallows for things to advance but
still be recorded historically.
But what is on my mind is thatin data science, and we talked

(03:49):
about the complimentary teambefore you bring together a lot
of different scientists, andthey will have fundamentally use
the scientific method in theirfield. But in data science, we
kind of have this funny worldwhere there's now a wealth of
data being gathered in variousways that may not even have been
set up for a specific purpose.

(04:14):
You know, you could have so muchdata from Twitter, let's say,
and everything that's going onon Twitter, and it gets really
difficult then to mine it in theway that you're setting up for a
hypothesis to set up a constructwhere your data is clean, or
representative or the rightsample. You know what I mean?

Jeremy (04:38):
Yeah, well sort of, but I'm sort of gonna throw it back
to you and say, well, it's notlike the sun was set up to be,
you know, mined and analysed andand, you know, investigated and
picked apart was it so, thesethings are not there for our
convenience. In that sense, it'squite realistic, I think.

Jason (04:58):
Yeah. And the world is messy right.

Jeremy (05:01):
Yeah, right. I think that's, I think, ask any data
scientists, and they'll tellyou, they'll tell you lots of
stories about how exactly howmessy The world is.

Jason (05:07):
Yeah. And I mean, to put it in perspective of like, when
we say something about, like,sending satellites up to observe
the sun, there's so much ridingon that, that we set it up,
right, we put a lot ofinvestment into will, what is
the exact hypothesis? What isthe exact experiment? What is
the exact data? Because it'sreally expensive? If you get any
of that wrong? Yeah. Whereas inthe world of Twitter, when I

(05:30):
gave that example, there isn'tany of that it's just the
characters being spewed out byeverybody all over the world
using that form. And I justwonder, Is there something to
the scientific method that weneed to make sure is maintained?
When you take any sort of offthe shelf model or build a piece

(05:53):
of software that fundamentallyhas assumptions in it
fundamentally makes a hypothesishas a business decision, maybe
at the end of it? And thereneeds to be a scientific rigour
that comes with forming thehypothesis, and that's what the
scientific method can allow?

Jeremy (06:13):
Yeah, I, for me, it was always, it was always about the
sort of the test and learn,right, I've got a phenomenon, it
was rarely a phenomenon, I hadto say, in computer science, it
tended to be I've got a whitebox, not for my particular brand
of computer science of blackbox. And, and I'm applying it to

(06:33):
a particular data set. And Iwanted to understand or a
particular environment, I wantedto understand, how am I
algorithm impacted and reactedto that? That data that
environment? Yeah, so I might betrying to construct a ranked
measure of which web page isreally important or something
like that. And there's loads ofparameters involved in this. And

(06:56):
some are, some are significant,and some are not. And so what I
would tend to do is say, right,well, let's, let's tweak a
parameter, and see what whatimpact it has. And then there'll
be a measure at the other end ofthe process, which looks at the
output and goes, this has had adramatic impact, you know, I've
seen a significant improvementbecause of that tweak in the
parameter. That's the assumptionfrom from a data science

(07:18):
perspective, this, thistransfers really well, because
you've got this idea of keepingthe vast body of your setup the
same, changing one thing as muchas possible, leaving everything
else the same, and seeing,seeing what the impact is of
that one, one change. Andideally, being able to infer
This is a bit of an assumptionthat the change you then saw was

(07:42):
as a direct result, of course,from the change that you made.
Yeah, what do you get from that?
Well, you get the learning, youget, you get to create a
hypothesis about yourenvironment, or about your
algorithm or about your, aboutabout your problem domain and
your decision that's beingimpacted. And which, which then
can lead to more experiments,more tweaking, and testing and

(08:05):
learning and all of that sort ofthing. So I think what we're
fast getting towards is it, youstart to get a nice, looping,
iterative process.

Jason (08:15):
Yeah. Which is really important. Because even when you
draw the scientific methods, youcan draw it as a loop and say,
you come out of one experimentwith some conclusion, and that's
fine. We can replicate it. Andthat's proven. But we also have
10 new questions. And we want torun 10 new experiments with 10
new versions of data. You know,go through that test and learn

(08:36):
that you're talking about.

Jeremy (08:37):
And there's the problem or me maybe a problem, if for
many years, the excitement,right, which is, oh, wow, I
started with something I thoughtwas really straightforward. And
now I've got something which isI've got, you know, I've got 10
possible data sets and 10possible questions on each of
those data sets. So and I've got100 investigations to carry out.
And that I think, in, in ascientific environment in

(09:00):
especially in academicenvironment, yes, that's great.
That's just grist to the mill.
That's exactly what you'reexpecting you can you can divvy
that up amongst your PhDstudents or graduate students,
you could, you can have a planfor how you're going to tackle
investigating and prioritisingthose over the next three years.
Right. Yeah. But that doesn'tquite work in a commercial
setting and in Data Science.

Jason (09:23):
Yeah. And this is where I think we need at the core of our
efforts, the scientific methodto be understood so that we
follow the correct process offorming our hypothesis and
knowing that our experiment,experiment is valid, and the
data is valid for the setup thatwe have for us. I think going

(09:43):
into it. We also have at theoutset, to know the decision to
know the impact to know the costwhere the line gets drawn. And
how much do we need to verifyour assumption Since How much
does our hypothesis need to beproven? Before the impact is

(10:05):
realised before that decisionstaken? Because your business
model, you know, depending onthe context might hinge upon it
might be something subtle, or itmight be a massive change in
your business operations.

Jeremy (10:20):
I think that the the change, I noticed, when I
started working in industry afew years ago, now, the chain,
the change, I noticed was howimportant it was to get a really
salient and to the pointhypothesis, out of the out of
the box, right. Yeah, to get theexact concept that you were
trying to test to trying toprove that if you did show

(10:43):
should be true, or not true,would would enable you to access
the decision access the theimpacts that your your data
science algorithms Exactly. Butyou know, see, what you didn't
want was a sequence ofinteresting investigations,
which in the end, it gave youanswers to some questions. But
those questions didn't reallyhelp you with

Jason (11:04):
Right. I think that's where our stakeholder management
is so important, because when weset out to run one of these
scientific experiments on thedata set that we have, and to
answer a question, it's exactlywhat you just said, we need to
bring that back to thestakeholders version of the
question, what is it thatthey're going to actually say is

(11:27):
the hypothesis that they hadthat was interpreted then in the
experiment set up that we adoredthe model that we may have
built, and I think that that'swhere it's important for us as
scientists to bring them intothat way of thinking. And that's
possibly where I see opportunityfor a disconnect that needs to
be seized upon, you know.

Jeremy (11:48):
I see that in in many projects where I have, you know,
well, meaning stakeholders,approaching the team, some, you
know, at a critical point in theproject, and they will, they
will show this disconnect verystraightforwardly by just
saying, how long is that goingto take? Of course, yeah. And

(12:10):
yeah, as a data scientist, andas a scientist, that that
instantly causes sort of flagsto go up and sort of alarms to
ring because I'm going, Yeah,it's an investigation, I can't
really, I can't really give youa how long, which, of course,
is, you know, if they're, ifthey're used to doing some kind
of Gantt chart based projectplan, then then that causes

(12:34):
instantly causes a bit of aproblem. So I think there's some
really interesting and nicemodifications, then that come
from that uncertainty at theheart of this data science and
scientific process, which is tosay, I don't know, if I knew the
answer, I wouldn't have to doall this fancy stuff in, you

(12:56):
wouldn't be paying a bunch ofscientists to do this
investigation. If I knew theanswer. The whole point is, we
don't know the answer. And we'regonna have to work out what that
answer is, before we can reallyprogress this to the next next
stage of the project.

Jason (13:10):
Yeah. And it raises the questions when they're asking,
how long will it take? Were weasking the right question in the
first place? Because what youshould be asking is, do you have
an answer? Yes? Or what is itthat we need to do to get us get
confidence about it? Is it thatwe need more data that we need

(13:32):
more resources that we need anew hypothesis to be visited
with? Maybe a new set ofstakeholders?

Jeremy (13:40):
Yeah, and sometimes you get lucky. Sometimes you can
say, look, we can get you acertain quality of answer. And
we can, you know, you tell usthat you you tell us how long we
got almost Yeah. So if you ifyou've got a couple of days or a
week, I can get you something.
So depending on the problem, ifit's like an investigation, and
you're sort of slowly refiningyour answer getting better and
better. And you say, Well, I canget it to this good. And I can

(14:00):
probably even give you anapproximation of how how close
this good is to what you wantfor your, you know, statistical
matching algorithm or something.
But on the other hand, if it's,if it's an investigation where
you don't even know whetherthere is an answer, is there a
signal which means that I canidentify a cancer in a in a

(14:22):
medical scan or a medical image?
And I just don't know whetherwhether I can see. So answering
the question, how long will ittake is much harder then.

Jason (14:30):
Yeah. So I think at the core of this, we need to make
sure that scientists can adhereto the scientific method boss.
There is fundamental outline ofwhat the question is upfront
that needs to be agreed and needto be, compartmentalise, how
much time is allowed for pureresearch or formation of
hypothesis or gathering thedata? Yeah, there's frameworks

(14:52):
to do that. You know, we canhave a whole discussion around
that another time. But thatiterative process.

Jeremy (14:59):
I know there are companies who do just
investigative epics at thebeginning of their projects
where they say this epic? Yeah,it's two weeks sprint, this four
weeks of sprint, whatever, we'rejust going to be doing
investigates that's definitelyone way of doing it. But I think
for me, the other thing, which Ididn't mention I talked to Dan
was that when you're as a team,when you're setting these

(15:21):
hypotheses, you shouldabsolutely carry those
hypotheses through to your, theway that you set yourselves
tasks. And so your hypothesesshould really I like, I like my
heart hypotheses to bequestions. I because I like them
to have a yes or no answer. Andfor one Have either of those
answers to be a validpossibility? Whereas in

(15:42):
traditional project management,typically, you'll say, well,
you'll have a block of timethat's associated to put the
roof on the house. The answer,if the answer to is well, we
couldn't, that's not reallyacceptable. But in some, but in
data science, it's absolutelyfine to go, can we find a nice,
you know, customer preferencemetric, which tells me what film

(16:05):
that an individual is going toenjoy watching? Yeah. And the
answer may be no, not in the waythat's useful to you right now.
In which case, you need to beformed the question you need to
think about another way ofpresenting the problem, you
know, there are lots of ways ofthem rebounding that and going
okay, well, we still need tomake progress, but maybe we're

(16:25):
making sure to make progressdown another another arm.

Jason (16:27):
I think, to summarise just off the back of your
analogy there it's a case of wewere putting a roof on a house
but this might be a house we'venever built before, and a roof
and a format we've never usedbefore. So our old timelines
don't hold.

Jeremy (16:42):
Using materials that no one's ever used before. Exactly.
Exactly.

Jason (16:49):
Thanks for chatting this out with me, Jeremy.

Jeremy (16:51):
No worries, Jason, that was fun!

Jason (16:54):
Thanks for joining us today at the DataCafe. You can
like and review this on iTunesor your preferred podcast
provider. Or if you'd like toget in touch you can email us
Jason that datacafe.uk or Jeremyat datacafe.uk on twitter at
datacafepodcast. We'd love tohear your suggestions for future
episodes.

All Episodes

Episode Transcript

Popular Podcasts

Crime Junkie

24/7 News: The Latest

Stuff You Should Know

.css-15opob5{left:0;position:absolute;top:0.8rem;} All Episodes

.css-14f5ked{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:2;overflow:hidden;}[Bite] Data Science and the Scientific Method

Episode Transcript

Popular Podcasts

.css-r6mb8g{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:1;overflow:hidden;}Crime Junkie

24/7 News: The Latest

Stuff You Should Know

All Episodes

[Bite] Data Science and the Scientific Method

Crime Junkie