47. Three practical ways to fast-track the data cleansing process

Episode Transcript
Available transcripts are automatically generated. Complete accuracy is not guaranteed.
Narrator (00:07):
You're listening to the Assurance Show.
The podcast for performanceauditors and internal auditors
that focuses on data and risk.
Your hosts are ConorMcGarrity and Yusuf Moolla.

Yusuf (00:20):
So Conor, today, I want to quickly talk about
some recent experiencesthat we've had in the use
of data and that relates tofinding the easy way out.
So that sounds a bit wrong.
Obviously we don't want to find,necessarily, the easy way out.
But, making sure thatwe're not doing work that
we shouldn't need to do.

Conor (00:35):
I'm intrigued by this.
So I want to hear the examplesyou're going to use becauseyou
haven't briefed me on them.

Yusuf (00:41):
Okay.
Good stuff.
So the first one is, quiteoften open data, but also
internal data that we might use.
We find data in things likeannual reports, company
presentations, in thepublic sector it would be
in financial statements,budget papers, and the like.
And sometimes, in our excitementor haste, to grab that data

(01:03):
and use it, we go to thatunstructured source and then do
all sorts of weird and wonderfulwork in order to pull it out.
So we can take a PDF, wecan extract tables from
it, then detailed workto format those tables.
We had a case likethat a few months ago.
And we're actuallyseeing another one.
We were working with a clientwho had taken a PDF, and that
had spent about a week cleansingthat data and coming up with

(01:25):
a structured dataset to use.
And one of the first questionsthat we asked was is this the
only source of data that wehave, because data that flows
into a PDF or report or thelike, potentially would have
come from a structured source.
And it turns out that thedata came from a database
that we could have extracteddirectly from and used

(01:46):
in its original form.
So that creates extrawork , it's not just extra
effort, but it's also, wethen found errors in the
conversion from unstructuredform to structured form.

Conor (01:57):
Instead of this own structured information
that you have here, can youask around within your own
organization about how it gothere in the first instance to
try and shortcut that processso that we don't end up with
the situation halfway inwhere we find out there was
a database or something thatexisted from the outset..

Yusuf (02:13):
That's right.
Back in the day welearned to go and ask the
question, trial and error.
So you learn as yougo that there's a
better way to do things.
But yeah, so that exampleis about efficiency,
but also about quality.
Don't go and do that work.
So when you're faced with aPDF of an annual report or a
performance statement or budgetstatement, the first question
is not how can I take the dataout of this for my purpose?

(02:36):
The first question is wheredoes this data come from?
And maybe I can sourcethat data directly.
Now, sometimes you can't, youjust don't have access to the
individuals, or it may be thatsomebody created it manually.
But the first point ofcall is how can I do
this in a simple way?
So that is taking theeasy way out in that case.
But it's also the better waybecause it produces better
quality and efficiency.

Conor (02:57):
And it minimizes your risk of getting things
wrong in translation.

Yusuf (03:01):
Yeah.
So the second example, iswhere we taking structured data
and we go through a cleansingprocess and we probably spoke
about this before, but weall know that it takes a long,
long time to cleanse the datausually takes a long time.
And despite what certainvisualization tool providers
might promise, it usuallydoes take quite a bit of
time to actually cleansedata and standardize it

(03:21):
and format it for use.
And if you're not going throughthe process, you're going to
end up with, usually going toend up with poor results, or at
least you won't have confidencein the results that you have.
However, when you're goingthrough the cleansing
process, the question youneed to continuously ask
yourself is, am I going toactually use this data field?
Or am I going to usethis data source?
Because often you end upwith large sets of data.

(03:44):
And when I say large, I meanwide in terms of the number
of fields or the numberof columns that you have.
And the automatic data analystapproach would be to cleanse
everything that they have.
So that any potential questionthat might come up will be easy
to answer given the cleanseddata that they now have.
And that is a little bitof an inefficient approach.
So we need to thinkabout key fields that

(04:05):
we're going to be using?
And then cleanse those first.
If you're using a repeatableanalysis processes, if
you're using a workflowwhere it's quite easy to go
into that workflow and makechanges, to flow downstream.
It's better to start small.
Start with the datathat you know you're
really going to need.
And then you know you're reallygoing to use cleanse that first

(04:27):
and then come back later, ifyou need to cleanse additional
fields for use, that may answerquestions that come up later
on or where you may need todelve into some of the deals.
So that's something again thatwe learned over the years.
Initially when you'restarting your sort of data
journey, initially, youjust want to grab everything
and cleanse everythingand make sure you have
this, perfect 100% dataset.

(04:47):
the better thing to do, isidentify the specific fields
that you are going to be using,focus on cleansing those.
And then if you needsomething later on down
the track, come back to it.
So it's an iterative process.

Conor (04:59):
That ability to identify the fields that you will be
using it initially, is thatdependent upon having a clear
objective for your audit as towhat you're trying to achieve,
as opposed to an exploratoryprocess where you just want to
look at everything you have.

Yusuf (05:14):
We always advocate for, both hypothesis focused,
but also exploratory, evenin the exploratory, you
don't have to have everythinga hundred percent clean.
What the hypothesisbased approach is
pretty straightforward.
You just work backwards from thequestion you're trying to answer
and get to the fields that youneed, but even with exploratory,
you will, intuitivelyknow What you need, right?

(05:34):
There's certain fields thatyou just need the certain
dates and IDs and textfields and things that you
will just need straight up.
So you cleanse that to thebest level that you can.
And then as you exploreyou iteratively cleanse
for that exploration.
So you're exploringcertain fields.
There's a question thatyou have that has gone on
answered and that you onlyhave a partial answer to,
and you think, you knowwhat, there's actually valid.

(05:56):
In cleansing that fieldfurther to be able to
answer this exploratoryquestion that I have.
But again, that's coming backto what you need as opposed
to trying to do it upfront

Conor (06:06):
Yeah.
One of the risks, which isa positive side on negative
side is curious, data analystslike to explore a lot.
But they need to step backsometimes and ask what
is the answer I'm tryingto come up with here?
Or am I adding any value bydoing this further exploration
for this particular.

Yusuf (06:22):
it's an ongoing battle, right?
we've been doing this for along time, but I get a new
data, I automatically wantto go and just fix everything
before I jump into it.
stop yourself and resistthat urge and say, no,
stop What do you need?
What can, what you need?
And then if I need more, I'llcome back and fix it later.

Conor (06:42):
I think you're still working on your
discipline on that.

Yusuf (06:44):
I think I am.
Yeah.
, and I don't thinkI'll stop, right.
Because it's ongoing learningand it's a nice learning
that sort of doesn't mean.
So the third one is,particularly important
when you bringing datatogether so this happens
often when you're lookingat data over multiple years.
let's say you bringing10 or 15 or 20 years
worth of data together.
Sometimes the data from 20years ago, not sometimes often,

(07:06):
always, almost always the datafrom 20 years ago will look
different to this yesterday.
when you have a situationlike that, first
explore a smaller set.
So it could be one year, twoyears, maybe even three years.
just depends on yourcircumstance, but don't try
to grab all 20 and cleanseall 20 the first time around.
If you are going to be lookingat some sort of longitudinal
analysis, start small, startwith the first year of data.

(07:29):
What do I have maybe adda couple of years to that.
and then cleanse that specific.
And then later on, youcan bring other data in.
One of the risks that youhave with that is there may
be some data that existed 20years ago that doesn't exist
anymore, that you may need toadjust for, but then you need
to ask yourself if I don't havea current view of that, am I
really gonna use that field?
So you may have 10 fieldsin your current data set

(07:50):
20 fields in an old data,set those other 10 fields.
If I don't have anycontemporary information
on it, do I really need.
What am I going to do with it?
what comparison am I going tobe able to make the best thing
you might be able to come upwith as a result is that we're
not collecting the data anymore.
It could have been useful.
It should be useful, I mean,there's value in that, but
really the incremental valueof that is, significantly

(08:11):
outweighed by the level ofeffort you're going to need to
put in to get the data cleansed.
So start small, start closerto the time that you're
in at the moment and thenwork backwards from there.
Most workflow, styletools, et cetera.
It doesn't really matter howmuch you have except when
you're trying to deal withall of the anomalies all of
the exceptions that may exist.
So anybody workingwith data will know.

(08:34):
The more data you bringin, the more potential you
have for having to fix upsome of those edge cases.
That specific missing field orthis one year in which we didn't
have this particular column,ends up taking a lot of time.
And so you want totry to avoid that.
So that's the third simplestep to make your analysis
process easier, moreefficient less frustrate.

Conor (08:55):
And taking those tips on board will save you
a lot of time in the law.

Yusuf (09:00):
Summing up.
First thing is unstructured,first look for the
structured equivalent.
The second one is don't try tocleanse all of the data fields
that you have straight up,make it an iterative process.
And the third one istry to start with a
smaller set of data.
If you can, before expandinginto the full volume of data
that you might have over yearsor over departments or whatever.

Conor (09:22):
Thanks Yusuf.

Yusuf (09:22):
Thanks Conor.

Narrator (09:23):
If you enjoyed this podcast, please share
with a friend and rateus in your podcast app.
For immediate notificationof new episodes, you can
subscribe at assuranceshow.com.
The link is in the show notes.

All Episodes

Episode Transcript

Popular Podcasts

My Favorite Murder with Karen Kilgariff and Georgia Hardstark

24/7 News: The Latest

Dateline NBC

.css-15opob5{left:0;position:absolute;top:0.8rem;} All Episodes

.css-14f5ked{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:2;overflow:hidden;}47. Three practical ways to fast-track the data cleansing process

Episode Transcript

Popular Podcasts

.css-r6mb8g{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:1;overflow:hidden;}My Favorite Murder with Karen Kilgariff and Georgia Hardstark

24/7 News: The Latest

Dateline NBC

All Episodes

47. Three practical ways to fast-track the data cleansing process

My Favorite Murder with Karen Kilgariff and Georgia Hardstark