Episode Transcript
Available transcripts are automatically generated. Complete accuracy is not guaranteed.
Andrew Clark (00:01):
One of the things
that people are now talking
about is, "Okay, we can't usedata now that's on the internet
because so much of it isgenerated, so can we make fake
data to train our models off ofinstead of real data?
That's where there's a lot ofissues there that's not helping
either, because you're nowcreating.
.
.
one of the key things tohighlight with synthetic data is
, although there are academicmethods and we can get it into
(00:22):
them on how can you bestreplicate data, synthetic data
is not 'real data,' and one ofthe key things you miss is those
.
the interplay, theinterconnectedness between
inputs to a model, is veryimportant.
You can't replicate those, evenwith the most in-depth methods,
at a generalized level - w ecan get into what that means a
little bit - but, it has aproblem: a lack of diversity of
(00:44):
language and bias issues and alot of things there.
I'll let Sid add some more fromthe LLM specific space, but
there's a major disconnect here.
And just moving to syntheticdata versus training from open
source data, neither one isactually good from model
training.
You need curated data.
Debra J Farber (00:59):
Welcome everyone
to The Shifting Privacy Left
podcast.
I'm your host and residentprivacy guru, Debra J Farber.
Today I'm delighted to welcomemy next two guests: Dr.
Andrew Clark and Sid Mangalik,Co-hosts of The AI
Fundamentalists Podcast.
Dr.
Andrew Clark is Monitaur'sCo-founder and Chief Technology
(01:21):
Officer.
A trusted domain expert on thetopic of machine learning,
auditing and assurance, Andrewbuilt and deployed machine
learning auditing solutions atCapital One.
He also contributed to MLauditing standards at
organizations like ISACA and theUK's ICO.
Andrew co-founded Monitaur tolead the way in end- to- end AI
(01:44):
and machine learning assurance,enabling independently
verifiable proof that AI MLsystems are functioning
responsibly and as expected.
Before monitor, Andrew servedas an Economist and Modeling
Advisor for several prominentcrypto economic projects, while
at Block Science.
Sid Mangalik is a ResearchScientist at monitor and he's
(02:08):
working towards his PhD at StonyBrook University, with a focus
on Natural Language Processingand Computational Social
Sciences.
Welcome, Andrew and Sid.
I'm so excited that you're here.
Andrew Clark (02:20):
Thanks, really
excited to be here as well, and
thank you so much for reachingout.
Looks like a fantastic podcastand audience and we're really
excited to be here.
Debra J Farber (02:28):
Great.
I'm excited for you to be hereas well.
So, I was searching forpodcasts that focused on privacy
enhancing technologies on thiswebsite called Podchaser and it
allows you to search across allpodcasts everywhere, and I came
across episode five of The AIFundamentalists Podcast, and
that focused on the pros andcons to using synthetic data for
(02:50):
ML/ AI.
But now, I've been followinghow synthetic data can be used
for privacy use cases, but whatstruck me about this episode was
how you underscored howsynthetic data may not be all
that useful for machine learningand AI purposes, except for a
few key scenarios.
I found the discussion reallyriveting.
You guys are just a fountain ofknowledge, and I thought it
(03:12):
would be helpful to my audienceof privacy engineers to get a
deep dive on exactly what theselimitations are.
So, to start off, Andrew, whydon't you tell us what motivated
you to found Monitaur and focuson AI governance?
And, I guess, tell us a littlebit about Monitaur as well?
Andrew Clark (03:29):
Definitely, thank
you.
Monitaur is, you describedearlier at a high- level, which
is a great example of machinelearning, assurance, and
governance company.
So, I have an undergrad inaccounting and was working in
the accounting field and wasrealizing accounting is a great
career, but it's really not forme.
So, I taught myself how toprogram in Python, started
taking extra statistics andeconomics classes and started
going over to the more technicalside of the house.
(03:50):
First job was IT auditor-started doing, I call it
Sarbanes Oxley is one of thecompliance regulations that came
out of Enron and some of thoseaccounting disasters.
So, we call it "the fullemployment act for accountants
and lawyers.
' So there's lots of rote workyou have to do and IT auditing
and things of that nature.
So, I started figuring out howcan I start automating these
processes?
(04:10):
Going to continuous auditingwas a big thing at the time.
Where can we use machinelearning and data analytics to
try and enhance the auditprocess?
So, when I started doing thatdeep dive, I did got a Master's
Degree in Data Science and justkept getting more interested in
this field and learning more andI started looking like, "Hey,
how we traditionally audit andevaluate systems such as you get
a bank reconciliation and makesure that 2+2=4.
(04:33):
If you're auditing a financesof a company, for example, in IT
auditing we also check to makesure who has access to things
and it's very black and whitetype scenarios.
Well, as I kept digging deeper,and specifically as machine
learning is coming on the scene,how model auditing and
assurance is usually done orhistorically has been done has
been more aggregated, high-level, looking at residual plots
(04:54):
and things like that, not thegranularity and re-performance
that traditionally auditing doesfor financial auditing,
technical auditing.
So, I started working withISACA on guidance like hey, with
machine learning models, it'sless about (and this something
we can get into as well) - it'sless about the type of model.
So, like machine learningversus statistical modeling,
versus deep neural networks -any of those things, that's a
(05:16):
paradigm of modeling.
The key difference is, with therise of big data and "machine
learning becoming popular, thereal thing of why we want to
increase that assurance is thatthese are being used, affecting
end users.
Traditionally - and thetechniques that originally
developed about I mentioned,we're looking at residual plots
and things like that - was ifyou're looking at liquidity
measurements for a bank orforecasting those sorts of
(05:37):
requirements that modeling hadtraditionally been used for,
there's a very different levelof assurance that you need than
if you are making decisionsabout Debra, Andrew, Sid.
You want to be making sure thatyou're not biased; you're fair.
Those are very specific things,and you can re-perform and
understand how that system isworking.
So, I really started gettingexcited about that and digging
in, did presentations, workedwith ISACA, worked with Capital
(05:59):
One, worked over there on someof their systems for a while and
setting up that audit processand really kept digging deeper
intothere's like a major gap here on
how AI systems, which we'regoing to call the umbrella term.
All modeling can kind of gounder the umbrella of AI.
AI is really.
.
.
it's the field of trying to haveautomated computer systems that
replicate human activity.
That's very broad.
(06:19):
You can even have RPA and someother things underneath there.
But for any type of these AIsystems that are becoming more
prominent, being used fordecisions about Andrew, Debra
and Sid, we want to make surethat that level of governance
and assurance is higher.
So, that's what really promptedthe founding of Monitaur.
We started very machinelearning-esque; and we've
expanded now to really be solidmodel governance and really
focusing on what are thosefundamentals you need to have
(06:41):
correct, that we can beconfident that our systems are
performing as expected.
So, we are confident.
That's where it really goes inwell with privacy and maybe in
those specific attributes of howcan us as a society be
comfortable that insuranceunderwriting, healthcare
diagnoses, all of these thingsare now being done by machines?
How can we get comfortable withthat?
So the genesis of Monitaur isproviding really that 'batteries
(07:01):
included' software package thatcan help through those parts of
your journeys for enforcingbest practices.
Debra J Farber (07:08):
That's pretty
awesome and, my goodness, do we
need that.
Right?
I mean, is there a frameworkeven that you're incorporating
or you're kind of coming up withbest practices based on just
your experience and knowledge ofhow machine learning works and
is deployed today?
Andrew Clark (07:24):
That's a great
question.
So short answer (07:25):
we've
developed our own internal thing
looking at all these differentresources and systems
engineering a lot of influences.
Sid and I can talk about in mywork doing this with multiple
companies for a long time.
There's also a lot offrameworks out there that we
reference and map to and that doexist.
The thing is there's not adefinitive framework.
(07:45):
For example, in privacy orcybersecurity for example,
there's that NIST CybersecurityFramework that everybody kind of
rallies around in AI and ML.
There's not like the standardthing.
NIST came out with an AI RiskManagement Framework, which we
were (Sid and I) were bothcontributors to for that whole
process, but nobody's reallysaid you must use that.
So, there are frameworks thatexist but they're very high-
(08:05):
level and conceptual where wetry and be a little bit more
targeted.
But, that's one of the issueswith AI and ML right now is
there's not that definitiveeverybody must follow, which is
why people are talking aboutregulation.
EU AI is probably going to bethat first mover in that space.
Debra J Farber (08:17):
Yeah, that makes
sense.
Thank you.
Okay, Sid, tell us about yourcareer path and why did you
choose to focus in this spaceand how did you end up at
Sid Mangalik (08:28):
Yeah, so I've
followed a very traditional path
, you can say.
I was always a computerscientist at heart, going back
maybe 10 or 11 years, I reallyfound that I was really
passionate about AI and usingthese types of systems to solve
human problems and understandhumans.
So, I naturally gravitatedtowards natural language
processing, understanding howlanguage exists as a medium
(08:50):
between humans and computersystems, and how we can
understand each other in somesense.
And then, I think later on, Idiscovered that what I really
wanted to do was this 'AI forgood' and 'AI for increasing and
improving human outcomes' andgenerally making people happier.
And so, I started off my careerworking at Capital One as a
(09:11):
data engineer, doing classicpre-pipeline work, pre-AI work.
And that's where I met Andrewand we hit it off really well
and he started to monitor and hesaid, w"e need a data scientist
.
Are y ou ready to hop on board?
" and I couldn't say yes anyfaster because I was so excited
to do data science work and AIwork.
(09:31):
About a year into that, Idecided, "it's time for me to
enter a PhD.
I think I really need to becomea subject matter expert in this
.
I really want to pursue mypassion and I really want to do
this type of AI in the field ofmaking human lives better.
The research I work on in myPhD is working on mental health
outcomes across the US using NLPsystems.
So, we do a lot of work for TheNational Institute of Mental
(09:54):
Health and the NIH at large.
And, just feels like a reallygreat fit for that because we
are focused on making safe,unbiased, and fair AI models
available for the general publicthat's going to have to
interface with these models andis going to have to live in this
new AI world.
Debra J Farber (10:12):
Awesome, so
thanks for that.
Let's talk about your newpodcast that you brought to life
, called The AI fundamentalistsPodcast.
What motivated you to start itand who's your intended audience
?
Andrew Clark (10:25):
So, The AI
fundamentalists - I forget
exactly the genesis of it.
I think Sid and I were justtalking about things happening
in this space.
Specifically, we're bothpassionate about the current
state of LLMs and notnecessarily in the way that the
media is portraying them.
Something like, "Hey, there's amajor gap of like people
speaking the truth on how thesesystems actually work.
(10:46):
I think one of the things thatSid was talking to me about was
a lot of his other co-researchesin NLP are not as happy or as
confident in these systems aswhat the media is portraying,
and there's just a major gap.
And, our whole focus is how dowe do best practices?
How do we build systemsproperly?
You saw a major space thatneeded to be filled on.
Let's go back to firstprinciples.
Let's think about how shouldyou build AI systems that are
(11:08):
safe, fair, performant?
What are those building blocks?
How do we start helpingpractitioners that wanna be
better, that wanna be that 1%better every day, that wanna be
professionals?
How can we help them kind ofcut through the noise and learn?
What are the key buildingblocks you need to do?
Because if you're building agenerative AI LLM system or if
you're building that simplemodel that's gonna predict
(11:28):
pop-tart usage at your company,the same fundamentals at the
lowest level are the same.
Understand what you're buildingand why.
Understand the data, understandthe modeling process.
How are you gonna evaluate thatthis is doing what it should be
doing?
All of those different aspectsare very the same across, and
they're also not new.
Well, there's lots of thisdiscussion around like, "oh, all
this new AI stuff it's not newat all.
(11:48):
If you listen to The AIFundamentalists Podcast, we keep
talking about NASA and theApollo program all the time,
because that's where the genesisof a lot of these
larger-scaling modeling systemsand the really in-depth systems
engineering approach and reallygoes back farther than that even
.
So, we really are doing thatfundamentalist mindset for
practitioners that don't believethe hype and wanna be trying to
(12:10):
improve and make saferesponsible systems.
Sid, is there anything you'dlike to add to that?
Sid Mangalik (12:14):
Yeah, I think
that's great.
I mean, to me, The AIFundamentalists is about two
people that love AI, wanting totalk about AI and wanting it to
exist in the real world.
And, things like hype and largeinvesting excited people that
like buzzwords and don'tunderstand how data science and
(12:35):
AI works taking over the fieldis not gonna be healthy for our
field and for building trust inour field and getting it out
there and using it to solveproblems.
So, it's really for us to talkto other data scientists and
people who work in this field tothink about doing AI the right
way, which is oftentimes thehard way.
Debra J Farber (12:53):
Yeah, usually
the hard way.
It makes sense, especially ifyou're gonna go make testable
systems and you gotta kind ofplan that stuff out from the
beginning, from the design phaseI would think.
Thank you for that.
I know that today we're goingto be centering our conversation
around synthetic data.
Before we get started on that,do you mind describing or
defining what synthetic data is?
(13:14):
I realize this might varyaccording to industry or
academia, but what are someworking definitions?
Sid Mangalik (13:21):
Yeah.
So I think a very simpledefinition of synthetic data is
we want to create data whichlooks as close to original
source data as possible.
Insofar as if we have somesample data from customers and
we only have a hundred customers, but we would like to
supplement that data to tell amore full and rich story, we
(13:43):
need to sometimes createsynthetic data sets that can
augment the original data set,or act as a proxy or clone for
the original data set.
Synthetic data does not meanthat you pull in a random
generator or your dice or yourOuija board and remake the data
totally at random.
You wanna make data that lookslike the original data.
In terms of industry / academia, industry's gonna be more
(14:06):
interested in making data thatis either carbon copies of the
original data, insofar as youcan use them for privacy.
Academia is really interestedin making data that is
meaningfully similar data,statistically.
Debra J Farber (14:20):
So that's an
interesting slight difference
between the two.
Has that created tension at all?
Sid Mangalik (14:25):
I think we've seen
tensions insofar as what is
considered good synthetic data.
Academics are looking atsynthetic data as virtually
indistinguishable from theoriginal data, whereas in
industry, you can just feel like"what is sufficient that I
eyeball test and it looks fine.
It's not this same rigorousprocess that we expect in an
(14:47):
academic setting.
Debra J Farber (14:49):
That's
interesting.
That makes sense.
That also kind of tracks withhow you kinda look at HIPAA
regulations and when it comes tode-identification, which we're
not gonna talk about whetherit's good or bad or it's
sufficient or not actually forprivacy these days, but when you
look at that, there's a littlemore leeway for academics during
research around the data setsversus if you were releasing it
(15:10):
to non-academics.
So, that makes sense to me.
I'm seeing that play out overthe years.
So the AI community has beentalking about using open-sourced
information or informationthat's scraped from the internet
in order to help createsynthetic data.
But, this approach seems to bebackfiring.
Some argue that LLM output issynthetic data and while they're
(15:33):
ingesting LLMs from othermodels, I hear there might be
some degrading going on there.
What are your thoughts on doingthis?
Andrew Clark (15:40):
Yeah, that's
definitely happening where
you're having the outputs ofother LLM models.
LLM models fundamentally, thecurrent state of those large
language models, they're trainedto sound human, not be accurate
.
So, the objective function -every time you train a model,
there's a specific objectivefunction of what are you
optimizing for?
And that's one of the keythings when we get into privacy
(16:00):
and bias and things like that iswhat are you optimizing for?
In this case, when you've builtthese models, they're
optimizing to sound like a human, not that they're actually
factual.
So, it's a very importantdistinction.
What's gonna hinder them beingadopted?
They might look really cool ifyou say, "Hey, write me an essay
, and everybody gets all excitedabout it.
This is that disconnect of whyThe Fundamentalists was started.
But, can you actually use thisin a business?
(16:22):
That's gonna really changebecause when you realize that
you have to be correct aboutthings.
So, we have started runninginto issues, people with
copyright and things like thatwas scraping data from the
internet which was written byhumans, to now people are using
Chat GPT and LLM models togenerate content, which is now
back online and it is beingingested.
And there's some recent papersout on how the latest versions
(16:44):
of the GPT models are degradingand they're not as accurate.
They're not as even smart asthey used to be because they're
doing that feedback loop ofingesting other LLM data, so
it's just kind of like adownward spiral.
So, one of the things thatpeople are now talking about is
"okay, we can't use data nowthat's on the internet because
so much of it is generated, socan we make fake data to train
(17:04):
our models off of instead ofreal data?
And that's where there's a lotof issues there that's not
helping either because you'renow creating.
.
.
one of the key things tohighlight with synthetic data is
, although there are academicmethods and we can get into them
on how can you best replicatedata, synthetic data is not real
data, and one of the key thingsyou miss is the interplay.
The interconnectedness betweeninputs to a model is very
(17:28):
important and you can'treplicate those, even with the
most in-depth methods at ageneralized level.
We can get into what that meansa little bit, but it has a
problem and a lack of diversityof language and bias issues and
a lot of things there.
I'll let Sid add some more fromthe LLM specific space, but
there's a major disconnect herein just moving to synthetic data
versus training from opensource data, and neither one is
(17:49):
actually good for model training.
You need curated data.
Sid Mangalik (17:52):
Yeah, so I'll just
pick up right there.
So, using LLM outputs assynthetic data poses some pretty
immediate and maybe obviousissues, and you can think about
this on your own.
In terms of, if you've usedChatGPT before, you may have
felt that it has a certain tone.
You may have felt that it has acertain consistency in the way
that it likes to speak.
(18:12):
And maybe you've even receivedan email and thought, "I think
chatGPT wrote this email andthis really highlights this lack
of diversity, this lack ofvariance that comes in synthetic
outputs.
And this is not a surprise,because if we remember that all
AI modeling is about creatinghigh-level, broad-stroke
(18:32):
representations of the realworld, of what human language
looks like.
We're not going to get thesedeeper nuances, these different
styles of talking, maybe eventhe same types of mistakes we
would expect from human authors.
And so, when you keep spiralingand ingesting and outputting,
and ingesting and outputting thesame data over and over and
over again, you only furtherentrench these biases and these
(18:57):
repeated behaviors that are inAI modeling systems already.
And so, you do see a naturaldegradation process.
Maybe you even experientially,anecdotally, have felt ChatGPT
is not as good as it used to bebecause now they need to put
more and more data into it.
But, the data they're puttinginto it is this synthetic data
and it's not getting this richdiet of nutritious data.
(19:19):
It's just being given the samehashed out data over and over
and over again.
Debra J Farber (19:24):
And so here
we're talking mostly about the
base models, right?
What's happening with those whobuild on top of LLMs.
Are they able to adjust for thebad outputs, or is it pretty
much, "No.
If you build on top ofsomething that's uncertainly or
not necessarily a great output,then it's going to be more
garbage in garbage out.
Sid Mangalik (19:42):
There are still
some early studies that are in
this field.
I actually just recently putout.
.
.
I was a co-author in a paperwhere we kind of discussed this
idea, which is looking atdownstream tasks for AI systems.
Is fine-tuning going to makeany difference based on the
quality of the underlying model?
It seems like you can mitigatea good amount of problems in
(20:03):
your specific domain.
Giving it really good data tofine-tune on will make it really
good at your use case.
But, you will have a lot ofstruggles changing the baseline
model to adapt to your needs inbroader contexts.
If someone has a customerservice bot and you've trained
it to be really good at customerservice, that doesn't mean that
(20:23):
you fix any problems with thefacticity it had before.
Andrew Clark (20:27):
And one key thing
to highlight, to underscore what
Sid just said, is we'reassuming that your fine-tuning
has very high quality data,there's a sufficient amount of
it, it has those correlationsand it's not synthetically
generated data that's from youruse case.
So, if we can mitigate some ofthe base model issues, you need
to have extremely good data thatyou're then training on.
(20:47):
So, there is still good data inthe system.
You're just allowing to use alittle bit of synthetic at the
base level.
Debra J Farber (20:54):
Okay, got it.
So then, maybe you could tellus what are some use cases where
synthetic data is most usefuland then what are some use cases
where it's least useful?
Andrew Clark (21:02):
I'll take the pros
and then, Sid, you can take the
cons.
So, pros, as Sid actuallystarted discussing this earlier,
it's 'supplemental data.
, So one of the ways you can doit is, if you don't have enough
records, it can be helpful toexpand your data set a little
bit more, because modeling,specifically machine learning,
needs lots of data.
So, you want to be able to haveenough data to be able to train
(21:22):
your model properly.
S ometimes augmenting your dataset is good.
One of the ways that,personally, I like to use it to
help with the augmentation is tocreate data that is a little
bit outside the norm.
What you're wanting to make.
.
.because one of the basicissues with machine learning and
AI models is that they getoverfit.
So, if your data set is on thisone specific thing is we're
looking at data for let's justdo a for LLM discussion, right?
(21:44):
We've just looked at historybooks, but now we start wanting
to write poetry.
Well, it's not going to be theexact one-to-one definition,
right?
So, then you want to make sureyou have enough poetry examples.
Maybe you have enough syntheticpoetry in there to help expand
it, so it's not overfit.
So, sort of like expansion ofyour data set is a good way to
use it.
Leading into that is one of thebest ways that we've found.
(22:06):
One of the discussions we had onThe AI Fundamentalists Podcast
was using it for stress testing.
This is, I think, the mostvalid use for synthetic data is
stress testing.
So, for instance, we want todetermine we talked about, we
want safe performance systems.
We need to determine where isthis model safe and performant?
If I have pricing informationfrom Utah and I have a great
(22:26):
model that works in Utah off ofthose demographics, that doesn't
mean I can take that same modeland move it to Massachusetts.
The demographics, the incomelevels, there's lots of things
that could be different there.
So we need to then test can itperform correctly in
Massachusetts?
We would need to stress test itand figure out where it breaks
down.
Once we know where that modelfalls apart, we know if it hits
the business objectives andwhere to put those monitoring
(22:47):
bounds.
So stress testing - there's ahuge lot to validation research
and goes back to that NASAdiscussion on the stress testing
and supplemental synthetic dataI think is one of the best ways
to do that.
Stress testing (22:57):
it's the gap
that most people in ML don't do
stress testing, and that'ssomething we think is very
important.
Synthetic data linking toprivacy, an area that you can
definitely use it, isdifferential privacy is
technically synthetic data.
So there's two definitions ofthat.
We could get into that if you'dlike later, but there's global
and there's local and basicallyyou're adding noise at either
the aggregator level or at thelocalized level, depending on
(23:20):
who's the person of trust.
So, we can dig into that ifwe'd like to later.
But, that's another way whereyou're basically anonymizing
certain attributes, as youmentioned about HIPAA earlier.
Debra J Farber (23:29):
Fascinating.
Thanks.
And then, Sid, some of themaybe not- so- great use cases
for synthetic data.
Sid Mangalik (23:35):
Yeah.
So, I think Andrew's exampleshere are great because they
really talk about how syntheticdata is really useful when you
need to push the boundaries ofwhat you can safely collect.
Maybe you don't want to collectoutcomes for people that are on
the edges of these groups, ormaybe they're just really hard
to find.
So this is a really great usecase for collecting data that's
essentially not possible tocollect.
(23:57):
Where this becomes a lot lessuseful is when you're trying to
learn about the basedistribution itself, when you're
trying to learn about theoriginal group of people being
studied.
If you create synthetic datajust repeating that data, and
you repeat it back out to itself, you'll just create a system
that becomes a little bit narrow, maybe even just overfit, to
(24:19):
look more and more like thatoriginal data set, because now
your model looking at this iswell I have so many great
examples of person A20 times,and so your model will become
something of a person Aclassifier or regressor.
This is the same problem withLLMs that if you just put in the
base distribution again, thatcenter distribution over and
over and over again, you canjust narrow the standard
(24:41):
deviations, you can narrow thevariance, and you end up with a
very singular model that becomesdifficult and hard to tame to
do these outside tasks, which iswhat synthetic data is much
better for.
Debra J Farber (24:52):
That makes sense
.
I also see that that could be aprivacy problem as well.
Right?
If you're training the modelwith pictures of Debra Farber
and all of a sudden then you'relike I'll throw in a few other
pictures of other people, it'sstill most likely going to spit
out something that looks like me.
Or, if it's like I could see itbeing a privacy challenge, or I
could see where some of thiscopyright issues get in the way
(25:13):
if it's over-trained with eitherpersonal data or copyrightable
material.
Talk to me about the importanceof the quality of the inputs.
Andrew Clark (25:22):
That's what's huge
.
We've liked to coin on The AIFundamentalists Podcast that
'we're bringing stats back.
' There's been a bunch of.
.
.data science as a field kindof grows out of statistics not
giving the world what it wanted,and statistics has taken a back
burner.
Essentially, what we'rerealizing here, and people are
learning the hard way with someof these systems, is you really
have to have that quality datathat captures those complex
(25:43):
inter- relationships.
Although there are a lot ofresearch areas on how to create
better synthetic data, you stillhave to have some sort of good
data.
Data augmentation is really theresearch we're doing here in
making better data, but you haveto have that good core data.
And how do you know data isgood?
How do you capture those inter-relationships?
Surveying, making all thistraditional statistical
(26:04):
evaluation techniques, all ofthose things are vitally
important for building these AIsystems.
You have to have those coregood inputs and maybe you can
sprinkle some synthetic to helpexpand that or stress test.
But, quality of data inputs -machine learning and AI are
really garbage in, garbage out.
Your model is only as good asthe data you have in.
That's why people have 'data asa new oil' and all those
(26:25):
discussions, but it's qualitydata.
We've gone to the extreme ofyou had all this big data and
then we didn't know how tohandle it.
Now we've gone like, "okay, wetrade off the internet, we have
copyright issues, we don't knowhow to handle these things,
let's just generate our own.
" But you can build really solidmodels.
This is one of the bigdownfalls with the deep neural
network structure is the amountof data it requires.
Sometimes the parsimonious,easier- to- use models can be
(26:48):
better because they can handlesmaller inputs and smaller data
sets better.
So, quality of data is directlycorrelated with the quality of
your modeling system that you'reusing.
Debra J Farber (26:59):
That makes sense
.
So, after the backlash of usingopen source data or just data
scraped from the internet,written by humans though, Sam
Altman from OpenAI has beentalking about using
LLM-generated synthetic data,what do you think about this
approach with LLMs generally?
Sid Mangalik (27:15):
Yeah, I mean this
is exactly what we've talked
about before.
It is this problem of justregurgitating data in and over
and over and over again, andthere's been a lot of great
research coming out of Harvardand Oxford where they're trying
to study what are the negativeeffects of doing this type of
approach.
People are already a few stepsahead in trying this out and
seeing what happens and thesemodels really, really suffer in
(27:38):
these settings.
And so, while it's going to begreat to tell investors, "ey,
gpt-5 has twice the data GPT-4had if that wasn't high quality
data, we're just going to becreating weaker and weaker
models over time.
They might look better andbetter at the specific use cases
that show up in the demos, butit's not going to create more
(27:58):
factual models.
It's not going to create modelsthat are better aligned to
human needs.
It's just going to look morelike human text, but that's very
superficial.
Debra J Farber (28:08):
Why do you think
they're taking this approach?
Is it just about PR?
Is it about just investment toget people to think that this is
going to be the path forwardfor optimizing LLMs?
How will they be able to, Iguess, demonstrate to the public
or even Microsoft, who's givingthem quite a bit of money as a
partner to continue this path if, in the end, it ends up using
(28:34):
LLM-generated synthetic data isgoing to degrade the quality of
the data?
Andrew Clark (28:38):
Depends.
We have the appropriateresponses, and then we have the
real responses.
Debra J Farber (28:42):
Oh, I want the
real responses.
Unless you don't want to say soon this podcast, but I would
love to hear what you think thebullshit is from the PR.
Andrew Clark (28:54):
Yeah, well, I'll
take a swing and then.
Sid, we'll piñata this a littlebit.
I'm not a big fan of OpenAI andtheir structure.
They actually started as aresearch lab - OpenAI is in the
name - they were going to opensource everything.
They were going to be a goodcommunity member about how can
we all just expand technology?
They've really come to a closedsource monopoly.
(29:15):
They even were naive enough tosay "let's suggest regulations
that we can then create a modearound ourselves which obviously
I haven't looked at that before.
There's a lot of hubris - in thecomputer science community,
there are individuals who arevery egotistical and this is
where a lot of these concepts -and they have great marketing
around machine learning andthings - it's new, it's brand
new.
We've done it here.
Well, NASA was doing it in the60s.
(29:35):
It was just called engineeringback then.
Like, there's this hype trainthat comes out of Silicon Valley
because the success of Facebookand things like that that, as
long as you have a computerscience degree and you hang out
in Silicon Valley, you'reinfallible and you can just do
awesome stuff.
Debra J Farber (29:52):
Especially if
you're a white male.
Sorry, I just wanted to getthat in there.
Andrew Clark (29:53):
Yeah, well, it's
very true.
No, I agree, I agree.
And, there's that hype aroundthat and that's synthetic data.
There's a research area.
What could possibly go wrong?
There's that lack offundamental understanding, lack
of doing things the hard way.
Everything is growth hacking,hacking this, hacking that,
trying to cut corners and justthinking it's all going to work
out.
I think there's a lot of thatculture and, you're right, it's
a white male culture thatoperates that way.
(30:15):
And honestly, I don't think -they're just like how can we
quickly make money, sell it,flip it, create a moat around
ourselves?
I really think it's kind ofoperating that way.
Sid Mangalik (30:25):
I think we should
remember a little bit of how we
got in the situation in thefirst place.
We got in the situation becausethere's a huge backlash from
creators, from people that owndata, from people that create
text data, saying "I did notconsent to this, I did not ask
for my data to be put into thismodel, I certainly did not ask
for it to be made public andavailable to everyone for free
(30:45):
on the internet, and thisunderscores a common and
consistent issue we see withopen AI.
When we talk to lay peopleabout this, they're excited,
they're ecstatic, they're like,"ow, this AI is here, it's cool,
it wasn't here before.
I can't believe this ishappening all of a sudden.
But people that are in NLP labsdon't feel this way.
(31:06):
This type of work has been outand ready for almost five years
now.
This is not new research.
Google had these types ofmodels long, long, long before
Open AI publishes ChatGRT.
Why didn't they share it?
Because they felt it wasirresponsible.
They didn't feel they wereready.
They didn't feel they're ingood shape.
They hadn't done the properchanneling to make safe and
(31:26):
secure and consent to data.
And then, Open AI does it veryopenly, very recklessly, and it
feels like a bandage solution.
They're in a situation wherethey're finally getting backlash
to using everyone's data andthey say, "fine, we'll just make
our own data.
Debra J Farber (31:41):
Fascinating.
It'll be interesting to see howthat actually develops.
I am closely watching potentiallegislation in the area.
I mean, gosh, I have so much tosay about how they're coming to
market and I might just save itfor another conversation
because it could take up its ownconversation, basically.
Let's switch to something thatI want to say, not the opposite,
but talk about ethics a littlebit.
(32:02):
So, talk to me about theimportance of diversity in the
training of AI models.
Andrew Clark (32:09):
That's huge.
You mentioned it a little bitearlier with the image.
You mentioned an image use caseof a bunch of image models just
trained off of your face andthen what happens if we get a
different face?
That's been a major problem intraining data.
There's been a lot of whitefaces for image thing and that's
created issues.
There are cases when syntheticdata and up sampling / down
sampling really does make sensebecause these models, despite
(32:31):
the PR around them, are not veryintelligent.
We talked earlier about theoptimization; you optimize over
something.
So, an area of research that Sidand I have been trying to
highlight is 'multi-objectivemodeling,' where your model has
to be both performant and fair.
You have to find the best modelthat fits those versus most of
these models.
ChatGPT is focused on "how canI make the model that sounds the
most human, not sounds the mosthuman and is not racist?
(32:52):
You're not focusing on thefairness.
Now, you could; it just meansthat you need more training
data, and you needrepresentative data.
So, that's another reason whystatistics is so important.
Everybody likes to make fun ofstatisticians and polls and U.
S.
Census Bureau and all thosethings, but that's a really hard
job that is focused on "how canwe accurately represent the
underlying population.
(33:12):
So why statistics is so huge isif we're building a model that
has those implications as wetalked about earlier about end
users, we want to make sure thatit is representative of that
underlying population (33:24):
all the
demographics, all the
socioeconomic status, all ofthose attributes.
Also, we want to make sure thatif there are minorities in that
group that aren'twell-represented, but if you
look at the U.
S.
as a general, we are a verydiverse group so you should be
able to use that normally U.
S.
data.
The problem is some of thesamples people use aren't.
But, let's play it out anyway.
Say, you have a set of data andyou still want to make sure
(33:46):
it's fair towards minorities,even if it's not having enough
data.
So, you can either up samplethat or you use these like
multi- objective modeling toensure the fairness and bias.
So, this is where it getsreally complicated and we don't
like this approach of awilly-nilly synthetic data
everything.
There's some great use casesfor synthetic data, but they're
more to add robustness andstress test and safety to
(34:08):
systems.
So, for instance, if I'mlooking at a set of billionaires
, maybe there's not enoughAfrican-American or female
billionaires.
Well, I'll make up a couple somy data set is more balanced
when I'm modeling billionaires,as an example.
That's a great case forsynthetic data.
Saying, "I'm just going to makeup a set of data off of
synthetic data with no referencepoint," that's a scary use case
and that's what open AI is atleast publicly suggesting.
Debra J Farber (34:30):
Oh, fascinating.
Okay.
Andrew Clark (34:31):
Yeah.
S ynthetic data can be used forgood, but it's also going to be
used for bad.
For bad, also, one of thethings - I'm going to not talk
about Open AI here - normallywhat I notice, and we help a lot
of companies with bias andfairness, is that the majority
of people actually do want to dothe right thing and how these
models can be so easily biased.
The researchers that we'vetalked about with the image
(34:51):
thing, I don't think anybody(maybe it's because there's a
white male culture), I'm surethey're bad apples, but not
everybody's like "I want to makea bias model.
A lot of times, it's justignorance or not understanding
the nuance.
Right?
So, these complexities arehuge, and that's what we really
need to do those best practiceson how can we build fair,
performant systems and make surewe're doing that responsibly.
(35:11):
Synthetic data is definitelypart of it, but it is not it's
not a panacea.
You have to know what you'redoing and taking it slow and
making sure you have thatbalanced data.
Debra J Farber (35:19):
So, don't move
fast and break things?
That's right.
For consequential systems.
For consequential systems.
That makes sense.
Sid Mangalik (35:26):
Yeah, and this
isn't just a human issue.
This is an AI modeling problem,too.
Right?
This is class imbalance, pointblank.
If you don't let your model seethese types of outcomes, if you
don't let it see healthyoutcomes for minority groups,
you will only create a modelthat assumes that they only have
unhealthy outcomes and willnever allow them to have the
(35:47):
same type of fairness and equitythat other people experience in
these models.
Debra J Farber (35:51):
Yeah, I mean,
you could see how maybe not
those building the models, butyou could see how maybe a
nefarious billionaire or someone(I'm not addressing anyone
specifically) or someone with alot of power who wants to I
don't know curtail a certainpopulation somewhere, could
inject bias into these models orsomehow create restraints that
come out of these models that doharm, on purpose, to certain
(36:13):
populations.
So, yeah, that was scary.
You want responsible AI.
Yeah, I could see why, and Ithink that there's not a great
way for media to reallyunderstand how can they actually
cut through the hype and talkabout the true challenges that
are going on today.
Right now, they've been prettymuch distracted by the far
(36:36):
future potential risks tohumanity that those bringing
LLMs to market talk about, toavoid speaking about the risks
that are inherent in the modelstoday that could hurt people
today.
I don't know what the answer isto get the media more educated
on it so that they can actuallyhave a real discussion with the
public at their level.
But gosh, do I hope that theyget there.
Andrew Clark (36:58):
Yeah, that's
what's really tough is, as we're
digging into this.
there's so many layers, it'shard to just explain in a quick
note.
And, honestly, with OpenAI, Iwould like to think there's
nobody doing anything nefariousthere.
They just don't know.
This is not how to do this biasmitigation, how to do this
synthetic data in the capturing.
It's not a straightforwardthing.
It's not even very well knownin the ML community.
So I'd like to think, and I haveno reason not to think so, that
(37:21):
OpenAI, they don't understandsome of these things.
They're not trying to be bad.
Maybe they are, but we don'thave any evidence that they are.
That makes sense.
It's just very difficult for,like, how does the media portray
this, and how do you understandthis type of thing?
And how do we even educate ouruniversities that this is a
thing?
I honestly don't think - Googleunderstood this; that's why
they did not want to releaseBARD until OpenAI did it first.
(37:44):
They had to.
OpenAI, I honestly thinkthey're just move fast, break
things and didn't think about itis honestly what I think
happened.
Debra J Farber (37:50):
I agree.
Watching it come to market, Iremember a few weeks before
people like maybe there was Idon't remember if it was Google
or Meta, I think it was Meta hadreleased some new LLM or
something new to the AIcommunity that people were like
have you you released this?
It's not ready for generalrelease because there's so many
issues with it and there wassome backlash online.
(38:11):
And then, all of a suddenOpenAI goes, boom, here's ours.
Right?
and provided an API access toit and really hyped it so that
everybody and their mother couldtry it - at least the free
version, not the commercialversion, which is even worse
because there's fewerprotections in there.
I mean, it really felt like atleast a PR campaign of let's
(38:31):
capitalize on the interest forAI now, let's make it available
everywhere in the world all atonce and really hype it so that
regulators can't shut us down.
That's what it felt like to me,as someone who's been in
privacy for 18 years and hasbeen looking at how companies
come to market and you know Isit on several privacy tech
advisory boards with help withgo- to- market and I'm like
(38:52):
watching this happen and itreally feels like to me,
I'm not saying that the peoplewho work for the company
themselves were bad or nefarious, but I think the way they came
to market was arrogant anddaring regulators to act.
So, to me it just seems alittle ridiculous, given what
Google and Meta and the big techcompanies already know what
happens.
(39:12):
They already are under 20 yearFTC consent decrees for some
actions they've had on whetherit was privacy or misleading
users in the past.
So, they've learned theirlessons, or they're currently
under consent decrees, so theydon't want to like rock that
boat because they're beingaudited.
Right?
Open AI doesn't have any ofthose restrictions, and what I
see from the Silicon Valleyinvestors is they're like, "well
(39:35):
, okay, we'll just make this aline item and you're all of the
potential lawsuits and whateveris a line item and the potential
billion, you know our annualspend.
At least we'll be able to cometo market and make billions of
dollars, and then we'll justhave to pay for the lawsuits and
any fines later, but at leastwe get the big market share.
So, that's the part I'm mostangry about: the way that they
(39:55):
came to market.
But, I do see that it's alsoeach PhD that works there isn't
necessarily, you know, bad orevil, and it could be just
ignorance.
It certainly isn't helping themto have this kind of stress on
the data scientists that areworking on this to try to fix
something that might not befixable.
So, I know I've heard you talkabout in the past this concept
(40:16):
of the 'fairness throughunawareness fallacy.
' Can you just unpack that alittle bit for us?
Andrew Clark (40:24):
Yes, I'll take the
high- level and then let's add
some details.
So, fairness through awarenessis something where a lot of
companies will think that, "hey,if I'm not looking at an
attribute" so say, I have like atabular data set that's just
like an Excel spreadsheet, thinkof it with rows and columns
where I have information about,like pricing for an insurance
policy as an example.
So I'll have a bunch ofinformation about this person
(40:45):
lives in Tennessee.
This person drives a Fordpickup.
They drive around 30,000 milesa year.
That kind of information.
Right?
Well, I'm going to just removeage and gender and ethnicity;
"because my model doesn't seethat stuff, it's fair.
I don't have to look at it.
Actually, in certain industriesand insurance, you're not
allowed to look at it.
In federal housing, theyactually have an aggregated
statistical database on that tomake sure their models are fair.
(41:07):
That's a rabbit hole.
But anyway, you basicallyremove any of those sensitive
attributes and train your modelwithout those, and then there's
no way to track it and you don'tknow if you're fair or you're
not fair.
But you think you're fairbecause you didn't train off of
that.
It's under the assumption thatif I had age in the system, it
might be ageism, but the factthat I took age out, it's not
going to do anything.
The problem is those innercorrelations and relationships
(41:28):
why you can't get synthetic datathat's as accurate as real data
.
Those relationships still exist.
Well, maybe there's a proxy orsomething that exists around,
"Well, people drive Ford pickupsand they live in this specific
area around China.
Well, turns out, they'reactually rich white males and
we're realizing that this othergroup - that correlation still
(41:49):
exists in the data set and thoseinner relationships - so, you
are just being blind to the factthat there might be bias,
thinking because I'm nottraining on age or gender or
ethnicity, I'm just magicallynot biased, which is a fallacy.
Debra J Farber (42:01):
Oh, that's
interesting.
Okay, I get that.
That makes sense.
Andrew Clark (42:05):
And, as we talked
about earlier, like the
multi-objective, it'scounterintuitive, but we
actually want to know ethnicity,gender, age, because we can
then train our model toexplicitly not be biased.
There are ways to say, "f youflip male / female gender or,
sorry, male female ethnicity,any of those attributes, I want
to have the exact same responsebased off of that.
(42:26):
So if I flip ethnicity, Ishould get the exact same
pricing.
That variable should not haveany impact.
I can train my model to do that.
But, I don't know if that's thecase.
If I just threw out thatattribute and oftentimes, as we
figured out, it actually ishaving this implicit bias that
you just can't see.
So, that's kind of a dangeroussolution.
Debra J Farber (42:45):
That's really
helpful to avoid.
Okay, awesome, and you've alsospoken in the past about the
difference between 'randomizeddata' and 'synthetic data.
' How is randomized datadifferent?
Sid Mangalik (42:55):
Yeah, so I mean
you can really think about
randomized versus synthetic dataas "is the data that I get at
the end of this coherent?
If you just create purelyrandom data, you'll create
17-year-olds that have beenthrough three divorces and are
the senior executive of the bank.
Andrew Clark (43:09):
Great examples.
Sid Mangalik (43:11):
Just for random
data, right?
This is not coherent data.
This doesn't make sense.
This doesn't look like anythingwe've seen before.
The goal of synthetic data isto not just be random data.
It's to be data that looks likethe original data and acts like
the original data and interactsbetween the variables the way
that real data interacts.
Debra J Farber (43:28):
Excellent,
that's a really helpful
definition.
I really appreciate it.
Okay, so there are severaltechniques that you've described
in the past for using syntheticdata with ML/AI" the Monte
Carlo method; Latin hypercubesampling, gaussian copulas (I
hope I'm saying that right) andrandom walking.
Do you mind walking us throughthose four methods?
(43:50):
Obviously at a high level, butif you wanted to use a use case,
that would be, I think, helpfulfor the audience.
Maybe first for the Monte Carlomethod what are the pros and
cons to using this technique forsynthetic data?
Andrew Clark (44:03):
Sure, I'll take
Monte Carlo and Latin hypercube
and then Sid, you can do thelast two.
So, Monte Carlo method isbasically named after the Monte
Carlo famous casino in Monaco, Ibelieve it is.
So in general, what it is isit's a more intelligent way of
sampling.
You're basically trying to dorepeated random sampling to
(44:24):
obtain numeric results.
Normally, what you're trying todo is use randomness to solve
deterministic principles.
So, it's something we'll dowhen we're trying to test the
system or determine what are allthe possible outcomes.
We'll run like 100 timesthrough this scenario where you
have some randomness factors orstochastic generators.
Essentially, we're trying torepresent "here are the
(44:44):
different scenarios for economicgrowth "for instance, those
sorts of things.
You try and have this wholesystem where you run through
that, you're sampling each time,and you're trying to figure out
"what's really the true valueby perturbing that input space
and figuring out just tweakingattributes.
It's really this system ofrunning experiments, if you will
, for defining what the outcomescould be for a system.
(45:04):
So, Latin hypercube sampling isa more intelligent way, instead
of just random sampling.
It's essentially, if you thinkof a chessboard.
it helps you make sure you hitall of the areas within a
chessboard versus randomsampling.
Sid's description was great ofa 17-year-old with three
divorces and CEO of a bank.
If you're just randomlysampling in the input space you
could get some crazy outputsthere.
(45:25):
Latin hypercube tries to be alittle bit more intelligent way
of sampling for Monte Carlosimulation.
I used to work doing economicsimulations and trying to build
economies and figuring outdifferent growth patterns and
things like that.
So, Monte Carlo is a reallygood system technique for stress
testing models; determiningcomplex systems; what are all
the possible inputs / outputs;and defining how you should
(45:46):
build a system.
It can be used specifically forthose stress testing
discussions we talked aboutearlier to generate data for
synthetic data and thosedifferent scenarios.
Debra J Farber (45:55):
Oh cool.
Sid Mangalik (45:57):
And I think, on
the other side, gaussian copulas
and random walking solve theother problem.
So, if the first to solve theproblem of picking good samples,
these Gaussian copulas and therandom walk both solve the
problem of making coherentselections.
So the Gaussian copula is atechnique, which is just
grounded on our favorite bellcurve, our famous normal
(46:18):
distribution, and thinking abouthow random variables can be
interrelated and creating outputdata that mimics the shape and
these correlations.
If you take a correlationmatrix of the original data and
a correlation matrix of yoursynthetic data, you want them to
look as similar as possible,right?
You want age and number ofdivorces to be highly correlated
(46:39):
.
You don't want those to beinversely correlated or
something, right?
So this helps to make data thatmatches the shape of the
original data, even in theinteraction space.
It's funny that I'm doingrandom walking here because this
is Andrew's PhD topic, but it'salso a very similar technique
where we're kind of walkingthrough the data and progressing
through it in a very naturalway.
Where we want to consider twovariables, let's walk over to
(47:01):
the fourth variable - the thirdvariable and the fourth
variable, and then to the fifthvariable in a logical
progression, which will let youget closer to the shape of the
original data by, in a sense,walking the original data.
Debra J Farber (47:13):
Awesome.
Okay, thank you so much.
Is there somewhere that ifpeople want to learn more about
techniques they could go to doto have more in-depth overview
of synthetic data techniqueswith ML and AI?
Do you know where they can gofor that?
Andrew Clark (47:27):
That's a tough one
.
This is where it gets a littlebit of you can research some of
these techniques like gaussiancopulas is a huge research
field; and it actually - becauseof the underlying distributions
and things - it got hit reallyhard off the financial crisis.
It was one of the contributingfactors for some of that.
There's not, that I'm aware of- Sid, weigh in if there's like
any good textbooks or anything.
(47:47):
Definitely, you can reach outto us and we're happy to point
you to different things, but Idon't know if like a definitive
"here.
Look at this place forsynthetic data.
It's very tough, and that'swhat this lack of this
fundamentalist principlesapproach on how do people even
learn some of these techniques?
It's very much embedded withinaerospace engineering.
It does a lot of Monte Carlo,or computational finance does
(48:09):
this thing, or a complex systemsengineering does this.
We try and take thatinterdisciplinary approach, but
it's hard to find like aresource to direct people
towards.
Sid Mangalik (48:18):
Yeah, sometimes it
feels like you just start on
Wikipedia.
You learn about the techniques,you've picked up a stats
textbook or there's some greatonline stats textbooks.
You learn about the methods,but there's not necessarily this
space yet - and this is a verygrowing and nascent space - of
taking these techniques andbring them to AI modeling.
This is still a relatively newidea, so while the math is there
(48:39):
, it hasn't quite been marriedyet to AI.
Debra J Farber (48:43):
Got it.
Well, maybe you guys can writethat book, and I would be glad
to promote it, because clearlydata scientists need this
information.
I'm n ot saying you have to goand write a book, but somebody
should because clearly peopleare thirsty for doing the right
thing when it comes to buildingmodels and they don't
necessarily know what to do.
Okay, my last question for youguys before we end today, ending
(49:03):
on a really positive note.
What are you most excited aboutwhen it comes to synthetic data
and how it will be used in thefuture?
Andrew Clark (49:11):
I love simulations
.
My PhD work is on it.
I love building conflictsystems and stress testing
models and I really think themore awareness we can have
around these techniques andbuilding systems that are safe,
performant, reliable that'ssimulating it and stress testing
it.
That's really the future and aswe're starting to move modeling
systems and AI systems that arebeing used for consequential
(49:32):
decisions, we really have tostart doing that stress testing
step.
The OCC, which is Officer ofthe Comptroller of the Currency,
after the financial crisis, setin a model risk management
framework that all banks have tofollow as examples.
Model validation, effectiveobjective challenge is a
benchmark, but really only largebanks have to follow that.
Stress testing - other largerbanks have to do those things.
(49:52):
But outside of that realm, modelvalidation stress testing has
not really been utilized besidesaerospace engineering.
It's used a lot in aerospaceengineering for spaceships and
fighter planes and all that kindof fun stuff.
(Debra (50:04):
Human safety purposes
there,.
Exactly) exactly - that's whereit's been used.
There's full fields of safetyengineering and reliability
engineering that NASA helpedpioneer and it's been used for.
.
.Boeing uses this for building737s and all that kind of stuff.
Right?
There's those other fields thatare doing these things.
Once we're using consequentialsystems that do affect human
safety and reliability.
(50:24):
We need to start bringing inthose techniques to building
these systems.
Before Chat GPT goes live, liverun a bunch of safety and
reliability engineering tests onit, that kind of thing.
So, So I'm excited on thepotential of using synthetic
data to do those things.
And that's where synthetic datais not new; new it's building
scenarios.
So, So I think there's a greatarea for research there.
Just caveat the profession -don't think it's a replacement
(50:45):
for real data for training yourmodels; use it to augment and
stress test your models.
Debra J Farber (50:51):
Excellent.
Sid, anything to add to that?
Sid Mangalik (50:52):
Yeah, I mean I
don't think the field is ready
yet, but I'm really excitedabout the potential of synthetic
data as a way to doun-intrusive privacy for data
processing and management.
We have patients and medicaldata and we want to learn about
them, and it can be difficultsometimes to have large sample
sizes, to have safe sample sizes, to collect data from patients
(51:17):
that aren't just the standardprofile we always collect.
Synthetic data might pose oneof the first chances for us to
have good quality, privacy-forward data available to us,
and there's still a lot ofproblems that we're figuring out
in this field of what thatlooks like if we're just
entrenching the same biasesagain and again.
But, there's a really strongpossibility that in the next
(51:38):
decade we could see syntheticdata being used in a lot of use
cases where we couldn't safelydo it before.
Andrew Clark (51:44):
Thank you so much
for having us on.
I think this is a really fundiscussion.
It's great talking with you and, yeah, this is a huge topic and
privacy and data is becomingmore important, so thank you for
having this podcast.
I think it's definitelysomething very much needed.
Debra J Farber (51:59):
Excellent.
Well, Andrew and Sid, thank youso much for joining us today on
The Shifting Privacy Leftpodcast.
Until next Tuesday, everyone,when we'll be back with engaging
content and another great guest.
.
.or guests.
Thanks for joining us this weekon Shifting Privacy Left.
Make sure to visit our website,shiftingprivacyleft.
com where you can subscribe toupdates so you'll never miss a
(52:22):
show.
While you're at it, if youfound this episode valuable, go
ahead and share it with a friend.
And, if you're an engineer whocares passionately about privacy
, check out Privado (52:31):
the
developer- friendly privacy
platform and sponsor of the show.
To learn more, go to privado.
ai.
Be sure to tune in next Tuesdayfor a new episode.
Bye for now.