Hacking the Apple M-Series via Prefetching Exploits

Episode Transcript
Available transcripts are automatically generated. Complete accuracy is not guaranteed.
(00:00):
Ierengaym.
com ierengaym.
com

PJ (00:10):
Hey, everybody.
Welcome back to tricky bits withRob and PJ.
So one of the interestingarticles that have come out over
the last month has to do withApple's M series of chips.
So the M1, the M two, and the Mthree.
And issue in question is thathave successfully attacked the

(00:38):
chip to extract.
Encrypted data, specificallyencryption keys.
And it's a really unique orinteresting way by which they've
gotten access to this.
And it has to do with aperformance.
Gain that the chip is attemptingto do, or does do, which is this

(00:58):
topic around instructionprefetching.
So the M Series has aparticularly aggressive set of
prefetching that it does whenit's examining the information
coming in.
And because of the particularside effects that can occur,
this allows attackers topotentially get access to

(01:21):
sensitive data.
Rob, this, uh, this is aunfortunate situation because I
think you've pointed out this isoccurring all the way down at
this level of the silicon andeveryone raved at the
performance for so long.
But we, we end up in this kindof situation.

Rob (01:43):
Well, one correction I'll point out, it's data pre
fetching, not instruction prefetching, that causes the
problem here.

PJ (01:49):
Oh,

Rob (01:49):
And, it's not something unique to the Apple chips,
everybody does it, it's just howthey do it.
But before we can reallyunderstand what's going on here,
we have to go all the way back,way, way back.
Why do we do this?
Why do we prefetch is theultimate question.
And why do we prefetchinstructions?

(02:11):
And why do we prefetch data?
And what are we prefetching fromand to?
It's, it's an architecturalquestion that goes back about 30
years.
And it all comes back to why dowe have a cache on processors?
The reason we have a cache isbecause processor speeds and
memory speeds.

(02:31):
are so far apart these days.
We can, today we can getprocessors that are boosted to
over five gigahertz clockspeeds.
You have memory as yes, it'sfaster than it was, but three,
4, 000 mega transactions asecond is about as fast as
you're going to get.
So processor have gone up 10X inspeed, 20X, 30X, whatever it may

(02:53):
be over the last few years.
And memory has gone up maybe 5X.
So there's huge speeddiscrepancy.
So.
And this started in the 80s,like the original ARM 3
processor, the 386 processorsback then had the same problem.
Processors were getting muchfaster than memory.
Maybe it was 30 megahertz forthe processor, uh, and it was 8

(03:16):
megahertz for the memory.
What happens is if you accessmemory, you have to make the
processor wait.
And it takes time.
So to fix this problem, we addedcache.
Cache runs at the same speed, oralmost the same speed as the
processor, and single, one, twocycles, maybe, at access speed.

(03:39):
So then, you load data, it's inthe cache, you load the same
data, and you get it for free.
You don't pay that delay.
But what happens if you make thememory orders of magnitude
slower than the processor?
You end up with a problem whereyou now miss the cache and the
processor has to wait.
a thousand clock cycles while itsits there doing nothing.
And that's not acceptable.
So every architecturalimprovement in processor speed

(04:03):
is basically addressing thisproblem.
And it has been that way for,like I said, 30 years.
So what did we do?
We started doing things like,well, first of all, we'll split
instructions and data.
So they're in their own cache.
And the way you prefetch each ofthem is very different.
Like instructions have Branchpredictors, branch target
predictors to have a guess as tolike, okay, the code's going to
go this way.

(04:24):
I assume this branch is going tobe taken.
So I'll fetch instructions fromthe other side of the branch and
won't go this way.
And if it's wrong, it throwsthem away, comes back and
fetches the real way.
So it only pays the price.
if it's wrong.
And if it can predict that earlyenough, then it doesn't pay.
If it can't predict earlyenough, it pays a percentage of

(04:45):
the full cost.
If it can predict early enough,it doesn't pay any cost.

PJ (04:48):
I was gonna say from a statistical standpoint, if you
can keep that prediction in ahigh enough basically order,
like 90 percent

Rob (04:58):
and that's where it is.

PJ (04:59):
you effectively are are able to guess the right path and
therefore proceed unimpeded mostof the time.
It's only the sort of hiccup ifyou mispredict every once in a
while.

Rob (05:10):
Yep.
And that's the same for, sothat's the instruction side and
that they have their own systemfor fetching instructions.
I said, where's the code flowgoing to go on the data side?
It's a whole nother.
Prediction system.
And initially it was just, haveyou used this cache line before?
Is it in the cache?
If so, it's cheap.

(05:31):
And we started addingprefetches.
So, you know, I know I'm goingto access this data.
So I'll issue an instruction tosay, prefetch this before I get
to it.
How far ahead do you prefetch?
Because the discrepancy betweenCPU and memory speed is variable
on everyone's machine.
You might have five speedmemory.
I might have medium speedmemory, same CPU, same code,

(05:51):
same optimization.
The time changes.
So manual prefetching isactually very difficult to
implement efficiently, but thoseinstructions exist and have
existed for a long time and theydid work for a while.
And the cool thing about theprefetch is instructions is you
can give them a hint as to like,I am not going to use this
again.
So prefetch it, but don't leaveit in the cache.

(06:13):
So it's not polluting the cachefor that could be taken by other
memory, which is more useful toyou.
So there are some benefits tothese non temporal prefetches as
they're So that's where we wentinitially.
But then we started to do thingslike, well, let's have a look at
what data's going to beaccessed.

(06:33):
And this is where it gets reallydifficult.
Out of order processes areneeded to get here.
You need the prefetcher, theinstruction prefetcher, to be
way ahead of where the data is.
Actual execution point is whereare the instructions being
retired from?
Because that gives the processora huge window to pre look at

(06:54):
load instructions and be like,okay, I see a load here and I've
computed the address already.
So I can issue this load.
And it may be hundreds of cyclesearly, which takes away that
cache load delay.
It also may be wrong.
And this is where we get intospeculation errors and side
channel effects on speculation.
There is a unspoken contract,which has been violated a lot in

(07:18):
architecture, which says,whatever the hardware does,
there should be no visible sideeffects.
So if the processor predicts,I'm going to go down branch A
and it goes down branch A andstarts fetching instructions.
And in those instructions,they're all load instructions.
And it starts issuing cachefetches for those load

(07:40):
instructions.
Those loads are now in the cacheearly enough that they could be
used without penalty.
But what if that branch waspredicted wrong?
It doesn't unflush those loadsfrom the cache, it just lets
them speed.
So the process that gets to theretirement port of this branch,
which it predicted to go Down a,but it actually went down B.
So now it'll throw all that workaway as far as executing

(08:04):
instructions goes, but theprefetches are still visible.
Now it starts executing downpath B and there are visible
side effects from the path itdidn't go down this is the root
of like the, the Spectre bugthat we had a few years ago of
tricking the processor intoprefetching things and seeing
the side effects of that.

(08:24):
And the way that we see theseside effects is by basically
probing the cache, seeing howlong it takes to access a memory
address.
If you kind of guess the addressthat is fetched, you can then
access that in a differentthread or a different process.
And.
see the fact that, oh, thismemory was accessed quickly, so

(08:44):
therefore must be in the cache,where if it's not in the cache,
it's take a long time to access.
And then by doing timing, sidechannel timing effects on the
cache and statistical analysis,you can figure out what the
speculation of the processordid.
And that caused all sorts ofproblems with the Spectre days.
But, those problems, these sideeffects, which are supposed to

(09:05):
be invisible, still exist.
The recent Apple problem isrelated to these.
So going back to memoryprefetching of, and specifically
data prefetching, so we talkedabout like, okay, we can see an
address that we're gonna loadfrom, let's prefetch that.
The processors also got a littlesmarter in spotting patterns.
Say you're loading through anarray of data structures and

(09:27):
you're accessing memory every600 bytes.
The processor will spot that andbe like, okay, I see you're
fetching 600 byte intervals.
I'll just automatically fetchfar enough out.
That maybe not fetch the nextone, but I'll skip three and
I'll fetch the fourth one.
And I'll go from there andhopefully get this into the
cache in time.
And obviously the last timearound the loop, it might get

(09:49):
that wrong.
It might fetch one that youdon't need.
Again, it should be invisibleand probably isn't if you look
close enough.
So we have these automaticprefetches, which will fetch
data in sequential patterns.
We have prefetches like youcould hint.
But what if the data isn't soeasy to prefetch?

(10:09):
What if it's the prefetchdestination is dependent on the
data itself?
For example, a linked list.
If you load a linked list node,if you never access the next
node in the data, the processorwill never see it.
And it's totally random wherethe next node is.
So it can't predict anything.
So we added things called datadependent prefetches.

(10:31):
And what they are is As you loadthe node, even before you issue
the load for the next node,which is too late to prefetch,
typically, the processor willlook at the, the node, the cache
line that the node's in and go,Oh, I see things that look like
addresses, and I will prefetchthat just in case.

(10:51):
And that's the root of the,problem on the Apple processors.
Now, Intel do the same thing.
They have a data dependent prefetcher, but it's not as
aggressive as the Apple one.
The Apple one will basicallylook at multiple levels and
fetch from there.
So, It's far more likely to havevisible side effects.

(11:12):
And in fact, the go fetch peoplewho found that's the name of the
security flaw that they gave it,can't get it to work on an Intel
or AMD processor, but they doget it to work on the Apple
processes.
Another reason is Intel have alot of controls, which can
switch these things on and offand Apple don't seem to have
them at least in the M one andthe M two.
Maybe the M three has somecontrols.

(11:32):
And.
It all comes back to thisViolation of a contract that the
hardware isn't doing what thesoftware Said

PJ (11:41):
So what do you mean by this is, the hardware is trying to be
smart.
it's over eager.
For example, like you gave a noncoherent Or a data structure
that has sort of non coherentmemory, like a list or an STL
map or set where you could bebouncing across all over the
place for pointers, like theApple prefetcher is basically

(12:06):
like examining every single oneof these nodes, picking out all
the memory and going after it,even though you may not actually
be exploring that in thesoftware, you may never go down
any of those nodes,

Rob (12:17):
it might not actually be an address It might not be an
address and there's the problem.
Obviously there's a lot of logichere.
It's not just going, Oh, that'san address.
I'll fetch it because what's anaddress.
It has to kind of have somesmarts to be like, okay, this is
an address which makes sense forthis process or what I've seen
before access wise.
So there are some smarts there.
It's just an endo smarts and howmany levels of prefetch do you

(12:41):
do?
If you fetch the next cacheline, do you look at that?
Do you look at the next one youfetch?
Do you recursively do this?
All of these smarts is where thedifference between Apple and
Intel are.
And Apple was a little overaggressive, which probably gave
them a few percent.
And that's all we're talking isa few percent speed increase.
But it breaks this fundamentalcontract of what the software

(13:02):
and the hardware is doing.
So how does it actually work?
What is this, uh, GoFetch doing?
It's looking at the side effectsof this prefetching, and then
making fake data to reveal asecurity key.
And you can hack over thingswith it too, but obviously
security keys are the big one.

(13:23):
Is this a big deal?
Kind of, yes, because you couldhave another non root process
running alongside some appthat's used in OpenSSL, and it
could derive the private keys,or the AES key, if it's a
symmetrical system.
It takes a long time, but itdoesn't take that long, and it
doesn't take that muchprocessing power, so if you had

(13:44):
a rogue process on your Mac, youmay not know it's there.
It's just sitting there tryingto guess keys and it may guess
wrong most of the time, but inthe end, it will get the key and
it seems to be pretty reliablefrom the test.
I've not seen the code yet, butfrom the videos of the tests, it
seems to be pretty reliablegiven enough time.

(14:07):
And so what is it?
How's it doing this?
So encryption code, first ofall, you should never write
yourself.
You should use open SSL orsomething that's been written by
somebody who really knowsencryption.
Because you don't want the codeto have different execution

(14:27):
paths based on the bits of theprivate key.
If it does a lot more work for a1 bit than it does for a 0 bit.
Then it's really easy to detectby cache analysis, power
analysis, thermal analysis,there's a whole bunch of ways to
determine how it's executed,various bits of the key and easy

(14:49):
to derive the private key,regardless of the data.
So you do things like, okay,we'll write code that has no
difference in execution time forzero and one bits of the key and
Intel have a whole executionmode where you can kind of do
constant time.
execution for a whole bunch ofinstructions regardless of what
the input parameters are.

(15:10):
So, if you're loading frommemory, it's this cost.
If you're loading from memoryplus an offset, it's the same
cost.
They're all slower instructions.
You don't get the maximumperformance of each variation of
the instruction.
They all go to a lowest commondenominator, but it makes it
real easy to do constant speedexecution.
And same thing.
You want constant branching.

(15:30):
You want everything to be asconstant as possible, regardless
of what the input key is.
So you can't side channel attackthe key.
And this is kind of getting intothe territory of where the
hardware violates that softwarecontract.
Because the software wentthrough all these, all this
effort to make sure there's novisible side effects.
Every instruction takes the sameamount of time.
There's no branches.

(15:51):
There's no 0, 1 variation.
Blah, blah, blah.
There's a whole list of thesethings that X security, code
implements.
But the hardware then just goesand runs wild with it.
They're like, Oh, I'll do this.
I'll do that.
I'll speculate.
I'll prefetch.
And you start to reveal dataabout the key.
And that's exactly what GoFetchdoes.
GoFetch.
Interjects fake data in the dataside.

(16:16):
So you've got the data side,you've got the key side and the
key side is what you don't know,but if you can't control the
data side.
So if you start injecting datayou can control things that look
like addresses and whatnot.
there's a few hoops to jumpthrough, but basically you're
generating fake data.
To pass through this constantspeed, constant time, constant
power encryption code, and usingthe side effects of that fake

(16:38):
data from the pre fetcher toguess what the keys of the bit,
the bits of the key were, and ittakes time.
It takes a while, but it's veryeffective.

PJ (16:47):
you're taking advantage of the fact this was all supposed
to run in constant time.
But I'm now going to inject fakedata to induce this hardware
variation that I can detect,which then gives me information
about the underlying key.
So I can actually find outinformation by, again,
statistically running thisenough times.

(17:07):
let's talk about order ofmagnitude of time.
Are we talking that it's goingto take hours to crack a code,
or seconds, or somewhere inbetween?

Rob (17:13):
It's somewhere in between cracking a RSA 2048 key takes on
the order of 25 to 30 minutes, a2048 bit Diffie Hellman key, 127
minutes, and there's no offlineprocessing.
On those, when you get to thematrix lattice encryption

(17:34):
schemes, which are supposed tobe secure against quantum
computers, like Kyber 512 and,dilithelium two and things like
that.
It, uh, Cracks, those two.
So CBER 5, 12 43 minutes, but italso takes about 300 minutes of
offline processing time wherethe RSA and Diffy Helman keys

(17:58):
didn't take any offline time.
So none of the keys are pretty.
Safe.
It's like, it's quite bad.
And like I said, it's for 26minutes to crack an RSA key.
That's,

PJ (18:10):
It's

Rob (18:10):
that's quite, that's quite quick.
And that's how long it is to bereliable.
It may not take that long.

PJ (18:18):
and, and, you know, to drive a point home, this is all the
way down in the silicon, like,any fixes need to be some sort
of software fixes on top.
I mean, it's, it's in, it's inthe hardware.
There's nothing you can do tofix that.

Rob (18:33):
Yeah, and it's actually quite hard to avoid these side
effects.
You've basically got torandomize the code such that the
side effects Aren't visible orat least way down in the noise
floor that you can't isolatewhat the hardware did.
And this RSA has done this forquite a while.
Well, they'll, they'll just haverandom bits associated with the
key and it makes it much harderto analyze what the actual

(18:57):
algorithm is doing.
Because from an external pointof view, you don't know what the
noise is and the random noisesand you don't know what the
algorithm is.
So it makes it a very difficultto analyze from side channels.
So you have to do that.
And.
It comes at a huge performancehit and if it's a one off RSA
key that performance hit even ifit's 10x slower It doesn't

(19:17):
really matter if you're justdoing a key here and a key
there.
Where it does matter is thingslike AES, where, yes, it's a
symmetrical key, so you're notlooking for the private half of
the key, you're looking for thekey itself, because it's
symmetrical.
But that's used all the time.
All of HTTPS, all of the driveencryption, all Everything
that's basically encrypted isultimately an AAS key that

(19:39):
because it's fast to encrypt.
So VPNs, all of that is all donewith AAS.
And then that key is protectedby an RSA key.
So getting the RSA key couldgive you the actual AAS key
that's being used.
So now you can decrypt the harddrive.
You can decrypt HTTPS sessionsand things like that.
and that becomes a huge problem.

PJ (19:59):
And to put it an order of magnitude, did I read correctly
in the paper that it's like a 2xdifference?
So, I mean, if wouldeffectively, you know, say cut
my speeds in half if I was onthe

Rob (20:11):
Not really,

PJ (20:12):
No.

Rob (20:13):
Maybe, maybe not because you're not, if encryption is
your bottleneck, then slowing itdown by two X is a problem.

PJ (20:20):
Yeah.

Rob (20:23):
encryption and decryption is typically not the bottleneck
in these systems.
The internet itself is.
So even if you slow theencryption down by two X, are
you likely to see the result?
Probably not, but maybe in somescenarios, in the grand scheme
of things, not a big deal.
I mean, Spectre had all theseother things that had to be
slowed down.

(20:43):
Some people noticed it, a lotdidn't.
And I think this is the same.
I think the workarounds areviable to make this attack at
least difficult because the datawill be in the noise and
statistically analyzing thebackend gets a lot more
difficult to pinpoint whatactually happened and if it's
2x, who cares, it's.

(21:05):
Much better than having thisthere.
Like I said, it's 20, 30 minutesto crack a key.
You could quite easily have arogue app running for 30 minutes
before you even notice that it'sdraining your battery or that
it's there.
How many people have run theprocess monitor on a Mac to see
what's actually running as bareservice level processes that
don't have any UI?

PJ (21:23):
and quite honestly, how many people would know what to look
for?
Because I mean, there's

Rob (21:27):
Exactly,

PJ (21:28):
that are running and it'd be like, Oh, like, unless you're
looking each of these things uponline, you could easily have a
rogue process there.

Rob (21:35):
Exactly.
And it doesn't need to be rootaccess.
It's just any process.
So as long as, if a malicioussoftware can launch a process,
With minimal user rights, thenit can crack the keys of all the
processes, which is the big,that's the big problem right
there.
And like I said, who knowswhat's meant to be there, what
isn't supposed to be there.

(21:55):
And this was reported to Applelast year.
So a lot of the fixes arealready in place.
OpenSSL will get updated to havea different code path for the
machines.
But the problem is, is thatApple don't have any controls
over this prefetcher.
Intel have some controls.
So you can turn it off andthings like that.
I think the ultimate fix is likethe cacheable fix in memory

(22:16):
pages.
Of.
When we added memory managersand we added caches, there was
always a need for this memorycannot be cached.
You can't access this memory outof order.
This memory is a memory mappedpiece of hardware and reads and
writes have to go in the orderthe processor issues them in.
And so because this problem wasalready known about when we

(22:38):
added caches,

PJ (22:40):
right.

Rob (22:40):
we added flags to the memory pages from day one.
That was this memory is notcacheable.
And basically, if you set thatflag, yeah, you don't get cache
performance, but you do get inorder memory accesses, which is
exactly what we need.

PJ (22:54):
to be checking, like, when we say we, you're referring to
Intel and AMD chips, right?
Because

Rob (22:59):
Intel AMD ARM,

PJ (23:01):
M1

Rob (23:01):
has flags on the cache pages.

PJ (23:05):
and M2 would as well, right?

Rob (23:07):
Oh, absolutely.
Every processor since 1985 hashad these flags because there
was always a need for memorythat wasn't cached.
This existed before we addedcaches.
So when we added the caches,like, you know what, this is not
going to work for all of this.
VGA hardware, whatever it maybe, things that existed back
when we first added caches.
We realized immediately, we needto have flags to switch the
cache off.

(23:28):
And it was done on a memory pagebasis.
It can also be done globally.
What we didn't.
See, is these problems showingup?
So we didn't add flags to thememory pages to switch off the
prefetch.
And we should have, that is oneof the better fixes is the
processor runs as normal.
You just can't prefetch in thesepages.
You put all your keys and allyour encryption go in these
pages.
The architectural differencebetween regular prefetching and

(23:51):
these pages is known anddocumented.
And the performance differencebetween having the prefetch on
and having it off is also known.
So we can write code to takeadvantage.
Of the situation that it's in.

PJ (24:03):
And correct me if I'm wrong, the Intel chips have such a flag
on the memory pages, correct?

Rob (24:08):
they, they do not have on the memory page.
No, they have it more globally.
So you can turn it off.
And I think the M three has aflag that you can turn off, but
the Intel one you can accessfrom.
I believe it's a model specificregister and you don't need the
kernel to do it for you.
You can just do it, run it andswitch it back on again.

PJ (24:27):
Got

Rob (24:27):
the Apple one is a lot more complicated to switch on and
off.
It requires the kernel to help,et cetera, et cetera.
So for the early processes don'thave that at all.
You can't turn it off.
At least at least that we knowof there's no architectural way
to do it.
To, uh, turn it off.
So I think that's the nextthing, is to have this memory
cannot be pre fetched and justlike this memory cannot be

(24:48):
cached already exists.
This needs to be added.

PJ (24:51):
So the M3 has a control and that is hard to, to work with
because as you say, it requiresthe kernel to help.
The M1 and the M2 do not.
They are basically,

Rob (25:01):
as, far as as, as far as I know,

PJ (25:03):
is software

Rob (25:03):
you can't turn it off.
I assume you can't turn it offat some level because none of
them are on.
When you power the processor,someone enables them.
So I assume at some level, itmay not be practical to turn
them off.
Maybe you have to turn offmemory management and the
caching and everything to get itto go off.
Maybe it's controlled by thesame flags as the cache.
I have no idea how it works atthis level, but, I assume like I

(25:26):
said, when it passed, when itfirst booted, these things were
not enabled,

PJ (25:30):
Right.

Rob (25:30):
one enabled it per process, per core, whatever it may be as
the process of boots,

PJ (25:37):
so I think it, it will be useful to talk a little bit
about, you know, you'vementioned that the Intel chips
also have, data pre fetching.
I mean, everyone does.
But why is this not a news storythat is affecting Intel as well?
is

Rob (25:53):
the Intel one's not, uh, not aggressive.
I believe the Apple one willlook at a cash line, see an
address, pre fetch that, andthen pre fetch what's in that
too.
And just basically do this overrecursive thing where the Intel
one only does one layer, whichmakes it very hard to attack.
If you look at the paper, whichyou can find at, gofetch.

(26:14):
fail

PJ (26:15):
We'll post a link.

Rob (26:16):
All the other information is in there, you'll see that
they, they actually analyze.
Access patterns, not justdirectly access, not just
directly looking at what the DMPdid.
They're looking at a wholeaccess pattern across many
accesses.
And the Apple one, because it'smore aggressive.
It reveals more about itself.

PJ (26:35):
So in terms of a speed difference, then let's say,
let's say we had a fictitiousknob where the M series could be
matched to the Intel aggression,which is not as aggressive.
would be the percentagedifference or, you know, order
of magnitude of percentagedifference of speeds?

Rob (26:55):
Barely, barely any.

PJ (26:58):
Like what?
1 percent less?

Rob (27:01):
I have no idea.
You'd have to look at, you'dhave to simulate it or have a
chip where you could switch itoff at various levels to
actually get real numbers.
I'm guessing that And the moreaggressive pre fetcher maybe
gives them a percent, twopercent.
It's not going to be much.
The pre fetcher existed at all,only added a few percent over
static analysis pre fetchingthat the process is used to do.

(27:24):
And only in certain scenariostoo.
If you're doing the predictableevery 500 bytes.
That can be prefetched mucheasier without data dependency.
These are the things likeprefetching based on the data
itself, which is, like I said,the linked list example is one
where you'd need a datadependent prefetcher because you
don't know where you're going.
And doing it one level is allyou need to fetch that.

(27:47):
If you start doing multiplelevels, then you start to reveal
a lot more.
Secrets, but the fact that thewhole DMP wasn't a major
performance increase.
It was just incrementalperformance increase leaving the
DMP there, but taking away someof it's more aggressiveness
wouldn't mitigate the entire DMPbeing there.
It still has benefit.

(28:08):
So if it was 1 percent overall,I'd be amazed.
But that's the numbers we'reworking.
They re architect an entire chipto get 2 or 3 percent speed
boosts.
It's all these 1 2% s on top ofeach other that give us the
performance we have, today.
And it's all these 1 or 2% sworking together which reveal
side channels.

PJ (28:27):
So Rob, I mean, is there a, uh, a macro symptom here that
because we've becomecollectively shittier
programmers and have moved tomore map sets, dictionaries,
things that use kind of thisincoherent memory that we've now
designed chips basically to kindof enable that more rather than

(28:48):
like, Hey, use a vector and useit right.
And think about how to lay outyour memory.
did we back ourselves into thisproblem?

Rob (28:55):
We have backed ourselves into this problem to some extent
by having generic But it's allto enable hardware to be a black
box as software engineers liketo look at it.
If we go back to the early days,even not that early, even if you
go back to like the 2000s, well,let's go to the game consoles.
Let's even just, let's just lookat these.
These are concrete, Examples.

(29:17):
The PC has always been on thisgeneric model where it's, it'll
just, the hardware will justfigure out what your code does
and you don't have to optimizeyour code, it'll do it for you.
You can optimize for differentthings along the way on the
Pentium.
We used to optimize for UV pipesin execution, and then the
Pentium Pro went to out of orderexecution.
So then you have to optimize forD decode bandwidth because once

(29:38):
it's decoded it to micro ops, itwill execute them out of order.
You have no control.
over it.
There are some optimizations youcan do at a high level, which is
what the compilers do to kind ofencourage the reordering and
after order execution to executeefficiently.
But it's a very differentoptimization to what you were
doing previously of countingcycles and UV pipe and schedule

(30:00):
and things like that.
And this was true on ARM.
This was true on everythinguntil they go to out of order.
When they go to out of order,Optimizing becomes real hard.
You kind of optimize forregister pressure.
You optimize for decoderbandwidth and things like that.
And ARM and Intel today are nodifferent, but that's the PC
path.
And all we're doing now istrying to get the most out of
it.
We can prefetch predictableaccess patterns.

(30:21):
We do data dependent prefetchingto find unpredictable access
patterns.
And it's just the next stepalong the way, and it just
allows code to be code and justwork.
If we go back to.
The PlayStation 2, for example,it had a scratch pad.
Like it was a very fast cacheperformance level piece of

(30:43):
memory.
You could DMA into it.
You could DMA out of it, and youwould access it with the, uh,
with the CPU.
So you could pre, you couldpreload this thing with CPU
instructions.
You could preload it with DMAand then get very fast
performance.
So to write real fast enginecode on the PlayStation 2.
It was critical that you use thescratch pad and at Insomniac we

(31:05):
ran the entire engine outtascratch Padd, it was over like
16 K, but we DM EDMA in like wehave this list of things that
need to be culled.
So rather than go through themin memory and let them load into
the cache in the natural way.
And this was early days, so itwas basically on demand caching.
It was no prefetch.
Rather than do that, we DMA 4k,8k or something into the

(31:27):
scratchpad, process it from thescratchpad.
And then while we're doing that,we double buffer.
So we, while we're processingbuffer A with DMA and buffer B,
and then we'd.
We'd switch between them.
We never missed the cachebecause we, uh,, always had the
next buffer available when wegot there.
And then we'd size it based onhow much work were we doing?
How quick could we go throughthe list?
How big was it was the list intotal?

(31:48):
how many working buffers did weneed?
If you need an input and outputand a process buffer, then
obviously you have to make themsmaller.
Cause they all have to fit inthis 16 K and you can work out
data structures for anyscenario, which.
fit in this model.
Link lists are still reallydifficult because you can't DMA
the next node until you've seenthe current node, but you can't

(32:09):
put the whole list in a smallblock and then move the whole
thing into Scratchpad and thenwork it that way.
It requires programmers torethink all of the classical
data structures to be like, howcan we run this from Scratchpad?
And then you get to thePlayStation 3, which had the
SPUs.
A lot of what we'd learned onthe PS2 already applied to the
PS3, because we already had themindset of making these data

(32:30):
structures, which could beDMA'd.
The PS2 was slightly moredifficult that it wasn't the
same processor.
At least on the PS2, it was theEE core that was accessing
scratchpad.
You was just doing DMA,effectively pre filling a cache
in a way that made sense to you.
So you controlled how you fillthe scratchpad versus the cache
hardware controlling how itfilled the cache.

(32:52):
And you could.
Basically eliminate all stallsbecause we can make data that
fits.
Data structures were notdesigned to work in these.
I know what I'm accessing, whereand when.
So I'll prefetch my data.
The PC is always relied on thefact that it'll just figure it
out and how well it figures itout is based on how well you

(33:12):
write the code.
So today you'll get moreperformance from a PC by.
Semi optimizing your prefetchbandwidth and semi optimizing
your access patterns.
But you don't have to take it tothe extreme that the early
machines did.
And, yeah, I do agree.
We have become shittierprogrammers because we just
write generic code for genericarchitecture and Intel and ARM

(33:34):
are both in the same boat.
They're both the same thing,RISC versus CISC is Been gone
for years.
anybody who argues, oh, it's aRISC chip, so it's better than
an Intel chip is an idiot.
The only real difference betweenthe two chips is the decoder.
Everything behind there is allstate of the art out of order
architecture.

PJ (33:54):
Let's double click a little bit more on the Intel side here.
So why didn't Intel fall to thesame issues?
We've talked about it being lessaggressive, but I think there's
a little

Rob (34:03):
That's really it.
Less aggressive makes itstatistically more difficult to
predict those patterns.
Cause like I said, it'sstatistically analyzing what the
prefetcher does.
And if you're only.
If you're less aggressive, itdoesn't reveal, it's still
revealing data.
It's still filling the cache.
It's just maybe impossible toanalyze or needs more aggressive

(34:27):
examples to be able to analyze.

PJ (34:30):
The signal to noise ratio is unfavorable in that case.

Rob (34:33):
Yeah.
So they're not saying it can'tbe done.
They're just saying that thiscode can't do it.

PJ (34:37):
right, you mentioned to me an interesting fact, which is
that Intel will publish whatthey're going to do way ahead of
time in a way to almost invitefolks to critique their designs.
And that seemed reallyinteresting to me.
I think that was likeinteresting difference of like,
Hey, come take a look at it.
Security folks, see what wecould be doing wrong.

(35:00):
And we'll take some feedback onthese things.

Rob (35:02):
Absolutely.
Intel have always been reallygood at doing that.
They don't always publishupfront what they're going to
do, obviously for competitivereasons.
But they do to certain peopleand they also have lots of white
papers on the general idea ofwhat is a data dependent
prefetcher

PJ (35:18):
Right.

Rob (35:21):
DMPs, data memory prefetchers, that go by various
names, they have a white paperon how they work, what they do,
and then prefetchers.
People will comment on the whitepaper and say like, Oh, this
could be a big problem or it'sviolating this or violating
that.
So in theory, this is how itworks.
And the implementation will bedifferent from the initial white
paper due to the feedback.

(35:41):
And Intel have always had a lotof bells and whistles of like
switch things on and off.
So I'm not surprised that Intelcould turn those off a lot
easier.
They also published a post whitepapers of the effect it had.
You can see like exactly how itworks.
The performance increase, likeit expected the performance
increase that it got.
Apple don't do any of this.

PJ (36:01):
Yeah,

Rob (36:01):
don't even know how Apple's DNP works.
It's like, we have no idea.
We have nobody was in there.
We assumed it was there we had,because they don't tell us
anything.
It's like, Oh, it's better,better by whose standards.

PJ (36:13):
There's that culture, of wanting to create a big splash
and it's secrecy culture, right?

Rob (36:18):
Yeah.
The secrecy culture at thislevel for security cannot exist.
They need to be more open intowhat they're doing and not rely
on people.
Basically side channel attackingit to see what it's doing.
And this is the same for like,how big are the reorder buffers
in the Apple chips?
They never tell you.
Like, how many reorder workingregisters does it have?

(36:39):
They don't tell you.
People figure it out by justwriting in creative sequences of
code, and seeing how theperformance changes.
Again, it's a side channelattack.
It's not a direct measure.
It's It's like if I write codein this really academic, very
inefficient, potentially, mostlikely, way, but I can see some
measurable difference for whenit runs out of internal renaming

(37:03):
registers, then I can start tofigure out how many of those
registers there are.
Apple could just tell us.
We can figure it out.
So they're not savingthemselves.
It's all they're doing isdelaying people and in figuring
out people figure these otherthings out too.
So

PJ (37:19):
Right.

Rob (37:19):
just need to come clean.
I've said it before.
I'll say it again.
I'll always say it.
Apple suck that they don'tdocument anything.

PJ (37:25):
I'm curious how much of this you think is, is really the, the
culture and the hubris.
And this isn't just Apple, thisis the big tech problem.

Rob (37:32):
Oh, it's the culture full stop.
It's the culture in Apple fullstop.
It's like, Oh, we'll add one ofthese.
We'll do this.
We'll do that.
And all the people in Appleprobably don't know what they're
doing.
Nevermind external, anybodyexternal security, which is, as
of the last on the list thatApple want to talk to

PJ (37:50):
It's fascinating because obviously a lot of the appeal of
Apple is the verticalintegration, the hey, we're
secure.
Right.
It's this closed system, butit's a really interesting
example of how this closedsystem actually creates these
security issues.
Which is driven by the secrecy.
It's driven by probably managerswho want to have it all.

(38:10):
Oh, I'm 1 percent faster thanIntel.
Like, ah, that'll get me my, mypromotion next year.

Rob (38:17):
Definitely some of that there for sure, but, security at
any level cannot be done throughsecrecy.
And if there's any effect onsecurity, how it executes, what
it executes, it has to be fullyunderstood before it's released.
And that goes to things likesays leaking speculative data,
things that you think like, oh,well, I didn't use this code

(38:38):
path.
So it doesn't matter.
Was the mindset of the originalspeculation.
It's like, we'll just throw thatwork away, but there are some
visible side effects, which canreveal sequence if used
correctly or incorrectly.
And.
That's the problem.
I mean, one option could be ifyou pre fetch everything into a
second cache, that was just forpre fetching, you had no other

(38:59):
access to it.
All it did was pre fetch intothis cache, and then if you use
it, transfer to the real cache.
If you used it, so it could bedone real quick.
That would solve, that wouldsolve a lot of pre fetching
problems, of speculation, attackproblems.

PJ (39:12):
That's a hardware solution.

Rob (39:14):
That's all, That's a hardware fix.
And it's a cache that you couldbe using for general cache,
which would make everythingslightly faster.
But now you've got this hugecache, which is just sitting
there as a prefetch buffer,which you have no internal
access to, but things like that.
Like it's architecturaldifference from what we
currently do, which fixes someproblems.
It may also reveal all theproblems, is it worth it?

(39:36):
Is it worth having anothermegabyte cache where we could
have a two megabyte cache?
Everything gets faster.
We'll have, Some of the visibleside channels, no matter what
you do over the directlyexecuting instructions as they
are, it's going to havesomething that could be
attacked.

PJ (39:50):
But, again, in the difference with the Intel
example, it's, you know, or notthose statistics are raised to a
level of detection, depends uponlike how you actually approach
this.
So, I mean, Intel gets away withit by virtue of the fact that
it's not as aggressive.

Rob (40:06):
Yeah.
By being less aggressive, theyavoided some of these problems.
And it's, it's a case of whenthey first found Spectre, they
found it on, I think And thenthey, Oh, Intel have the same
problem.
Oh, ARM have the same problembecause everybody was doing the
same speculation with the sameside effects.
And then it's like, Oh, youknow, well, all modern

(40:26):
processors are affected byMeltdown and Spectre.
A lot of different fixes wereput in for specter mitigation
things.
the way it's changing the wayspeculation worked, but again,
it was new hardware, not oldhardware, although it still has
the same problems.
You could change software tomitigate it.
You couldn't prevent it, but youcould mitigate it.
And we kind of moved on from it.
I think this will be the samething.

(40:47):
New hardware won't have theseproblems.
It'll either have controls orit'll have less aggression and
software will get changed tomake it so it's statistically.
Next to impossible to get thekeys and we'll move on it'll
just be a blip in history, butit could have been Prevented it
to some extent if Apple wouldhave been more open as to how
their system works Like we stilldon't know how it works.

(41:09):
I mean we haven't reverseengineered it We don't know that
Intel aren't doing what Apple'sdoing and we don't know why
Intel is immune to thisparticular attack, but Apple
isn't.
We don't have any architecturaldetails as to what they're
actually doing.
And we don't know what's thetrigger.
We assume it's Intel's beingless aggressive and they're not
doing dependent dependentlookups, but we don't know.

(41:30):
All we know is statistically, wecan't get the data out of Intel,
but we can get it out of Apple.
It'd be nice if we could have anarchitectural review of both and
see what the trigger is.
Okay.
This.
One tiny feature, it's probablyinsignificant feature is what's
making Intel, Apple not.
We don't know because we alwayshave reverse engineered how they

(41:50):
both work.

PJ (41:51):
From a cultural standpoint, is Apple an outlier or is it
like everyone else?
Meaning like we've talked abouthow it's different from Intel,
but you know, Nvidia, AMD arm,like, do they tend to be more
like Intel in terms of theirhere's how it works or more like
Apple in terms of no, trust us,we've closed it all off or
somewhere in between.

Rob (42:09):
AMD is usually the most open of everybody.
Fact that they have open sourcevideo drivers, they started to
open source the firmware insidethe GPU.
They started to open source lotsof little bits that you would
normally not have access to.
I think AMD's ultimate goal isto have an open source from
first instruction.
Everything running could be opensource.

(42:32):
And that includes like boardmanagement and firmware and.
BIOS and everything.
So AMD have all traditionallyalways been a lot better.
I think Intel, if you sign thecorrect NDAs and you have
platform integrator or a systemvendor, then you get that same
level of access, but not opensource.

PJ (42:49):
Hmm.

Rob (42:50):
do give you that access, just not in an open source way.
AMD tend to, A, they have thesame NDAs too.
If you need access to stuff,that's not open source.
But AMD is going down the pathof we'll just open source all of
it in the end.
And there's legal issues, legalreviews and things like that.
I assume is what stops them justgoing, fuck it, open source
everything.
AMD is traditionally a lotbetter.

(43:11):
Intel do give you the accesswith the correct, NDAs and, uh,
licensing.
Apple, zero.

PJ (43:16):
Got it.
the flaw is present on theiPhone, but there's no attack
vector because you can't get thesame access that we know of,
correct?

Rob (43:26):
Potentially, it's definitely there.
I mean, the M series and the Aseries are kind of the same
cause.
just to have that package in theSOC is different speeds and
things like that are different.
Cash sizes of caches aredifferent, but front
architecturally, the samethings, the firestorm, ice
storm, low performance, highperformance calls are the same
between the M's and the A's.
So yes, this problem does existon the iPhone Debatable whether

(43:48):
the attack vector exists.
It's hard to run backgroundprocesses on

PJ (43:51):
On iOS,

Rob (43:52):
a.

PJ (43:52):
yeah.

Rob (43:54):
On the iPhone, it's hard to just know what it's doing at any
given time.
Like, is it running right now?
Is it not running right now?
Where on OS X, it's a lot easierto do that.
So if you can get the apps torun at the same time, the same
problem exists on iPhone.
It's just the vector is a lotmore complicated due to the user
interface and backgrounding oftasks and things like that.
But in theory, it does exist.

PJ (44:15):
Got it.
Well, folks, I think that's agood lesson.
Watch out for what you'redownloading.
recognize that issues exist.
This is just a part ofcomputing.
It's been a part of computingand it's a balance back and
forth between performancesecurity and, know, in many

(44:35):
ways, the hubris side of things.

Rob (44:37):
And I will also throw out that, The people who found the
problem did do this responsibledisclosure to Apple.
They told Apple about this inDecember of 23, 107 days before
it was publicly released.
So Apple did have time to startimplementing mitigations.
So by the time this hit thepublic, a lot of this was

(44:57):
already mitigated, not fixed.

PJ (45:00):
But mitigated.

Rob (45:01):
they are definitely still attack vectors, but some of them
will have been removed.
The most common uses of theattack vector will have been
mitigated.
So it's not as dangerous as itsounds, but it's still a big
problem.

PJ (45:15):
Hopefully by the M4, we'll have it all fixed, right?

Rob (45:17):
Well, I assume they're going back now and redesigning a
lot of the M4 to fix thisproblem, or at least have page
memory, page controls on offcontrols, constant time
execution, whatever it may bethat they're going to have to
add to it.
To get some of this to work.
and M4 might be too close.
It might be M5.
That's going to have changesbecause M4 would have already
been laid out by now.

PJ (45:38):
a, that's going to be an interesting question, I think.
Uh, do you think, Rob, this isactually enough to either delay
the M4?
prompt a change in culture to beif not completely open, but more
open, maybe on the level ofIntel NDAs.

Rob (45:56):
Never going to happen.
Never going to happen.
Apple are never going to beopen.
It's just not in their culture.
They just don't know how to doit.

PJ (46:04):
I find it fascinating maybe on a, on a long term scale then,
because it really is almost aconflict in my mind between this
desire for vertical integrationthat Apple has been going after
forever, and as well as themtouting themselves to be the
most superior in security.
It feels like one of those twothings are going to have to Oh,

(46:25):
plus the secrecy.
It seems like one of those threethings is going to have to
break.
And in this case, it broke thesecurity side.

Rob (46:32):
Yeah, it's they, I just wish they were more open, more
analysis and more investigationsinto what they're doing would
put people's mind at risk.
Cause we know like a lot of thisgood security is from the fact
that they don't documentanything.
And it's not that any moresecure than anybody else.
Once you dive into the details,it's just getting access is
difficult, which is a level ofsecurity.

(46:55):
It shouldn't be the ultimatelevel of security.

PJ (46:57):
The old security by obscurity line.

Rob (47:00):
Yep.
Don't do it.
I mean, it helps, but don't relyon it.

All Episodes

Episode Transcript

Popular Podcasts

Crime Junkie

24/7 News: The Latest

Stuff You Should Know

.css-15opob5{left:0;position:absolute;top:0.8rem;} All Episodes

.css-14f5ked{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:2;overflow:hidden;}Hacking the Apple M-Series via Prefetching Exploits

Episode Transcript

Popular Podcasts

.css-r6mb8g{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:1;overflow:hidden;}Crime Junkie

24/7 News: The Latest

Stuff You Should Know

All Episodes

Hacking the Apple M-Series via Prefetching Exploits

Crime Junkie