Malware Analysis Using Artificial Intelligence and Deep Learning

Episode Transcript
Available transcripts are automatically generated. Complete accuracy is not guaranteed.
Speaker 1 (00:00):
Welcome to the Deep Dive, where the show that bigs
through stacks of sources to give you the key takeaways,
making sure you're well informed. And today, Wow, we are
plunging into a pretty intense digital battlefield. The stakes are
incredibly high. We're talking about malware, you know, that nasty
software designed purely to disrupt damage steel and the scale,
the sheer scale of this problem, it's just staggering. Get this.

(00:22):
Every single day, something like three hundred and fifty thousand
new instances of malicious software pop up detected. Just think
about that number for a second. And back in twenty eighteen,
over six hundred and sixty nine million new variants were
spotted in that year alone. This isn't just annoying pop ups.
It's a huge financial hit businesses. We're spending on average
two point four million US dollars back in twenty eighteen

(00:42):
twenty nineteen just fighting malware and web attacks. So our
mission for this deep dive is to really get into
how cutting edge artificial intelligence, specifically deep learning, is being
used as well a crucial line of defense. We want
to explore how these intelligence systems are learning, adapting, maybe
even predicting threats that the old way just can't catch.
It's not just about spotting the known bad guys anymore, right,

(01:03):
it's about anticipating the unknown, the brand new stuff.

Speaker 2 (01:06):
That's precisely it. You know, for years cybersecurity really leaned
heavily on what's called signature based detection. You could think
of it like having a huge photo album of known criminals.
It's great for recognizing malware we've already seen and fingerprinted,
very efficient for that, but it's big weakness. It's achilles heel.
Really is the zero day attack, ah.

Speaker 1 (01:28):
The infamous zero days Exactly?

Speaker 2 (01:31):
These are completely new malware variants never seen before. They
don't have a signature, no photo in the album to match.
And that's exactly where AI and deep learning are stepping in.
They use much more sophisticated methods like looking at dynamic
behavior to spot malicious intent, even if the code itself
is brand new.

Speaker 1 (01:46):
Okay, let's unpack that a bit, Starting with like the
raw materials, how do we even study malware? I gather
there are two main ways, static and dynamic analysis.

Speaker 3 (01:55):
That's right.

Speaker 2 (01:56):
Static analysis is well like examining a suspicious package without
actually opening it. You're looking at the code itself without
running it, things like library calls that might make tech
strings inside it, byte sequences, maybe the sequence of API calls.
It seems designed to make signature based detection that mostly
uses this static data, but as we said, it totally

(02:17):
misses new malware because there's no existing signature, right, no
mugshot exactly. And then you have dynamic analysis. This is
where you actually detonate the malware so to speak.

Speaker 1 (02:28):
You run it sounds risky.

Speaker 2 (02:30):
Well, you run it or emulate it in a very
controlled environment a sandbox usually, and you watch what it does.
So you track the actual API calls it makes, how
it interacts with the system, maybe even low level hardware
events for unknown malware. Seeing its behavior what it actually
does is absolutely critical. It's not just about its blueprint.

Speaker 1 (02:48):
But it's actions makes sense, and I heard some people
are even combining them like a hybrid approach.

Speaker 2 (02:53):
Yes, absolutely. Hybrid analysis tries to get the best of
both worlds, looking at both the static structure and the
dynamic bee behavior to build more complete picture.

Speaker 3 (03:02):
Things like mal DNA try to do this.

Speaker 1 (03:04):
So you mentioned API calls and other things you look for.
These are the features, right, The specific clues precisely.

Speaker 2 (03:09):
Features are the specific characteristics we extract. And API call
sequences are incredibly valuable. Why because they directly show what
a program is trying to do. Interact with files, connect
to the network, modify the system. API calls reveal.

Speaker 1 (03:24):
That ah okay, And the key.

Speaker 2 (03:26):
Insight here is that the order of these calls often
screams malicious intent. Think about it opening a file, encrypting it,
then deleting the original. That sequence tells a very different
story than just opening and reading a file.

Speaker 1 (03:38):
Yeah, it definitely sounds like ransomware exactly.

Speaker 2 (03:41):
So researchers use techniques like n grams, which is just
a fancy way of saying they look at short ordered
sequences of calls, like pairs or triplets to capture this
vital order information. Opcode sequences are another important feature too.
Those are the really low level machine instructions giving insight
into the program's core functions.

Speaker 1 (04:00):
So how do analysts actually get this data? What tools
are they using?

Speaker 2 (04:03):
Ah, there's a whole toolkit for static analysis. You have
dissemblers and debuggers like ida pro or allidobig. They let
you peek inside the compiled code. See the assembly instructions
extract op codes, potential API calls, and for.

Speaker 1 (04:16):
The dynamic side, the sandbox stuff right.

Speaker 2 (04:19):
Tools like API monitor are used to track those API
calls live, but you usually need to run the malware
inside a virtual machine or sandbox to contain it. Buster
Sandbox Analyzer BSA and similar tools like CW sandbox are
designed for exactly that. They run the malware safely and
log everything it does, file changes, network connections, API calls.

(04:39):
They're even more advanced tools like ether, which use hardware virtualization.
They kind of sit outside the operating system the malware
is running in, making them much harder for the malware
to detect.

Speaker 1 (04:49):
Okay, this is fascinating. So you've got all this raw data,
API sequences, op codes, behaviors. Now how do you actually
feed this into an AI? How does the machine see
the malwa?

Speaker 3 (05:00):
Well, this is.

Speaker 2 (05:00):
Where some really creative approaches come in. One of the
most surprising ones is malware visualization.

Speaker 1 (05:06):
Visualization you mean like charts and graphs.

Speaker 2 (05:08):
No, literally turning the malware code the binary file itself
into an image, usually a grayscale image.

Speaker 1 (05:15):
Wait, what turning code into a picture? How does that
even work?

Speaker 2 (05:19):
Or why it sounds bizarre? I know, but researchers found
that malware samples from the same family, even if they
look different in code, often end up having similar textures
and structural patterns when you represent their binary data as pixels.

Speaker 1 (05:32):
In an image like a visual fingerprint.

Speaker 2 (05:34):
Kind of yeah, kindred attributes as some call it. And
the brilliant part is this lets us use incredibly powerful
deep learning models that were originally designed for image recognition.

Speaker 1 (05:44):
You mean, like the AI that recognizes cats and photos.

Speaker 2 (05:47):
Exactly, Convolutional neural networks or CNNs. They're designed to find
patterns in images, edges, textures, shapes, increasingly complex features. So
by turning malware into an image, we can train as
c N to spot the visual hallmarks of malicious code,
even if it has no obvious image component itself. It's
surprisingly effective.

Speaker 1 (06:07):
Wow. Okay, that's pretty cool. So CNN's for the image approach.
What other AI tools are in the box?

Speaker 2 (06:12):
Well, for data that's sequential where the order is crucial,
like those API call sequences or op code sequences we
talked about. With these different architectures, recurrent neural networks or
RNNs are designed specifically for sequential data, Okay, and within
RNNs variants like lstm's long short term memory networks are
really powerful. They have mechanisms to remember information over longer sequences,

(06:36):
which is perfect for tracking complex behaviors that unfold over.

Speaker 1 (06:39):
Time, so they can connect an early action with.

Speaker 3 (06:40):
A later one precisely.

Speaker 2 (06:42):
LSTMs are actually quite successful commercially. Another popular variation is
the GRU or gated recurrent unit, which is a bit
simpler than LSTM but often performs just as well. Both
LSTMs and grus have shown really significant improvements in detecting malware,
even things like spotting cybersecurity events based on say, patterns
and social media messages over time.

Speaker 1 (07:03):
Interesting any other architectures.

Speaker 2 (07:05):
Definitely there are residual networks or resonants. Their key innovation
is allowing the network to learn identity mappings, basically letting
the signal skip layers if needed. This helps train much
deeper networks without running into problems like vanishing gradients where
the signal gets too weak to train the.

Speaker 3 (07:22):
Early layers effectively.

Speaker 2 (07:23):
It's kind of inspired by how neurons connect in the brain.

Speaker 1 (07:26):
Deeper networks mean potentially learning more complex patterns.

Speaker 3 (07:29):
I guess that's the idea.

Speaker 2 (07:31):
And then there are jans generative adversarial networks.

Speaker 1 (07:35):
These are fascinating adversarial sounds intense.

Speaker 2 (07:38):
It is in a way you have two networks competing.
A generator tries to create fake data like fake malware samples,
and a discriminator tries to tell the generator's fakes apart
from real.

Speaker 1 (07:49):
Dat like a game of cat and mouse.

Speaker 2 (07:51):
Exactly a mini max game. The generator gets better at
fooling the discriminator, and the discriminator gets better at spotting fakes.
The really exciting part about cans is their potential for
things like zero day malware detection, because the generator might
create novel malicious patterns or even we can use them
in the lab to generate challenging new threats to test

(08:11):
our defenses before similar things appear in the wild. It's
like a digital.

Speaker 1 (08:15):
Sparring partner proactive defense. I like that. What about understanding
the words of malware like op codes or API calls?

Speaker 3 (08:22):
Ah?

Speaker 2 (08:22):
Yes, that's where word embedding techniques come in, like word
two vec, or even approaches based on hidden Markov models
like HMM two vec. The core idea is similar to
how language models understand words and sentences. You treat op
codes or API calls as words. These techniques learn to
represent these words as numerical vectors in a high dimensional.

Speaker 1 (08:42):
Space, vectors like points on a map.

Speaker 2 (08:44):
Sort of yes, And the key is that words used
in similar contexts, like API calls that often appear together
in malicious sequences and then closer together in this vector space.
Word two vec, for example, trained on just a shallow
neural network, can capture really meaningful relationships. It learns the
meaning or function of an op code from how it's

(09:05):
used alongside others, so.

Speaker 1 (09:06):
It groups similar functions together automatically.

Speaker 2 (09:09):
Essentially, yes, it captures semantic relationships. There are others too, briefly,
like extreme learning machines or elms. These are super fast
because they don't use the typical backpropagation training method solving
linear equations instead.

Speaker 1 (09:22):
Wow, okay, so it's a really diverse AI toolkit. CNNs
for images, RNNs for sequences, jans for generating challenges, embeddings
for meaning.

Speaker 2 (09:31):
Exactly, they're not just generic algorithms, they're specific tools honed
for different facets of the malware problem. Each has its
strengths depending on the data and the goal.

Speaker 1 (09:39):
Right, It's like having different kinds of sensors. And analyzers.
So let's talk about where this is actually being deployed.
Where are these AI techniques making a real difference on
the front lines?

Speaker 2 (09:48):
Good question. A huge area is Android malware detection. Think
about it, billions of smartphones out there. It's a massive target.

Speaker 1 (09:56):
Yeah, my phone feels like my life sometimes, right.

Speaker 2 (09:59):
So AI system analyze Android apps using static, dynamic or
hybrid methods. They look for suspicious API calls and app
shouldn't need like pt trace for debugging other processes, or
mkdr to create directories unexpectedly or connect for unusual network activity.
They also flag risky permission requests. Does that simple game

(10:19):
really need send SS permission or read contacts or system
milert window to draw over other apps. AI learns the
patterns of legitimate apps versus malware.

Speaker 1 (10:27):
That makes sense. What about newer areas. I keep hearing
about smart cars and potential hacking.

Speaker 2 (10:31):
That's a critical emerging frontier. Connected vehicle security part of
intelligent transportation systems or rights. Modern cars are basically computers
on wheels, packed with sensors embedded devices, communicating wirelessly V
two V vehicle to vehicle, V two I vehicle to.

Speaker 1 (10:46):
Infrastructure, which means more tax surfaces.

Speaker 2 (10:49):
Exactly, and the risks are serious. Denial of service DOSS
or distributed denial of service DAS attacks could cripple communication.
Imagine jamming traffic safety messages or preventing cars from coordinating
at intersections.

Speaker 1 (11:02):
That sounds potentially catastrophic.

Speaker 3 (11:05):
It could be so.

Speaker 2 (11:06):
AI is being developed to monitor the complex network traffic
in and around vehicles, looking for anomalies communication patterns that
indicate jamming, spoofing, or attempts to compromise vehicle systems.

Speaker 1 (11:17):
Okay, cars, phones, What about the cloud? So much runs
there now?

Speaker 2 (11:22):
Absolutely, cloud infrastructure protection is vital. A major threat is
malware injection into virtual machines vms, because cloud platforms often
automatically provision lots of similar vms. If one type gets compromised,
malware can potentially spread very easily to others configured the same.

Speaker 1 (11:38):
Way, like an infection spreading through identical twins.

Speaker 2 (11:41):
A good analogy. AI techniques, sometimes even simpler machine learning
like keeneurest neighbors or local outlier factor can monitor the
hypervisor the software managing the vms. They look at performance metrics,
CPU load, memory usage, network IO. Anomalies in these patterns
can indicate a VM has been compromised and is doing
something malicious.

Speaker 1 (12:02):
Like a fever chart for the VM.

Speaker 2 (12:03):
Kind of yeah, though it can be less effective against
low and slow malware that tries very hard to hide
its activity and not cause obvious performance spikes.

Speaker 1 (12:11):
Right stealthy attacks. What about just general network defense like
intrusion detection systems.

Speaker 2 (12:17):
Yes, IDs are a classic battleground where AI is making inroads.
Instead of just relying on known attack signatures, AI can
perform anomaly detection on system of ventlogs I think database logs,
operating system logs. AI models, particularly auto encoders, can learn
what normal activity looks like for a specific user or.

Speaker 1 (12:34):
System, establishing a baseline exactly.

Speaker 2 (12:37):
Then any significant deviation from that learned normality gets flagged
as suspicious. It might be an attacker trying to escalate
privileges or moving laterally through the network. Some systems even
use hybrid approaches, maybe combining deep learning like auto encoders
for complex dependent data with traditional machine learning like support
vector machines for simpler independent data like timestamps.

Speaker 1 (13:00):
In different angles. And what about something seemingly simpler like spam?

Speaker 2 (13:03):
Ah, but spam gets clever too. Image spam is a
big one. Spammers embed their malicious messages or links inside images,
specifically to bypass text based filters.

Speaker 1 (13:14):
Oh right, so the filter doesn't see the text correct.

Speaker 2 (13:17):
But AI, especially CNN's again often combined with transfer learning
models like VGG nineteen, which are pre trained on millions
of images, can fight back effectively. They don't just read text.
They analyze the image itself. It's metadata like height, with
color statistics, mean color skewness, texture patterns, even shapes detected
using edge filters. They learn the visual characteristics of spam.

Speaker 1 (13:39):
Images, so the AI sees the spamminess in the image itself.
That's clever.

Speaker 2 (13:44):
It shows how AI can tackle threats designed to evade
older methods.

Speaker 1 (13:48):
It really does feel like a constant arms race, though,
as our AI gets better at spotting malware.

Speaker 2 (13:54):
The attackers start using AI themselves to create better malware.

Speaker 3 (13:58):
It's an unavoidable cycle.

Speaker 1 (13:59):
Which leads to this concept I've read about adversarial examples
sounds ominous.

Speaker 2 (14:03):
It's a major challenge. Adversarial examples or aes or inputs
could be an image, could be a data file, could
be a software binary that are intentionally but very slightly modified.
The modification is often tiny, maybe even imperceptible to a human,
but it's specifically crafted to fool an AI classification.

Speaker 4 (14:22):
Model to make the AI misjudge it exactly in the
malware context, attacker could take a genuinely malicious file, tweak
it just a little bit, maybe adding some junk code,
changing a few bytes so that our AI detector now
classifies it as benign.

Speaker 1 (14:35):
But it still does the bad stuff.

Speaker 2 (14:37):
Crucially, yes, it preserves its original malicious functionality while wearing
this AI fooling camouflage. It highlights that even powerful AI
models can have these exploitable blind spots. There were even
techniques to create universal perturbations that can fool a model
across many different inputs.

Speaker 1 (14:56):
That's worrying. So the malware itself is also evolving, partly
in respect through our defenses.

Speaker 2 (15:01):
Constantly, and machine learning is actually being used to track
this evolution. Researchers analyze malware families over time, perhaps looking
at op code sequences within specific time windows. They use techniques,
maybe even simpler ones like linear SVMs, to detect points
where a malware family significantly changed its characteristics.

Speaker 1 (15:18):
Like finding evolutionary branches in the malware family tree.

Speaker 2 (15:21):
Precisely understanding how threats adapt helps us anticipate future shifts
in their tactics or structure.

Speaker 1 (15:27):
There must be practical challenges in just studying all this malware,
especially older stuff.

Speaker 2 (15:31):
For live threats, oh absolutely, Handling live malware is inherently risky,
and for older samples, the infrastructure they relied on, especially
their command and control server C two servers, is often
long gone, so you.

Speaker 1 (15:42):
Can't see their full behavior, not easily.

Speaker 2 (15:45):
That's where C two server emulators become really useful. These
are tools researchers build to mimic the original C two server.
This allows them to run the malware, even historical samples,
in an isolated lab network and observe its full range
of capability. Because the malware thinks it's talking to its
real controller, you can extract features, understand its entire life cycle.

Speaker 1 (16:05):
You trick the malware into showing its hand.

Speaker 3 (16:08):
Essentially.

Speaker 2 (16:09):
Yes, sometimes you might even need to slightly patch the
malware itself, maybe to bypass some anti analysis checks it has,
or if say an encryption key needed for its C
two communication was lost to time, like with some old
cryptol locker variants.

Speaker 1 (16:22):
It's a complex process. Now, with all this focus on AI,
this AI mania, almost are their downsides things we need
to be cautious about.

Speaker 3 (16:30):
That's a very important point.

Speaker 2 (16:31):
Yes, while AI is powerful, we need perspective. Machine learning
is data driven, but it's not magic. Humans still make
crucial decisions, things like choosing the right model architecture, setting
parameters like the number of hidden states in an HMM,
selecting the kernel function for an SVM. These aren't automatic.
They require human expertise and significantly impact performance.

Speaker 1 (16:53):
Right. The human element is still key in setting it.

Speaker 2 (16:56):
Up, definitely, and there are practical constraints. More data is
often better, but it needs more computing power, more storage,
longer training times. That's a real bottleneck. Plus, some highly
tuned models can become.

Speaker 3 (17:08):
Very specific to the data set they were trained on.

Speaker 2 (17:10):
They might not generalize well to new, slightly different data,
which is a constant issue with evolving malware. There's a
real need for more robust, more generic deep learning approaches.

Speaker 1 (17:20):
Adaptability is crucial and.

Speaker 2 (17:22):
Another big challenge, maybe less technical, but just as important,
is the lack of a unified standard for malware taxonomy.
Different anti virus vendors often label the same threat differently,
even with tools like virus Total that aggregate results. Correlating
threats globally and building truly comprehensive data sets is harder
than it should be because we don't always speak the

(17:43):
same language when naming things.

Speaker 1 (17:46):
That makes collaborative defense tricky.

Speaker 2 (17:48):
It does, and one final sort of intriguing point. Researchers
have found that different methods for selecting the most important features,
like those API calls or op codes, can sometimes pick
vastly different sets of features.

Speaker 1 (18:00):
But they still work.

Speaker 2 (18:01):
But they still end up achieving similar classification accuracy, which
raises a fascinating question. Are these methods truly finding the
single best set of features or are there potentially multiple
different sets of features that are almost equally good at
identifying malware. It makes you wonder about what the AI
is really learning.

Speaker 1 (18:19):
That is interesting. It suggests maybe there isn't one perfect
way to see the malware.

Speaker 2 (18:23):
Okay, we have definitely covered a lot of ground in
this deep dive. We've seen how AI and deep learning
are genuinely transforming the fight against malware. From visualizing code
as images, which is still kind of blowing my mind yea,
to understanding behavior through sequences and protecting everything from our
phones and cars to the cloud. It's clearly a super dynamic,

(18:45):
constantly evolving field.

Speaker 1 (18:47):
It absolutely is, and I think the key takeaway is
the sheer complexity of this ongoing cybersecurity arms race. AI
gives us incredibly powerful new tools, yes, but the ingenuity
attackers means it's never solved. Critical thinking, human oversight, asking
the right questions, understanding the limitations of the AI, these
remain completely indispensable. It's very much a human machine partnership,

(19:09):
absolutely a partnership against an ever adapting adversary. So maybe
the thought to leave you, our listener with, is this,
As AI gets better and better at spotting the hidden patterns,
the secret signatures of malicious code, what new forms of
digital camouflage will the attackers invent next? And will our
intelligent defenses always find the optimal way to adapt or

(19:31):
just one of many good enough ways. Constantly pushing the
very boundaries of what these intelligent systems can even perceive
is definitely something to think about.

All Episodes

Episode Transcript

Popular Podcasts

My Favorite Murder with Karen Kilgariff and Georgia Hardstark

24/7 News: The Latest

Dateline NBC

.css-15opob5{left:0;position:absolute;top:0.8rem;} All Episodes

.css-14f5ked{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:2;overflow:hidden;}Malware Analysis Using Artificial Intelligence and Deep Learning

Episode Transcript

Popular Podcasts

.css-r6mb8g{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:1;overflow:hidden;}My Favorite Murder with Karen Kilgariff and Georgia Hardstark

24/7 News: The Latest

Dateline NBC

All Episodes

Malware Analysis Using Artificial Intelligence and Deep Learning

My Favorite Murder with Karen Kilgariff and Georgia Hardstark