Techstuff Classic: How MP3 Compression Works

Episode Transcript
Available transcripts are automatically generated. Complete accuracy is not guaranteed.
Speaker 1 (00:04):
Get in touch with technology with tech Stuff from how
stuff works dot com. Hey everybody, and welcome to tech Stuff.
I'm Jonathan Strickland. I'm the host of the show, and
this is a Saturday morning rerun episode where we take
a classic episode of tech Stuff and we present it
to you guys who may have missed it. I've been

(00:25):
talking a lot about tech and music recently. If you've
been listening to the recent episodes, you know all about that,
and there have been some great discussions. But it also
requires a little bit of uh knowledge of previous episodes
at times, and I know it can be tricky to
dig through the archives. So in this classic episode, I
talk about how the MP three compression format works, so

(00:48):
that you can actually understand how MP three works as
opposed to something like middy, and you can get an
appreciation for the differences between the two formats. This episode
originally published on January two thousand and seventeen. This is
a whole year ago more than that. Now we're in
April two eighteen as I record this. I hope you

(01:09):
enjoyed this classic episode. I hope it gives you a
deeper appreciation of the technical aspect of creating digital music
and I'll see you guys on the other side. So
let's remember that the heart of digital information is the
bit that's either a zero or a one. The basic
unit of information for digital formats zeros and ones. Now

(01:34):
we can use those zeros and ones to describe all
sorts of information, from text to audio, to video and
really pretty much anything you can think of that's represented digitally. Ultimately,
when you get down to it, it's a bunch of
zeros and ones. So let's say you start off with
your uncompressed audio file. You've got this enormous audio file
in front of you. It's made up of zeros and ones.

(01:57):
How do you make that file smaller? So in the world,
we can compress stuff, right, we can apply physical pressure
to things. Think about packing a suitcase. You can make
sure you get that extra outfit and if you just
press it down hard enough and get that zipper zipped
before it can burst open. But once you get to
a certain level of compression, you cannot make things smaller,

(02:19):
at least not without hurting yourself or whatever it is
you're trying to compress. Digital files are a little different
because you cannot physically cram the zeros and ones closer together.
It doesn't work like that. These are abstract things. You
can't make them smaller, right, You can't decrease the font.
It doesn't work that way. The numbers represent two different states.

(02:41):
So if you want to create a smaller audio file
containing the recording that was in a larger audio file,
you have to start getting creative now. In the last
part of this series, I talked about how the MP
three compression algorithm was born from an applied research institution
in Germany and the team behind the m B three
wanted to find a way to compress audio, specifically music

(03:03):
for transmission over phone lines. Eventually this evolved into the
Motion Pictures Expert Group audio Layer three compression methodology, better
known as the MP three, and there's also Impact two
and IMPEG four standards. Impact two, by the way, is
the basis of compression on DVDs, although the actual DVD

(03:23):
format is really a modification of Impact two. An Impact
four is a compression strategy for audio and video that's
frequently used in lots of different up capacities, including streaming
media services. So by the late nineteen seventies, researchers began
to explore the possibility of leveraging psychoacoustics to figure out
how to compress audio. And psychoacoustics refers to the way

(03:46):
we perceive sound, it's uh and also the physiological effects
of sound on us. So this involves not just our
our physical sense of hearing, but also our brains and
the way our brains interpret sound. Owned So, for example,
there's a psychoacoustic phenomenon that's called the Hawse effect h
A A S. And I think it's pretty interesting. So

(04:08):
here's how the Hawse effect works. If you hear the
exact same sound coming from different directions, but the two
sounds arrive within thirty to forty milliseconds of each other,
your brain will be convinced that you really only heard
one sound and it came from the direction that hit
you first. So let's say a sounds coming from directly

(04:30):
in front of you and to your left, and you
get both of them within that thirty forty millisecond range,
and you hear the one coming from ahead of you
first to you. You're convinced that you only heard that
sound once and it came from dead on straight ahead
of you. Your brain kind of discounts the one that
came off from the left, although it can reinforce it,

(04:53):
which ends up being really useful if you're planning out
p A systems for stage shows. I'm not joking. That
really is the way that uh people plan those things out.
It's pretty neat. Humans perceived sounds in a way that's
not necessarily representational of all the sounds surrounding us. You
can think of your brain as the filter between your
understanding and what reality actually is. A lot of stuff

(05:15):
goes on that it ends up getting rid of information
that your brain just says, you know what, he or
she doesn't need that, it's just gonna confuse things. We're
gonna dump it. And that's kind of how it works.
It's all on an unconscious level. It's not like you're
actively working to do this. So let's say you're in
a relatively busy hallway and there could be a lot

(05:37):
of sounds in that hallway. Stuff that's going on constantly
around you. Maybe they are doors opening and closing, Maybe
their footsteps going up and down the hallway. Maybe someone
shoes are squeaking against the linoleum floor. People are chattering
away in there. But you are having a conversation with someone,
so you turn your focus on that person and other
sounds seemingly fade away. They're still doesn't but they're not important.

(06:01):
So in this example, you would actually call those other
sounds of distraction and you would really focus on the conversation. Uh.
That also shows how we're able to consciously direct our
since our perception of hearing. So both of these factors
come into play. Now. One thing that MP three encoding
takes advantage of is something called masking, and there are

(06:24):
a couple of different variations of the masking effect. One
of them is called frequency masking. So let's say you've
got to sound frequencies that are similar, perhaps there's just
a few hurts apart. Remember, UH, frequencies are measured in hurts,
which is really the number of oscillations per second. So
let's say you've got a sound that's at I don't know, uh,

(06:47):
one thousand killer hurts, and another one that's at one
thousand and ten killer hurts. Now, the human ear is
precise enough to be able to tell the difference of
two sounds that are at least two hurts apart from
each other. That's how precise our resolution of hearing, it's
it's at that level. But if you get two sounds

(07:09):
played at the same time and they are that close
together in frequency, and one of those frequencies is played
at a greater volume than the other, our brains will
pick up on the louder sound and ignore the quieter sound,
even though both of them are present. What becomes important
at that point is the amplitude. Now, the further apart
in frequencies you get, the less that has an effect.

(07:33):
So if you get far enough apart where there are
two pitches, one of them noticeably louder than the other,
but they're far enough apart, you will hear both of them.
It only works if the two pitches are relatively close together,
and there's not a universal formula for frequency masking. As
you get closer to the boundaries of human hearing, frequency
masking becomes easier, So if it's a really low pitch

(07:54):
or a really high pitch, it's easier to get away
with it. Once you started getting into what is the
out of as the sweet spot for human hearing, which
is generally considered to be between two and five killer hurts,
you need a greater difference in volume or a smaller
difference in frequency in order for masking to work. Frequency

(08:14):
masking at any rate. But then there's also temporal masking,
and you might say, okay, I got it. Temporal that
means time. Indeed it does, my friend. This describes the
effect of a short but loud sound masking a softer
sound for a short time. Weird thing is the loud
sound can actually mask sounds that precede it slightly, not

(08:37):
by a whole lot, but a little bit. MP three
compression takes advantage of both frequency and temporal masking when
it's trying to determine which data needs to be included
and which data can be dumped, because it won't affect
your perception of whatever the the audio file is in
the first place. So you also probably remember I talked
about the physical limitation to what we humans can hear,

(08:59):
no matter what our brains might be up to, so
that this doesn't have to do with our brains, you know,
filtering through the information that's coming in. This has to
do with the physical limitations of the human ear. In
the last episode of the series, I said typical human hearing.
Keep in mind typical there are exceptions. UH covers the
range of frequencies between about twenty hurts and twenty killer

(09:21):
hurts or twenty thousand hurts, So twenty to twenty thou
higher frequencies represent higher pitches and sound lower frequencies lower pitches, right,
And as you get older, your ability to perceive those
higher frequencies starts to diminish. So most adults actually have
an upper range closer to sixteen killer hurts, not twenty. Uh. Kids,

(09:44):
they can hear those higher pitches. You may have heard
the story about how some convenience stores experimented with getting
rid of teenage loiterers by by uh projecting out these
super high pitches that that adults could not here but
kids could, and it discouraged kids from hanging out at
the convenience store and loitering. Um. I love that idea

(10:09):
so much. Anyway, that's because I'm old and my hearing
is terrible. Well, remember I also mentioned you can detect
changes in pitch at two hurts increments if you get
below two hurts and change, like, if it's just a
one hurts difference between two frequencies, it's too low a
resolution for us to detect. To us, it will sound

(10:30):
exactly the same. So if you were to hear a
frequency at one thousand one hurts or one point zero
zero one killer hurts and one point zero zero to
kill hurts, you wouldn't notice the difference. They would sound
exactly the same to you. So if you're gonna take

(10:50):
audio and compress it, one step you could consider is
eliminating anything that's outside the actual range of frequencies that
we can hear, or simplifying any changes in frequency that
are smaller than two hurts. If you get take all
that data and you say it is physically impossible for
a human to perceive this, get rid of that information,

(11:11):
then in theory it wouldn't have any effect on the
rest of the recording. But how you go further than that, right,
how do you create a method so that you can
really compress this file? You want a method that will
preserve the important sounds while potentially ignoring all the unimportant
or incidel sounds. And you wanted to be automatic because

(11:31):
if you have it manually, then that's going to take
countless hours just to edit a single sound file. So
that was the challenge that the MP three research team
faced as a group. Now, their solution, which ultimately created
even more challenges was to come up with what was

(11:51):
essentially a simulated human ear and brain. They needed to
replicate the experience of perceiving music so that an algorithm
could evaluate every sound in an audio file and judge
if in fact was relevant enough to include in the
final compressed version. If a sound were imperceptible, then it

(12:13):
wouldn't make sense to include it in the MP three file.
So by leaving out all the irrelevant data, they can
make the audio information take up less bandwidth. The file
itself would be smaller because you just dumped everything that
wasn't important. So the team used an algorithm called the
low complexity Adaptive Transform Coding or lc DASH a TC

(12:34):
as the foundation for their research. This was kind of
their starting point, and this is an approach that that
tries to do away with redundancy as much as possible,
and it also incorporates adaptation to perceptual requirements. Also, MP
three's oh a lot to the IMPEG Layer two standard,
So the Layer two obviously came out before Layer three,

(12:56):
and so a lot of the features of layer three
are really um their legacy features from Layer two. Uh.
In other words, MP three group kind of got stuck
with them because otherwise they would have had a problem
with backwards compatibility. So the result is kind of a
clunky arrangement under the hood, and some of the features
may make very little sense when I go through them,

(13:19):
but some of that is because it's a holdover from
an earlier compression strategy, which isn't terribly satisfying as an answer.
But the reason many parts of the MP three compression
algorithm are the way they are is because that's the
way we've always done it. So next I'm gonna dive
into the phases of compression. But before I do that,

(13:41):
let's all take a deep breath and take a moment
to thank our sponsor, and we're back. So there are
two big phases we'll need to talk about with MP
three compression. The first phase is analysis and the second

(14:03):
phase is the actual compression itself. And after that there's
the process of decoding and MP three for playback. But
that's way simpler once we get an understanding of how
the encoding process actually happens. So let's begin with analysis. Now.
This is the part where the standard has to figure
out which frequencies within an audio range are recording rather

(14:26):
are important or perceptible. So how does a program and
encoder figure out what we can hear and what we
cannot hear? Alright, time to get technical. So you start
off with your pulse code modulation audio file or PCM file.
And you might remember I talked about PCM audio in

(14:47):
the first episode of this series, but just in case
you don't, it's a lossless digital audio file. The actual
format could be a wave or ai f F or
something along those lines, but the important thing to keep
in mind is that it is uncompressed. Now, that means
those files tend to be pretty big. This is our
raw material that we want to take and squish down

(15:09):
to a more manageable transferable size. And in our our
last episode in this series, I also mentioned that the
standard for c D audio is a sample rate of
forty four point one killer hurts. And we learned that
you need a sample rate twice the frequency of the
highest frequency in your recording, and since human hearing tops

(15:30):
out at around twenty kill hurts, the standard for c
ds is forty four point one killer hurts. The MP
three standard can support lots of different sample rates, but
forty four point one killer hurts is pretty much the
common standard. So you've got a number of samples with
your audio file, and that number will depend upon how
long the audio file is. You've got forty four samples

(15:53):
per second, actually twice that for stereo. But for the
purposes of this discussion, let's kind of stick with mono
sound so that I don't start having math coming out
of my ears. And we're still in the very easy,
simple part as far as math goes. We haven't gotten
to the complicated stuff yet. All right, So you've got
forty four thousand, one hundred samples per second. To compress

(16:15):
it into an MP three format, the algorithm first groups
all of these samples into collections called frames. So take
those four thousand one per second, and then you start saying, okay,
we're gonna group you in batches. Each batch is called
a frame, and each frame contains one thousand, one fifty
two samples. Now that's specifically to maintain backwards compatibility to

(16:39):
IMPEG Layer two, which established that one thousand, one fifty
two number. But we're not talking about IMPEG layer two.
We're talking about IMPEG Layer three, and though that means
we have to get a little more complicated. So each
frame consists of two subgroups called granules. So each granule
has five hundred seventy six samples six times two one two,

(17:04):
so five seventy six samples per granule. Now, technically MP
three encoders only work on one granule at a time,
but they may reference the granules immediately before and immediately
after the current one in order to see how the
audio within the file changes over time. All right, So
now you've got your granules of five hundred seventy six

(17:25):
samples each. Then the MP three encoder runs the samples
through a filter bank, which sorts the sound into thirty
two frequency ranges. Are you? Are you crazy about the
numbers yet, Dylan? Are you? Dylan's Dan's nodding. Dylan gets
worse from here. So you have thirty two frequency ranges,

(17:45):
which is another nod to the layer two method, which
use those thirty two ranges for encoding purposes. But we're
not talking about layer two, are we. No, we're talking
MP three. Gosh darn it. That means we take those
thirty two ranges and we subdivide them by a factor
of eighteen. That means we have five hundred seventies six
bands of frequencies each band containing one seventy six of

(18:10):
the frequency range of the original sample. So what that
actually means and this this is actually pretty easy. The
bands are not limited to a specific number for their
frequency range, right. The bands don't mean that on the
on band number one it goes from twenty hurts up
to a certain range, and on band five D seventy

(18:32):
six it ends at twenty killer hurts. That's not what
it means. They're dependent upon the original audio. So if
the original audio contains sounds within a narrow range of frequencies,
the five seventy bands will be more precise. But if
the original recording has a vast range of frequencies, the
bands are less precise. So another way to think about

(18:53):
this is with a pizza. So let's say you get
extra large pizza and you cut it into eight equal slices,
and then you get a small pizza and you cut
that into eight equal slices. Well, in both cases you
have with each slice one eighth of a pizza. But
the extra large pizza pizza slice is bigger than the

(19:15):
small pizza pizza slice. It all depends on the size
of the pizza. So in this case, it depends upon
the range of frequencies. And and Dylan, do you think
we could go for some pizza, you know, just just
put the episode on hold and go get pizza. Dylan's nodding.
It's great for audio. Yeah, so, uh, pizza, We'll be
right back. Okay, I was good pizza. Now um oh, man,

(19:38):
I got a whole bunch more notes. Okay, well, let's
let's go ahead and and do the rest of this.
All right, So you've got your sound divided up into
those five seventy six sub brands of frequencies, you know,
the thing I compared to pizza slices earlier. Now you
get two different mathematical processes applied to this data. One
is the fast Furrier trans form or f T, and

(20:02):
the other is the modified discrete Cosine transform or m
d c T. Now, I am not going to dive
deeply into how these transforms work, because frankly, they are
beyond my mathematical understanding. But I know what they do.
I just cannot explain the process like how they do

(20:22):
what they do. So I'm going to give you the
explanation of what they do. What the outcome of each
of these transformed processes happens to be, but I'm not
going to be able to tell you the actual mathematical
steps involved in each because I don't math. So good guys,
But let's start with a fast for your transform. So
transform is kind of what it sounds like. It's all

(20:42):
about transforming information in some way. So in this particular case,
the f f T transforms the frequency bands we just
talked about into data that can be further analyzed by
a psychoacoustic model that's in the encoder. So this is
that simulated human ear and brain we were talking about earlier.

(21:03):
So what the encoder does is it analyzes each bit
of data and looks for signs that it represents audio
that wouldn't be perceived by a human. So it's look
looking for any potential for masking possibilities. So are there
collections of frequencies that are grouped close together, and is
one of those frequencies louder than the others. You might

(21:24):
be able to do away with those softerw frequencies because
of frequency masking. The encoder will also look at whether
or not the audio has a lot of complexity to it,
if it has a lot of changes, or if it's
just relatively steady or simple audio. Any transient sounds that
are present in the audio might end up being temporal masking,

(21:44):
so it'll analyze those as well and see if that's
a possibility. So really what they're looking is for, you know,
just any really loud sounds that stand out above the
rest of the recording. That's what the f f T
is doing. So what about the modified discrete cosine transform. Well,
this is happening in parallel with the f f T,

(22:05):
and the samples get sorted into different patterns called windows. Uh.
And the criterion for sorting all has to do with
whether the sample represents a steady sound or varied sound.
So if you have a simple steady sound that goes
into a long window. If there's a lot of variation
in the sound, like there are a lot of consonants

(22:27):
in a vocal line, or it's like a drum solo
or something like that, it would get sorted into a
series of three short windows. And each short window contains
one two samples. That amounts to four whole milliseconds, so
four thousands of a second in three patterned windows. So

(22:48):
you've got these windows now, either long windows for simple
sounds or short windows for the more complex sounds, and
then the modified discrete cosine transformed kicks into gear. It
looks at each long window or set of three sort
windows and converts them into a set of spectral values.
To some of you, that probably sounds meaningless. So let's
talk about spectral analysis for a second. First, I was

(23:11):
very disappointed to learn that spectral analysis doesn't involve a
psychologist talking to a ghost about its emotional state. So bummer.
But spectral analysis is when you look at a spectrum
of information, like a spectrum of frequencies or related information
like energy states. That's what this transform does. It takes

(23:31):
data that originally represented a slice of time in a
sound waveform. That's what sample is. A sample is an
instance of time in a wave form and converts it
into information representing sound as energy across a range of frequencies. Now,
you can plot out spectral information in a lot of
different ways, but one common method is to use brightness

(23:54):
to indicate energy levels. Higher energy levels are brighter patches
in your vision. Dual representation of spectral data. High frequencies
would appear at the top of a spectral view like
imagine a box, and at the top of the box
that's where you would find high frequencies. At the bottom
of the boxes where you find low frequencies, and it's

(24:14):
just lots of patches of color. The really bright patches
of color represent very high energy frequencies, so they could
be high or low in in actual frequency, but we're
talking about energy levels, not whether it's a higher low pitch.
Looking left or right represents the passing of time, and

(24:35):
looking along any vertical points shows you the actual frequency
or pitch, and then the respective energy level is the brightness.
So it's kind of like looking at sound as a wave,
but instead of being a wave, you're looking at information
that indicates frequency range and energy level. That representation is
actually kind of analogous to how we hear audio, So

(24:55):
and encoder can analyze the spectral view and start to
filter out the data we would and perceived due to psychoacoustics. Now,
after all that processing, the encoder looks at the frequency
sub brands and the levels of spectral intensity for each
and that information can then be used for the next phase,
which is compression. But right now I think we could

(25:18):
all stand a little decompression, So let's take another quick
break to thank our sponsor all right, So now you're
ready to compress your analyzed audio. Good for you, and
by you I mean encoders. This has to be simpler

(25:41):
than that analysis segment, right, I mean that got a
little crazy with all the different bands and sub bands
and windows and frames and granules. Sadly it gets more complicated.
All right. So there are two layers of compression going
on with IMPEG Layer three. One of those layers depends

(26:03):
upon the psychoacoustic analysis and the other doesn't. So why
would you use two layers with different strategies like that? Well,
the reason is that one strategy is great for complex
audio with lots of components, but not so great with
simpler sounds, and the other strategy is kind of the opposite.
So the psychoacoustic approach is the one that's really good
for complicated sounds. If if you've got a lot of

(26:26):
volume changes, lots of different frequencies, it's just complicated and
rich sound. You've got a lot of opportunities to look
for masking and other acoustic elements that limit the actual
sounds that people perceive. So it means there are a
lot of chances for you to uh fudge by dropping
all the stuff that people probably wouldn't notice anyway. And Uh,

(26:49):
if you take a piece that's got a lot of
elements at varying volumes, there are likely several opportunities to
to do this. But if you're talking about relatively straightforward
audio with few components, few changes in volume, there's really
not a whole lot of data you can ditch without
it actually affecting the quality of the audio in a
perceptible way. And this is part of what Brandenburg, that

(27:13):
guy I was talking about in our first episode in
this series. Uh, that's when he discovered when he was
working with the MP three standard and he was listening
back to that Suzanne Vega acapella track Tom's Diner. Uh,
he was listening to a compressed version of it, and
he said it was terrible. He said it ruined the
quality of the audio. And part of that is because

(27:34):
that particular song is fairly simple, there's just not a
lot of opportunity to take advantage of masking and other
tricks without potentially compromising the quality. So they decided to
also incorporate some traditional compression strategies which which worked better
with those types of recordings. So the MP three format
takes advantage of both the traditional approach and the psychoacoustic approach,

(27:58):
and that allows the encoder to compressed files into smaller
size without just following a single strategy, like it doesn't
have to do a one size fits all for all
elements of audio. Now, combining those two strategies requires a
little more mathematical gymnastics. So let's go back to those
five seventy six frequency bins. You know, those sub bands

(28:20):
we talked about earlier. You gotta quantize those suckers. What
does that mean. It means assigning a quantity to each
to each frequency bin, you have to give it a
quantity of some sorts so that you can end up
judging how much you can get away with dropping data.
So to do this, the encoder sorts those five six

(28:42):
bins into twenty two scale factor bands. How you doing
over there, Dylan? Just checking in on you? Okay, Dylan's
got Dylan's got a thousand yards stare going. I hope
you guys are doing okay over there? All right, So
before smoke starts coming out of your ears, let me
explain what the scale factor bands are all about. The
whole purpose of the scale factor bands is to determine

(29:05):
how the information will be stored within the compressed state.
So you want to get away with as little data
as possible before affecting sound quality. So if you can
say the same thing in a shorter space without affecting
the quality of what it is you're saying, you go
with it. Brevity is the soul of compression. So if

(29:27):
we were talking about language, I would say it's more
efficient to say it's raining outside, or even just it's raining,
because you would assume that it would be outside where
the rain is happening, and it would be inefficient for
me to say it's coming down like cats and dogs
out there. It's not as efficient as saying it's raining.

(29:49):
So if you can get away with shorter statements without
affecting the actual quality, and you could argue that by
switching from it's coming down like cats and dog out
there and it's raining changes the quality, and that could
be a valid argument. But if you can get away
with shorter without affecting quality, you do it. So each

(30:10):
scale factor band is represented by a quantity, Then the
encoder divides that quantity by a given number called the quantizer,
which is the same across the entire frequency spectrum for
that recording. The resulting number is then rounded up or
down to a whole digit. And here's an important point.

(30:33):
Individual scale factor bands can be scaled up or down
for more or less precision to represent the actual value
of those bands. So what the heck does all that mean? Well,
the purpose of dividing and rounding is just to simplify
the data to reduce the amount you need in order
to store the information. So let's go with a totally

(30:53):
hypothetical example. Let's say you've got a scale factor band
and you've decided your rep is sending that scale factor
band with the quantity seven eight four zero seven thousand,
eight hundred forty, and you've chosen the number one hundred
to quantize your data, meaning that you will divide each
uh scale factor bands quantity by one hundred. So this

(31:18):
is seven thousand, eight hundred forty. You divide it by
one hundred UH, and the scale factor for this particular
band you have determined is one point zero. That means
that once you get that result where you've divided the
quantity by the quantizer, you multiply by one. That means
there's no change. You multiply by one you get the
same number. More on that end a bit. Okay, So

(31:40):
you take that seven thousand, eight hundred forty you divided
by one hundred. That gives you seventy eight point four. Well,
now you have to round that number, so you round
it down to seventy eight. Now, when you have a
decoder and you're ready to play back the information, it
comes across this quantity the sight and it knows what
the quantizer number was, so it multiplies by one hundred

(32:02):
to get back to seven thousand, eight hundred. So the
replicated number is actually forty off from the original number.
The original number again was seven thousand, eight hundred forty.
The replicated number is seven thousand, eight hundred. Now those
inconsistencies manifest as noise in the actual playback. So if
you wanted to increase the precision of any given scale

(32:24):
factor band, you could do so by changing the scale
factor number. So in that example, just now, I said
the number was one point zero, meaning there's no change
to that result. But I could have said it was ten,
which means we would multiply the quanties number by ten.
So we would take that seven thousand, eight hundred forty
divided by one hundred, you get seventy eight point four,
then multiplied by ten to get seven eight four. So

(32:48):
when the decoder decompresses the file, it would reverse this
this whole thing. It would just multiply by a hundred um.
You would end up getting seven thousand, hundred forty again,
which means that you wouldn't introduce any noise to the file.
You would have a perfect representation. But in some cases
the encoder may determine that any noise that you generate
wouldn't be noticed or it wouldn't impact the quality of

(33:11):
the audio enough for it to be a problem because
of other factors for that particular scale factor band, like
maybe it's really quiet, or maybe it's really complex. So
in those cases, you could reduce the scale factor number
by making it something else, like point one instead of
one point oh. So that means you would multiply the
quantized number by point one, So the seventy eight point

(33:32):
four would become seven point eight four, and then you
have to round it to get a whole integer, so
you get eight seven point eight four rounds up to eight. Now,
when a decode or decompresses the audio and multiplies eight
by one hundred, that quantizer that we've talked about so much.
Uh and uh. Actually at this point it would have
to be eight thousand because it's also taking into account

(33:53):
the scale factor, so it's multiplying it by a thousand,
not just a hundred. So you would get a number
that would pop up to eight thousand. And remember the
original with seven thousand, eight hundred forty. So you look
at the difference between these two, the original seven thousand forty,
the new fact number is eight thousand. There's a pretty
big difference there. That change might introduce enough noise for

(34:15):
it to be a problem. So how does the encoder
determine if a scale factor band is meeting the proper criteria?
How can it tell if there is uh too much
noise or if the noise falls below the threshold. Well,
it goes through what it's called a Huffman coding process.
At this point, Dylan is currently just staring at the

(34:37):
wall and drool is coming out. Huffman coding process. It's
converts scale factor bands into binary strings, and the process
goes through a series of tables to determine if the
data within the scale factor band requires more or less
precision to describe the sound without affecting the audio quality.
So Huffman coding is a process. And when you start
with a large number of possibilities and you begin to

(34:58):
narrow it down. Uh. Some people describe it as the
coding equivalent of twenty questions. So you ask your first
question like animal, vegetable, or mineral. You get an answer
so animal. While that first answer eliminates a ton of
other possibilities and narrows the focus, like anything that doesn't
pertain to animal, you can automatically discount because you already

(35:20):
know it can apply to that answer. With MP three compression,
this means making certain the number of bits representing a
granule because remember I mentioned that an MP three formats
you have frames, and each frame, each frame has a thousand,
one or fifty two samples and consists of two granules
with five s each. So when you answer the first question,

(35:43):
it eliminates a lot of other possibilities and narrows the focus.
So like with animal, vegetable, mineral, if I say animal,
you're gonna not ask any questions that have to do
with minerals or vegetables only because it wouldn't make sense.
You know, those aren't gonna apply. Same thing with m
P three's, except this time it means making certain the
number of bits representing a granule. Remember their two granules

(36:05):
per frame with the MP three layer, Uh, you want
to make sure that the number of bits representing that
granule match the chosen bit rate for a compression. So
if after going through this process, the encoder says, hey,
this granule has more bits than what's allowed. It's too
many bits. The we gotta get rid of some of these,
the encoder can adjust the scale factor band so that

(36:27):
there's less precision meaning that multiplier in other words, that
but I talked about earlier, and thus reduce the amount
of data needed to represent that particular granule. If a
granule comes in under the bit rate, the encoder can
increase the precision to reduce noise and fill that granule

(36:48):
out properly so that matches the actual threshold. After all this,
the pairs of granules become frames within the MP three files,
and the only other component then MP three file apart
from these frames is the I D three metadata. And
this is pretty simple. This is like a header and
it comes before all the frames in the audio file

(37:09):
and contains information about about the file itself, which can
include stuff like the title of a song, an artist name,
an album title, other stuff like that. It can also
include copyright information as well as information about the file itself,
such as whether or not it's stereo recording or a
mono recording. So when you use a decoder like an

(37:29):
MP three player, it takes this compressed information, these these
these representations that the music has been reduced to, and
it converts that Huffman data back into the quantized format,
scales the data back up to its original size or
close approximation. Remember the the uncompressed version may actually be

(37:53):
off by a significant amount depending upon each individual granule.
And all of that data gets combined into a new
PCM sample that can be played back to you. And
that's all there is to it. Nothing could be easier,
all right. That took a lot out of me, So
I got really technical, and I apologize if I lost

(38:14):
any of you out there, or for those of you
who have a lot of experience working on compression algorithms,
for oversimplifying in several cases. But now we've got a
full episode about this, and I hope you have a
better understanding of how a big sound file can be
reduced to a smaller sound file. Next time, I'll just
say magic. It will make everyone happier. If you guys

(38:36):
have any questions for me, or comments or suggestions, anything
like that, send me a message. My email is tech
Stuff at how stuff works dot com, or you can
drop me a line on Facebook or Twitter to handle it.
Both of those is tech Stuff H. S W. And
I'll talk to you guys again really soon. For more

(39:00):
on this and thousands of other topics, is it how
stuff works dot com, wh

All Episodes

Episode Transcript

TechStuff News

Follow Us On

Hosts And Creators

Oz Woloshyn

Karah Preiss

Show Links

Popular Podcasts

Dateline NBC

Stuff You Should Know

The Clifford Show

.css-15opob5{left:0;position:absolute;top:0.8rem;} All Episodes

.css-14f5ked{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:2;overflow:hidden;}Techstuff Classic: How MP3 Compression Works