How MP3 Compression Works

Episode Transcript
Available transcripts are automatically generated. Complete accuracy is not guaranteed.
Speaker 1 (00:04):
Get in tech with technology with tech Stuff from how
stuff works dot com. Hey there, and welcome to tech Stuff.
I'm your host, Jonathan Strickland. And in a recent episode
I explored how digital audio works and gave kind of
a brief history on the MP three file format. I

(00:24):
warned you back then that that was part one of
a three part series, and today we're gonna explore part two.
So I hadn't forgotten about it. We're back to it, uh,
And today we're gonna do a deeper dive with m
P three's and how do they compress audio? And how
can you take a file filled with information and make
it a smaller size? What do you have to give

(00:45):
up in order to make files smaller? And today we're
gonna try and unravel the technical mystery behind the MP
three And I am not going to lie to you people.
This is gonna get a bit you know, man athy
And that was an English major, So you mathematicians out there,
get ready with your corrections because I'm probably gonna make

(01:07):
some over generalizations for the purposes of my own sanity.
There does get to a point where to really get
into the technical details, it would likely be uh impossible
for me to describe it in a way that would
make sense and be accurate. Um, and I have given
my producer Dylan the mandate that, should I get to

(01:31):
cryptic and incomprehensible with my explanation, that he is to
intervene in a way that he sees fit. Just not
in the face, Dylan. It's not in the face. It's moneymaker, man.
I gotta gotta take care of it. So let's remember
that the heart of digital information is the bit that's

(01:52):
either a zero or a one. The basic unit of
information for digital formats zeros and ones. Now we can
use those zeros and ones to describe all sorts of information,
from text to audio, to video and really pretty much
anything you can think of that's represented digitally. Ultimately, when
you get down to it, it's a bunch of zeros

(02:14):
and ones. So let's say you start off with your
uncompressed audio file. You've got this enormous audio file in
front of you. It's made up of zeros and ones.
How do you make that file smaller? So in the
real world, we can compress stuff, right, we can apply
physical pressure to things. Think about packing a suitcase. You
can make sure you get that extra outfit in if

(02:36):
you just press it down hard enough and get that
zipper zipped before it can burst open. But once you
get to a certain level of compression, you cannot make
things smaller, at least not without hurting yourself or whatever
it is you're trying to compress. Digital files are a
little different because you cannot physically cram the zeros and
ones closer together. It doesn't work like that. These are

(02:58):
abstract things. You can't make them smaller, right. You can't
decrease the font. It doesn't work that way. The numbers
represent two different states. So if you want to create
a smaller audio file containing the recording that was in
a larger audio file, you have to start getting creative now.
In the last part of this series, I talked about

(03:20):
how the MP three compression algorithm was born from an
applied research institution in Germany and the team behind the
MP three wanted to find a way to compress audio,
specifically music for transmission over phone lines. Eventually, this evolved
into the Motion Pictures Expert Group Audio Layer three compression methodology,
better known as the MP three, and there's also IMPACT

(03:44):
two and IMPEG four standards. Impact two, by the way,
is the basis of compression on DVDs, although the actual
DVD format is really a modification of Impact two and
Impact four is a compression strategy for audio and video
that's frequently used in lots of different up pacities, including
streaming media services. So by the late nineteen seventies, researchers

(04:05):
began to explore the possibility of leveraging psycho acoustics to
figure out how to compress audio. And psychoacoustics refers to
the way we perceive sound it's uh and also the
physiological effects of sound on us. So this involves not
just our our physical sense of hearing, but also our
brains and the way our brains interpret sound. So, for example,

(04:28):
there's a psychoacoustic phenomenon that's called the Hawse effect h
A A S. And I think it's pretty interesting. So
here's how the Hawse effect works. If you hear the
exact same sound coming from different directions, but the two
sounds arrive within thirty to forty milliseconds of each other,
your brain will be convinced that you really only heard

(04:50):
one sound and it came from the direction that hit
you first. So let's say a sounds coming from directly
in front of you and to your left, and you
get both of them within that thirty to forty millisecond range,
and you hear the one coming from ahead of you
first to you, you're convinced that you only heard that

(05:10):
sound once and it came from dead on straight ahead
of you. Your brain kind of discounts the one that
came off from the left, although it can reinforce it,
which ends up being really useful if you're planning out
p A systems for stage shows. I'm not joking. That
really is the way that people plan those things out.
It's pretty neat. Humans perceive sounds in a way that's

(05:31):
not necessarily representational of all the sounds surrounding us. You
can think of your brain as the filter between your
understanding and what reality actually is. A lot of stuff
goes on that it ends up getting rid of information
that your brain just says, you know what, he or
she doesn't need that, it's just gonna confuse things. We're

(05:52):
gonna dump it. And that's kind of how it works.
It's all on an unconscious level. It's not like you're
actively working to do this. So let's say you're in
a relatively busy hallway, and there could be a lot
of sounds in that hallway, stuff that's going on constantly
around you. Maybe they are doors opening and closing, Maybe
their footsteps going up and down the hallway. Maybe someone

(06:14):
shoes are squeaking against the linoleum floor. People are chattering
away in there. But you are having a conversation with someone,
so you turn your focus on that person and other
sounds seemingly fade away. They're still present, but they're not important.
So in this example, you would actually call those other
sounds of distraction and you would really focus on the conversation. Uh.

(06:35):
That also shows how we're able to consciously direct our
sense our perception of hearing. So both of these factors
come into play. Now. One thing that MP three encoding
takes advantage of is something called masking, and there are
a couple of different variations of the masking effect. One
of them is called frequency masking. So let's say you've

(06:57):
got to sound frequencies that are similar ahaps, there're just
a few hurts apart. Remember, frequencies are measured in hurts,
which is really the number of oscillations per second. So
let's say you've got a sound that's at I don't know, uh,
one thousand killer hurts, and another one that's at one

(07:19):
thousand and ten killer hurts. Now, the human ear is
precise enough to be able to tell the difference of
two sounds that are at least two hurts apart from
each other. That's how precise our resolution of hearing it's
it's at that level. But if you get two sounds
played at the same time and they are that close

(07:40):
together in frequency, and one of those frequencies is played
at a greater volume than the other, our brains will
pick up on the louder sound and ignore the quieter sound,
even though both of them are present. What becomes important
at that point is the amplitude. Now, the further apart
in frequencies you get, the less that hasn't a effect.
So if you get far enough apart where they are

(08:02):
two pitches, one of them noticeably louder than the other,
but they're far enough apart, you will hear both of them.
It only works if the two pitches are relatively close together,
and there's not a universal formula for frequency masking. As
you get closer to the boundaries of human hearing, frequency
masking becomes easier. So if it's a really low pitch
or a really high pitch, it's easier to get away

(08:23):
with it. Once you start getting into what is the
ought of as the sweet spot for human hearing, which
is generally considered to be between two and five killer hurts,
you need a greater difference in volume or a smaller
difference in frequency in order for masking to work. Frequency
masking at any rate. But then there's also temporal masking,

(08:46):
and you might say, okay, I got it. Temporal that
means time. Indeed it does, my friend. This describes the
effect of a short but loud sound masking a softer
sound for a short time. Weird thing is the loud
sound can actually mask sounds that precede it slightly, not
by a whole lot, but a little bit. MP three

(09:06):
compression takes advantage of both frequency and temporal masking when
it's trying to determine which data needs to be included
and which data can be dumped, because it won't affect
your perception of whatever the the audio file is in
the first place. So you also probably remember I talked
about the physical limitation to what we humans can hear,

(09:26):
no matter what our brains might be up to, so
that this doesn't have to do with our brains, you know,
filtering through the information that's coming in. This has to
do with the physical limitations of the human ear. In
the last episode of the series, I said typical human hearing.
Keep in mind typical there are exceptions. UH covers the
range of frequencies between about twenty hurts and twenty killer

(09:48):
hurts or twenty thousand hurts. So twenty to twenty thousand
higher frequencies represent higher pitches and sound lower frequencies lower pitches, right,
And as you get older, your ability to perceive those
higher frequencies starts to diminish. So most adults actually have
an upper range closer to sixteen killer hurts, not twenty. UH.

(10:11):
Kids they can hear those higher pitches. You may have
heard the story about how some convenience stores experimented with
getting rid of teenage loiterers by by UH projecting out
the super high pitches that that adults could not hear
but kids could, and it discouraged kids from hanging out
at the convenience store and loitering. UM. I love that

(10:35):
idea so much. Anyway, that's because I'm old and my
hearing is terrible. Well, remember I also mentioned you can
detect changes in pitch at two hurts increments if you
get below two hurts and change, Like, if it's just
a one hurts difference between two frequencies, it's too low
a resolution for us to detect. To us, it will

(10:56):
sound exactly the same. So if you were to hear
a frequency at one thousand one hurts or one point
zero zero one killer hurts and one point zero zero
to killer hurts, you wouldn't notice the difference. They would
sound exactly the same to you. So if you're gonna

(11:17):
take audio and compress it, one step you could consider
is eliminating anything that's outside the actual range of frequencies
that we can hear, or simplifying any changes in frequency
that are smaller than two hurts. If you get take
all that data and you say it is physically impossible
for a human to perceive this, get rid of that information,

(11:38):
then in theory it wouldn't have any effect on the
rest of the recording. But how you go further than that? Right,
how do you create a method so that you can
really compress this file? You want a method that will
preserve the important sounds while potentially ignoring all the unimportant
or incidel sounds. And you want to be automatic because

(11:58):
if you have a man you really then that's going
to take countless hours just to edit a single sound file.
So that was the challenge that the MP three research
team faced as a group. Now, their solution, which ultimately
created even more challenges, was to come up with what
was essentially a simulated human ear and brain. They needed

(12:22):
to replicate the experience of perceiving music so that an
algorithm could evaluate every sound in an audio file and
judge if an in fact was relevant enough to include
in the final compressed version. If a sound were imperceptible,
then it wouldn't make sense to include it in the
MP three file. So by leaving out all the irrelevant data,

(12:44):
they can make the audio information take up less bandwidth.
The file itself would be smaller because you just dumped
everything that wasn't important. So the team used an algorithm
called the low complexity adaptive transform coding or lc DASH
a t C as the foundation for their research. This
was kind of their starting point, and this is an

(13:06):
approach that tries to do away with redundancy as much
as possible. And it also incorporates adaptation to perceptual requirements. Also,
MP three's oh a lot to the IMPEG Layer two standard,
So the layer two obviously came out before Layer three,
and so a lot of the features of layer three
are really um their legacy features from layer two. Uh.

(13:31):
In other words, MP three group kind of got stuck
with them because otherwise they would have had a problem
with backwards compatibility. So the result is kind of a
clunky arrangement under the hood, and some of the features
may make very little sense when I go through them,
but some of that is because it's a hold over
from an earlier compression strategy, which isn't terribly satisfying as

(13:53):
an answer. But the reason many parts of the MP
three compression algorithm are the way they are is because
that's the way we've always done it. So next I'm
gonna dive into the phases of compression. But before I
do that, let's all take a deep breath and take
a moment to thank our sponsor, and we're back. So

(14:22):
there are two big phases we'll need to talk about
with MP three compression. The first phase is analysis and
the second phase is the actual compression itself. And after
that there's the process of decoding and MP three for playback.
But that's way simpler once we get an understanding of
how the encoding process actually happens. So let's begin with analysis. Now.

(14:45):
This is the part where the standard has to figure
out which frequencies within an audio range are recording rather
are important or perceptible. So how does a program and
in coder figure out what we can hear and what
we cannot hear? All? Right, time to get technical. So

(15:06):
you start off with your pulse code modulation audio file
or PCM file. And you might remember I talked about
PCM audio in the first episode of this series, but
just in case you don't, it's a lossless digital audio file.
The actual format could be a wave or ai f
F or something along those lines, but the important thing

(15:26):
to keep in mind is that it is uncompressed. Now,
that means those files tend to be pretty big. This
is our raw material that we want to take and
squish down to a more manageable, transferable size. And in
our our last episode in this series, I also mentioned
that the standard for c D audio is a sample
rate of forty four point one. Killer hurts and we

(15:50):
learned that you need a sample rate twice the frequency
of the highest frequency in your recording, and since human
hearing tops out at around twenty kill hurts, the standard
for CDs is forty four point one killer hurts. The
MP three standard can support lots of different sample rates,
but forty four point one killer Hurts is pretty much
the common standard. So you've got a number of samples

(16:12):
with your audio file, and that number will depend upon
how long the audio file is. You've got forty four
thousand one samples per second, actually twice that for stereo,
but for the purposes of this discussion, let's kind of
stick with mono sounds so that I don't start having
math coming out of my ears. And we're still in
the very easy, simple part as far as math goes.

(16:34):
We haven't gotten to the complicated stuff yet, all right,
So you've got forty four thousand, one hundred samples per second.
To compress it into an MP three format, the algorithm
first groups all of these samples into collections called frames.
So take those forty four thousand one per second, and
then you start saying, okay, we're gonna group you in batches.

(16:56):
Each batch is called a frame and each frame contains
one thousand, one fifty two samples. Now that's specifically to
maintain backwards compatibility to IMPEG Layer two, which established that
one thousand, one or fifty two number. But we're not
talking about IMPEG layer two. We're talking about IMPEG Layer three,
and though that means we have to get a little

(17:18):
more complicated. So each frame consists of two subgroups called granules.
So each granule has five undred seventy six samples seventy
six times two one thousand fifty two, so five seventy
six samples per granule. Now, technically MP three encoders only
work on one granule at a time, but they may

(17:39):
reference the granules immediately before and immediately after the current
one in order to see how the audio within the
file changes over time. All right, so now you've got
your granules of five hundred seventy six samples each. Then
the MP three encoder runs the samples through a filter bank,
which sorts the sound into thirty two frequency ranges. Are

(18:02):
you are you crazy about the numbers yet, Dylan? Are you?
Dylan's Dylan's nodding. Dylan gets worse from here. So you
have thirty two frequency ranges, which is another nod to
the layer two method which use those thirty two ranges
for encoding purposes. But we're not talking about layer two early, No,
we're talking MP three. Gosh darn it. That means we

(18:24):
take those thirty two ranges and we subdivide them by
a factor of eighteen. That means we have five hundred
seventies six bands of frequencies, each band containing one six
of the frequency range of the original sample. So what
that actually means, and this this is actually pretty easy.

(18:44):
The bands are not limited to a specific number for
their frequency range. Right. The bands don't mean that on
the on band number one it goes from twenty hurts
up to a certain range and on band five D
seventy six in that twenty killer hurts. That's not what
it means. They're dependent upon the original audio. So if

(19:05):
the original audio contains sounds within a narrow range of frequencies,
the five bands will be more precise. But if the
original recording has a vast range of frequencies, the bands
are less precise. So another way to think about this
is with a pizza. So let's say you get extra
large pizza and you cut it into eight equal slices.

(19:27):
And then you get a small pizza and you cut
that into eight equal slices. Well, in both cases you
have with each slice one eighth of a pizza. But
the extra large pizza pizza slice is bigger than the
small pizza pizza slice. It all depends on the size
of the pizza. So in this case, it depends upon

(19:48):
the range of frequencies. And and Dylan, do you think
we could go for some pizza, you know, just just
put the episode on hole and go get pizza. Dylan's nodding.
It's great for audio. Yeah, so, uh, pizza, We'll be
right back. Okay, that was good pizza. Now um oh man,
I got a whole bunch more notes. Okay, well, let's

(20:08):
let's go ahead and and do the rest of this.
All right, So you've got your sound divided up into
those five seventy six sub brands of frequencies, you know,
the thing I compared to pizza slices earlier. Now you
get two different mathematical processes applied to this data. One
is the fast Furrier transform or f f T, and

(20:28):
the other is the modified discrete cosine transform or m
d c T. Now I am not going to dive
deeply into how these transforms work because frankly, they are
beyond my mathematical understanding. But I know what they do.
I just cannot explain the process like how they do

(20:49):
what they do. So I'm going to give you the
explanation of what they do what the outcome of each
of these transformed processes happens to be. But I'm not
going to be able to tell you the actual mathematical
steps involved in each because I don't math. So good guys,
But let's start with a fast for your transform. So
transform is kind of what it sounds like. It's all

(21:09):
about transforming information in some way. So in this particular case,
the f f T transforms the frequency bands we just
talked about into data that can be further analyzed by
a psychoacoustic model that's in the encoder. So this is
that simulated human ear and brain we were talking about earlier.

(21:30):
So what the encoder does is it analyzes each bed
of data and looks for signs that it represents audio
that wouldn't be perceived by a human. So it's looks
looking for any potential for masking possibilities. So are there
collections of frequencies that are grouped close together, and is
one of those frequencies louder than the others, you might

(21:51):
be able to do away with those softer frequencies because
of frequency masking. The encoder will also look at whether
or not the audio has a lot of complexity to it,
if it has a lot of changes, or if it's
just relatively steady or simple audio. Any transient sounds that
are present in the audio might end up being temporal masking,

(22:11):
so it'll analyze those as well and see if that's
a possibility. So really what they're looking is for, you know,
just any really loud sounds that stand out above the
rest of the recording. That's what the f f T
is doing. So what about the modified discrete cosign transform. Well,
this is happening in parallel with the f f T

(22:32):
and the samples get sorted into different patterns called windows
uh and the criterion for sorting all has to do
with whether the sample represents a steady sound or varied sound.
So if you have a simple steady sound that goes
into a long window, if there's a lot of variation
in the sound, like there are a lot of consonants

(22:53):
in a vocal line or it's like a drum solo
or something like that. It would get sorted into it
series ease of three short windows, and each short window
contains one two samples. That amounts to four whole milliseconds,
so four thousands of a second in three patterned windows.

(23:15):
So you've got these windows now, either long windows for
simple sounds or short windows for the more complex sounds.
And then the modified discrete cosine transform kicks into gear.
It looks at each long window or set of three
short windows and converts them into a set of spectral values.
To some of you, that probably sounds meaningless. So let's
talk about spectral analysis for a second. First, I was

(23:38):
very disappointed to learn that spectral analysis doesn't involve a
psychologist talking to a ghost about its emotional state, so bummer.
But spectral analysis is when you look at a spectrum
of information, like a spectrum of frequencies or related information
like energy states. That's what this transform does. It takes

(23:58):
data that originally represents a slice of time in a
sound waveform. That's what sample is. A sample is an
instance of time in a wave form and converts it
into information representing sound as energy across a range of frequencies. Now,
you can plot out spectral information in a lot of
different ways, but one common method is to use brightness

(24:21):
to indicate energy levels. Higher energy levels are brighter patches
in your visual representation of spectral data. High frequencies would
appear at the top of a spectral view, like imagine
a box, and at the top of the box that's
where you would find high frequencies, at the bottom of
the box that's where you find low frequencies, and it's

(24:41):
just lots of patches of color. The really bright patches
of color represent very high energy frequencies, so they could
be high or low in in actual frequency, but we're
talking about energy levels, not whether it's a higher low pitch.
Looking left to write represents the passing of time, and

(25:02):
looking along any vertical points shows you the actual frequency
or pitch, and then the respective energy level is the brightness.
So it's kind of like looking at sound as a wave,
but instead of being a wave, you're looking at information
that indicates frequency range and energy level. That representation is
actually kind of analogous to how we hear audio. So

(25:22):
an encoder can analyze the spectral view and start to
filter out the data we wouldn't perceive due to psychoacoustics. Now,
after all that processing, the encoder looks at the frequency
sub brands and the levels of spectral intensity for each
and that information can then be used for the next phase,
which is compression. But right now I think we could

(25:45):
all stand a little decompression, So let's take another quick
break to thank our sponsor. All right, so now you're
ready to compress your analyzed audio. Good for you, and
by you I mean encoders. This has to be simpler

(26:08):
than that analysis segment, right, I mean that got a
little crazy with all the different bands and sub bands
and windows and frames and granules. Sadly it gets more complicated,
all right. So there are two layers of compression going
on with MPEG Layer three. One of those layers depends

(26:30):
upon the psychoacoustic analysis and the other doesn't. So why
would you use two layers with different strategies like that? Well,
the reason is that one strategy is great for complex
audio with lots of components, but not so great with
simpler sounds, and the other strategy is kind of the opposite.
So the psychoacoustic approach is the one that's really good
for complicated sounds. If if you've got a lot of

(26:53):
volume changes, lots of different frequencies, it's just complicated and
rich sound, you've got a lot of opportunity to look
for masking and other acoustic elements that limit the actual
sounds that people perceive. So it means there are a
lot of chances for you to uh fudge by dropping
all the stuff that people probably wouldn't notice anyway. And uh,

(27:16):
if you take a piece that's got a lot of
elements at varying volumes, there are likely several opportunities to
to do this. But if you're talking about relatively straightforward
audio with few components, few changes in volume, there's really
not a whole lot of data you can ditch without
it actually affecting the quality of the audio in a
perceptible way. And this is part of what Brandenburg, that

(27:40):
guy I was talking about in our first episode in
this series. Uh, that's what he discovered when he was
working with the MP three standard and he was listening
back to that Suzanne Vega acapella track Tom's Diner. He
was listening to a compressed version of it, and he
said it was terrible. He said it ruined the quality
of the audio. And part of that is because that

(28:01):
particular song is fairly simple. There's just not a lot
of opportunity to take advantage of masking and other tricks
without potentially compromising the quality. So they decided to also
incorporate some traditional compression strategies, which which work better with
those types of recordings. So the MP three format takes
advantage of both the traditional approach and the psychoacoustic approach,

(28:25):
and that allows the encoder to compressed files into smaller
size without just following a single strategy, like it doesn't
have to do a one size fits all for all
elements of audio. Now, combining those two strategies requires a
little more mathematical gymnastics. So let's go back to those
five seventy six frequency bins. You know, those sub bands

(28:47):
we talked about earlier. You've got to quantize those suckers.
What does that mean. It means assigning a quantity to
each to each frequency bin, you have to give it
a quantity of some sorts so that you can end
up judging how much you can get away with dropping data.
So to do this, the encoder sorts those five six

(29:09):
bins into twenty two scale factor bands. How you doing
over there? Dylan just checking in on you? Okay, Dylan's
got Dylan's got a thousand yards stare going. I hope
you guys are doing okay over there? All right, So
before smoke starts coming out of your ears, let me
explain what the scale factor bands are all about. The
whole purpose of the scale factor bands is to determine

(29:32):
how the information will be stored within the compressed state.
So you want to get away with as little data
as possible before affecting sound quality. So if you can
say the same thing in a shorter space without affecting
the quality of what it is you're saying, you go
with it. Brevity is the soul of compression. So if

(29:54):
we were talking about language, I would say it's more
efficient to say it's raining outside, or even just it's raining,
because you would assume that it would be outside where
the rain is happening, and it would be inefficient for
me to say it's coming down like cats and dogs
out there. It's not as efficient as saying it's raining.

(30:16):
So if you can get away with shorter statements without
affecting the actual quality, and you could argue that by
switching from it's coming down like cats and dogs out
there and it's raining changes the quality, And that could
be a valid argument. But if you can get away
with shorter without affecting quality, you do it. So each

(30:37):
scale factor band is represented by a quantity, Then the
encoder divides that quantity by a given number called the quantizer,
which is the same across the entire frequency spectrum for
that recording. The resulting number is then rounded up or
down to a whole digit. And here's an important point.

(31:00):
Individual scale factor bands can be scaled up or down
for more or less precision to represent the actual value
of those bands. So what the heck does all that mean? Well,
the purpose of dividing and rounding is just to simplify
the data to reduce the amount you need in order
to store the information. So let's go with a totally

(31:20):
hypothetical example. Let's say you've got a scale factor band
and you've decided you're representing that scale factor band with
the quantity seven eight four zero seven thousand, eight hundred forty,
and you've chosen the number one hundred to quantize your data,
meaning that you will divide each uh scale factor bands

(31:41):
quantity by one hundred. So this is seven thousand, eight
hundred forty. You divide it by one hundred. Uh and
the scale factor for this particular band you have determined
is one point zero. That means that once you get
that result where you've divided the quantity by the quantizer,
you multiply by one. That means there's no change. Multiply

(32:03):
by one you get the same number. More on that
end a bit. Okay, So you take that seven thousand,
eight hundred forty you divided by one hundred. That gives
you seventy eight point four. Well, now you have to
round that number, so you round it down to seventy eight. Now,
when you have a decoder and you're ready to play
back the information, it comes across this quantity the seventy eight,

(32:24):
and it knows what the quantizer number was, so it
multiplies by one hundred to get back to seven thousand,
eight hundred. So the replicated number is actually forty off
from the original number. The original number again with seven thousand,
eight hundred forty, the replicated number is seven thousand, eight hundred. Now,
those inconsistencies manifest as noise in the actual playback. So

(32:48):
if you wanted to increase the precision of any given
scale factor band, you could do so by changing the
scale factor number. So in that example, just now, I
said the number was one point zero, meaning there's no
change to that result. But I could have said it
was ten, which means we would multiply the quantized number
by ten. So we would take that seven thousand, eight
hundred forty divided by one hundred you get seventy eight

(33:10):
point four, then multiplied by ten to get seven four.
So when the decoder decompresses the file, it would reverse
this this whole thing. It would just multiply by a
hundred um. You would end up getting seven thousand, hundred
forty again, which means that you wouldn't introduce any noise
to the file. You would have a perfect representation. But
in some cases, the encoder may determine that any noise

(33:33):
that you generate wouldn't be noticed or it wouldn't impact
the quality of the audio enough for it to be
a problem because of other factors for that particular scale
factor band, like maybe it's really quiet, or maybe it's
really complex. So in those cases, you could reduce the
scale factor number by making it something else like point
one instead of one point oh. So that means you

(33:54):
would multiply the quantized number by point one, So the
seventy eight point four would become seven point eight four,
and then you have to round it to get a
whole integer, so you get eight seven point eight four
rounds up to eight. Now, when a decode or decompresses
the audio, it multiplies eight by one hundred. That quantizer
that we've talked about so much, uh and uh, actually

(34:17):
at this point would have to be eight thousand because
it's also taking into account the scale factor, so it's
multiplying it by a thousand, not just a hundred. So
you would get a number that would pop up to
eight thousand. And remember the original with seven thousand, eight
hundred forty. So you look at the difference between these two,
the original seven thousand forty, the new fact number is

(34:37):
eight thousand. There's a pretty big difference there. That change
might introduce enough noise for it to be a problem.
So how does the encoder determine if a scale factor
band is meeting the proper criteria? How can it tell
if there is ah too much noise or if the
noise falls below the threshold? Well, it goes through what
it's called a Huffman coding process. At this point, Dylan

(35:01):
is currently just staring at the wall and drool is
coming out. Huffman coding process. It's converts scale factor bands
into binary strings, and the process goes through a series
of tables to determine if the data within the scale
factor band requires more or less precision to describe the
sound without affecting the audio quality. So, Huffman coding is

(35:22):
a process. And when you start with a large number
of possibilities and you begin to narrow it down, uh.
Some people describe it as the coding equivalent of twenty questions.
So you ask your first question like animal, vegetable or mineral.
You get an answer so animal. While that first answer
eliminates a ton of other possibilities and narrows the focus

(35:42):
like anything that doesn't pertain to animal, you can automatically
discount because you already know it can apply to that answer.
With MP three compression, this means making certain the number
of bits representing a granule because remember I mentioned that
in MP three formats you have frames, and each frame.

(36:02):
Each frame has a thousand, one or fifty two samples
and consists of two granules with five s each. So
when you answer the first question, it eliminates a lot
of other possibilities and narrows the focus. So like with animal, vegetable, mineral,
if I say animal, you're gonna not ask any questions
that have to do with minerals or vegetables only because

(36:22):
it wouldn't make sense. You know, those aren't gonna apply.
Same thing with m P three's except this time it
means making certain the number of bits representing a granule.
Remember their two granules per frame with the MP three layer, Uh,
you want to make sure that the number of bits
representing that granule match the chosen bit rate for a compression.

(36:43):
So if after going through this process, the encoder says, hey,
this granule has more bits than what's allowed. It's too
many bits. The we gotta get rid of some of these,
the encoder can adjust the scale factor band so that
there's less precision meaning that multiplier in other words, that
but I talked about earlier, and thus reduce the amount
of data needed to represent that particular granule. If a

(37:07):
granule comes in under the bit rate, the encoder can
increase the precision to reduce noise and fill that granule
out properly so it matches the actual threshold. After all this,
the pairs of granules become frames within the MP three files.
And the only other component in an MP three file

(37:27):
apart from these frames is the I D three metadata.
This is pretty simple. This is like a header, and
it comes before all the frames in the audio file
and contains information about about the file itself, which can
include stuff like the title of a song, an artist name,
an album title, other stuff like that. It can also
include copyright information as well as information about the file itself,

(37:50):
such as whether or not it's a stereo recording or
a mono recording. So when you use a decoder like
an MP three player, it takes this compressed information. These
these these representations that the music has been reduced to,
and it converts that Huffman data back into the quantized format,

(38:12):
scales the data back up to its original size or
close approximation. Remember the the uncompressed version may actually be
off by a significant amount depending upon each individual granule.
And all of that data gets recombined into a new
pc M sample that can be played back to you.
And that's all there is to it. Nothing could be easier.

(38:35):
All right, that took a lot out of me, so
I got really technical, and I apologize if I lost
any of you out there, or for those of you
who have a lot of experience working on compression algorithms,
for oversimplifying in several cases. But now we've got a
full episode about this, and I hope you have a
better understanding of how a big sound file can be

(38:55):
reduced to a smaller sound file. Next time, I'll just
say magic. It will make everyone happier. But I hope
you guys appreciated this. In the next episode in this
series it will be far less technical. I'm going to
be more historical. I'm going to talk about the progression
of the MP three player, how it came, about, how

(39:16):
it evolved, and how the iPod ended up becoming the
dominant brand in a c of MP three players, and
then maybe kind of explore where MP three players are today,
like how many are there, how how big is the market?
Are are people still buying them? That kind of question.
If you guys have any questions for me, or comments

(39:37):
or suggestions anything like that, send me a message. My
email is tech Stuff at how stuff works dot com,
or you can drop me a line on Facebook or Twitter,
the handle of both of those those tech stuff h
s W and I'll talk to you guys again really
soon for more on this and sense of other topics.

(40:01):
Is it how stuff works? Dot com m

All Episodes

Episode Transcript

TechStuff News

Follow Us On

Hosts And Creators

Oz Woloshyn

Karah Preiss

Show Links

Popular Podcasts

United States of Kennedy

Stuff You Should Know

Dateline NBC

.css-15opob5{left:0;position:absolute;top:0.8rem;} All Episodes

.css-14f5ked{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:2;overflow:hidden;}How MP3 Compression Works