Decoding MP3s, Volume 2: A Portrait of the Audio as a Compressed Bit Stream

The other day, I wrote about the process of how I arrived at decoding MP3s in Python. Today, I’ve decided to create a rough outline for the next few MP3 posts. The idea is to keep the next few posts light. I’m just going to pull from my notes, skip fact-checking, and hit “post”. I’ll start today with a very basic outline of what MP3s are made of. I’ll save the more intricate details for later posts, by which time I hope to actually understand what I’m saying.

First, I’m going to try and quickly give you a sense of how MP3s become what they are.
A piece of audio is split into chunks (aka frames) and the discrete cosine transform converts these time-domain samples into frequency-domain samples. At this point we have the component frequencies of the audio as well as how much of each frequency is present in the signal. To put it another way, if the audio data was a fully-cooked meal, the DCT returns the ingredient list. These frequency samples are arrays of floating-point numbers.
It’s around here that the “lossy”-ness of MP3 comes into play, and it does so in a couple ways. One way is by flat-out removing certain samples. Frequencies above ~20kHz and below ~20Hz are removed, and silent/quiet sections of audio are encoded using fewer bits. These are both fairly intuitive — if we can’t hear it, don’t encode it. Another introduction of loss is a little more subtle. To encode these samples using fewer bits, certain values are quantized, meaning multiplied by a certain number (aka a scaling factor) to get a 16-bit integer. Whatever fractional component is still present is dropped, thereby minimizing loss while increasing compression.
After all this happens, the values are Huffman encoded and written at wherever the last section of audio data was written.

Hopefully that overview balanced brevity and depth. I certainly left a lot out. At the very least I think I mentioned all the terms I wanted to.

Basic Sections

At the highest level, MP3s can be broken out into two parts: ID3 tags and the actual MP3 frames. ID3 tags contain metadata that’s useful for the listener (artist, track, and album names, among other things), but it’s of no importance to the sound. So in the context of decoding MP3 data, ID3 tags can be skipped. That leaves the bulk of the file, which is the MP3 frames themselves.

MP3 Frames

Each frame has three main components: a header, side information, and main data. The header and side information contain details on how to read the main data and the main data contains the actual quantized, Huffman-encoded frequency samples that will someday turn into music for your ears.

Frame Headers

The frame header contains information on the number of channels, bit rate, sampling frequency, and a few other fields that are important for decoding the data around it. The bit rate and sampling frequency can tell you how many bytes are in each frame. The number of channels will tell you how many chunks of things you’ll need to decode per frame (lots of for i in range(0, channels) stuff going on).

Side Information

Side information contains more information about how to decode the main data. The important information here is broken up into granules, which up to this point I’ve thought of as “groups of bytes”. So far my thinking has not needed to evolve, so I will continue to think of granules as just groups of bytes in the main data that can be decoded together. I will try to note when/if this mental model needs updating in later posts. The frame’s side information contains the scaling factors for each granule and channel, whether or not those scaling factors are shared, which Huffman table to use to decode the main data (there are 15 different tables to choose from), and some offset information (e.g. — where to start reading and how much to read).

Main Data

Last but not least, there’s the main data. Once we have all the details from the header and size information, reading this stuff should, in theory, be straightforward. However, my MP3 decoder is 1) not done and 2) screws up specifically at decoding the main data, so either I’m not good at straightforward tasks or reading the main data isn’t straightfoward. Maybe a little of both. At the moment there’s really only one interesting aspect of the main data that I can talk about, and that’s the bit reservoir. For reasons that aren’t entirely clear to me, while so much else is variable (scaling factors, bit rate), the number of time samples per frame is constant. Sometimes these samples can be encoded using N bits, sometimes they require M bits, and that means the main data can end wherever. But instead of starting to write the next frame header, the MP3 encoder will start to write the next frames main data. This is the essence of the bit reservoir. Where one main data ends, another begins, and not a single bit goes to waste. I can’t say I understand why the MP3 specification goes this route instead of just starting the next header, but the spec is $160 and I’m unemployed, so wild conjecture is all I’ve got.

So that’s what’s in an MP3. Sorry for explaining it poorly, but poorly is the only way I understand it.