Living with (Data) Loss
mp3 and its cousins are a fact of life... here's how to get the most out of them
If you do audio for the Web, broadcast, or movie theaters, sooner or later you'll have to deal with some form of lossy data compression. But you don't have to buy into the mp3 myths and hype.
If you understand how those algorithms actually work - how they decide what data to lose - your tracks can sound a lot better. This is the second tutorial in a four-part series, which started with Hearing What's Not There. That article covered some quirks that have evolved in human hearing. They make lossy compression possible.
But since I write about how sound relates to picture, that piece starts with a few non-audio quirks... and how they make stage magic possible. They're disclosed in a recent paper from the journal Nature Reviews Neuroscience. Doctors and some well-known magicians (and others) got together to explore the subject. It's worth clicking on and reading... or at least, downloading to your desk for later perusal.
Here's a very quick sample. One of the 'and other' authors of the paper is a professional thief. Researchers explain part of his technique:
To steal a watch directly from the wrist of a mark, the pickpocket might first squeeze the wrist while the watch is still on... it makes a high contrast somatosensory impression that adapts the touch receptors in the skin, making them less sensitive to the subsequent light touches that are required to unbuckle and remove the watch.
...you have been warned. Well, lossy compression uses similar high-level impressions to change how your ears hear. But instead squeezing your wrist, it uses parts of the music you're already listening to!
As I said, this is a four-part series. I'll be happiest if you read the parts in order, but it's up to you. You can learn a lot from today's tutorial, even without understanding the principles that make it work:
- What mp3 and similar algorithms are doing 'under the hood'
- How to choose the best settings for your particular sound (some of them aren't intuitive)
- Some audio demos of what mp3 actually takes away... with a graphic analogy of how you can do the tests yourself
Magic Part II... The Takeaway
If you couldn't hear it in the first place, is it really missing? There's a song about stars at night being big and bright (clap four times) in the heart of Texas. But while that state is certainly nice, we know those same stars are also over Ohio and New York. In fact, you could imagine standing in Times Square and seeing a beautiful night sky.
But while you can imagine it, you'll probably never see it. Having all those bright signs around makes your eyes less sensitive to starlight, and the atmosphere over Times Square further confuses things by reflecting 'earthlight' back at you.
As you learned in my last blog entry, human ears get similarly desensitized when they hear other sounds at nearby frequencies (the nerves can't handle the data). Add that to the natural 'blurring' of even the best playback systems and acoustics (equivalent to those atmospheric reflections), and it's no wonder we're sometimes blind to certain details in a recording.
I'll assume you read that blog entry, or already knew how spectral and temporal masking work. If not (nyah, nyah): you'll just have to trust me.
Framed!When you run a signal through an mp3 or similar encoder, the algorithm first breaks the audio into frames, lasting up to a few milliseconds each.
These frames have nothing to do with video frames in the same file. Their length is determined primarily by the data rate - or amount of compression - you've chosen. Lower rate files, with more compression, use longer frames.
Each frame is boosted so its loudest wave reaches 0 dBFS. This is to take advantage of every bit during processing. The amount of boost is noted with the frames, so they can be restored to their original volume on playback.
(That's why normalizing or boosting a raw audio file doesn't make compression any more efficient. A louder file might be easier for users to hear after it's been decoded, but that's a different issue.)
The algorithm looks at each frame and measures how much energy the frame has at different frequencies. The number of frequencies is a trade-off: more bands allows tighter masking, but requires sharper filters that respond more slowly. The mp3 format uses up to 512 bands, other compression systems have more or less.
- If a particular band is silent during the frame, the process notes it and doesn't waste any more data there.
- If a band is loud, it reduces the number of bits. The loud signal will mask noises at the same frequency.
- If a band is soft, it's processed with more bits, unless there's a masking sound in an adjacent band. Then it assumes the band won't be heard, and deletes it entirely.
- The resulting audio is run through a data packer similar to WinZip or Stuffit. Normal audio is too complex to compress well in these systems, but they do a good job with the simpler data-reduced frames.
Your First ChoicesThe most critical setting in a lossy compression scheme, including mp3 encoders, is the bitrate. Lower bitrates mean longer frames, increasing the chance that masking sounds won't last the whole time. The result is noise and a flangey or chirping effect.
Which bitrate you consider low, and how much noise or distortion is acceptable, depends on the application. But if you do things right, broadcast-quality sound can be achieved at 128 kbps. One of the most important factors is which encoder you use. Even in a standardized format like mp3, there are multiple trade-offs that program designers have to make.
Commercial encoders are usually better designed in this respect than freeware. Because they've also paid licensing fees to the Fraunhofer Institut - inventors of the mp3 format - commercial publishers may have had more access to inner workings of the system. But at least one free encoder, the open-source LAME library, is also very good.
It makes sense to use a high-quality encoder. Other things will help as well:
- If you have to encode at a low bitrate, get rid of high frequencies first. Apply a low-pass filter at 8 kHz to 12 kHz (or use a good sample-rate converter to lower the rate to 22 kHz, which filters sounds above 10 kHz). The moderate dullness this imparts will be less objectionable than low bitrate noises.
- Don't try to help the high-frequency filtering by boosting just below the Nyquist Limit, even though many encoders or sample rate converters give you this choice with a "Preserve Highs" option. It wastes precious bits on unimportant sounds, and can increase the chance of flanging or chirping.
- Don't use extreme broadcast-style level compression, particularly multiband compression. This makes it harder for the algorithm to tell the difference between important sounds and those that can be lost.
- Speech is harder to encode than music because it changes faster. The most common distortion at low bitrates is a reverberation-like noise tail on the words. It can be lessened by lowering the number of bands in the encoder, which raises the internal filters' response times.
Most encoders don't let you control the number of filters, but many let you select a "speech" optimization. That's what it's doing.
- Higher background noise levels also increase problems with encoding. Start with the cleanest possible recording.
- The above note does not mean you can use a noisy recording if you run it through most Noise Reduction plug-ins first. The two algorithms fight each other.
Stereo or joint stereo Most algorithms expect the left and right channels of a stereo pair to be similar. This is usually true in music. A joint stereo mode encodes only major differences between the channels, particularly at high frequencies, freeing up more of the bitrate for better quality. But ambiences and crowd sounds can be very different on the left and right, if the space isn't reverberant and there are lots of spread-out sources. With these sounds, "joint stereo" pushes things toward the center.
Variable bitrate This option, also known as VBR, can both reduce file size and improve the sound. The algorithm uses different bitrates for each frame, depending on how many are needed. This avoids wasting bits on pauses or easy-to-encode passages.
VBR works best on simpler or slower-moving sources, including a lot of new age or classical music. It presents little advantage on faster and highly processed sounds, such as most pop styles, because the maximum bitrate must be used for most frames.
Lossy: The Next GenerationWhen you convert a compressed file back to 16-bit linear audio, something will be missing. If you encode it again, the algorithm has a harder time finding details that can be safely deleted. Noise and distortion build up with each subsequent pass.
If you must go through multiple encodings, stay with the highest bitrates possible. If the final release format will be at a low bitrate, don't apply it until the last step.
There is some evidence that multiple generations through the same compressor sound worse than the same number of generations through a variety of algorithms.
What have they done to my song?Want to hear exactly what the mp3 algorithm takes away from voice or music, when you do it properly? No hype, no simulation... but a scientific experiment you can replicate on your desktop. It's at my website.
Next time: how lossless encoders shrink files without sacrificing any data.