Zap and it's gone!

You've reached the new home for Jay Rose's “dplay.com” articles and tutorials!

Living with (Data) Loss

mp3 and its cousins are a fact of life... here's how to get the most out of them

TimesSqStars

If you do audio for the Web, broadcast, or movie theaters, sooner or later you'll have to deal with some form of lossy data compression. But you don't have to buy into the mp3 myths and hype.

If you understand how those algorithms actually work - how they decide what data to lose - your tracks can sound a lot better. This is the second tutorial in a four-part series, which started with Hearing What's Not There. That article covered some quirks that have evolved in human hearing. They make lossy compression possible.

But since I write about how sound relates to picture, that piece starts with a few non-audio quirks... and how they make stage magic possible. They're disclosed in a recent paper from the journal Nature Reviews Neuroscience. Doctors and some well-known magicians (and others) got together to explore the subject. It's worth clicking on and reading... or at least, downloading to your desk for later perusal.

Here's a very quick sample. One of the 'and other' authors of the paper is a professional thief. Researchers explain part of his technique:

To steal a watch directly from the wrist of a mark, the pickpocket might first squeeze the wrist while the watch is still on... it makes a high contrast somatosensory impression that adapts the touch receptors in the skin, making them less sensitive to the subsequent light touches that are required to unbuckle and remove the watch.

...you have been warned. Well, lossy compression uses similar high-level impressions to change how your ears hear. But instead squeezing your wrist, it uses parts of the music you're already listening to!

As I said, this is a four-part series. I'll be happiest if you read the parts in order, but it's up to you. You can learn a lot from today's tutorial, even without understanding the principles that make it work:

On to part two.

Magic Part II... The Takeaway

If you couldn't hear it in the first place, is it really missing?

There's a song about stars at night being big and bright (clap four times) in the heart of Texas. But while that state is certainly nice, we know those same stars are also over Ohio and New York. In fact, you could imagine standing in Times Square and seeing a beautiful night sky.

But while you can imagine it, you'll probably never see it. Having all those bright signs around makes your eyes less sensitive to starlight, and the atmosphere over Times Square further confuses things by reflecting 'earthlight' back at you.

As you learned in my last blog entry, human ears get similarly desensitized when they hear other sounds at nearby frequencies (the nerves can't handle the data). Add that to the natural 'blurring' of even the best playback systems and acoustics (equivalent to those atmospheric reflections), and it's no wonder we're sometimes blind to certain details in a recording.

I'll assume you read that blog entry, or already knew how spectral and temporal masking work. If not (nyah, nyah): you'll just have to trust me.

Framed!

When you run a signal through an mp3 or similar encoder, the algorithm first breaks the audio into frames, lasting up to a few milliseconds each.

These frames have nothing to do with video frames in the same file. Their length is determined primarily by the data rate - or amount of compression - you've chosen. Lower rate files, with more compression, use longer frames.

Each frame is boosted so its loudest wave reaches 0 dBFS. This is to take advantage of every bit during processing. The amount of boost is noted with the frames, so they can be restored to their original volume on playback.

(That's why normalizing or boosting a raw audio file doesn't make compression any more efficient. A louder file might be easier for users to hear after it's been decoded, but that's a different issue.)

The algorithm looks at each frame and measures how much energy the frame has at different frequencies. The number of frequencies is a trade-off: more bands allows tighter masking, but requires sharper filters that respond more slowly. The mp3 format uses up to 512 bands, other compression systems have more or less.

The common mp3 algorithm uses this scheme. How good it sounds depends on how well the encoder has been written, and on the bitrate chosen. The newer AAC algorithm couples it with a quick look at adjacent frames to see if temporal masking will hide even more details. For a given bitrate, a good AAC will sound better than a good mp3.

Your First Choices

The most critical setting in a lossy compression scheme, including mp3 encoders, is the bitrate. Lower bitrates mean longer frames, increasing the chance that masking sounds won't last the whole time. The result is noise and a flangey or chirping effect.

Which bitrate you consider low, and how much noise or distortion is acceptable, depends on the application. But if you do things right, broadcast-quality sound can be achieved at 128 kbps. One of the most important factors is which encoder you use. Even in a standardized format like mp3, there are multiple trade-offs that program designers have to make.

Commercial encoders are usually better designed in this respect than freeware. Because they've also paid licensing fees to the Fraunhofer Institut - inventors of the mp3 format - commercial publishers may have had more access to inner workings of the system. But at least one free encoder, the open-source LAME library, is also very good.

It makes sense to use a high-quality encoder. Other things will help as well:

While we're at it, the encoder might give you some other choices as well:

Stereo or joint stereo       Most algorithms expect the left and right channels of a stereo pair to be similar. This is usually true in music. A joint stereo mode encodes only major differences between the channels, particularly at high frequencies, freeing up more of the bitrate for better quality. But ambiences and crowd sounds can be very different on the left and right, if the space isn't reverberant and there are lots of spread-out sources. With these sounds, "joint stereo" pushes things toward the center.

Variable bitrate       This option, also known as VBR, can both reduce file size and improve the sound. The algorithm uses different bitrates for each frame, depending on how many are needed. This avoids wasting bits on pauses or easy-to-encode passages.

VBR works best on simpler or slower-moving sources, including a lot of new age or classical music. It presents little advantage on faster and highly processed sounds, such as most pop styles, because the maximum bitrate must be used for most frames.

Lossy: The Next Generation

When you convert a compressed file back to 16-bit linear audio, something will be missing. If you encode it again, the algorithm has a harder time finding details that can be safely deleted. Noise and distortion build up with each subsequent pass.

If you must go through multiple encodings, stay with the highest bitrates possible. If the final release format will be at a low bitrate, don't apply it until the last step.

There is some evidence that multiple generations through the same compressor sound worse than the same number of generations through a variety of algorithms.

What have they done to my song?

Want to hear exactly what the mp3 algorithm takes away from voice or music, when you do it properly? No hype, no simulation... but a scientific experiment you can replicate on your desktop. It's at my website.

Next time: how lossless encoders shrink files without sacrificing any data.