If a triangle plays in a forest...

You've reached the new home for Jay Rose's “dplay.com” articles and tutorials!

Hearing What's Not There

Sometimes, making data disappear can be acceptable

Ever wonder how magicians make a large object disappear, or a woman's dress instantly change color? According to a study in Nature Reviews Neuroscience, cognitive scientists have been wondering as well. The scholarly, footnoted article explains magic tricks in terms of the visual and neurological quirks they rely on. It credits The Amazing Randi, The Great Tomsoni, and Teller (of "Penn and...") as co-authors. It's visually oriented, but abracadbra: here on the audio side, we've been doing that kind of research - and benefitting from it - for years. You can benefit, too.

As an example of the article's visual orientation, consider one line from its introduction:

Much as early filmmakers experimented with editing techniques to determine which technique would communicate their intent most effectively, magicians have explored the techniques that most effectively divert attention or exploit the shortcomings of human vision...

But this is an audio blog, so I'll cover an equivalent audio technique: How we use neurological research and mathematical tricks to make most of the data in an audio file disappear... without noticeably affecting the sound.

In other words, it's about getting the most from mp3, AAC, and other techniques. This often means not doing what the audio programs' menus seem to suggest. If you've got anything to do with moving audio on the Internet, these tricks will be helpful. But some of them also apply to sound for movies or broadcast. (Did you know that the Dolby Digital theatrical format uses more radical compression than most mp3 music?)

I don't want to get ahead of myself. This is a big topic, so I'm releasing this article in four parts over the next ten days or so:

On to part one.

Magic Part I... The Mask

How your brain gets around a neural traffic jam Our ears are not particularly precise sensors.

That's not for lack of trying: each ear has close to 30,000 nerves on the basilar membrane, tuned to respond to different pitches. But those membrains aren't like giant organ keyboards, with specific nervous impulse for every tone we ever hear. That would be too much data for the brain to process efficiently.

Instead, when we hear a tone at a particular frequency, a group of nerves centered around that pitch fire. How many nerves go off depends on the volume as well as other factors. A loud sound triggers more nerves. The brain interprets these group firings are as specific pitch and volume.

The nerves aren't grouped linearly. They're more concentrated at frequencies where sounds tend to be important, and spread out where they aren't. In other words, the first audio data compression systems were human. They evolved in our eardrums and auditory cortex.

This has been known for years, and been measured across very large populations. It's generally called the Threshold of Hearing. Pitches above the threshold get heard. Those below it, don't.

You can express it with a graph:

Not01 Low frequencies are on the left, mids in the middle, highs on the right. (The calibrations are logarithmic because that's how we hear pitch.)

The vertical decibels are calibrated relative to the frequency where most people's ears are the most sensitive, around 3.5 kHz. You could consider 0 dB on this chart to be true 0 dB SPL - the nominal threshold of hearing - or any other convenient level, depending on the individual and the circumstance.

The important thing isn't the calibrations; it's what happens at the heavy brown line. That's how the Threshold varies with frequency, for most humans. (The line is pretty accurate, given my drawing abilities; there are more rigorous ones elsewhere on the Web.)

At 3.5 kHz, the short, green bar at 15 dB is louder than the threshold. It gets heard. But the red bars at 50 Hz and 15 kHz are ignored, even though they're louder, unless you have statistically exceptional hearing. In fact, most people can't detect a very high or very low pitch until it gets some 40 dB louder than one they could comfortably hear in the mid-range!

There seems to be good evolutionary reason for this. While roaring predators are louder than human speech, the most important parts of intelligibility are around 3.5 kHz. That's where it would be most advantageous to understand your neighbor's shouts, even if there's a tiger nearby.

The darned line keeps moving

How many nerves are involved for any particular tone also depends on volume, and is constantly being adjusted by our brains and muscles in the ear. It's necessary. A nearby jet plane hits your ear with about 10,000,000,000 more pressure than the quietest tones used in a hearing test. But it's also a form of data compression.

All this efficiency comes with a sacrifice. Because louder sounds use larger groups of nerves, and the threshold is constantly being adjusted, softer sounds at a nearby frequencies can't get through at the same time. Neural pathways that would normally respond to them are already busy.

The effect can be thought of like this:

Not02
When something loud enough comes along (blue bar, about 40 dB at 2 kHz), it drags the threshold with it. The green bar from our previous drawing - and a slightly louder one I added at 1 kHz - don't get heard, even though they're above the normal threshold.

The actual amount of masking varies with the frequency, volume, and overall timbres of the sounds, but it's always there. It gets broader at the extremes of the bands, where nerve bundles are more spread out. A 250-Hz sound, 25 dB above the threshold, ties up so much neural activity that a simultaneous 200-Hz sound that's 10 dB softer actually disappears.

After-images (and pre-images) in your ear

One of the magic tricks described in the Nature Reviews article is The Great Tomsoni's Coloured Dress Change. Tomsoni's assistant appears in a white dress, which he says he'll turn red. Her white spotlight goes out and a red one comes on... making the dress look red as well. He makes a joke, the audience laughs, and he tells the booth to changes the lights back. When the spot turns white, her is made of red fabric!

Read the article to see how it's done. I write audio tutorials.

But I'll give you a hint, based on how we hear. Nerves are chemical, and chemicals have to recover after they've been fired. It results in a time-based masking as well.

Not03
In this drawing, frequency doesn't matter. A long loud tone is sounded (blue bar, lasting 180 ms or about 6 frames), and it drags the threshold up to match. But look at what happens in the 50 ms or so after the tone: nerves are still recovering, so the threshold stays up. In fact, the brain even forgets nearby pitches that happend 20 ms or so before the tone, because pathways get overwhelmed!

What it all means

These two effects - loudness and temporal masking - are the basis behind perceptual encoders like mp3, AAC, and Dolby Digital. Our hearing mechanisims can't hear certain sounds, so bits in a compressed audio file don't get wasted on them. They're also the basis behind most noise reduction algorithms, but we'll save that for a future tutorial series.

Of course, you know there's a lot of bad perceptual encoding going on. Sounds get thrown away that the brain should be hearing, and we miss them. And bad choices during the encoding can add artifacts that make things even worse. But it's not the encoding's fault... it's the user's.

Next article: What these compression algorithms actually do, and how to make them do it more efficiently. It's usually not what's on the encoder menus.