The Secret Lives of MP3 Files

There are many myths and much confusion about the proper way to create audio files using an MP3 encoder, particularly for spoken-word podcasts. The following is a compilation of various topics I have posted over the past few years.

Sample Rates

The sample rate of an MP3 file specifies how the input to the encoder is analyzed. Sample rates are not the same as BitRates (see below), the bandwidth of the encoder’s output. IOW, there isn’t a fixed relationship between sample rate and bit rate. For more information on MP3 technology and history see this entry in Wikipedia.

I recommend using a 44,100Hz sample for all podcasts. If you want a super-high quality recording (better than CD) and you plan to encode at 320kbps or more, then you *might* want to use 88,200Hz, but that’s not really in podcast territory, at least not today.

Avoid any sample rate that isn’t a multiple of 11,025Hz. Many hardware and software MP3 players including those common web-page ones using Adobe/Macromedia Flash won’t play anything that isn’t a multiple of 11,025.

If possible, record at the same sample rate you’ll use for your MP3. If you have to resample from 48,000Hz to 44,100Hz, for example — the audio tracks within many video formats are recorded at 48,000Hz — you will generate undesirable artifacts. Downsampling from 88,200Hz to 44,100Hz isn’t perfect, but because it’s a simple 2:1 transformation, the artifacts are somewhat less objectionable than when the resampler has to interpolate.

Why not go lower, perhaps 22,050Hz? After all, the uncompressed WAV or AIFF files will then be half the size. Some point out that this lower sample rate will reduce the amount of data given to the MP3 encoder and therefore require somewhat less compression. Per the Nyquist Theorem a 22,050Hz sample rate can reproduce frequencies of up to 11,000Hz which is more than good enough for voice recordings, particularly those which are recorded from telephone lines (~3,500Hz max). But we have found a few hardware and software MP3 players that don’t accept anything except 44,100Hz. Furthermore, most pre-encoding recordings are (and should be) made at 44,100Hz.

There is no important vocal information in the octave from 11,000Hz to 22,000Hz — but rather just unwanted noise. Instead of using a lower sample rate to remove the unnecessary top octave of audio, I use a low-pass filter at 11,000Hz in my very last step before encoding. I prefer this to depending on the Nyquist Theorem to accomplish this task. By presenting audio in this unnecessary octave to the encoder, I’d be asking the encoder to attempt to preserve it, whereas what I really want is for it to be ignored. Using the low-pass filter, I reduce the data to the encoder and therefore achieve a better result. The more unnecessary information you feed to your MP3 encoder (which typically has a fixed-bitrate output) the more encoding-stream bandwidth it must utilize to reproduce it.

Use only constant-bit-rate (CBR) for Podcasts, not variable-bit-rate (VBR). The latter is only valuable for higher-res MP3s (192kbps and more), and can’t be rendered on some players.

Bit Rates

The bit rate of an MP3 file specifies the output stream: the number of bits generated per second of audio. Bit rates are not the same as sample rates (see above).

When I started IT Conversations in June of 2003, I encoded the files at 32kbps (about 15MB per hour of audio). At that time I got more complaints that the files were too large than about the quality. But things changed quickly! A year later, by popular demand, I upgraded to 48kbps and received few complaints. For a while Adam Curry encoded his Daily Source Code program at 192kbps for, but he soon yielded to pressure and cut it back to 96kbps. Even at that rate an hour-long show takes about 40MB on an iPod or other device, and while the files can be downloaded overnight, it can take quite a while if you’re not on broadband.

When high-capacity iPods and similar devices became commonplace we upgraded the IT Conversations standard to 64kbps. Contrary to what many people say, you can hear the difference quite clearly on the lower-quality voice-only audio we get from recording interviews by telephone. Under some circumstances, the difference between 48kbps and 64kbps is more noticeable with voice than with music because the waveform of the human voice is more complex than that of most instruments, even many instruments together. Furthermore, since so much non-classical music contains intentional distortion, unless you’re familiar with a particular recording, the music itself tends to mask the codec’s artifacts. If you’re listening to a speaker, particularly one whose voice you know, the artifacts can be quite annoying.

That 64kbps rate is high enough to eliminate most of the artifacts that occur due to MP3 encoding after substantial digital processing of our mostly-telephone interviews.

Mono or Stereo?

I want to correct a bit of mis-information that’s floating around the Podosphere: that you should avoid the use of stereo when encoding your MP3 podcasts that are mostly speech. If you just want the quick answer:

Encode mostly speech podcasts using Joint Stereo with the Intensity Stereo (IS) option. Use the same configuration for mostly music podcasts, but fall back to Stereo if you hear annoying artifacts in the resulting MP3 files.

Now for the longer explanation…

Assuming you’re encoding in Constant Bit Rate (CBR) format (recommended), the bit rate will determine the size of your output file relative to the length in time of the source. That ratio will be fixed and independent of whether you encode in mono or stereo. Stereo never generates a larger MP3. If anything, it reduces the quality instead.

If you select Dual Channel encoding (not recommended), half the bits in the file will be used to encode each channel, and the resulting sound will indeed be of lower quality. A dual-channel stereo 64kbps file sounds about as good as a 32kbps mono file. It’s what the naysayers are warning you about.

But think about the information in a stereo signal, which consists of left (L) and right (R) channels. Rather than treat them separately, the encoder can also derive the sum or mix (L+R) and difference (L-R) signals. If the signal is pure mono, L-R is zero. The encoder can detect this and use virtually all the bits available to encode just the L+R signal. As the difference (L-R) increases and decreases, the encoder diverts more or fewer bits to encode the separation data.

MP3 is one of a family of perceptual codecs, which means that it uses a model of human hearing to discard components of the audio that are unnecessary because they would be masked by other components anyway. It turns out that the majority of the data in an audio signal is relatively unimportant to recreating the original, and one such component is the L-R difference between the two tracks.

Taking advantage of the phenomenon, MP3 supports three stereo-encoding algorithms, all of which segregate L+R data:

  • Independent Channel (IC) stereo
  • Intensity Stereo (IS)
  • Middle/Side (MS) stereo

If you select the Stereo option in your encoder configuration, you get only the older IC encoding. If you select Joint Stereo you then have the option of selecting either IS, MS or both in addition to IC. In my opinion, the best combination for low-res (128kbps and below) MP3s, particularly if they contain mostly speech, is Joint Stereo with the IS option. MS stereo is designed for bit rates above 128kbps.
MP3 files are organized into frames, each of which corresponds to a short (less than a second) chunk of time. The encoder is allowed to use a different scheme for each frame. For example, if you start with a stereo musical intro (moderate L-R content), the encoder will generate stereo frames using the mode(s) you’ve selected. (It may even switch between IC, IS and MS from one frame to the next, which is one reason why I recommend disabling MS mode. Otherwise, if you listen carefully you can hear the mode switches.) At the end of the intro comes your speech (in which L-R is zero) and the encoder can allocate virtually all of each frame’s bits to the L+R (mono) signal. If you play music, the encoder switches back to stereo as required.

When L-R=0 (i.e., the left and right channels are the same), the encoder can actually generate mono frames even though you’ve selected stereo encoding. In this case you’ll have no loss of quality whatsoever. But this requires that the L and R channels be identical, so here’s what I do: After editing, mixing and mastering a program, I select the speech-only part of the file and copy/paste the left track to the right track. That way I know they’re bit-for-bit identical. I’m giving the encoder some help in doing its job. Okay, maybe I’m compulsive, but I believe I can hear the difference.

6 thoughts on “The Secret Lives of MP3 Files

  1. Pingback: Geek News Central
  2. Doug,

    Thanks so much for this! I’ve been looking for some relevant information on encoding for some time now, and this totally fits the bill!

    I’m really into audiobooks, and have the equivalent to a season pass to Audible. Unfortunately, my car’s MP3/CD player is finicky as hell, and complains about 90% of what’s fed into it.

    I’ve found that converting from AA to MP3 and breaking the 150+MB file into many smaller files will allow it to play acceptably, but I had no clue what to set Goldwave’s conversion settings to.

    Kbps is old hat, I understood that before, but the Hz options (8000, 11025, 12000, 16000, 22050, 24000, 32000, 44100, and 48000) have always been a mystery to me. They don’t seem to yield much different file sizes; the Kbps seems to be the only significant factor there (I notice a difference of a few kb. Oddly, 16000 yields a slightly LARGER file than 22050).

    This is spoken word, sampled from a 64kbps, clean original. On a related note, with reference to your data about artifacts, am I better off starting with a CD-quality file and converting down, or starting with the desired format and simply breaking it up?

    Like

  3. Zen, You’re always better of starting with the uncompressed original. If you decode an MP3 then re-encode, you’ll lose some quality. As you discovered, the sample rate has no effect on the MP3 filesize. Only the bitrate matters in this regard. You’ll find that sample rates that are multiples of 11025 play on more players. 22050 and 44100 are almost universal. Some devices (and Flash players) won’t play 24000, 32000, 48000, etc.

    Like

  4. Guys, the Hz figure is the sampling interval for the original audio before compression. If you have a low bitrate compressed file that you “up convert” using a higher sample rate, then that’s why the filesize doesn’t change – it’s resampling the same signal, just faster!
    To see the effect in action, try DOWN converting a PCM/WAV file at 44.1kHz (or higher if your compressor allows it), and you’ll see a significant reduction in the filesize as the sample rate is dropped. It won’t be as “large” as the effect of the bitrate selected, but it will vary.
    HTH.

    Like

  5. Whosoever thinks there is nothing important in the top octave of audio should think again. I can always tell the difference between an audio file that is cut off at 11kHz and one that is cut off at 16kHz – this problem occurs a lot with Youtube files unless downloaded in HD (1280×960 or better) and I can always in such cases tell the difference between the HD and the 640×480 videos in terms of the audio – in the latter (in cases where undue compression has been imposed by Youtube on all but HD because they haven’t got the enhanced license) the top end is missing, and I can hear the effect – a muddying of the sound – all too easily!

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s