There are many myths and much confusion about the proper way to create audio files using an MP3 encoder, particularly for spoken-word podcasts. The following is a compilation of various topics I have posted over the past few years.
The sample rate of an MP3 file specifies how the input to the encoder is analyzed. Sample rates are not the same as this entry in Wikipedia.(see below), the bandwidth of the encoder’s output. IOW, there isn’t a fixed relationship between sample rate and bit rate. For more information on MP3 technology and history see
I recommend using a 44,100Hz sample for all podcasts. If you want a super-high quality recording (better than CD) and you plan to encode at 320kbps or more, then you *might* want to use 88,200Hz, but that’s not really in podcast territory, at least not today.
Avoid any sample rate that isn’t a multiple of 11,025Hz. Many hardware and software MP3 players including those common web-page ones using Adobe/Macromedia Flash won’t play anything that isn’t a multiple of 11,025.
If possible, record at the same sample rate you’ll use for your MP3. If you have to resample from 48,000Hz to 44,100Hz, for example — the audio tracks within many video formats are recorded at 48,000Hz — you will generate undesirable artifacts. Downsampling from 88,200Hz to 44,100Hz isn’t perfect, but because it’s a simple 2:1 transformation, the artifacts are somewhat less objectionable than when the resampler has to interpolate.
Why not go lower, perhaps 22,050Hz? After all, the uncompressed WAV or AIFF files will then be half the size. Some point out that this lower sample rate will reduce the amount of data given to the MP3 encoder and therefore require somewhat less compression. Per the Nyquist Theorem a 22,050Hz sample rate can reproduce frequencies of up to 11,000Hz which is more than good enough for voice recordings, particularly those which are recorded from telephone lines (~3,500Hz max). But we have found a few hardware and software MP3 players that don’t accept anything except 44,100Hz. Furthermore, most pre-encoding recordings are (and should be) made at 44,100Hz.
There is no important vocal information in the octave from 11,000Hz to 22,000Hz — but rather just unwanted noise. Instead of using a lower sample rate to remove the unnecessary top octave of audio, I use a low-pass filter at 11,000Hz in my very last step before encoding. I prefer this to depending on the Nyquist Theorem to accomplish this task. By presenting audio in this unnecessary octave to the encoder, I’d be asking the encoder to attempt to preserve it, whereas what I really want is for it to be ignored. Using the low-pass filter, I reduce the data to the encoder and therefore achieve a better result. The more unnecessary information you feed to your MP3 encoder (which typically has a fixed-bitrate output) the more encoding-stream bandwidth it must utilize to reproduce it.
Use only constant-bit-rate (CBR) for Podcasts, not variable-bit-rate (VBR). The latter is only valuable for higher-res MP3s (192kbps and more), and can’t be rendered on some players.
The bit rate of an MP3 file specifies the output stream: the number of bits generated per second of audio. Bit rates are not the same as(see above).
When I started IT Conversations in June of 2003, I encoded the files at 32kbps (about 15MB per hour of audio). At that time I got more complaints that the files were too large than about the quality. But things changed quickly! A year later, by popular demand, I upgraded to 48kbps and received few complaints. For a while Adam Curry encoded his Daily Source Code program at 192kbps for, but he soon yielded to pressure and cut it back to 96kbps. Even at that rate an hour-long show takes about 40MB on an iPod or other device, and while the files can be downloaded overnight, it can take quite a while if you’re not on broadband.
When high-capacity iPods and similar devices became commonplace we upgraded the IT Conversations standard to 64kbps. Contrary to what many people say, you can hear the difference quite clearly on the lower-quality voice-only audio we get from recording interviews by telephone. Under some circumstances, the difference between 48kbps and 64kbps is more noticeable with voice than with music because the waveform of the human voice is more complex than that of most instruments, even many instruments together. Furthermore, since so much non-classical music contains intentional distortion, unless you’re familiar with a particular recording, the music itself tends to mask the codec’s artifacts. If you’re listening to a speaker, particularly one whose voice you know, the artifacts can be quite annoying.
That 64kbps rate is high enough to eliminate most of the artifacts that occur due to MP3 encoding after substantial digital processing of our mostly-telephone interviews.
Mono or Stereo?
I want to correct a bit of mis-information that’s floating around the Podosphere: that you should avoid the use of stereo when encoding your MP3 podcasts that are mostly speech. If you just want the quick answer:
Encode mostly speech podcasts using Joint Stereo with the Intensity Stereo (IS) option. Use the same configuration for mostly music podcasts, but fall back to Stereo if you hear annoying artifacts in the resulting MP3 files.
Now for the longer explanation…
Assuming you’re encoding in Constant Bit Rate (CBR) format (recommended), the bit rate will determine the size of your output file relative to the length in time of the source. That ratio will be fixed and independent of whether you encode in mono or stereo. Stereo never generates a larger MP3. If anything, it reduces the quality instead.
If you select Dual Channel encoding (not recommended), half the bits in the file will be used to encode each channel, and the resulting sound will indeed be of lower quality. A dual-channel stereo 64kbps file sounds about as good as a 32kbps mono file. It’s what the naysayers are warning you about.
But think about the information in a stereo signal, which consists of left (L) and right (R) channels. Rather than treat them separately, the encoder can also derive the sum or mix (L+R) and difference (L-R) signals. If the signal is pure mono, L-R is zero. The encoder can detect this and use virtually all the bits available to encode just the L+R signal. As the difference (L-R) increases and decreases, the encoder diverts more or fewer bits to encode the separation data.
MP3 is one of a family of perceptual codecs, which means that it uses a model of human hearing to discard components of the audio that are unnecessary because they would be masked by other components anyway. It turns out that the majority of the data in an audio signal is relatively unimportant to recreating the original, and one such component is the L-R difference between the two tracks.
Taking advantage of the phenomenon, MP3 supports three stereo-encoding algorithms, all of which segregate L+R data:
- Independent Channel (IC) stereo
- Intensity Stereo (IS)
- Middle/Side (MS) stereo
If you select the Stereo option in your encoder configuration, you get only the older IC encoding. If you select Joint Stereo you then have the option of selecting either IS, MS or both in addition to IC. In my opinion, the best combination for low-res (128kbps and below) MP3s, particularly if they contain mostly speech, is Joint Stereo with the IS option. MS stereo is designed for bit rates above 128kbps.
MP3 files are organized into frames, each of which corresponds to a short (less than a second) chunk of time. The encoder is allowed to use a different scheme for each frame. For example, if you start with a stereo musical intro (moderate L-R content), the encoder will generate stereo frames using the mode(s) you’ve selected. (It may even switch between IC, IS and MS from one frame to the next, which is one reason why I recommend disabling MS mode. Otherwise, if you listen carefully you can hear the mode switches.) At the end of the intro comes your speech (in which L-R is zero) and the encoder can allocate virtually all of each frame’s bits to the L+R (mono) signal. If you play music, the encoder switches back to stereo as required.
When L-R=0 (i.e., the left and right channels are the same), the encoder can actually generate mono frames even though you’ve selected stereo encoding. In this case you’ll have no loss of quality whatsoever. But this requires that the L and R channels be identical, so here’s what I do: After editing, mixing and mastering a program, I select the speech-only part of the file and copy/paste the left track to the right track. That way I know they’re bit-for-bit identical. I’m giving the encoder some help in doing its job. Okay, maybe I’m compulsive, but I believe I can hear the difference.