Demo

Speech demo

The demonstration consists of two parts.

In the first, short speech recordings are obscured by a noise which is meant to resemble that in a crowded room with several people talking at the same time. Like, for instance, a bar (but one with no music). The purpose of this first section is to show that one can usually understand what is said, even with a remarkably high noise level.

In the second part, a speech recording is quantized to very few levels, which, simply put, means that a very large part of the information in the signal is removed. Again, speech perception proves to be quite insensitive to low signal quality, in this respect.

About the speech recordings

In all examples, the speech consists of (slightly slurred) American English, spoken by a man
All sound files are in the .wav format (mono)
The “babbling” noise was created by mixing together several different tracks containing speech recordings, and then mixing several tracks containing this mix, but with random delays between the tracks. You can listen to the babble by clicking here.
The sounds in part 1 were created using Adobe Audition
The sounds in part 2 were processed using MATLAB. You can find the code here.

Part 1: Obscured speech

Below are links to sound files which contain recordings of speech obscured by noise. There are three different recordings, and for each recording there are four sound files with different noise levels.

This is how the sounds were created:

· The speech recordings were normalized to 0 dB. This means that the highest amplitudes in each recording were coded with the highest possible sample value.

· The noise recording was also normalized to 0 dB.

· To produce an obscured speech recording with a certain relative noise level, the normalized speech was mixed with normalized noise that had its amplitude set higher, lower or equal to the speech.

The dB levels in the examples are speech signal level relative to the noise signal, i.e. 0 dB means that the signal levels are equal; +1 dB that the speech signal level is 1 dB higher than that of the noise. Note that these are not RMS levels.

Listen to the sounds by clicking on the links, and see at what noise level you can make out what is said. For each example there is a link to a written transcript of the recording.

Example 1

What did he say? Check the transcript.

Example 2

What did he say? Check the transcript.

Example 3

This last one is a bit harder, so the speech levels are higher than in the two previous examples.

What did he say? Check the transcript.

Part 2: Quantized speech

All digitized sound is quantized; the number of levels is limited by how many bits are used to store each sample. The number of quantization levels is the number of different signal amplitudes that can be represented. In the sound files in examples 1-3, 16 bits are used. This gives 2¹⁶ = 65536 quantization levels. The following demonstration shows that one can get by with using a lot fewer for speech signals.

The sound files contain the same spoken sentence, quantized with 1,2,3 and 16 bits, respectively.

As you can hear, four quantization levels (2 bits) are enough to at least be able to understand most of what is said.

Conclusion

Apparently, speech is a very robust means of communication. High levels of noise, as well as severe distortion, can be tolerated. This is partly due to redundancy, an abundance of information present in the speech signal. Our knowledge of syntax, semantics and other lingual properties enable us to “fill in the blanks”, if some part of a word or sentence is obscured or missing.

Among the applications that make use of the redundancy in speech are compression schemes, like those used in GSM-phones and in IP-telephony. The objective here is to minimize the amount of data that needs to be coded and transmitted by removing as much as possible of the original speech signal. Data rates of as little as a few kilobits per second can be achieved, with surprisingly good sound quality.

Back

Last updated 2003