The demonstration
consists of two parts. In the first, short
speech recordings are obscured by a noise which is meant to resemble that in
a crowded room with several people talking at the same time. Like, for
instance, a bar (but one with no music). The purpose of this first section is
to show that one can usually understand what is said, even with a remarkably
high noise level. In the second part, a
speech recording is quantized to very few levels, which, simply put, means
that a very large part of the information in the signal is removed. Again,
speech perception proves to be quite insensitive to low signal quality, in
this respect. About the speech
recordings
Part 1: Obscured speech
Below are links to
sound files which contain recordings of speech obscured by noise. There are
three different recordings, and for each recording there are four sound files
with different noise levels. This is how the
sounds were created: ·
The speech
recordings were normalized to 0 dB. This means that the highest amplitudes in
each recording were coded with the highest possible sample value. ·
The noise
recording was also normalized to 0 dB. ·
To produce
an obscured speech recording with a certain relative noise level, the normalized
speech was mixed with normalized noise that had its amplitude set higher,
lower or equal to the speech. The dB levels in
the examples are speech signal level relative to the noise signal, i.e. 0 dB
means that the signal levels are equal; +1 dB that the speech signal level is
1 dB higher than that of the noise. Note that these are not RMS levels. Listen to the
sounds by clicking on the links, and see at what noise level you can make out
what is said. For each example there is a link to a written transcript of the
recording. Example 1
What did he say? Check
the transcript. Example 2
What did he say?
Check the transcript. Example 3
This last one is a
bit harder, so the speech levels are higher than in the two previous
examples. What did he say?
Check the transcript. Part 2: Quantized
speech
All digitized sound
is quantized; the number of levels is limited by how many bits are used to
store each sample. The number of quantization levels is the number of
different signal amplitudes that can be represented. In the sound files in
examples 1-3, 16 bits are used. This gives 216 = 65536
quantization levels. The following demonstration shows that one can get by
with using a lot fewer for speech signals. The sound files
contain the same spoken sentence, quantized with 1,2,3 and 16 bits,
respectively. As you can hear, four
quantization levels (2 bits) are enough to at least be able to understand
most of what is said. Conclusion
Apparently, speech
is a very robust means of communication. High levels of noise, as well as
severe distortion, can be tolerated. This is partly due to redundancy, an
abundance of information present in the speech signal. Our knowledge of
syntax, semantics and other lingual properties enable us to “fill in
the blanks”, if some part of a word or sentence is obscured or missing.
Among the applications
that make use of the redundancy in speech are compression schemes, like those
used in GSM-phones and in IP-telephony. The objective here is to minimize the
amount of data that needs to be coded and transmitted by removing as much as
possible of the original speech signal. Data rates of as little as a few
kilobits per second can be achieved, with surprisingly good sound quality. Last updated 2003 |