Introduction
It might be that in 50 years time, we’ll have a family android who will converse with us about the weather or even Manchester United’s mid-season performance. If this is the case, then an important component of this icon of the future will be its ability to recognise speech the same way as humans. For the moment though, 'speech recognition' is an important emerging technology that is playing a key role in automotive telematics, mobile phone technology, conferencing systems and similar telecom applications. This article discusses some of the obstacles that such systems need to overcome in order to move forward, towards a human level of performance, and how DSP noise reduction can help optimise the performance of such systems.
The Main Principals of Speech Recognition
Speech Recognition is the process of converting a talker’s sampled speech into the sequence of words representing what the talker has said. The basic building block of speech is the phoneme. There is one phoneme for every basic sound in the language. For example, the word 'cat' is constructed from three phonemes -'k', 'a' and 't'. A Speech Recognition Engine will need to construct the sequence of the phonemes in the speech, before it can produce the sequence of words. This is typically carried out in a number of distinct stages.
First, each short segment of speech is analysed and its important acoustic characteristics are placed into a feature vector. The feature vector is compared to a database of feature vectors for the various phonemes, in order to find the closest match. This process is repeated for each short segment of speech to produce a sequence of phonemes.
The next stage involves use of a pronunciation dictionary to create a number of possible word sequences. A pronunciation dictionary contains a list of words and the sequence of phonemes corresponding to the pronunciation of the word. Using this dictionary in reverse, the phoneme sequences are put together to make known words. A single sequence of phonemes can, however, correspond to a number of different word sequences that have the same pronunciation. For example 'car key' is pronounced the same as 'khaki'. Consequently this stage will result in a number of alternative word sequences.
A language model then examines the context, and possibly the grammar, of the suggested strings of words to narrow the possibilities down to a word sequence that makes sense, the recognised word sequence.
To summarise, a typical speech recognition engine breaks down the speech into a sequence of feature vectors, capturing the important acoustic characteristics of the speech. The feature vectors are converted into a sequence of phonemes, which are built up into suggested sequences of words. These word sequences are then narrowed down to the recognised sentence.
Overcoming Background Noise
One of the major obstacles to achieving high performance speech recognition is 'noise'. In an in-car situation, this noise comes from a number of sources; the road, the engine, the radio, the wind and maybe even the passengers. On a mobile phone, this noise might be background music, traffic, wind or passers-by talking.
Noise is a problem because it affects the acoustic characteristics extracted from the speech to make the sequence of feature vectors. This then introduces errors in the feature vectors and their corresponding phonemes.
Early attempts to apply noise reduction software techniques to enhance speech recognition performance had limited success since, in most cases, these noise reduction technologies had been developed to improve human-to-human communication systems. With such technology, there is always some misidentification of 'noise' and 'speech'. Noise that is misidentified as speech will be transmitted leading to speech-like artefacts that can sound like a babbling brook, very disturbing for a human listener. On the other hand, speech that is mis-identified as noise will be removed, potentially causing the speech to sound distorted.
Achieving the optimal performance from a noise reduction technology involves a trade-off between introducing watery artefacts and causing speech distortion. In human-to-human communication, watery artefacts are usually more unacceptable than losing small parts of speech, particularly since the brain, to some extent, tends to fill in the missing bits of speech to make sense of the output. On the other hand, in a speech recognition system, even a small amount of speech distortion can result in words being unrecognisable, while watery artefacts are often ignored. Consequently it is usually necessary to design noise reduction technology specifically for enhancing the performance of speech recognition systems.
|