"Robust Combination of Neural Networks and Hidden Markov Models for Speech Recognition"
Edmondo Trentin - PhD Thesis
Universita' di Firenze (DSI), Feb 27, 2001
Advisor: Prof. Marco Gori
This Thesis is in memory of Bernhard "Massimo Pigone" Flury
Keywords: Automatic speech recognition; hidden Markov model; artificial neural network; hybrid system; adaptive amplitude of activation functions; gradient ascent; maximum likelihood training; divergence problem; weight-grouping; robustness to noise.
Abstract: In spite of their ability to classify short-time acoustic-phonetic unis, such as individual phonemes or few isolated words, artificial neural nets (ANNs) historically failed as a general framework for Automatic speech recognition (ASR). Failure emerges in dealing with long sequences of acoustic observations, like those required in order to represent words from a large dictionary or whole sentences (continuous speech recognition). This is mainly due to the lack of ability to model long-term dependencies in ANNs, even when recurrent architectures are considered. In the early Nineties this fact led to the idea of combining hidden Markov models (HMMs) and ANNs within a unifying, novel model, broadly known as hybrid ANN/HMM. A number of significant, different hybrid paradigms for ASR were proposed in the literature. Unfortunately, state-of-the-art ANN/HMM hybrid systems do not fully exploit the potential advantages that could be taken from the combination of the underlying paradigms, mainly due to the following reasons: (1) in most hybrid architectures ANNs play only a marginal role, while the kernel of the recognizer still relies on a standard HMM. (2) In the major cases of connectionist probability estimation for an underlying HMM, namely in Bourlard and Morgan's model and its many derivates, the lack of a globally-defined, mathematically motivated training scheme strongly weakens the theoretical framework, and substantially reduces the improvement in performance on the field. (3) No actual hybrid system stressed (enough) the generalization capabilities of ANNs in order to tackle the problem of ASR in noisy environments under different, adverse conditions. The search for theoretically motivated as well as effective solutions to these problems formed the motivation for the development of a novel ANN/HMM hybrid paradigm. We introduced novel algorithms based on a gradient-ascent technique, aimed at maximizing the likelihood of acoustic observations given the model, for global training of a hybrid ANN/HMM system. The ANN is used as a non-parametric estimator of emission probabilities associated with individual states of the HMM. The approach is clearly related to the hybrid systems proposed by Bourlard & Morgan and by Bengio, respectively, with the twofold goal of combining the benefits from both within a unifying framework, and of overcoming their limitations. Once the generic (BML) training algorithm is given, a critical point had to be considered, referred to as the "divergence problem": how can we ensure that the ANN outputs do actually estimate likelihoods? Major answers to the question are given by: (1) imposing a probabilistic constraint on ANN weights (SWS-ML algorithm); (2) factorizing the emission probabilities via Bayes' Theorem, i.e. performing connectionist state-posterior probability estimation (Bayes algorithm); (3) extremizing a Maximum A Posteriori training criterion, instead of the bare likelihood (MAP algorithm). Improved learning and generalization capabilities, i.e. robustness to noise of the overall recognition system, are finally pursued via a novel, gradient-driven "soft" weight-grouping technique. The latter relies on the introduction of adaptive amplitudes of activation functions. Three instances of the algorithms are developed (unique amplitude, layer-by-layer, unit-by-unit). Experiments were accomplished on noisy ASR tasks, namely connected digits recognition, in two distinct setups: (i) office noise was added to clean signals from the SPK database at different SNRs; (ii) the VODIS II/SpeechDatCar database, collected in a real car environment, was used. The results confirmed a dramatic improvement in terms of recognition performance over the standard HMM (as well as over Bourlard & Morgan paradigm), highlighting that the unit-by-unit case is best suited to the particular architecture/problem under consideration. In a broader perspective, once the whole ASR system relies on an ANN, techniques aimed at improving convergence and generalization properties of the latter (e.g., regularization theory) may be successfully applied to the ANN/HMM hybrid in a straightforward manner.