"Robust Combination of Neural Networks and Hidden
Markov Models for Speech Recognition"Edmondo Trentin - PhD ThesisUniversita' di Firenze (DSI), Feb 27, 2001Advisor: Prof. Marco GoriThis Thesis is in memory of Bernhard "Massimo Pigone" Flury |

**Keywords:** Automatic speech recognition; hidden Markov model;
artificial neural network; hybrid system; adaptive amplitude of
activation functions; gradient ascent; maximum likelihood training;
divergence problem; weight-grouping; robustness to noise.

**Abstract:**
In spite of their ability to classify short-time acoustic-phonetic
unis, such as individual phonemes or few isolated words, artificial
neural nets (ANNs) historically failed as a general framework for
Automatic speech recognition (ASR). Failure emerges in dealing with
long sequences of acoustic observations, like those required in order
to represent words from a large dictionary or whole sentences
(continuous speech recognition). This is mainly due to the lack of
ability to model long-term dependencies in ANNs, even when recurrent
architectures are considered. In the early Nineties this fact led to
the idea of combining hidden Markov models (HMMs) and ANNs within a
unifying, novel model, broadly known as hybrid ANN/HMM. A number of
significant, different hybrid paradigms for ASR were proposed in the
literature.
Unfortunately, state-of-the-art ANN/HMM hybrid systems do not fully
exploit the potential advantages that could be taken from the
combination of the underlying paradigms, mainly due to the following
reasons: (1) in most hybrid architectures ANNs play only a marginal role,
while the kernel of the recognizer still relies on a standard
HMM. (2) In the major cases of connectionist probability estimation
for an underlying HMM, namely in Bourlard and Morgan's model and
its many derivates, the lack of a globally-defined, mathematically
motivated training scheme strongly weakens the theoretical
framework, and substantially reduces the improvement in performance on
the field. (3) No actual hybrid system stressed (enough) the generalization
capabilities of ANNs in order to tackle the problem of ASR in noisy
environments under different, adverse conditions.
The search for theoretically motivated as well as effective solutions
to these problems formed the motivation for the development of a novel
ANN/HMM hybrid paradigm. We introduced novel algorithms based on a
gradient-ascent technique, aimed at maximizing the likelihood of
acoustic observations given the model, for global training of a hybrid
ANN/HMM system. The ANN is used as a non-parametric estimator of
emission probabilities associated with individual states of the
HMM. The approach is clearly related to the hybrid systems proposed by
Bourlard & Morgan and by Bengio, respectively, with the twofold goal
of combining the benefits from both within a unifying framework, and
of overcoming their limitations.
Once the generic (BML) training algorithm is given, a critical point
had to be considered, referred to as the "divergence problem": how
can we ensure that the ANN outputs do actually estimate likelihoods?
Major answers to the question are given by: (1) imposing a
probabilistic constraint on ANN weights (SWS-ML algorithm); (2)
factorizing the emission probabilities via Bayes' Theorem,
i.e. performing connectionist state-posterior probability estimation
(Bayes algorithm); (3) extremizing a Maximum A Posteriori training
criterion, instead of the bare likelihood (MAP algorithm).
Improved learning and generalization capabilities, i.e. robustness to
noise of the overall recognition system, are finally pursued via a
novel, gradient-driven "soft" weight-grouping technique. The latter
relies on the introduction of adaptive amplitudes of activation
functions. Three instances of the algorithms are developed (unique
amplitude, layer-by-layer, unit-by-unit). Experiments were
accomplished on noisy ASR tasks, namely connected digits recognition,
in two distinct setups: (i) office noise was added to clean signals
from the SPK database at different SNRs; (ii) the VODIS
II/SpeechDatCar database, collected in a real car environment, was
used. The results confirmed a dramatic improvement in terms of
recognition performance over the standard HMM (as well as over
Bourlard & Morgan paradigm), highlighting that the unit-by-unit case
is best suited to the particular architecture/problem under
consideration. In a broader perspective, once the whole ASR system
relies on an ANN, techniques aimed at improving convergence and
generalization properties of the latter (e.g., regularization theory)
may be successfully applied to the ANN/HMM hybrid in a straightforward
manner.

**Download:**

- Part I, in Postscript format: TrentinPhD.1.ps (53K)

- Part II, in Postscript format:TrentinPhD.2.ps (1840K)

Back to