This book discusses large margin and kernel methods for speech and speaker recognition
Speech and Speaker Recognition: Large Margin and Kernel Methods is a collation of research in the recent advances in large margin and kernel methods, as applied to the field of speech and speaker recognition. It presents theoretical and practical foundations of these methods, from support vector machines to large margin methods for structured learning. It also provides examples of large margin based acoustic modelling for continuous speech recognizers, where the grounds for practical large margin sequence learning are set. Large margin methods for discriminative language modelling and text independent speaker verification are also addressed in this book.
Key Features:
This book will be of interest to researchers, practitioners, engineers, and scientists in speech processing and machine learning fields.
Le informazioni nella sezione "Riassunto" possono far riferimento a edizioni diverse di questo titolo.
Dr Joseph Keshet, IDIAP, Switzerland
Dr Keshet received his B.Sc. and M.Sc. in electrical engineering from the Tel-Aviv University, Tel-Aviv, Israel, in 1994 and 2002, respectively. He got his Ph.D. from the Hebrew University of Jerusalem, Israel in 2007. From 1994 to 2002, he was with the Israeli Defense Forces (Intelligence Corps), where he was in charge of advanced research activities in the fields of speech coding. Since 2007, he is a research scientist in speech recognition at IDIAP Research Institute, Martigny, Switzerland.
Dr Samy Bengio, Google, California, US
Dr Bengio received his M.Sc. and Ph.D. degrees in Computer Science from University of Montreal in 1989 and 1993 respectively. Between 1999 and 2006, he was a senior researcher in statistical machine learning at IDIAP Research Institute, where he supervised PhD students and postdoctoral fellows working on many areas of machine learning. He is the author/co-author of more than 160 international publications, including 30 journal papers. He has organized several international workshops (such as the MLMI series) and been in the organization committee of several well known conferences (such as NIPS). Since early 2007, he is a research scientist in machine learning at Google, in Mountain View, California.
This book discusses large margin and kernel methods for speech and speaker recognition
Speech and Speaker Recognition: Large Margin and Kernel Methods is a collation of research in the recent advances in large margin and kernel methods, as applied to the field of speech and speaker recognition. It presents theoretical and practical foundations of these methods, from support vector machines to large margin methods for structured learning. It also provides examples of large margin based acoustic modelling for continuous speech recognizers, where the grounds for practical large margin sequence learning are set. Large margin methods for discriminative language modelling and text independent speaker verification are also addressed in this book.
Key Features:
This book will be of interest to researchers, practitioners, engineers, and scientists in speech processing and machine learning fields.
Samy Bengio and Joseph Keshet
One of the most natural communication tools used by humans is their voice. It is hence natural that a lot of research has been devoted to analyzing and understanding human uttered speech for various applications. The most obvious one is automatic speech recognition, where the goal is to transcribe a recorded speech utterance into its corresponding sequence of words. Other applications include speaker recognition, where the goal is to determine either the claimed identity of the speaker (verification) or who is speaking (identification), and speaker segmentation or diarization, where the goal is to segment an acoustic sequence in terms of the underlying speakers (such as during a dialog).
Although an enormous amount of research has been devoted to speech processing, there appears to be some form of local optimum in terms of the fundamental tools used to approach these problems. The aim of this book is to introduce the speech researcher community to radically different approaches based on more recent kernel based machine learning methods. In this introduction, we first briefly review the predominant speech processing approach, based on hidden Markov models, as well as its known problems; we then introduce the most well known kernel based approach, the Support Vector Machine (SVM), and finally outline the various contributions of this book.
1.1 The Traditional Approach to Speech Processing
Most speech processing problems, including speech recognition, speaker verification, speaker segmentation, etc., proceed with basically the same general approach, which is described here in the context of speech recognition, as this is the field that has attracted most of the research in the last 40 years. The approach is based on the following statistical framework.
A sequence of acoustic feature vectors is extracted from a spoken utterance by a front-end signal processor. We denote the sequence of acoustic feature vectors by [bar.x] = ([x.sub.1], [x.sub.2], ..., [x.sub.T]), where [x.sub.t] [member of] X and X [subset] [R.sup.d] is the domain of the acoustic vectors. Each vector is a compact representation of the short-time spectrum. Typically, each vector covers a period of 10 ms and there are approximately T = 300 acoustic vectors in a 10 word utterance. The spoken utterance consists of a sequence of words [bar.v] = ([v.sub.1], ..., [v.sub.N]). Each of the words belongs to a fixed and known vocabulary V, that is, [v.sub.i] [member of] V. The task of the speech recognizer is to predict the most probable word sequence [[bar.v].sup.'] given the acoustic signal [bar.x]. Speech recognition is formulated as a Maximum a Posteriori (MAP) decoding problem as follows:
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII], (1.1)
where we used Bayes' rule to decompose the posterior probability in Equation (1.1). The term p([bar.x]|[bar.v]) is the probability of observing the acoustic vector sequence [bar.x] given a specified word sequence [bar.v] and it is known as the acoustic model. The term P([bar.v]) is the probability of observing a word sequence [bar.v] and it is known as the language model. The term p([bar.x]) can be disregarded, since it is constant under the max operation.
The acoustic model is usually estimated by a Hidden Markov Model (HMM) (Rabiner and Juang 1993), a kind of graphical model (Jordan 1999) that represents the joint probability of an observed variable and a hidden (or latent) variable. In order to understand the acoustic model, we now describe the basic HMM decoding process. By decoding we mean the calculation of the arg [max.sub.[bar.v]] in Equation (1.1). The process starts with an assumed word sequence [bar.v]. Each word in this sequence is converted into a sequence of basic spoken units called phones using a pronunciation dictionary. Each phone is represented by a single HMM, where the HMM is a probabilistic state machine typically composed of three states (which are the hidden or latent variables) in a left-to-right topology. Assume that Q is the set of all states, and let [bar.q] be a sequence of states, that is [bar.q] = ([q.sub.1], [q.sub.2], ..., [q.sub.T]), where it is assumed there exists some latent random variable [q.sub.t] [member of] Q for each frame [x.sub.t] of [bar.x]. Wrapping up, the sequence of words [bar.v] is converted into a sequence of phones [bar.p] using a pronunciation dictionary, and the sequence of phones is converted to a sequence of states, with in general at least three states per phone. The goal now is to find the most probable sequence of states.
Formally, the HMM is defined as a pair of random processes [bar.q] and [bar.x], where the following first order Markov assumptions are made:
1. P([q.sub.t]|[q.sub.1], [q.sub.2], ..., [q.sub.t-1]) = P([q.sub.t]|[q.sub.t-1]);
2. p([x.sub.t]|[x.sub.1], ..., [x.sub.t-1], [x.sub.t+1], ..., [x.sub.T], [q.sub.1], ..., [q.sub.T]) = p([x.sub.t]|[q.sub.t]).
The HMM is a generative model and can be thought of as a generator of acoustic vector sequences. During each time unit (frame), the model can change a state with probability P([q.sub.t]|[q.sub.t-1]), also known as the transition probability. Then, at every time step, an acoustic vector is emitted with probability p([x.sub.t]|[q.sub.t]), sometimes referred to as the emission probability. In practice the sequence of states is not observable; hence the model is called hidden. The probability of the state sequence [bar.q] given the observation sequence [bar.x] can be found using Bayes' rule as follows:
P([bar.q]|[bar.x]) = p([bar.x], [bar.q])/p([bar.x])
where the joint probability of a vector sequence [bar.x] and a state sequence [bar.q] is calculated simply as a product of the transition probabilities and the output probabilities:
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII], (1.2)
where we assumed that [q.sub.0] is constrained to be a non-emitting initial state. The emission density distributions p([x.sub.t]|[q.sub.t]) are often estimated using diagonal covariance Gaussian Mixture Models (GMMs) for each state [q.sub.t], which model the density of a d-dimensional vector x as follows:
p(x) = [summation over (i)] [w.sub.i]N(x; [.sub.i], [[sigma].sub.i]), (1.3)
where [w.sub.i] [element of] R is positive with [[summation].sub.i] [w.sub.i] = 1, and N(; , [sigma]) is a Gaussian with mean [.sub.i] [element of] [R.sup.d] and standard deviation [[sigma].sub.i] [element of] [R.sup.d]. Given the HMM parameters in the form of the transition probability and emission probability (as GMMs), the problem of finding the most probable state sequence is solved by maximizing p([bar.x], [bar.q]) over all possible state sequences using the Viterbi algorithm (Rabiner and Juang 1993).
In the training phase, the model parameters are estimated. Assume one has access to a training set of m examples [T.sub.train] = [{([[bar.x].sub.i], [[bar.v].sub.i])}.sup.m.sub.i=1]. Training of the acoustic model and the language model can be done in two separate steps. The acoustic model parameters include the transition probabilities and the emission probabilities, and they are estimated by a procedure known as the Baum-Welch algorithm (Baum et al. 1970), which is a special case of the Expectation-Maximization (EM) algorithm, when applied to HMMs. This algorithm provides a very efficient procedure to estimate these probabilities iteratively. The parameters of the HMMs are chosen to maximize the probability of the acoustic vector sequence p([bar.x]) given a virtual HMM composed as the concatenation of the phone HMMs that correspond to the underlying sequence of words [bar.v]. The Baum-Welch algorithm monotonically converges in polynomial time (with respect to the number of states and the length of the acoustic sequences) to local stationary points of the likelihood function.
Language models are used to estimate the probability of a given sequence of words, P([bar.v]). The language model is often estimated by n-grams (Manning and Schutze 1999), where the probability of a sequence of N words ([[bar.v].sub.1], [[bar.v].sub.2], ..., [[bar.v].sub.N]) is estimated as follows:
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII], (1.4)
where each term can be estimated on a large corpus of written documents by simply counting the occurrences of each n-gram. Various smoothing and back-off strategies have been developed in the case of large n where most n-grams would be poorly estimated even using very large text corpora.
1.2 Potential Problems of the Probabilistic Approach
Although most state-of-the-art approaches to speech recognition are based on the use of HMMs and GMMs, also called Continuous Density HMMs (CD-HMMs), they have several drawbacks, some of which we discuss hereafter.
Consider the logarithmic form of Equation (1.2),
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]. (1.5)
There is a known structural problem when mixing densities p([x.sub.t]|[q.sub.t]) and probabilities P([q.sub.t]|[q.sub.t-1]): the global likelihood is mostly influenced by the emission distributions and almost not at all by the transition probabilities, hence temporal aspects are poorly taken into account (Bourlard et al. 1996; Young 1996). This happens mainly because the variance of densities of the emission distribution depends on d, the actual dimension of the acoustic features: the higher d, the higher the expected variance of p([bar.x]|[bar.q]), while the variance of the transition distributions mainly depend on the number of states of the HMM. In practice, one can observe a ratio of about 100 between these variances; hence when selecting the best sequence of words for a given acoustic sequence, only the emission distributions are taken into account. Although the latter may well be very well estimated using GMMs, they do not take into account most temporal dependencies between them (which are supposed to be modeled by transitions).
While the EM algorithm is very well known and efficiently implemented for HMMs, it can only converge to local optima, and hence optimization may greatly vary according to initial parameter settings. For CD-HMMs the Gaussian means and variances are often initialized using K-Means, which is itself also known to be very sensitive to initialization.
Not only is EM known to be prone to local optimal, it is basically used to maximize the likelihood of the observed acoustic sequence, in the context of the expected sequence of words. Note however that the performance of most speech recognizers is estimated using other measures than the likelihood. In general, one is interested in minimizing the number of errors in the generated word sequence. This is often done by computing the Levenshtein distance between the expected and the obtained word sequences, and is often known as the word error rate. There might be a significant difference between the best HMM models according to the maximum likelihood criterion and the word error rate criterion.
Hence, throughout the years, various alternatives have been proposed. One line of research has been centered around proposing more discriminative training algorithms for HMMs. That includes Maximum Mutual Information Estimation (MMIE) (Bahl et al. 1986), Minimum Classification Error (MCE) (Juang and Katagiri 1992), Minimum Phone Error (MPE) and Minimum Word Error (MWE) (Povey and Woodland 2002). All these approaches, although proposing better training criteria, still suffer from most of the drawbacks described earlier (local minima, useless transitions).
The last 15 years of research in the machine learning community has welcomed the introduction of so-called large margin and kernel approaches, of which the SVM is its best known example. An important role of this book is to show how these recent efforts from the machine learning community can be used to improve research in the speech processing domain. Hence, the next section is devoted to a brief introduction to SVMs.
1.3 Support Vector Machines for Binary Classification
The most well known kernel based machine learning approach is the SVM (Vapnik 1998). While it was not developed in particular for speech processing, most of the chapters in this book propose kernel methods that are in one way or another inspired by the SVM.
Let us assume we are given a training set of m examples [T.sub.train] = [{([x.sub.i], [y.sub.i])}.sup.m.sub.i=1] where [x.sub.i] [element of] [R.sup.d] is a d-dimensional input vector and [y.sub.i] [element of] {-1, 1} is the target class. The simplest binary classifier one can think of is the linear classifier, where we are looking for parameters (w [element of] [R.sup.d], b [element of] R) such that
[??](x) = sign(w x + b). (1.6)
When the training set is said to be linearly separable, there is potentially an infinite number of solutions (w [element of] [R.sup.d], b [element of] R) that satisfy (1.6). Hence, the SVM approach looks for the one that maximizes the margin between the two classes, where the margin can be defined as the sum of the smallest distances between the separating hyper-plane and points of each class. This concept is illustrated in Figure 1.1.
This can be expressed by the following optimization problem:
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (1.7)
While this is difficult to solve, its following dual formulation is computationally more efficient:
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (1.8)
One problem with this formulation is that if the problem is not linearly separable, there might be no solution to it. Hence one can relax the constraints by allowing errors with an additional hyper-parameter C that controls the trade-off between maximizing the margin and minimizing the number of training errors (Cortes and Vapnik 1995), as follows:
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (1.9)
which has dual formulation
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (1.10)
In order to look for nonlinear solutions, one can easily replace x by some nonlinear function [PHI](x). It is interesting to note that x only appears in dot products in (1.10). It has thus been proposed to replace all occurrences of [PHI]([x.sub.i]) [PHI]([x.sub.j]) by some kernel function k([x.sub.i], [x.sub.j]). As long as k(, ) lives in a reproducing kernel Hilbert space (RKHS), one can guarantee that there exists some function [PHI]() such that
k([x.sub.i], [x.sub.j]) = [PHI]([x.sub.i]) [PHI]([x.sub.j]).
Thus, even if [PHI](x) projects x in a very high (possibly infinite) dimensional space, k([x.sub.i], [x.sub.j]) can still be efficiently computed.
Problem (1.10) can be solved using off-the-shelf quadratic optimization tools. Note however that the underlying computational complexity is at least quadratic in the number of training examples, which can often be a serious limit for most speech processing applications. After solving (1.10), the resulting SVM solution takes the form of
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (1.11)
where most [[alpha].sub.i] are zero except those corresponding to examples in the margin or misclassified, often called support vectors (hence the name of SVMs).
1.4 Outline
The book has four parts. The first part, Foundations, covers important aspects of extending the binary SVM to speech and speaker recognition applications. Chapter 2 provides a detailed review of efficient and practical solutions to large scale convex optimization problems one encounters when using large margin and kernel methods with the enormous datasets used in speech applications. Chapter 3 presents an extension of the binary SVM to multiclass, hierarchical and categorical classification. Specifically, the chapter presents a more complex setting in which the possible labels or categories are many and organized.
(Continues...)
Excerpted from Automatic Speech and Speaker Recognition Copyright © 2009 by John Wiley & Sons, Ltd. Excerpted by permission.
All rights reserved. No part of this excerpt may be reproduced or reprinted without permission in writing from the publisher.
Excerpts are provided by Dial-A-Book Inc. solely for the personal use of visitors to this web site.
Le informazioni nella sezione "Su questo libro" possono far riferimento a edizioni diverse di questo titolo.
EUR 14,20 per la spedizione da Regno Unito a Italia
Destinazione, tempi e costiEUR 17,02 per la spedizione da U.S.A. a Italia
Destinazione, tempi e costiDa: Cotswolds Rare Books, OXFORDSHIRE, Regno Unito
Hardcover. Condizione: As New. 1st Edition. As brand new. Codice articolo 16341d
Quantità: 1 disponibili
Da: Feldman's Books, Menlo Park, CA, U.S.A.
Hardcover. Condizione: Very Fine. First Edition. No markings. Codice articolo 00044470
Quantità: 1 disponibili
Da: GreatBookPrices, Columbia, MD, U.S.A.
Condizione: New. Codice articolo 5550970-n
Quantità: 12 disponibili
Da: BookResQ., West Valley City, UT, U.S.A.
Hardcover. Condizione: Very Good. Ex-library book with typical stickers and stampings. Priority or international shipping available on this item. -5c-. Codice articolo F3108032012jo119776
Quantità: 1 disponibili
Da: GreatBookPrices, Columbia, MD, U.S.A.
Condizione: As New. Unread book in perfect condition. Codice articolo 5550970
Quantità: 12 disponibili
Da: GreatBookPricesUK, Woodford Green, Regno Unito
Condizione: As New. Unread book in perfect condition. Codice articolo 5550970
Quantità: 12 disponibili
Da: moluna, Greven, Germania
Gebunden. Condizione: New. Dr Joseph Keshet, IDIAP, SwitzerlandDr Keshet received his B.Sc. and M.Sc. in electrical engineering from the Tel-Aviv University, Tel-Aviv, Israel, in 1994 and 2002, respectively. He got his Ph.D. from the Hebrew University of Jerusalem, Israel in 2007. Fr. Codice articolo 556557720
Quantità: Più di 20 disponibili
Da: Kennys Bookshop and Art Galleries Ltd., Galway, GY, Irlanda
Condizione: New. 2009. 1st Edition. Hardcover. This book discusses large margin and kernel methods for speech and speaker recognition Speech and Speaker Recognition: Large Margin and Kernel Methods is a collation of research in the recent advances in large margin and kernel methods, as applied to the field of speech and speaker recognition. Editor(s): Keshet, Joseph; Bengio, Samy. Num Pages: 268 pages, Illustrations, plans. BIC Classification: UYQS. Category: (P) Professional & Vocational. Dimension: 249 x 173 x 20. Weight in Grams: 598. . . . . . Codice articolo V9780470696835
Quantità: 15 disponibili
Da: INDOO, Avenel, NJ, U.S.A.
Condizione: New. Codice articolo 9780470696835
Quantità: Più di 20 disponibili
Da: GreatBookPricesUK, Woodford Green, Regno Unito
Condizione: New. Codice articolo 5550970-n
Quantità: 12 disponibili