Springer, 2017. — 100 p.
The goal of developing a phone recognition system (PRS) is to derive the sequence of basic sound units from the speech signal. Most of the state-of-the-art PRSs are developed using spectral features such as Mel frequency cepstral coefficients. Spectral features mainly represent the gross shape of the vocal tract, but not the information related to the excitation source or the positioning and movements of various articulators. But, the production of each sound unit is characterized by articulatory and excitation source features in addition to vocal tract features. It is impossible to produce a sound unit without having an appropriate source of excitation. The rate of vibration of vocal folds varies from one phone to another phone based on their inherent characteristics as well as the influence of coarticulation characteristics due to the presence of adjacent phones. The positioning and movement of various articulators during the production of a sound unit change from one sound unit to another. A unique combination of articulators in the vocal tract and specific source of excitation results in production of a particular sound unit. In this work, the articulatory and excitation source features are explored for improving the performance of PRSs. The articulatory features (AFs) are derived from the spectral features using feedforward neural networks (FFNNs). Five AF groups, namely manner, place, roundness, frontness, and height, are considered. Five different AF-based tandem PRSs are developed using the combination of spectral features and AFs derived from FFNNs of each AF group. The systematic analysis of phone-level accuracies contributed by each AF group is carried out. Hybrid PRSs are developed by combining the evidences from AF-based tandem PRSs using weighted combination approach. It is observed that the use of AFs in addition to spectral features has lead to improvement in the performance of PRSs.
The excitation source information is derived by processing linear prediction (LP) residual of the speech signal. The use of excitation source information has shown improvement in the performance of PRSs. The robustness of proposed excitation source features is demonstrated using white and babble noisy speech samples. The PRSs developed using the combination of vocal tract and excitation source features are more robust to noise than the PRSs developed using vocal tract features alone. The performance of tandem PRSs is improved using excitation source features in addition to spectral features. The performance of PRSs developed using articulatory and excitation source features across read, extempore, and conversation modes of speech is analyzed, and results are compared. The use of articulatory and excitation source features has shown improvement in all the three modes of speech.
This book is mainly intended for researchers working on speech recognition area. This book is also useful for the young researchers, who want to pursue research in speech processing with an emphasis on articulatory and excitation source features. Hence, this may be recommended as the text or reference book for the postgraduate level advanced speech processing course.
Literature Review
Articulatory Features for Phone Recognition
Excitation Source Features for Phone Recognition
Articulatory and Excitation Source Features for Phone Recognition in Read, Extempore and Conversation Modes of Speech
Summary and Conclusion
A: MFCC Features
B: Pattern Recognition Models