Зарегистрироваться
Восстановить пароль
FAQ по входу

Taylor P. Text-to-Speech Synthesis

  • Файл формата pdf
  • размером 4,95 МБ
  • Добавлен пользователем
  • Описание отредактировано
Taylor P. Text-to-Speech Synthesis
Cambridge University Press, 2009. — 642 p.
Speech processing technology has been a mainstream area of research for more than 50 years. The ultimate goal of speech research is to build systems that mimic (or potentially surpass) human capabilities in understanding, generating and coding speech for a range of human-to-human and human-to-machine interactions.
In the area of speech coding a great deal of success has been achieved in creating systems that significantly reduce the overall bit rate of the speech signal (from order of 100 kilobits per second, to rates on the order of 8 kilobits per second or less), while maintaining speech intelligibility and quality at levels appropriate for the intended applications. The heart of the modern cellular industry is the 8 kilobit per second speech coder, embedded in VLSI logic on the more than 2 billion cellphones in use worldwide at the end of 2007.
In the area of speech recognition and understanding by machines, steady progress has enabled systems to become part of everyday life in the form of call centers for the airlines, financial, medical and banking industries, help desks for large businesses, form and report generation for the legal and medical communities, and dictation machines that enable individuals to enter text into machines without having to explicitly type the text. Such speech recognition systems were made available to the general public as long as 15 years ago (in 1992 AT&T introduced the Voice Recognition Call Processing system which automated operator-assisted calls, handling more than 1.2 billion requests each year with error rates below 0.5penetrated almost every major industry since that time. Simple speech understanding systems have also been introduced into the marketplace and have had varying degrees of success for help desks (e.g., the How May I Help You system introduced by AT&T for customer care applications) and for stock trading applications (IBM system), among others.
It has been the area of speech generation that has been the hardest speech technology area to obtain any viable degree of success. For more than 50 years researchers have struggled with the problem of trying to mimic the physical processes of speech generation via articulatory models of the human vocal tract, or via terminal analog synthesis models of the time-varying spectral and temporal properties of speech. In spite of the best efforts of some outstanding speech researchers, the quality of synthetic speech generated by machine was unnatural most of the time and has been unacceptable for human use in most real world applications. In the late 1970s the idea of generating speech by concatenating basic speech units (in most cases diphone units which represented pieces of pairs of phonemes) was investigated and shown to be practical once researchers learned how to reliably excise diphones from human speech. After more than a decade of investigation as to how to optimally concatenate diphones, the resulting synthetic speech was often highly intelligible (a big improvement over earlier systems) but regrettably remained highly unnatural. Hence concatenative speech synthesis systems remained lab curiosities but were not employed in real world applications such as reading email, user interactions in dialogue systems, etc. The really big breakthrough in speech synthesis came in the late 1980s when Yoshinori Sagisaka at ATR in Japan made the leap from single diphone tokens as the basic unit set for speech synthesis, to multiple dixix phone tokens, extracted from carefully designed and read speech databases. Sagisaka realized that in the limiting case, where you had thousands of tokens of each possible diphone of the English language, you could literally concatenate the correct sequence of diphones and produce natural sounding human speech. The new problem that arose was deciding exactly which of the thousands of diphones should be used at each diphone position in the speech being generated. History has shown that, like most large scale computing problems, there are solutions that make the search for the optimum sequence of diphones (from a large, virtually infinite database) possible in reasonable time and memory. The rest is now history as a new generation of speech researchers investigated virtually every aspect of the so-called unit selection method of concatenative speech synthesis, showing that high quality (both intelligibility and naturalness) synthetic speech could be obtained from such systems for virtually any task application.
Once the problem of generating natural sounding speech from a sequence of diphones was solved (in the sense that a practical demonstration of the feasibility of such high quality synthesis was made with unit selection synthesis systems), the remaining long-standing problem was the conversion from ordinary printed text to the proper sequence of diphones, along with associated prosodic information about sound duration, loudness, emphasis, pitch, pauses, and other so-called suprasegmental aspects of speech. The problem of converting from text to a complete linguistic description of associated sound was one that has been studied almost as long as synthesis itself; much progress had been made in almost every aspect of the linguistic description of speech as in the acoustic generation of high quality sounds.
It is the success of unit selection speech synthesis systems that has motivated the research of Paul Taylor, the author of this book on text-to-speech synthesis systems. Paul Taylor has been in the thick of the research in speech synthesis systems for more than 15 years, having worked at ATR in Japan on the CHATR synthesizer (the system that actually demonstrated near perfect speech quality on some subset of the sentences that were input), at the Centre for Speech Technology research at the University of Edinburgh on the Festival synthesis system, and as Chief Technical Officer of Rhetorical Systems, also in Edinburgh.
Based on decades of research and the extraordinary progress over the past decade, Taylor has put together a book which attempts to tie it all together and to document and explain the processes involved in a complete text-to-speech synthesis system. The first nine chapters of the book address the problem of converting printed text to a sequence of sound units (which characterize the acoustic properties of the resulting synthetic sentence), and an accompanying description of the associated prosody which is most appropriate for the sentence being spoken. The remaining eight chapters (not including the conclusion) provide a review of the associated signal processing techniques for representing speech units and for seamlessly tying them together to form intelligible and natural speech sounds. This is followed by a discussion of the three generations of synthesis methods, namely articulatory and terminal analog synthesis methods, simple concatenative methods using a single representation for each diphone unit, and the unit selection method based on multiple representations for each diphone unit. There is a single chapter devoted to a promising new synthesis approach, namely a statistical method based on the popular Hidden Markov Model (HMM) formulation used in speech recognition systems.
According to the author, Speech synthesis has progressed remarkably in recent years, and it is no longer the case that state-of-the-art systems sound overtly mechanical and robotic. Although this statement is true, there remains a great deal more to be accomplished before speech synthesis systems are indistinguishable from a human speaker. Perhaps the most glaring need is expressive synthesis that not only imparts the message corresponding to the printed text input, but imparts the emotion associated with the way a human might speak the same sentence in the context of a dialogue with another human being. We are still a long way from such emotional or expressive speech synthesis systems.
This book is a wonderful addition to the literature in speech processing and will be a mustread for anyone wanting to understand the blossoming field of text-to-speech synthesis.
Communication and Language
The Text-to-Speech Problem
Text Segmentation and Organisation
Text Decoding
Prosody Prediction from Text
Phonetics and Phonology
Pronunciation
Synthesis of Prosody
Signals and Filters
Acoustic Models of Speech Production
Analysis of Speech Signals
Synthesis Techniques based on Vocal Tract Models
Synthesis by Concatenation and Signal Processing Modification
Markov Model Synthesis
Unit Selection Synthesis
Further Issues
Conclusions
  • Чтобы скачать этот файл зарегистрируйтесь и/или войдите на сайт используя форму сверху.
  • Регистрация