Text to Speech

Running head: Text to Speech

Text to Speech Technology

Professor:

ABSTRACT

Text to speech approaches towards adding expressivity to machines is an important field being researched and worked on these days. This paper presents an overview of speech synthesis approach, its applications and advancements towards modern technology. It begins with a description of how such systems work, examines the use of text-to-speech software and try to apply this technology to the DMCS project for evidence of benefits of text to speech applications for people engaged in different fields and the level of accuracy that can be expected. Applications of speech synthesis technology in various fields are then explored. The document concludes with potential uses of speech to text in various fields, likely main uses of the technology in the future.

TEXT TO SPEECH – INTRODUCTION A Text-To-Speech (TTS) synthesis is a widely used technology that should be able to read any text aloud, whether it was directly introduced in the computer by an operator or scanned and submitted to an Optical Character Recognition (OCR) system. Let it be more precise, systems that simply concatenate isolated words or parts of sentences, denoted as Voice Response Systems, are only applicable when a limited vocabulary is required (typically a few one hundreds of words), and when the sentences to be pronounced respect a very restricted structure, as is the case for the announcement of arrivals in train stations for instance. It’s well known that the context of TTS synthesis is impossible to record and store all the words of the language (Dutoit, 1996). It is thus more suitable to define Text-To-Speech as the automatic production of speech, through a grapheme-to-phoneme transcription of the sentences to utter (Keller, 2000).

Speech recognition is an alternative to traditional methods of interacting with machines such as computers and much of hand held devices. An effective system can replace, or reduce the reliability on, standard keyboard and mouse input (Wendy, Ridha, Victor, Omar, Gilbert, Wei Chern & Guofeng, 2005). Nowadays many devices talk back to you. Sometimes back it was communal to hear phone dialog systems speaking to you but now, an increasing number of personal devices--laptops, smart phones, GPS navigation systems, and game devices talk to you, too. For the reason lots of work has been done to introduce various applications using text to speech technologies to facilitate and entertain the user (AT&T Labs Research).

It is much interesting to know about how do they work and what makes devices able to talk to us? Text-to-speech (TTS) is the technology that makes it possible. The application takes text and from this it produces imitated, artificial, machine-made speech. TTS has been around for many years, though only in the past few years has artificial speech reached a high level of naturalness. According to AT&T Labs, “Better sounding speech combined with the explosive popularity of small mobile devices with even smaller screens has increased consumer demand for TTS” (AT&T Labs Research). The use of this particular system could be used to replace standard input devices such as keyboards, touch pads, etc. According to Venkataramani, in electronics-based applications in various fields and IT, the design has a wide market opportunity ranging from mobile service providers to ATM makers (Venkataramani, 2006). Some of the targeted users for the system include: Value added service (VAS) providers, Mobile operators, Home and office security device providers, ATM manufacturers, Mobile phone and blue-tooth headset manufacturers, Telephone service providers, Manufacturers of instruments for disabled persons, PC users, Students, Drivers, Business people. As mentioned earlier that this technology frees people to multitasking while using their devices. The benefits also reach those people with special needs. For people with low vision, TTS can also read text from files, books, and websites, making all information accessible. According to AT&T Labs Research, “Stephen Hawking is a famous example (he prefers his own instantly recognizable version of TTS)”. Even those students who are learning a new language can improve their pronunciation or listening skills (AT&T Labs Research). Corporations like TTS as the technology can be a method to provide information effectively over the telephone. Mobile operators may also get benefit since they can introduce new applications which use TTS.

In this paper we would discuss the use of text-to-speech software for evidence of benefits of text to speech applications for people engaged in different fields. It is to find out the usefulness of TTS and the effect it is making in the lives of the entire target users mentioned above. The technology can assist people who have little keyboard skills or experience, people who are slow typists, or do not have the time or resources to develop keyboard skills. People who are dyslexic or have problems with character or word use and manipulation in a textual form, people with physical disabilities which affect their data entry, or their ability to read and therefore check what they have entered (Kirriemuir, 2003). According to Kirriemuir, “Speech recognition systems used by the general public e.g. phone-based automated timetable information, or ticketing purchasing, can be used immediately – the user makes contact with the system, and speaks in response to commands and questions” (Kirriemuir, 2003). On the other hand, systems built on computers are meant for individual use, such as personal word processing; which usually requires some level of “training” before use.

TTS – THE INSIDE Text-to-speech systems are usually made up of two distinct parts: a front-end and second, the back-end. The front-end performs two major tasks; it converts specific text containing special characters like numbers, characters and abbreviations in to the respective written and spelled words. This process is called text normalization, pre-processing, and sometimes known as tokenization (Snehi, 2006). The front-end in the second phase produces phonetic records for every word, and distributes, marks the text into prosodic units, like phrases, clauses, and sentences. The process of giving each word its phonetic transcripts is known as text-to-phoneme conversion. The output of the front end would finally be the Phonetic transcriptions and prosody information together. The back-end works as speech synthesizer which converts the symbolic representation in sound (Jonathon, Sharon & Dennis, 1987).

Synthesizer technologies

As discussed above that synthesizer is a significant part in the process of converting text to speech. It is responsible for converting written text in to sound. The most significant qualities of a speech synthesis system are spontaneity and understandability. Spontaneity describes how close the generated sound is to human voice, and understandable is the easiness with which the output is understood by the user. A speech synthesizer would be considered ideal if it is both natural and understandable. Speech synthesis systems is designed by trying to maximize the above mentioned characteristics (Rubin, Baer & Mermelstein, 1981).
The Synthesizer converts the word in to sound and this sound is stored in some audio file format and represented by using proper wave format. The two main technologies to generate these speech waveforms are Concatenative synthesis and Formant synthesis. Each has its strengths and weaknesses, the required function would determine which to be used (Rubin, Baer & Mermelstein, 1981).

Concatenative synthesis As the name suggests, Concatenative synthesis is based on the concatenation of segments of sounds. Normally, the sound produced by concatenative synthesis is much close to natural human-sounding. However, natural differences in speaking styles and the nature of the techniques applied for segmenting waveforms sometimes cause audible problems in output. There are further sub-types of concatinative synthesis three important are being discussed below (Pollet & Breen, 2008).

Unit selection synthesis Unit selection synthesis takes large databases of recorded speech produced previously by synthesizer. During database creation, each recorded sound is segmented into some individual phones, di-phones, phrases syllables, half-phones, morphemes, words, and sentences. Usually, this segmenting in to parts is done using some speech with some manual settings done afterwards, using visual representations like waveform (Schroder, n.d).
On the basis of these newly created units, index in database is generated based on the segmentation and audio parameters like the fundamental frequency (pitch) and duration etc. When some specific sound needs to be produced, it is created by determining the related and sequenced units from the database. This particular type of synthesizer is named on the basis of such whole process of unit selection. This process use weighted decision tree for unit’s selection (Schroder, n.d).

Unit selection apply a small amount of digital signal processing (DSP) to the recorded speech thus providing much spontaneity. DSP makes recorded speech sound less natural, but it is needed by systems at the point of concatenation to make the waveform smooth. The sound produced by unit selection is much close to real human voice, especially in perspectives of TTS. Unit Selection synthesis, to provide much naturalness, sometimes require large speech databases to be very large, which could be sometimes in gigabytes of recorded data, signifying dozens of hours of speech. Recently, researchers have recommended various automated methods to identify irregular segments in unit-selection synthesis methods (Thakur & Stato, 2011).

Di-phone synthesis Di-phone synthesis uses a small database containing all the di-phones i.e. sound-to-sound records occurring in a language. The number of di-phones depends on the phonetics of the language being used and only one example of each di-phone is kept in the speech database. At runtime, the required sound is applied to minimal units by using digital signal processing techniques (Thakur & Stato, 2011). Di-phone synthesis has from the sonic problems of Concatenative synthesis and the output is more like robotic-sounding of formant synthesis, but it still has got few of advantages of these approaches besides its lesser size. Although this technique is being researched as there is a variety of freely available software implementations yet its use in commercial applications is decreasing (Moulines & Charpentier, 2003).

Formant synthesis Formant synthesis is a method where instead of human speech samples, the synthesized speech is generated using physical modeling synthesis techniques. Significant factors like frequency, pitch, and noise levels are varied with time to create an artificial waveform. This technique is sometimes called rules-based synthesis; many Concatenative systems also use these rules-based components. The systems using formant synthesis technology produce fake sounding, machiney voice that would never be matched with human speech. Formant synthesis systems are considered better then formant synthesis due to its advantages over Concatenative technique as much of the times naturalness is not required (Thakur & Stato, 2011). Formant-synthesized speech can be reliable and understandable even at very high speeds, avoiding the sound problems that are noticed common in Concatenative systems. Formant synthesis systems usually use smaller programs as compared to the techniques used by Concatenative systems because the Formant synthesis does not have to deal with database of voice records. So they can be used in embedded systems, where memory and microprocessor power are critical areas to be managed. Formant-based systems output a wide variety of sounds and tones thus conveying not just queries, but also multiple tones and emotions, as they have complete control of all aspects of the output speech (Thakur & Stato, 2011).

HMM-based synthesis HMM-based synthesis is based on hidden Markov models, and sometimes called Statistical Parametric Synthesis. In this synthesis technique, the frequency spectrum i.e. the vocal tract, fundamental frequency, i.e. the vocal source, and interval of speech are demonstrated all together by HMMs. Speech waveforms are created from HMMs themselves based on the maximum likelihood condition (Yoshimura, Tokuda, Masuko, Kobayashi & Kitamura, 1999).

Sine Wave synthesis Sine wave synthesis is a method for producing speech by replacing the formants with unaltered tone whistles (Yoshimura, Tokuda, Masuko, Kobayashi & Kitamura, 1999).

Applications of Synthetic Speech Speech synthesis technology is proved to be a great advancement in information technology areas and it being used in various fields now days. Communication technology has moved from low worth talking calculators to 3D applications, like talking heads. The implementation method on small level, mostly takes the previous applications in use. But, some large applications, like reading technologies for the blind or e-mail readers, need unlimited vocabulary and a Text to Speech system. Speech synthesis systems are also becoming more affordable for common customers, which makes these systems more appropriate for daily use. Better availability of such systems may increase employing possibilities for people with communication problems (Synthetic speech, 2011).

A high-quality speech synthesis can help us in providing a foundation tomorrow's tools to literacy. Teaching materials could be introduced with attractive, game-like interfaces capable of speech synthesis mechanism for reinforcing the learning and clarifying correspondences between the written and the spoken words (Synthetic speech, 2011).

Text to speech synthesis application might be the most important and useful providing reading and communication aids for the blind. Before this technique, special audio books were used in which the content of the book was kept in audio recordings. Obviously making such spoken copy of large book was much time consuming yet very costly. The first reading machines were much expensive and their use was just limited to libraries or related places. Now the systems are mostly software based; it is easy to construct a reading machine for any computer environment with comparatively less expenses but still there are places to improve (Synthetic speech, 2011).

Speech synthesis is also being used to read web pages and other media formats through a normal telephone interface or with keypad-control on your personal computers. Technological advancements in computers environments also help in adding new features into reading aids like to find the information how the newspaper article is constructed. However, sometimes it may be impossible to find correct information anyhow (Jonathon, Sharon & Dennis, 1987).

Synthesized speech is making advancements in education fields. A computer with speech synthesizer can be a tutor at a full day service. It can be programmed to perform special tasks like helping out in spellings and pronunciation instruction for different languages. It can also be used in association with interactive educational applications. People who have problems in reading i.e. dyslexics, speech synthesis could be very helpful there because some children may feel embarrassment in asking and getting helped by the teacher (Jonathon, Sharon & Dennis, 1987). It becomes almost impossible to learn write and read without spoken help. With proper computer software, unofficial training for these problems is easy and economical to arrange. A very well-known speech synthesizer available with word processor is also helpful for proof reading the documents. Detecting grammatical and stylistic problems while listening is more convenient for users, instead of finding these while reading (Jonathon, Sharon & Dennis, 1987).

A multimedia sector is having various devices which are using the technique of speech synthesis from decades like telephone enquiry system, but the quality has been far from good for common users. Today, the quality and cost has reached the level that normal customers can afford and adopt it for his daily use. Email system use has grown to usual level in last few years. However, it is sometimes impossible to read like when being abroad due to some security problems. There synthetic speech helps users to get their e-mail messages listened through normal telephone line. Synthesized speech can also be used to speak out text messages received on mobile phones. For interactive multimedia system, an automatic speech recognition system is needed. The automatic recognition of fluent speech is being researched and worked on, but the quality of current systems is at least that much good that it can be used to give control commands, like yes/no, ok/cancel or on/off (Jonathon, Sharon & Dennis, 1987).

In short, speech synthesis can be used in all kind of human-machine interactions like in warning systems and alarm synthesized speech may be used to give more accurate information for the specific situation. Using speech instead of warning lights or buzzers proves to be a better approach to get something noticed. Speech synthesizer may also be used to notify some popup messages, reminders and sort of some desktop messages from a computer, such as printer activity or e-mail receiving (Cryer & Home, 2008).

Speech synthesis systems can provide ease for the familiarization and training of any language like with novel sound sequences. Learners can begin with speech sequences that are produced slowly, and increase the speed as the ability improves. Advanced learners may also experiment with reproduction speeds. English-speaking learners of French, for example, need training to assimilate French rhythm. Dictionaries and grammars are increasingly available now days in audio format; such systems become useful adjunct tools. Their ability to produce natural-sounding speech from almost any text, they have become essential on one's personal computer as the latest electronic dictionary (Cryer & Home, 2008).

DISASTER MANAGEMENT COMMUNICATION SYSTEM (DMCS) Communication and interaction means are growing as rapidly and recent technological advancements have enabled users to interact and communicate in much better way. In today’s world disaster which might be natural disaster most of the times and terrorism are so frequent that they can hit people at any moment. Communication and alarming in time is a simple and efficient relief at such times and thus there is a must need for an efficient disaster management system for public places even in business environments rushy areas and important commercial places which will work in adverse conditions as per need (Meissner, Luckenbach, Risse, Kirste & Kirchner, 2002).

It has been noticed that at times of disasters like storms, floods, earthquakes or fires our communication devices such as phone, mobile might not provide connectivity. The crowds must be well protected and the protection to people must be provided by the police or disaster management militaries instantly and immediately. And for such forces must be well trained and armed with the best machines which might help them to interact and communicate with the public and provide rescue against calamity in the region they work for (Meissner, Luckenbach, Risse, Kirste & Kirchner, 2002).

Wireless communication systems are getting importance these days. They use satellite signals for communication for the usually long distance communications. Portable Repeater system which can be used from a vehicle is recommended most of the times to provide communications around the area of the disaster. Reliability is an important factor for disaster management communication systems as the need for such systems increases for remote parts and sometimes even inaccessible areas like beneath deep waters and at high altitudes. People usually get lost while walking through in jungles; similarly many people meet accidents during quakes avalanches or landslides. So a disaster management communication system must be such reliable that it might be in working perfectly at all times. For such situations wide area network with satellite signals is usually recommended. Still more such technologies are being researched over and worked on (Meissner, Luckenbach, Risse, Kirste & Kirchner, 2002).

Satellite radio is an important contribution in disaster management. Snapshots taken reveal the severity of the disaster and offers penetration and zooming to any region to get a good view of that particular place. Communication and interaction advancements have definitely made our surroundings a global village where we can see the Tsunami details live on our television sets. Meteorically department’s forecasts somehow save our lives against such possible disasters. Similarly the internet technology helps us to get information immediately about any accident happened to our closed ones instantly. The movement of air crafts, space ships or the rocket launcher tracked immediately. So it is obvious that communication is what made this material exist. An effective disaster management system can manage or stop much loss of lives or property (Meissner, Luckenbach, Risse, Kirste & Kirchner, 2002). Involving speech synthesis techniques in Disaster Management Systems would be much effective. The readings recorded through the sensors or the monitors about the weather changing or any expected natural disasters would be converted in to speech. This voice message would be communicated to users either through cell phones alerts or though voice announcements at various places. Similarly another use of speech synthesis would be like to convert the voice announcement in to different regional languages. If at times system would have to announce some alert it would announce that with different regional languages throughout the country in the respective regions (Meissner, Luckenbach, Risse, Kirste & Kirchner, 2002).

In the future, if speech recognition techniques reach sufficient level, synthesized speech might get involved in language interpreters and various other communication technologies, such as videoconferencing, videophones, or talking mobile phones. If it is possible to identify speech, it could get recorded as ASCII string, and then resynthesize it back to speech, thus saving a large amount of transmission capacity. With talking mobile phones it is possible to increase the usability significantly like where it is difficult or even dangerous to try to get the visual information. Obviously listening is less dangerous to listen than to read the output from mobile phone for example when driving a car. The application field for speech synthesis is becoming wider all the time which brings also more funds into research and development areas (Meissner, Luckenbach, Risse, Kirste & Kirchner, 2002).

PLANNED FUTURE ENHANCEMENTS Despite all this discussion mentioning the bright sides and uses of synthesis technologies, it must be stated current capacities are still limited. Under good environments, we have a reliable capacity for a formal reading style but the systems providing truly expressive speech are still unknown today. A few research teams are working on expression of surprise, anxiousness, excitation or disappointment. Many such laboratories are working with European COST 258 (http://www.unil.ch/IMM/docs/LAIP/COST_258/cost258.htm). It is stated that in a few years' time, further steps will be taken towards greater naturalness of artificial voices, with the inspiring results given by Harmonics and Noise Modeling (HNM) of speech. Speech synthesis can be used even in a better way for the purpose of understanding and assisting human communication in multiple new styles (Keller, 2000).

References

AT&T Labs Research. (n.d.). At&t natural voices text-to-speech. Retrieved from http://www.research.att.com/projects/Natural_Voices/?fbid=p2J7LeYnwzA
Cryer, H., & Home, S. (2004). Exploring the use of synthetic speech by blind and partially sighted people. RNIB Centre for Accessible Information (CAI) , 1, 1-10.
Dutoit, T., (1996), An Introduction to Text-To-Speech Synthesis Kluwer Academic Publishers, 326 pp.
Jonathan, A.; Sharon, H. M.; Dennis, K. (1987).From Text to Speech: The MITalk system. Cambridge University Press.
Keller, E., & Keller, B. (2000). New Uses for Speech Synthesis. Laboratoire d'analyse informatique de la parole (LAIP), 1, 1-4.
Kirriemuir, J. (2003, March 30). Speech recognition technologies. Retrieved from www.jisc.ac.uk/media/documents/techwatch/tsw_03-03.rtf
Meissner, A., Luckenbach, T., Risse, T., Kirste, T., & Kirchner, H. (2002). Design Challenges for an Integrated Disaster Management Communication and Information System. The First IEEE Workshop on Disaster Recovery Networks, 1, 1-7.
Moulines, E., & Charpentier, F. (2003). Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones. Elsevier, 9(5-6), 1.
Pollet, V., & Breen, A. (2008). Synthesis by Generation and Concatenation of Multiform Segments. Interspeech, 1, 1825-1828.
Rubin, P.; Baer, T.; Mermelstein, P. (1981). "An articulatory synthesizer for perceptual research". Journal of the Acoustical Society of America 70 (2): 321-328.

Schroder, M. (Director) (2011, January 1). Unit selection speech synthesis. DFKI. Lecture conducted from DFKI, DFKI.
Snehi, J. (2006). Computer peripherals and interfacing. (1st ed., p. 57). New Delhi: Laxmi Publications. DOI: www.laxmipublications.com
Synthetic Speech. (2011, May 18). RNIB. Retrieved May 1, 2012, from http://www.rnib.org.uk/professionals/accessibleinformation/accessibleformats/audio/speech/pages/synthetic_speech.aspx/
Thakur, S. K., & Stato, K. J. (2011). Study of various kinds of speech synthesizer technologies and expression for expressive text to speech conversion system. International Journal of Advanced Engineering Sciences and Technologies, 8(2), 301-305. Retrieved from http://www.ijaest.iserp.org/archieves/14-Jul-1-15-11/Vol-No.8-Issue-No.2/29.IJAEST-Vol-No-8-Issue-No-2-Study-of-Various-kinds-of-Speech-Synthesizer-Technologies-and-Expression-For-Expressive-Text-To-Speech-Conversion-System-301-305.pdf
Venkataramani, B. (2006). Nios ii embedded processor design contest—outstanding designs. Informally published manuscript, National Institute of Technology, Trichy, Retrieved from http://www.scribd.com/doc/47105194/Speech-to-text
Wendy, T. K., Ridha, K., Victor, S., Omar, F., Gilbert, E., Wei Chern, C., & Guofeng, H. (2005). Speech recognition technology for disabilities education.Journal of Educational Technology Systems, 33(2), p173-184. Retrieved from http://baywood.metapress.com/link.asp?target=contribution&id=K6K878K259Y7R9R2
Yoshimura, T., Tokuda, K., Masuko, T., Kobayashi, T., & Kitamura, T. (1999). Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis. Mendeley, 83(11), 2347-235.

Similar Documents

Sample of Persuasive Speech Text

Text-to-Speech Synthesis of Two-Syllable Filipino Words

How Does Shakespeare Use Representations of Speech and Other Dramatic Techniques to Convey the Relationship Between Angelo and Isabella in the Passage Below and One Place Elsewhere in the Text?

Text And Discourse Analysis

Speech Recognition

None

Purdue Owl: the Rhetorical Situation

Investigating the Presentation of Speech

Com114 Business Outline

Pragmatics Analysis

Great Speeches in Time

Comparing Speeches 'I Have A Dream And Nobel Lecture'

Phonetics and Articulation

Race in Obama’s America

Texting and Driving

Popular Essays