BYBLOS: The BBN continuous speech recognition system :-
In this paper, they describe BYBLOS, the BBN continuous speech recognition system. The system, designed for large vocabulary applications, integrates acoustic, phonetic, lexical, and linguistic knowledge sources to achieve high recognition performance. The basic approach it makes is the extensive use of robust context-dependent models of phonetic coarticulation using Hidden Markov Models (HMM). It describes the components of the BYBLOS system, including: signal processing frontend, dictionary, phonetic model training system, word model generator, grammar and decoder. In recognition experiments, it demonstrates consistently high word recognition performance on continuous speech across: speakers, task domains, and grammars of varying complexity. In speaker-dependent mode, where 15 minutes of speech is required for training to a speaker, 98.5% word accuracy has been achieved in continuous speech for a 350-word task, using grammars with perplexity ranging from 30 to 60. With only 15 seconds of training speech we demonstrate performance of 97% using a grammar. http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=1169748 Audio-visual modeling for bimodal speech recognition:-
Audio-visual speech recognition is a novel extension of acoustic speech recognition and has received a lot of attention in the last few decades. The main motivation behind bimodal speech recognition is the bimodal characteristics of speech perception and production systems of human beings. The effect of the modeling parameters of hidden Markov models (HMM) on the recognition accuracy of the bimodal speech recognizer is analyzed, a comparative analysis of the different HMMs that can be used in bimodal speech recognition is presented, and finally a novel model, which has been experimentally verified to perform better than others is proposed. Also, the geometric visual features are compared and analyzed for their importance in bimodal speech recognition. One of the unique characteristics of our bimodal speech recognition system is the novel fusion strategy of the acoustic and the visual features, which takes into account the different sampling rates of these two signals. Compared to acoustic only, the audio-visual speech recognition scheme has a much more improved recognition accuracy, especially in the presence of noise http://www.mendeley.com/research/audiovisual-modeling-bimodal-speech-recognition/ New Methods in Continuous Mandarin Speech Recognition:-
We describe new methods for speaker-independent, continuous mandarin speech recognition based on the IBM HMM-based continuous speech recognition system (1-3): First, we treat tones in mandarin as attributes of certain phonemes, instead of syllables. Second, instantaneous pitch is treated as a variable in the acoustic feature vector, in the same way as cepstra or energy. Third, by designing a set of word-segmentation rules to convert the continuous Chinese text into segmented text, an effective trigram language model is trained(4). By applying those new methods, a speaker-independent, very-large-vocabulary continuous mandarin dictation system is demonstrated. Decoding results showed that its performance is similar to the best results for US English.
http://www.isca-speech.org/archive/eurospeech_1997/e97_1543.html
Using MLP Features in SRI's Conversational Speech Recognition System:-
We describe the development of a speech recognition system for conversational telephone speech (CTS) that incorporates acoustic features estimated by multilayer perceptrons (MLP). The acoustic features are based on frame-level phone posterior probabilities, obtained by merging two different MLP estimators, one based on PLP-Tandem features, the other based on hidden activation TRAPs (HATs) features. This paper focuses on the challenges arising when incorporating these nonstandard features into a full-scale speech-to-text (STT) system, as used by SRI in the Fall 2004 DARPA STT evaluations. First, we developed a series of time-saving techniques for training feature MLPs on 1800 hours of speech. Second, we investigated which components of a multipass, multi-front-end recognition system are most profitably augmented with MLP features for best overall performance. The final system obtained achieved a 2% absolute (10% relative) WER reduction over a comparable baseline system that did not include Tandem/HATs MLP features.
http://www.isca-speech.org/archive/interspeech_2005/i05_2141.html
Hidden Markov Models for Speech Recognition System:-
The use of hidden Markov models for speech recognition has become predominant in the last several years, as evidenced by the number of published papers and talks at major speech conferences. The reasons this method has become so popular are the inherent statistical (mathematically precise) framework; the ease and availability of training algorithms for estimating the parameters of the models from finite training sets of speech data; the flexibility of the resulting recognition system in which one can easily change the size, type, or architecture of the models to suit particular words, sounds, and so forth; and the ease of implementation of the overall recognition system. In this expository article, we address the role of statistical methods in this powerful technology as applied to speech recognition and discuss a range of theoretical and practical issues that are as yet unsolved in terms of their importance and their effect on performance for different system implementations.
http://www.jstor.org/pss/1268779
Speech recognition using multilayer perceptron :-
Speech is a very powerful and fast tool for communication. That is the reason why the problem of automatic speech recognition has been fascinating computer scientists. Artificial neural networks (ANN) have been developed to model the functioning of the human brain. They are very powerful classifiers of patterns and hence can be used to recognize speech patterns. This paper discusses the work of our team on the application of ANN to the speech recognition task. We have utilized a particular class of neural networks called multilayer perceptrons (MLP) that utilize the backpropagation of error algorithm for setting of weight. After data acquisition, the speech signal is preprocessed and fed to an MLP for classification. The task is to recognize Urdu digits from zero to nine from a mono-speaker database. http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=1215948 Design Of Phonetically Rich Sentences For Hindi Speech Database:-
Speech Recognition is a pattern recognition process. Pattern recognition consists of two phases:training and testing. In case of supervised pattern recognition, the training data and the associated class labels are used to estimate the parameters of the classier. As speech recognition is a complex pattern recognition task, supervised training is employed. Consequently, an inventory of speech data along with class labels of the units of speech are needed for training a speech recognition system.A segmented and labelled speech database is useful in other areas as well. For example, a database containing rare phonemes in multitude phonemic contexts will aid in developing a highly intelligible and naturally sounding speech output system in that language. The database aids a systematic study of acoustic-phonetic correlates of a language. The speech database can also be used for speaker recognition tasks.Standard speech databases are available for languages such as English, German, Japanese etc. It is not practical to use databases of other languages for machine recognition of speech in an Indian language. There is a need to develop such a general purpose database for Indian languages. Previous eorts in this direction were limited in scope ([1, 2]). A technical proposal for a project aimed at the creation of a phonetically rich multi-speaker, continuous speech database for Hindi was presented in [3]. The aim of this paper is to report the progress made in this Hindi speech database project, specifically the design of phonetically rich Hindi sentences. The rest of the paper is organized as follows. An overview of the database project is given in section 2. The units of representation of speech are discussed in section 3. The desired characteristics of such sentences and the design strategies adopted to achieve the goal are dealt with in section 4. A summary of the work is given in section 5.
Speech Recognition System of Arabic Digits based on A Telephony Arabic Corpus:-
Automatic recognition of spoken digits is one of the difficult tasks in the field of computer speech recognition. Spoken digits recognition process is required in many applications such as speech based telephone dialing, airline reservation, automatic directory to retrieve or send information, etc. These applications take numbers and alphabets as input. Arabic language is a Semitic language that differs from European languages such as English. One of these differences is how to pronounce the ten digits, zero through nine. In this research, spoken Arabic digits are investigated from the speech recognition problem point of view. The system is designed to recognize an isolated whole-word speech. The Hidden Markov Model Toolkit (HTK) is used to implement the isolated word recognizer with phoneme based HMM models. In the training and testing phase of this system, isolated digits data sets are taken from the telephony Arabic speech corpus, SAAVB. This standard corpus was developed by KACST and it is classified as a noisy speech database. A hidden Markov model based speech recognition system was designed and tested with automatic Arabic digits recognition. This recognition system achieved 93.72% overall correct rate of digit recognition.
Keywords: Arabic, digits, SAAVB, HMM, Recognition, Telephony corpus.
A Speech Recognition System for Urdu Language:-
This paper presents a speech processing and recognition system for individually spoken Urdu language words. The speech feature extraction was based on a dataset of 150 different samples collected from 15 different speakers. The data was pre-processed using normalization and by transformation into frequency domain by (discrete Fourier transform). The speech recognition feed-forward neural models were developed in MATLAB. The models exhibited reasonably high training and testing accuracies. Details of MATLAB implementation are included in the paper for use by other researchers in this field. Our ongoing work involves use of linear predictive coding and cepstrum analysis for alternative neural models. Potential applications of the proposed system include telecommunications, multi-media, and voice-activated tele-customer services.
Speech Corpus Development for a Speaker Independent Spontaneous Urdu Speech Recognition System:-
Urdu, the national language of Pakistan, has over 100 million speakers in Pakistan and other regions [1]. This paper presents the development of a spoken language corpus for Urdu, specifically for a Lahoresuburban accent. A spoken language corpus is defined as “any collection of speech recordings which is accessible in computer readable form and which comes with annotation and documentation sufficient to allow re-use of the data in-house, or by scientists in other organizations” [2]. As noted in the literature review section next, this work represents one of the few speech corpora available for Urdu. The speech corpus has been released freely under an open content license and is envisioned to play a significant role in Urdu speech processing research in the future. It contains speech from 82 adult native Urdu speakers, with Lahore suburban dialect, ranging in age from 20 to 55 years. The corpus was specifically designed to be used forspeaker independent spontaneous speech recognition using the CMU Sphinx Open Source Toolkit for Speech Recognition [3]. The next section gives an overview of the current state of speech corpora development for Urdu, and also looks at some standards for spoken language resources. After a description of the methodology adopted for this work, key corpus statistics are reported, and critical issues encountered and resolved during the development process are discussed.
Urdu Word Segmentation:-
All language processing applications require input text to be tokenized into words for further processing. Languages like English normally use white spaces or punctuation marks to identify word boundaries, though with some complications, e.g. the word “e.g.” uses a period in between and thus the period does not indicate a word boundary. However, many Asian languages like Thai, Khmer, Lao and Dzongkha do not have word boundaries and thus do not use white space to consistently mark word endings. This makes the process of tokenization of input into words for such languages very challenging. Urdu is spoken by more than 100 million people, mostly in Pakistan and India1. It is an Indo-Aryan language, written using Arabic script from right toleft, and Nastalique writing style (Hussain, 2003). Nastalique is a cursive writing system, which also does not have a concept of space. Thus, though space is used in typing the language, it serves other purposes, as discussed later in this paper. This entails that space cannot be used as a reliable delimiter for words. Therefore, Urdu shares the word segmentation challenge for language processing, like other Asian languages. This paper explains the problem of word segmentation in Urdu. It gives details of work done to investigate linguistic typology of words and motivation of using space in Urdu. The paper then presents an algorithm developed to automatically process the input to produce consistent word segmentation, and finally discusses the results and future directions.