Ling 290 Final Flashcards
(95 cards)
How does ASR work?
System receives acoustic input from a speaker thru a microphone, analyzes it using a pattern/model/algorithm and produces an output, usually in the form of text
Automatic speech recognition (ASR)
Independant, machine-based process of decoding/transcribing oral speech
What’s the difference between:
Speech recognition
Voice recognition
Speech understanding
Speech understanding/identification: determining meaning (not transcription)
Speech recognition: ability of machine to recognize what is being said (WHAT)
Voice recognition: ability of a machine to recognize speaking style (WHO)
Describe the first ASR system and who invented it
In the early 1950s Bell Telephone Laboratories Davis, Biddulph, Balashek called AUDREY Could recognize isolated digits from 0 to 9 for a single speaker Speaker dependant Required extensive training
Describe the early ASR system template and its faults
Template-based recognition based on pattern matching (comparing speakers input to stored acoustic templates/patterns)
Faults:
-not good for large vocab recognition
-can’t match speech sounds if they are a diff length
Who were the first ones to use a computer for an ASR?
Forgi and Forgi in 1959
What did researchers experiment with in the 60s to improve ASR?
Researchers experimented with time-normalization techniques (dynamic time warping DTW) to minimize diffs in speech rate
What were the three milestones for ASR in the 70s?
1- focus on recognition of continuous speech
2- development of large vocab speech recognizers
3- speaker independent systems (so it could recognize a range of voices)
What was the first commercial ASR system?
VIP-100 (won US national award)
Describe the SUR project (speech understanding research)
ARPA started it (1971-1976) with the goal of creating a system capable of understanding connected speech of many speakers with a 1000 vocabulary (in a low noise environment) and have an error rate less than 10%
What was the successful product of the SUR project?
Harpy; showed the benefits of data-based statistics models over template-based; first step towards hidden Markov modelling (HMM)
What is HMM? (One ASR model of 1980s)
Hidden Markov modeling= based on complex statistical/probabilistic analyses
Represent language units like morphemes as a sequence of states with transition probabilities between each state
Uses highest probability to predict best answer
Good and bad aspects of HMM?
Good: can analyze both temporal and spectral variations of speech signals and can decode continuous speech
Bad: require extensive training, large amount of memory, huge computational power for model-parameter storage and likelihood evaluation
What is ANN? (Second ASR of 1980s)
Artificial neural network; was reintroduced after beginning in 1950s
Network consists of interconnected processing units combined in layers with different weights that are determined on the basis of training data
Good and bad aspects of ANN?
Good: classification of static patterns (including noisy acoustic data) and useful for recognizing isolated speech units
Bad: systems just based on ANN don’t work very well, needs to be paired with HMM
First and second steps in commercialization of ASR? (1980-1990s)
ASR used in telephone networks, portable speech recognizers offered to the public and ASR integrated into PC dictation systems to air traffic control training systems
What were the 3 focuses of ASR in the 1990s? Plus two extras
1- larger vocab
2- spontaneous speech recognition
3- working in noisy environments
Plus
- human to human speech recognition
- visual speech recognition (based on lip positions and movements etc)
Three areas of further progress of ASR in 2000s?
1- development of new algorithms
2- advances in noisy speech recognition
3- integration of speech recognition into mobile technologies like cellphones
Speech recognition systems can be characterized by which 3 dimensions
1- speaker dependence (speech dependant, speech independant, adaptive)
2- speech continuity (isolated/discrete word recognition systems, connected word recognition systems, continuous speech recognition systems, word spotting systems)
3- vocab size
3 errors in ASR
Errors in discrete speech recognition (deletion, insertion, substitution and rejection errors)
Errors in continuous speech recognition (same as discrete speech + splits and fusions)
Errors in word spotting (false rejects aka word is missed and false alarms aka word misidentified)
Difference between a direct, indirect and intent error in ASR
Direct= human misspeaks/stutters Intent= speaker decides to restate what's just been said Indirect= ASR system incorrectly identifies what the speaker said
What are three reasons why ASR is good for learning a language?
- practice
- motivation
- feelings of communicating, not just repeating phrases and words
What are the two options of ASR for automatic rating of pronunciation? 3 ways to reach these goals?
Two options= give global pronunciation rating OR identify specific errors
3 ways of achieving this= ASR identify word boundaries, accurately match speech to correct targets and compare to see what was done right/wrong
Describe spoken CALL dialogue systems
Software programs for practicing spoken languages use it
Provides one line of dialogue and then speaker choses one of two responses
If response is wrong, ASR system can recognize what response has been spoken (even if there’s errors) and then the computer responds, allowing the learner to try again