speech perception Flashcards
- Understand why designing computer speech recognition systems is difficult.
- can’t match people’s ability to recognize speech.
- Computers perform well when a person speaks slowly and clearly, when there is no background noise, and when they are listening for a few predetermined words or phrases
- humans can perceive speech even when confronted with phrases they have never heard (the presence of various background noises, sloppy pronunciation, speakers with different dialects and accents etc.)
Acoustic signal
produced by air that is pushed up from the lungs past the vocal cords and into the vocal tract.
vowels
produced by vibration of the vocal cords
- Each vowel has a characteristic series of ‘formants’ (resonant frequencies)
- The first formant has the lowest frequency, the second has the next highest, etc.
formants
Formants: The frequencies at which these peaks
- formant transition: Rapid shifts in frequency preceding or following formants are called (associated with consonants)
consonants
Consonants are produced by a constriction, or closing, of the vocal tract (thus changes in vocal tract, i.e constriction of the vocal tract) and air flow around articulators.
every other sound (like consonants) are created by the movement of air and shape of your articulators (are the tongue, lips, teeth, jaw, and soft palate)
phonemes
smallest unit of speech that changes meaning of a word
In English there are 47 phonemes:
spectrogram
- spectrogram indicates the pattern of - frequencies and intensities over time that make up the acoustic signal.
- Frequency is indicated on the vertical axis
- time (ms) is indicated on the horizontal axis;
- intensity is indicated by darkness, with darker areas indicating greater intensity.
- Intensity represented by darkness of bands - lower freqs
(300-700Hz) more intense here.
Dark smudges - formants
- bend in the band is formant transition
- The vertical lines in the spectrogram are pressure oscillations caused by vibrations of the vocal cord.
lack of invariance or variability problem:
no simple relationship between a particular phoneme and the acoustic signal
acoustic signal for a particular phoneme is variable.
variability from different speakers
Speakers differ in pitch, accent, speed in speaking, and pronunciation -> This acoustic signal must be transformed into familiar words
- Coarticulation: articulators are constantly moving as we talk, the shape of the vocal tract associated with a particular phoneme is influenced by the sounds that both precede and follow that phoneme. This overlap between the articulation of neighbouring phonemes is called coarticulation.
variability from context
even though we perceive the same /d/ sound in /di/ and /du/, the formant transitions, which are the acoustic signals associated with these sounds, are very different.
-Thus, the context in which a specific phoneme occurs can influence the acoustic signal that is associated with that phoneme.
Categorical perception
a wide range of acoustic cues results in the perception of a limited number of sound categories
- done with a property called voice onset time (VOT)
multimodal
speech perception is multimodal; our perception of speech can be influenced by information from a number of different senses.
McGurk effect
although auditory information is the major source of information for speech perception, visual information can also exert a strong influence on what we hear
audio-visual speech perception
- This influence of vision on speech perception is called
The McGurk effect is one example of audio-visual speech perception. (Eg. people routinely use information provided by a speaker’s lip movements to help understand speech in a noisy environment )
Experiment
The McGurk effect
-Visual stimulus shows a speaker saying “ga-ga.”
- Auditory stimulus has a speaker saying “ba-ba.”
- Observer watching and listening hears “da-da”, which is the midpoint between “ga” and “ba.”
- Observer with eyes closed will hear “ba.
- The link between vision and speech has been shown to have a physiological basis.
- Calvert et al. showed that the same brain areas are activated for lip reading and speech perception.
“top-down” processing affects speech perception
- Philip Rubin and coworkers (1976), for example, presented a series of short words, or nonwords, and asked listeners to respond by pressing a key as rapidly as possible whenever they heard a sound that began with /b/.
- participants took 631 ms to respond to the nonwords and 580 ms to respond to the real words.
- Thus, when a phoneme was at the beginning of a real word, it was identified about 8 percent faster
-speech perception is determined both by the nature of the acoustic signal (bottom-up processing) and by context that produces expectations in the listener (top-down processing).
phonemic restoration effect:
The ability to fill in part of a word that has been obscured was experienced even by students and staff in the psychology department who knew that the /s/ was missing.
- can be influenced by the meaning of words following the missing phoneme
The segmentation problem -
there are no physical breaks in the continuous acoustic signal.
speech segmentation
The perception of individual words in a conversation is called speech segmentation.
How we perceive breaks in words
-knowledge: Top-down processing, including knowledge a listener has about a language, affects perception of the incoming speech stimulus
-perceptual organization of the sounds, and this change was achieved by your knowledge of the meaning of the sounds.
transitional probablilites
statistical learning
- transitional probabilities—the chances that one sound will follow another sound.
the chances that one sound will follow another sound.
transitional probabilities
The process of learning about transitional probabilities and about other characteristics of language is called statistical learning. Research has shown that infants as young as 8 months of age are capable of statistical learning.
The pop-out effect
shows that higher-level information such as listeners’ knowledge can improve speech perception.
- hat after experiencing the pop-out effect subjects be- came better at understanding other degraded sentences that they were hearing for the first time.
broca’s aphasia
-
- Broca’s area is located in the frontal lobe and thus frontal lobe damage
- Patients with this problem—slow, laboured, ungrammatical speech caused by damage to Broca’s area, are diagnosed as having Broca’s aphasia
- have difficulty forming complete sentences, understanding some types of sentences.