Lecture 8 - Categorical Perception and Learning Flashcards Preview

COGS 101B Exam 1 > Lecture 8 - Categorical Perception and Learning > Flashcards

Flashcards in Lecture 8 - Categorical Perception and Learning Deck (27):

Statistical Learning

Through mere exposure, we seem to learn
what kinds of things go with other kinds of things.

we do learn contingencies over time

lines start to blur: associative or non-associative


Through perceptual learning, we seem to BUILD and STORE
specific stimulus distinctions.

• These stimulus features can be used to identify and
categorize different types of things.

• Once established, these feature categories become the basis for top-down perceptual processing (e.g. recognizing feathers on male and female chicks).

• have a store house of different objects and can filter than down to the environment


Example: Perceiving breaks between words

The segmentation problem

how do you find the breaks where theres always a continuous signal?

- there are no physical breaks in the continuous acoustic signal of speech. [High computational complexity]

– Top-down processing, including knowledge a listener has about a language,
affects perception of the incoming speech stimulus (parse the speech as it's coming in).

– Segmentation is affected by context, meaning, and our knowledge of word structure.


non-associative learning

helps us see how we respond to and distinguish stimuli inn the environment and our responses

perceptual learning - we become better and better at telling things apart


associative learning

find different contingencies (two different stimuli - classical) (response and outcome - operant )

just building contingencies between two things (learning language)


What kind of learning reviewed so far seems specifically
useful for speech segmentation?

Statistical learning

helps us know when the breaks are coming: knowing the probabilities of when certain syllables tend to follow other syllables


Saffran, Aslin & Newport (1996)
demonstrated that

infants can detect word boundaries with
different transitional probabilities. [innate tendency]
- we have the innate ability to track different contingencies

• A continuous stream of sounds becomes segmented.

• And this should apply to natural speech.



High likelihood PRE-->TTY
High likelihood BA --> BY
Low likelihood TTY-->BA


Perceiving features

In order to track probabilities, we need to first distinguish

basic features (e.g. syllables) of the stimulus.

have to be able to ID syllables and be able to build those categories up


Some feature detection seems to be innate.


• Frogs have ‘bug’ detectors: group of cells that detect the size and shape and movement pattern of bugs that induces them to flick out their tongues (Lettvin et al., 1959).

• Visual system has simple and complex edge detectors: straight lines, edges: occur as early as you can train the system
(Hubel & Wiesel, 1959, 1962).

• Babies have phonetic discrimination for all language
sounds up to 10 months of age.


But all of these feature detectors seem to be shaped by both experience and ‘topdown’ influences.

we have all these innate abilities to detect things in the environment but we can shape them with topdown knowledge
- experience dependent plasticity

• Critical periods (e.g. phonetic discrimination)

• Mere exposure and discrimination training

- we can form many many different types of representations


How do we (as babies) initially discriminate the different
phonemes (speech sounds) that make up syllables?

Acoustic Speech Waveform
[da] [di] [du]

babies can make discrimination from the acoustic signals that make up syllables

we pull out phonemes (smallest perceived sound from a sound signal)

phonemes can be attached to vowel sounds which creates a syllable and those syllables create words


Sound spectrograms

are often used to show changes in frequency
and intensity for speech.

– These are plotted by frequency (and amplitude) over time.

– Formants are the enhanced
(darker) bands of frequencies.



are produced by a constriction of
the vocal tract (using the articulators).


Formant transitions

rapid changes in frequency preceding or following
consonants as you're producing a sound

when you produce a "duh" or "buh"

This results in production of the basic unit of
speech sound – the phone.



speech signal

the basic unit of
speech sound



thing you understand

smallest unit of perceived speech stimulus that changes meaning of a word (bad vs pad). These are defined by your language.

if you change the phoneme you're changing the meaning of the word that it's attached to


The variability problem

there is no simple
correspondence between the acoustic signal (phones) and perceived phonemes.
- no one thing in the signal that you can "key in on"

Perceiving features in speech… is hard


Variability from context:

the acoustic signal associated with a phoneme
varies with acoustic context.

what the phoneme or phone is being attached to




overlap between
articulation of neighboring
phonemes causes variation in formant transitions. Yet, we still perceive the same /d/.

while you're articulation one phone, it's attached to other phones and you're trying to articulate that next phone as well

articulating all those things, all together at once

you're always paring that acoustic info with other acoustic info


Variability from different

– Speakers differ in pitch,
accent, speed in speaking, and pronunciation.

– This acoustic signal must be
transformed into familiar
phonemes and words.



One way we deal with the
variability problem is through
categorical perception.

(one of the ways)

it leads us through the valley

– This occurs when a continuum of stimulus energies ( a lot of acoustic signals coming out at you) are perceived as a limited number of sound categories (you don't hear a continuous stimulus, it's broken down).

– This can be accomplished through the use of acoustic cues (sets different syllables and phonemes apart).


acoustic cue


– An example of this comes from experiments on voice onset time (VOT): time delay between when a sound starts and when voicing (vocal cord vibrating) begins.

• Stimuli are /ba/ (short VOT)
and /pa/ (long VOT)


CogLab #40


You (n = 224) heard 9 different synthetic speech stimuli with a range of VOTs from short (0 ms) to long (80 ms).

• Task: What do you hear? (pa
or ba – identification).

• dependent on the critical period: 10-12 months of exposure to these phonemes

• Thus, we experience perceptual constancy for the phonemes within a given range of VOT.


Perhaps, as babies, we perceive basic speech information by*:

• Using innate (species-specific) perceptual abilities to identify phones by acoustic cues (e.g. VOT).

• Relying on mere exposure to allow these categories to become (and remain) clear.

• Once we have those categories we can track which sounds go together to form words using statistical learning.

• Later, improving performance when speaking using discrimination training (with operant conditioning).

- highly dependent on the environment: feedback that helps train the system


phonetic boundary.

As you increase VOT, listeners do not hear the incremental changes. Instead
they hear a sudden change from /da/ to /ta/

if they're on other sides of phonetic boundary then you hear two different things

great constraint


Is there a theoretical model that shows how this might be done (and is biologically plausible)?

McClelland & Rummhart (1981) 's Interactive Activation Model

developed a connectionist
model which may account for
some patterns in language

• Originally developed for printed language (but can be used for acoustics as well).

• Start off: Feature detectors are activated when they match the stimulus. (Note: they can be spatially sensitive.)
- sensitive to a certain line of a certain orientation: if it's part of a letter that letter node becomes active ( T )
- excitatory connection excites a T (activate the "T" words)
- inhibitory connection: L: we're not an L (don't activate "L's"

• They excite letter nodes when the detected feature is part of the represented object (otherwise inhibit).

• Letter nodes excite word nodes if they are a part of the word representation (otherwise inhibit).

• All letter stimuli are evaluated individually.


• R and K are equally likely letters
in the fourth position, based
purely on features. The D doesn’t
match at the feature level.

all the letter part of WORK: activate those nodes because they're highly likely (we can track that)

individual letters are primed or pre-activated

• The “WORK” node is already
activated and sends feedback to K to pre-activate it (priming?).

• We might explain this
behaviorally, noting that R has a low probability of following R.