Lectures Flashcards

1
Q

different types of computational modelling have radically different…

A

assumptions about the nature of cognition

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

most forms of computational modelling…

A

involve some form of simulating a cognitive process

ie. input -> “model” -> behavioural output

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

models are different on their level of analysis

A

Marr’s levels:
- neural
- algorithmic
- computational

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

how does computational modelling aid in understanding human behaviour?

A

by establishing a concrete definition of a cognitive process

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

origins of modelling

A

computer simulations have been popular since early years of psychology

the importance of computation was recognized at an early stage ie. Turing in 1950

Weiner (1948) and Shannon (1949) conducted early mathematical theories of information and communications

Society for Computation in Psychology

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Weiner and Shannon

A

Weiner (1948) and Shannon (1949)

conducted early work in mathematical theories of information and communications

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Society for Computation in Psychology

A

formed in 1971

one of the early subgroups of cognitive psychology

prof is a member

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

2 types of analytical models

A
  1. recognition memory experiment
  2. signal detection theory
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

recognition memory experiment

A

presented with a list of words

presented with pictures of those words

tested for old or new words

sometimes falsely accept things that didn’t occur

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

signal detection theory

A

measurement of the difference between two distinct patterns

first pattern is the one you’re supposed to pay attention to

second pattern involves the random noise that distracts a person/machine’s ability to collect and process info

essentially looks at how easy/difficult it is for someone to process info and respond to it when they’re also being exposed to background noise/distractions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

the primary model type we’ll look at in this course…

A

simulation models

output of model isn’t deterministic

underlying randomness in the model (typically implemented with random number generators)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

mind as computer

A

Pylyshyn 1984

mind takes in information from senses

integrates them and creates perceptual experience and behaviour

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

knowledge acquisition: Plato vs Chomsky

A

Plato: knowledge must be gained via experience

Chomsky: we are born with innate knowledge and learning mechanisms

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

poverty of the stimulus

A

there is no way that we must hear every form of language we produce in order to learn it

we produce more language than we experience

and all possible language is even greater than the language we produce

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

the difference between ‘language experienced’ and ‘language produced’ is accounted for through…

A

innate knowledge

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

possible solution: Simon (1969)

A

discussing the path taken by an ant on a beach, Simon noted that the ant’s path is “irregular, complex, hard to describe. but its complexity is really a complexity in the surface of the beach, not a complexity in the ant.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

big data and natural language processing

A

collection of large text sources has changed how we think about studying language

possible to propose learning mechanism and train on realistic data

a model can be “born” into a realistic language environment

we then gain insights into cognition and language performance by examining how the model learns and functions

also is a powerful natural language processing tool

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

T/F: virtual environments are approaching real world complexity levels

A

true

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

language learning: bi-directional benefit

A

we benefit from using large, realistic text sources because we can train models on them

the models give us insight into cognition/language performance/learning

also become powerful natural language processing tools

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

corpus-driven modelling

A

identifies strong tendencies for words/grammatical constructions to pattern together in particular ways

while other theoretically possible combos rarely occur

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

corpus-driven modelling allow for…

A

connections between lexical experience and lexical behaviour

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

first corpus ever

A

Brown corpus of Kucera and Francis

1967

consisted of about 1 million words, sampled from different areas

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

examples of text-based resources now available for use for corpus-driven modelling

A

Grade 1-12 textbooks

Scientific journal articles

Newspaper articles

Wikipedia

TV and movie subtitles

Books

Urban dictionary

Reddit

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

distributional models of semantics

A

usage-based model of meaning

based on assumption that statistical distribution of linguistic items in context plays key role in characterizing their semantic behaviour

distributional models build semantic representations by extracting co-occurrences from corpora

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
internal versus external theories of cognition
internal: involves attending internally to thoughts, memories and mental imagery external: involves attending to stimuli in the external environment brain, body, environment
26
organization of long term memory
long term memory splits into: explicit/declarative (conscious) and implicit (unconscious) explicit/declarative splits into: semantic (events, experiences) and episodic memory (facts, concepts) implicit splits into: priming and procedural memory (skills, tasks)
27
explicit/declarative memory splits into...
1. semantic memory (events, experiences) 2. episodic memory (facts, concepts)
28
implicit memory splits into...
1. priming 2. procedural memory (skills, tasks)
29
semantic memory
refers to what you know events, experiences
30
how is semantic memory tied to language?
not necessarily tied to language, but intimately connected language is a general organizing principle of memory
31
lexical semantic memory
memory of word meanings
32
study of semantic memory examines...
storage and retrieval
33
modern theories of semantics
based in experience environment serves as model/constraints
34
2 branches of "based in experience" theories of semantics
1. grounded/embodied theories - our perceptual world (and our brains, which are embodied) is used as our main info source to understand the world around us 2. text-based machine learning
35
frontal lobe
language processing emotional regulation executive functioning planning organizing memory impulse control problem solving selective focus decision making behavioural control
36
temporal lobe
episodic memory (involved in comprehension, storage and retrieval of memory) hearing ability - first area that processes speech info, turns it into a linguistic code memory acquisition some visual perceptions categorization of objects comprehension memory retrieval
37
perisylvian region
area of brain responsible for language composed of: - primary auditory cortex - wernicke's area - angular gyrus - arcuate fasciculus - primary motor cortex - broca's area
38
wernicke's area
constructs rep of meaning for linguistic info damage from stroke to this area = fluent/receptive aphasia - loss of ability to understand and create meaningful language - grammatically correct but incorrect meaning
39
broca's area
responsible for linguistic production damage from stroke to this area = non-fluent/productive aphasia - loss of ability to produce fluent language - but can still understand language
40
wernicke's location
posterior temporal lobe many connections to primary auditory cortex heavily connected to Broca's area
41
wernicke's = important for...
storage and retrieval of word representations, meanings, grammar
42
broca's location
posterior inferior frontal region next to primary motor cortex (responsible for muscles used to produce speech) sometimes called motor speech areea
43
arcuate fasiculus
connection between Wernicke's and Broca's area important for BOTH phonological and lexical-semantic processing
44
early theory of semantic memory - devised by Collins & Quillian
hierarchical networks
45
hierarchical networks
Collins & Quillian suggest our info in memory is organized hierarchically - can be repped by a tree - superordinate at the top - as you continue down the network, get more subordinate info
46
what kind of info is at the bottom of the tree in hierarchical networks?
actual instances of a category
47
if information is stored in the brain in the way suggested by hierarchical networks, then there should be a corresponding connection between...
the amount of time it takes you to find connections between these properties direct connections will be faster think about it like walking from point to point
48
living thing: example of hierarchical network
living thing - connects to propositions "is" and "can" and then to "grow" and "living" living thing: connects to propositions "is a" and then to either 1. plant 2. animal plant - connects to "is a" 1. tree 2. flower these eventually link into specific examples - pine, oak, rose, daisy
49
how did Collins and Quillian test if the timing of their network in validating closeness of associations actually applies to human processes?
gave people a sentence that was true or false had them say whether it was true or false ie. 'a canary can sing', 'can walk', 'has skin' - looking at properties progressively higher up in the network turns out that increasingly high properties take longer to validate
50
are Collins & Quillian's findings supported in all categories?
no, not validated in all categories a good first step, but not exhaustive
51
2 pieces of theoretical refinement: Smith, Shoben & Rips
1. proposed that items can be repped as a SET OF FEATURES - each concept is described by a set of features that define it 2. meaning can be described as a position in a geometric space - vectors
52
vectors
look at how similar and different certain vectors are use trigonometry to calculate the angles between different vectors once you have the numerical similarity between the vectors, you can plot how they are distributed in space
53
vector cosine
calculated using trigonometry that examines angles between different vectors will come up with value between 1 and -1 1 = the same (very similar) -1 = opposite
54
multidimensional scaling
uses the vector cosines to place words in a 2D space visually shows their similarity more similar items will be closer to each other within the space helps visualize how we connect things in our minds
55
what are features?
classical approaches propose that they are properties of categories ie. features of cars: "has wheels", "used for transportation", "has doors", "has an engine"
56
uninterpretable features: multidimensional scaling models
multidimensional scaling don't carry interpretable features the locations of things in space don't map onto features like "has wheels" or "has a door" can't say that location x in a matrix means that word y has a door
57
how do machine learning models construct features?
from text not typically based in perceptual environment some are interpretable, others are not
58
in neural networks, all info is distributed across...
the WIDTH of the network if you damage the network, all information decays together (not like you just lose a chunk of it)
59
topic models
probabilistically places words matched on whether that word has a feature or not ie. probability value that a certain word is a living thing, or is red, or can move etc/ good for information organization, can categorize info well
60
topic models are good at ______ but aren't really used as a _______
good at information organization/categorization but aren't really used as a theory of cognition
61
rogers and mclelland worked on what kind of model
neural network
62
basic idea behind roger and mclelland's neural network
based of off interest in how children acquire language take propositions (sentences) give model a sentence, derived from a representation network model give the model a word (canary) and proposition (can) then have an output layer with all sorts of possible options want model to produce certain options, and not produce others ie. want it to produce 'swim', 'grow', 'fly' but not 'swim' if the model gets something wrong, it uses back-propagation to adjust the weights so that next time it's less likely to make the same mistake can do this because it's a supervised network (we know what we want the network to produce, so we know when it's wrong) by end of training cycle, model produces the correct output
63
models of collins & quillian versus models of rogers & mclelland
collins & quillian: - hierarchical networks rogers & mclelland: - neural networks
64
supervised networks
we know what we want the network to produce so we know when it is wrong allows for back-propagation/error-driven learning ie. neural networks are supervised
65
back-propagation
error-driven learning possible in supervised networks when we know the output that we want the model to produce at first, the network will produce "noise" (the wrong things) but since we know what we want it to produce, we can CHANGE THE CONNECTIONS OF THE WEIGHTS so that next time it's incrementally more likely to produce the correct activations do this hundreds of thousands, millions of times eventually the network will produce the right activation
66
error-driven learning is really just...
reinforcement learning
67
each arrow in a network...
reps a diff weight/numerical value which is adjusted depending on how incorrect the network is
68
do we want a high or low learning rate?
low learning rate so that small changes are made to each input means that a lot of learning trials are required generally must be trained multiple times on same corpus
69
other term for backpropagation
the backward pass
70
what comes out as the output is essentially just the...
most activated node in the hidden layers
71
2 main approaches to neural networks
1. localist network: - each node reps only one entity - people tend to think these are neurologically implausible 2. distributed representation: - info is spread across the nodes - instead of being confined to one node - preferred, because more similar to brain's function
72
issue with the whole 'input = output from many many hidden layers' thing
results in a kind of black box model what exactly is happening in the hidden layers is unclear can't "get into the head of the model" - can't map it onto what humans do in experimental tasks led to Bayesian models (back-propagation networks that feed into other back-propagation networks...train each layer separately, don't have to go all the way back to the first layer)
73
three names associated with back-propagation
Rumelhart, Hinton & Williams (1986)
74
the trajectory of learning followed by rogers' and mclelland's model maps onto...
learning trajectories of children as they acquire language in the beginning, model produces noise (outputs are all equally likely, close together in 2D space) but with training, they begin to split apart and are weighted differently (just like how kids begin to learn words)
75
closed versus open models
closed models: - restricts the model to working with the training materials - assumes all of the knowledge about the world = contained in the training materials - allows for clarity in resulting explanation open models: - uses millions of samples - noise is eventually reduced through greater levels of experience - better than closed
76
Rogers and McLelland models = based on what assumption? open or closed networks?
based on the SIMPLIFICATION ASSUMPTION they are closed networks "the more detail we incorporate, the harder the model is to understand" - think of the growing complexity and non-interpretability of chatGPT
77
simplification assumption
linked to closed models suggests when you're training a model you should give it simple training data because complicated materials make it unclear as to whether the model is succeeding/failing because of the quality of the data simple data provides researchers with clarity regarding how good the model was
78
closed models and ecological validity
closed models have low ecological validity not reflective of tasks that humans actually perform language is very noisy, lots of info all the time so using simple training materials doesn't reflect the task that humans face when they're learning
79
open models require _____ information
more
80
BEAGLE model on 300 versus 300 000 propositions
300 propositions = closed model - only takes 300 trials to learn propositions - can cluster info right away - not error-driven - presents sentences as more structured than they are in reality 300 000 propositions = open model - derived from a large corpus of language - takes much longer to train, about 300 000 trials
81
why does it take the larger BEAGLE model longer to learn?
because the learning corpus and the actual corpus are different (open model) the actual corpus has more noise and nuance therefore takes longer to settle and to produce the correct output because open models learn from actual sentences, it takes more examples of info to come up with the correct structure
82
Current NLP Machine Learning Wars
people keep building bigger models, competing against each other BERT, RoBERTa, GPT-2, T%, Turing NLG, GPT-3 GPT-3 is winning
83
NLP
natural language processing
84
is ChatGPT a good model for the brain?
not really it contains way more info than the human brain does not really an applicable model with which to assess human cognition
85
LLM
large language model ChatGPT, FaceBook, Google
86
perceptual symbol systems
proposed by Barsalou as a general theory of cognition classic view: amodal symbols in cognition amodal systems have NO CONNECTION to perceptual environment
87
amodal systems have no connection to...
perceptual environment amodal symbol system transduces a partial perceptual experience into a completely new representation language that is INHERENTLY NON-PERCEPTUAL
88
3 problems with amodal approach
1. neurological evidence: - findings show that damage to sensory-motor cortex impairs processing of certain modality-based categories (ie. birds) 2. failure of transduction: - no system can elegantly go from perception to symbols 3. symbol grounding problem: how does the system know what it's computing?
89
an alternative to amodal systems
neural representations
90
neural representations
not a physical copy of the perceptual experience instead a RECORD OF THE NEURAL ACTIVATION that arises during perception similar to representations of imagery likely stored in CONVERGENCE ZONES: integrate info in sensory-motor maps to represent knowledge never completely transduced, perceptual traces are reconstructured
91
8 examples of semantic memory tasks
many diff behaviours are studied 1. word similarity 2. false memory 3. free association 4. semantic priming 5. verbal fluency 6. sentence comprehension 7. discourse comprehension 8. feature judgments
92
semantic memory models: word similarity
most common type of data used for these models used in model development and model evaluation give people two words and get them to RATE HOW SIMILAR THEY ARE on a scale collect ratings from people and average them compare this number to computational model that's also learning these words
93
semantic memory models: verbal fluency
used in more applied situations ie. diagnosing conditions like alzheimer's or schizophrenia give people a category and ask them to generate as many things as possible from that category compare the model's output to output of humans - see if the person fits the model made for a schizophrenic, for example
94
models and dementia
models can examine how language use changes prior to diagnosis because they're based on data from people in the years leading up to their diagnosis can quantitatively see how their memory systems are changing models = a tool to understand how the mem systems of people with dementia change over time
95
representation types: network models
words are connected within a semantic network (ie. 'release' connects to 'capture 'connects to 'pirate' connects to 'sailor' connects to 'anchor') generate representation of each item based on the nodes they're connected to
96
how are network models typically derived?
from free association data give people a word (like 'car') and get them to generate features associated with these items this is how they generate the semantic networks/network models
97
Turk problems
issue with network models explains human behaviours using other human behaviours Turk problems arise when the representational input is derived directly from human behavioural data COMPLEXITY OF THE MODEL = HIDDEN WITHIN THE REPRESENTATION
98
who coined the Turk problems?
Jones, Hills, Todd
99
are back-propagation models feature models?
yes! features are the activation values of the hidden ;ayer activation of hidden layer can be used as featural rep of a word
100
important changes occurring in 1990's-2000's that helped progress Big Data and Natural Language Processing
pre-1990's - didn't have large enough language corpora to train models on but with internet, larger texts were gathered 2000's - further movement to digitize existing/old texts large corpora of text brought in a diff domain of modelling COLLECTION OF LARGE TEXT HAS CHANGED HOW WE THINK ABOUT STUDYING LANGUAGE
101
large corpora has changed how we think about studying language...
now possible to PROPOSE LEARNING MECHANISMS and to TRAIN ON REALISTIC DATA model can be "born" into a realistic language environment we gain insights into cognition and language performance by examining how it learns/functions
102
T/F: virtual environments are approaching real world complexity levels
true
103
NLPs not only help us understand cognition and language performance, but also...
are powerful natural language processing tools
104
quantification of the natural language environment: Herbert Simon's take
Herbert Simons said "the apparent complexity of our behaviour over time is largely a reflection of the complexity of the environment in which we find ourselves" behaviour is adaptive: we shape our cognition to the requirements of our environment - cognitive system is built such that we can change our behaviours to match the needs of our environment
105
classic goal in the cognitive sciences
quantification of the natural language environment
106
quantification of the natural language environment: William Estes' take
William Estes stated that theories of behaviour should shift "the burden of explanation from hypothesized processes in the organism to statistical properties of environmental events" saying we should look at how people are learning from the environment/responding to it he was particularly interested in mathematical properties
107
distributional models
these types of models learn the meanings of words from the distribution of how they're used in language aka embedding models learn meaning of words from co-occurrence statistics
108
first major distributional model
Landauer & Dumais (1997) Latent Semantic Analysis model Lan and Dum wanted to switch from current algorithms which would simply be cued with specific words and come up with documents with most overlap they wanted a more MEANING-BASED approach - get rid of polysemy effect - introduce recognition of synonymic meanings
109
LSA works by...
1. examining a large corpus of text 2. extracting information about how words are used 3. information is based on frequency usage for particular words 4. build a vector that reps the meaning of the word in terms of its similarity to other words 5. decompose the matrix into smaller number of features
110
is LSA error-free?
yes, there's no error signal in the model's learning (unlike neural networks) simply accumulate information in memory and use that to drive the model not using predictive process to hone the model's learning - it treats each lexical experience equally
111
LSA: supervised or unsupervised?
unsupervised just learns the structure of the dataset
112
4 things we need for distributional models
1. input - corpus for model to learn 2. processing - learning algorithm - by which info is gleaned from input, extracted and stored in memory 3. memory - feature space - representation of where we keep info about the word's meaning 4. output - task problem
113
distributional models: processing/learning mechanism details
neural embedding models take a sentence they sequentially activate each word on its own want the model to predict the words that surround that word in that context predictions = in the output layer see if the predictions are correct back-propagate to increase accuracy
114
problem with chatGPT
it's too complex too many layers - we don't really know what's happening it's a "black box"
115
Firth quote about word co-occurrences
"you shall know a word by the company it keeps"
116
context (source of text) for distributional models
many different possibilities paragraphs, documents, books, authors etc.
117
when processing a sentence, distributional models pre-process. how?
pre-processing modifies the sentences/inputs to improve processing 1. stop list 2. subsampling
118
stop list
stop list of high frequency function words any word included on the stop list is removed from the sentence
119
subsampling
first a frequency distribution is run (custom to the corpus in question) creates a probability distribution - words with super super high frequencies are skipped
120
if you don't use stop lists or subsampling to get rid of certain words, then...
the model is quickly overwhelmed every single word will be understood to be similar to "the"
121
are there parallel processes to stop lists/subsampling in real people?
yes eye tracking studies show that when people read a page, they generally skip function words
122
which is better? stop list or subsampling?
subsampling gives you more control over what the model is processing and it's controlled by parameters more training flexibility
123
example of sentence before and after stop list/subsampling
if the solvent is insoluble the mixture can be decanted solvent insoluble mixture decanted
124
after pre-processing...
the remaining words are examined specifically their occurrences with each other word in the corpus each pair that is found modifies the count in the matrix (strength increases with each pair found) done word by word: find all the pairs for one word first, then move onto the next word...
125
fundamental component of the processing of these distribution models....
similarity between words
126
typical similarity metric in distributional semantics
cosine use a vector cosine: gives value between 1 (very similar) and -1 (no similarity) value represents placement of the vectors in a 2D space highly aligned in terms of featural reps = high similarity value
127
to determine if our model actually captures any semantic info...
we examine its performance with a word similarity task: - get people to rate how similar a pair of words are on a scale - get a set of values pertaining to the relation of words - TAKE COSINE SIMILARITY OF EACH WORD PAIR (between 1 and -1) TAKE CORRELATION between the cosine value the model has produced and the similarity value that people are producing use these values to see how similar the model's and people's results are ideally you want a positive correlation