Week 7 - Distributional Semantics Flashcards

1
Q

Semantic Processing

A

The computer needs to “understand” what words mean in a given context

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Distributional Hypothesis

A

The hypothesis that we can infer the meaning of a word from the context it occurs in

Assumes contextual information alone constitutes a viable representation of linguistic items, in contrast to formal linguistics and the formal theory of grammar

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Distributional Semantic Model

A

Generate a high-dimensional feature vector to characterise a linguistic item

Subsequently, the semantic similarity between the linguistic items can be quantified in terms of vector similarity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Linguistic Items

A

words (or word senses), phrases, text pieces (windows of words), sentences, documents, etc…

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Semantic space

A

The high-dimensional space computed by the distributional semantic model, also called embeding space, (latent) representation space, etc…

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Vector distance function

A

Used to measure how dissimilar two vectors corresponding linguistic items are

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Vector similarity function

A

Used to measure how similar two vectors corresponding linguistic items are

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Examples of vector distance/similarity function

A

Euclidean Distance
Cosine Similarity
Inner Product Similarity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Euclidean Distance

A

Given two d-dimensional vectors p and q:

sqrt( sum(pi-qi)^2 for i=0->d) )

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Inner Product Function

A

Given two d-dimensional vectors p and q:

sum(pi*qi) for i=0->d

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Cosine Function

A

Given two d-dimensional vectors p and q:

sum(pi*qi) for i=0->d

divided by

sqrt( sum(pi^2) for i=0->d )
* sqrt( sum(qi^2) for i=0->d )

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Vector Space Model

A

count based

Algebraic model for representing a piece of text object (referred to as a document) as a vector of indexed terms (e.g. words, phrases)

In the document vector, each feature value represents the count of an indexed term appearing in a relevant piece of text

By collecting many document vectors and storing them as matrix rows (or columns), it results in the document-term matrix.

Might treat the context of a word as a mini-document

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

VSM term weighting schemes

A

Binary Weight

Term Frequency (tf)

Term Frequency Inverse Document Frequency (tf-idf)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

VSM binary weighting

A

Each element in the document-term matrix is the binary presence (or absence) of a word in a document

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

VSM Term Frequency Weighting

A

Each element in the document-term matrix is the frequency a word appears in a document, called term frequency (tf)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Inverse Document Frequency

A

Considers how much information the word provides, i.e. if it’s common or rare across all documents

idf(k) = log(M / m(k))

Where:
M - total number of documents in the collection
m(k) - Number of documents in the collection that contain word k

17
Q

Term Frequency Inverse Document Frequency

A

For document i and word k
t(i,k) = tf(i,k) * idf(k)

idf(k) = log(M / m(k))

Where:
M - total number of documents in the collection
m(k) - Number of documents in the collection that contain word k

idf considers how much information the word provides, i.e. if it’s common or rare across all documents

18
Q

VSM for word similarity

A

Construct two vectors using the VSM (Vector space model)

Use cosine (or inner product) similarity to compute the similarity between the word vectors

Two approaches for getting word vectors:
Based on documents
Based on local context

19
Q

Context based word similarity

A

Instead of using a document-term matrix, use a word-term matrix, populating it with:

the co-occurence of a term and a word within the given context windows of the term, as observed in a text collection.

19
Q

Context Engineering

A

How to choose the context for context based word similarity

Options:
- Whole document that contains a word
- All words in a wide window of the target word
- content words only (no stop words) in a wide window of the target word.
- Content words only in a narrower window of the target word
- Content words in a narrower window of the target word, which are further selected through using some lexical processing tools

20
Q

Document-Term Matrix

A

A matrix with columns of terms, and rows of documents

21
Q

Context Window Size

A

In the context of Context Engineering and Context based word similarity:

  • Instead of entire document, use smaller contexts, e.g. paragraph, window of +- 4 words

Shorter window (1-3 words) - focus on syntax
Longer window (4-10 words) - capture semantics

22
Q

Benefits of low-dimensional dense vectors

A

-Easier to use as features in machine learning models
- Less noisy

23
Q

Latent Semantic Indexing

A

Mathematically is a singular value decomposition

Using SVD results computes:
document vectors: UD
term vectors: VD

Between document similarity: U * D^2 * UT
Between term similarity: V * D^2 * VT

The dimension k is normally set as a low value e.g. 50-1000 given 20k-50k terms

24
Q

Singular Value Decomposition

A

Decomposing a vector into three matrix components S, V, and D.

If there are m documents and n terms, and X is an mxn matrix:
X = U D V^T

Each row of V is a k-dimensional vector related to a term, it has n columns

Each row of U is a k-dimensional vector related to a document

The dimension k is normally set as a low value e.g. 50-1000 given 20k-50k terms

25
Q

Truncated SVD

A

Choose a number of k and truncate the SVD

Each row of UD provides a k-dimensional feature vector to characterise the row object

Each row of VD provides a k-dimensional feature vector to characterise the column object

Can be applied to any data matrix, not necessarily only the document-term matrix

26
Q

Predictive word embedding models

A

Perform prediction tasks based on word co-occurence information. E.g:
- Whether a word appears in a context of a target word
- How many times a word appears in the context texts of a target word

Examples for training are words and their context in a text corpus

Include:
- (general) continuous bag-of-words model
- skip-gram model
- GloVe model
- …

27
Q

Continuous-bag-of-words model

A

Assuming there are V words in the vocabulary we are dealing with a V-class classification task.

The input of each sample contains C context words

Objective is to learn a word embedding matrix of V rows and N columns (N is a hyperparameter) called W

Objective is to learn a word embedding for the vocabulary

Inputs to the model are one-hot encodings of the vocabulary where the non-zero element represents the context word being input.

Feature extraction component (h1 - output of hidden layer):
Copies the word embedding vectors for the context words from the rows of the embedding matrix, and averages them

Multi-class classification component:
Takes hi as the feature input and assigns it to one of the word classes in the vocabulary using logistic regression (a linear classification model trained using cross-entropy loss)

W’ is a matrix NxV which denotes the multi-class classification weight matrix of the logistic regression model

Like skip gram, based on wether two-words appear in each other’s context, doesn’t directly take into account the number of times two words appear in each other’s context.

28
Q

Skip-gram model

A

Like a continuous bag of words model but flipped, predicting the context of target word, instead of predicting the target word from the context

Like CBOW, based on wether two-words appear in each other’s context, doesn’t directly take into account the number of times two words appear in each other’s context.

29
Q

GloVe Model

A

Different from CBOW and Skip-gram utilises the frequency that a word appears in another word’s context in a given text corpus

30
Q

What to do with a semantic space

A

Clustering
- grouping similar words together

Data visualisation
- mapping the semantic space to a two (or three) dimensional space

Support other NLP tasks
- To be used as the input of machine learning models. E.g. Neural networks for solving NLP tasks.

31
Q

Advantages and disadvantages of distributional semantics

A

Advantages:

  • Very practical in terms of processing
  • Effective in capturing word meaning and relations, and support the training of neural language models

Disadvantages:

  • It is still an open issue whether statistical co-occurrences alone are enough to address deeper semantic questions
  • Semantic similarity is still a vague notion. For instance, the association between “car” and “van” is different from that between “car” and “wheel” (semantic similarity vs semantic relatedness)
  • What type of semantic information can be captured from context, and what part of the meaning of words remains unreachable without complementary knowledge?
32
Q

Disadvantages of distributional semantics

A