Word2Vec Flashcards

1
Q

Is Word2Vec a dense or sparse embedding?

A

Dense

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Are dense or sparse embeddings better?

A

Dense embeddings work better - not sure why

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What does static embedding mean?

A

It means that the words are single embeddings that do not change

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How does contextual embedding differ from static embedding?

A

The vector for the word will vary depending on its context

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What does self-supervision mean?

A

It means that a large tagged dataset is not needed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What does Word2Vec do in simple terms?

A

Trains a binary classifier to predict if Word A is likely to show up near word B - we then use the embedding weights

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What model is used by Word2Vec?

A

The skip-gram model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the skip-gram model?

A

It takes a set of target words and neighbouring context words (positive examples), random samples are taken to create negative examples, logistic regression is used to train classifier and the learned weights provide the embeddings

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What does a skip-gram model store?

A

A target embedding for matrix W for target words and a context embedding matrix C for context and noise words

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How do we avoid a bias towards common words when selecting noise words?

A

We use a weighted unigram sample frequency to select the noisy words

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How are Word2Vec embeddings learnt?

A

They are learnt by minimising a loss function using stochastic gradient descent (SGD)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What does the loss function do in Word2Vec?

A

Aims to maximise the probability that the target word is close to the positive examples and maximise the probability that the target word is not close to the negative examples

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Visually, what are we trying to do with Word2Vec?

A

We are trying to move the weights such that there is an increased association between the target word and positive examples, and a decrease when we have the negative examples

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How are the target matrix and context matrix initialised?

A

They are randomly initialised, typically with Gaussian noise

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How is the word embedding matrix initialised?

A

Target Matrix (W) + Context Matrix (C)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are some other static embeddings?

A

Fasttext and Global Vectors (GloVe)

17
Q

What words appear in a context window when it is small?

A

Words that are similar, such as those within a list

18
Q

What words appear in a context window when it is larger?

A

Words that are more associated rather than similar, so it will capture longer distance topical relationships

19
Q

What do analogies mean in relation to Word2Vec?

A

It looks at how A is to B as C is to …

20
Q

How are analogies computed using Word2Vec?

A

We compute the vector that takes you from the embedding space from A to B, and apply the offset to word C and find what words are similar

21
Q

When using Word2Vec with analogies, what needs to be excluded?

A

Morphological variants of the target word (i.e We don’t want potato → potato | potatoes, but we do want potato → brown)

22
Q

What are the problems with bias and embeddings?

A

Bias of the time will be included, which can be problematic when we consider how the world constantly changes

23
Q

What is allocation harm?

A

it is where bias in algorithms result in unfair real world outcomes (e.g. credit check that results in denial due to some underlying bias)

24
Q

What is bias amplification?

A

It is where embeddings exaggerate patterns making encodings even more bias than the original training resource. Implicit bias can be captured (racial, sexist, ageist bias)

25
Q

What is representational harm?

A

It is where harm is caused by a system demeaning or ignoring some social groups

26
Q

What is debiasing?

A

It is a way of manipulating embeddings to remove unwelcome stereotypes, it may reduce bias but will not eliminate it

27
Q

When do two words have first-order co-occurrence?

A

When they are typically near each other

28
Q

When do two words have second-order co-occurrence?

A

When they have similar neighbours