Week 7 - Distributional Semantics Flashcards

Question

Singular Value Decomposition

Answer 1

Decomposing a vector into three matrix components S, V, and D. If there are m documents and n terms, and X is an mxn matrix: X = U D V^T Each row of V is a k-dimensional vector related to a term, it has n columns Each row of U is a k-dimensional vector related to a document The dimension k is normally set as a low value e.g. 50-1000 given 20k-50k terms

Answer 2

Choose a number of k and truncate the SVD Each row of UD provides a k-dimensional feature vector to characterise the row object Each row of VD provides a k-dimensional feature vector to characterise the column object Can be applied to any data matrix, not necessarily only the document-term matrix

Answer 3

Perform prediction tasks based on word co-occurence information. E.g: - Whether a word appears in a context of a target word - How many times a word appears in the context texts of a target word Examples for training are words and their context in a text corpus Include: - (general) continuous bag-of-words model - skip-gram model - GloVe model - ...

Answer 4

Assuming there are V words in the vocabulary we are dealing with a V-class classification task. The input of each sample contains C context words Objective is to learn a word embedding matrix of V rows and N columns (N is a hyperparameter) called W Objective is to learn a word embedding for the vocabulary Inputs to the model are one-hot encodings of the vocabulary where the non-zero element represents the context word being input. Feature extraction component (h1 - output of hidden layer): Copies the word embedding vectors for the context words from the rows of the embedding matrix, and averages them Multi-class classification component: Takes hi as the feature input and assigns it to one of the word classes in the vocabulary using logistic regression (a linear classification model trained using cross-entropy loss) W' is a matrix NxV which denotes the multi-class classification weight matrix of the logistic regression model Like skip gram, based on wether two-words appear in each other's context, doesn't directly take into account the number of times two words appear in each other's context.

Answer 5

Like a continuous bag of words model but flipped, predicting the context of target word, instead of predicting the target word from the context Like CBOW, based on wether two-words appear in each other's context, doesn't directly take into account the number of times two words appear in each other's context.

Answer 6

Different from CBOW and Skip-gram utilises the frequency that a word appears in another word's context in a given text corpus

Answer 7

Clustering - grouping similar words together Data visualisation - mapping the semantic space to a two (or three) dimensional space Support other NLP tasks - To be used as the input of machine learning models. E.g. Neural networks for solving NLP tasks.

Answer 8

Advantages: - Very practical in terms of processing - Effective in capturing word meaning and relations, and support the training of neural language models Disadvantages: - It is still an open issue whether statistical co-occurrences alone are enough to address deeper semantic questions - Semantic similarity is still a vague notion. For instance, the association between "car" and "van" is different from that between "car" and "wheel" (semantic similarity vs semantic relatedness) - What type of semantic information can be captured from context, and what part of the meaning of words remains unreachable without complementary knowledge?

Week 7 - Distributional Semantics Flashcards

(33 cards)