Week 7 - Distributional Semantics Flashcards
Semantic Processing
The computer needs to “understand” what words mean in a given context
Distributional Hypothesis
The hypothesis that we can infer the meaning of a word from the context it occurs in
Assumes contextual information alone constitutes a viable representation of linguistic items, in contrast to formal linguistics and the formal theory of grammar
Distributional Semantic Model
Generate a high-dimensional feature vector to characterise a linguistic item
Subsequently, the semantic similarity between the linguistic items can be quantified in terms of vector similarity
Linguistic Items
words (or word senses), phrases, text pieces (windows of words), sentences, documents, etc…
Semantic space
The high-dimensional space computed by the distributional semantic model, also called embeding space, (latent) representation space, etc…
Vector distance function
Used to measure how dissimilar two vectors corresponding linguistic items are
Vector similarity function
Used to measure how similar two vectors corresponding linguistic items are
Examples of vector distance/similarity function
Euclidean Distance
Cosine Similarity
Inner Product Similarity
Euclidean Distance
Given two d-dimensional vectors p and q:
sqrt( sum(pi-qi)^2 for i=0->d) )
Inner Product Function
Given two d-dimensional vectors p and q:
sum(pi*qi) for i=0->d
Cosine Function
Given two d-dimensional vectors p and q:
sum(pi*qi) for i=0->d
divided by
sqrt( sum(pi^2) for i=0->d )
* sqrt( sum(qi^2) for i=0->d )
Vector Space Model
count based
Algebraic model for representing a piece of text object (referred to as a document) as a vector of indexed terms (e.g. words, phrases)
In the document vector, each feature value represents the count of an indexed term appearing in a relevant piece of text
By collecting many document vectors and storing them as matrix rows (or columns), it results in the document-term matrix.
Might treat the context of a word as a mini-document
VSM term weighting schemes
Binary Weight
Term Frequency (tf)
Term Frequency Inverse Document Frequency (tf-idf)
VSM binary weighting
Each element in the document-term matrix is the binary presence (or absence) of a word in a document
VSM Term Frequency Weighting
Each element in the document-term matrix is the frequency a word appears in a document, called term frequency (tf)
Inverse Document Frequency
Considers how much information the word provides, i.e. if it’s common or rare across all documents
idf(k) = log(M / m(k))
Where:
M - total number of documents in the collection
m(k) - Number of documents in the collection that contain word k
Term Frequency Inverse Document Frequency
For document i and word k
t(i,k) = tf(i,k) * idf(k)
idf(k) = log(M / m(k))
Where:
M - total number of documents in the collection
m(k) - Number of documents in the collection that contain word k
idf considers how much information the word provides, i.e. if it’s common or rare across all documents
VSM for word similarity
Construct two vectors using the VSM (Vector space model)
Use cosine (or inner product) similarity to compute the similarity between the word vectors
Two approaches for getting word vectors:
Based on documents
Based on local context
Context based word similarity
Instead of using a document-term matrix, use a word-term matrix, populating it with:
the co-occurence of a term and a word within the given context windows of the term, as observed in a text collection.
Context Engineering
How to choose the context for context based word similarity
Options:
- Whole document that contains a word
- All words in a wide window of the target word
- content words only (no stop words) in a wide window of the target word.
- Content words only in a narrower window of the target word
- Content words in a narrower window of the target word, which are further selected through using some lexical processing tools
Document-Term Matrix
A matrix with columns of terms, and rows of documents
Context Window Size
In the context of Context Engineering and Context based word similarity:
- Instead of entire document, use smaller contexts, e.g. paragraph, window of +- 4 words
Shorter window (1-3 words) - focus on syntax
Longer window (4-10 words) - capture semantics
Benefits of low-dimensional dense vectors
-Easier to use as features in machine learning models
- Less noisy
Latent Semantic Indexing
Mathematically is a singular value decomposition
Using SVD results computes:
document vectors: UD
term vectors: VD
Between document similarity: U * D^2 * UT
Between term similarity: V * D^2 * VT
The dimension k is normally set as a low value e.g. 50-1000 given 20k-50k terms