Test 6 Deep Learning Flashcards
(36 cards)
2012
AlexNet - Alex Krizhevsky, Hinton, Sustlever - revolutionized the world of computer vision blowing other competition out of the water in the ImageNet competition
Bag of Words - feature vector size
100,000 (vocabulary size)
language model
probability distribution of a string
what is the probability of a sequence of words?
product (big pi) of P(wm|w_1:m-1)
What is the issue with computing the entire probability of a word given the previous sequence
for large sequences (large m) it is not possible
n-gram
an n-gram is the probability of a word given n-1 previous words
Maximum likelihood estimator for an n-gram
C(wm−[n−1]:m−1wm)
divided by
Sum C(wm−[n−1]:m−1w)
which is the same as
C(wm−[n−1]:m−1wm)
/C(wm−[n−1]:m−1)
Perplexity
Weighted average branching factor -
Mth root of (1/P(w1:M)
branching factor
number of possible next words that can follow an word
Greedy Decoding
Wm = argmax_w P(w|w1:m-1) - choose the most likely next word
Sampling Decoding
Cover the probability space with intervals equal to each word frequency - sample from a [0, 1] uniform distribution
Top-K Sampling
Consider the K most likely entries
Top-p sampling
consider the smallest set of entries whose cumulative probability exceeds p
Beam Search
Generate K hypotheses
Extend each hypothesis by K (thus bringing total hypotheses to K^2)
Keep K of the K^2 hypotheses - rinse and repeat
What is the distance between any two words in one-hot representation?
2 - all words are equidistant
Distributional Hypothesis
Words that occur in similar contexts tend to have similar meanings
Cosine Similarity Formula
uTv/||u||||v||
Cosine Similarity Geometric Interpretation
The angle between the two normalized vectors
Word Co-Occurence Matrix
Count how many times each word appears when looking at a similar word within a specific context window - size is vocab x vocab
Word Vector
co-occurence matrix but for a single word
What is the issue tracking cosine similarity for word vectors?
They’re pretty sparse - close to zero
Word Embedding
C (co-occurence matrix) can be approximated with C = EU^T where E is of size v x d and U is of size p x d
What is the rank of the co-occurence matrix vs the embedding matrices?
Full Rank vs Maximum rank D
How can you calculate E and U^T?
Do singular value decomposition and consider the first d components