Test 6 Deep Learning Flashcards

(36 cards)

1
Q

2012

A

AlexNet - Alex Krizhevsky, Hinton, Sustlever - revolutionized the world of computer vision blowing other competition out of the water in the ImageNet competition

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Bag of Words - feature vector size

A

100,000 (vocabulary size)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

language model

A

probability distribution of a string

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

what is the probability of a sequence of words?

A

product (big pi) of P(wm|w_1:m-1)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the issue with computing the entire probability of a word given the previous sequence

A

for large sequences (large m) it is not possible

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

n-gram

A

an n-gram is the probability of a word given n-1 previous words

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Maximum likelihood estimator for an n-gram

A

C(wm−[n−1]:m−1wm)
divided by
Sum C(wm−[n−1]:m−1w)

which is the same as
C(wm−[n−1]:m−1wm)
/C(wm−[n−1]:m−1)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Perplexity

A

Weighted average branching factor -

Mth root of (1/P(w1:M)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

branching factor

A

number of possible next words that can follow an word

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Greedy Decoding

A

Wm = argmax_w P(w|w1:m-1) - choose the most likely next word

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Sampling Decoding

A

Cover the probability space with intervals equal to each word frequency - sample from a [0, 1] uniform distribution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Top-K Sampling

A

Consider the K most likely entries

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Top-p sampling

A

consider the smallest set of entries whose cumulative probability exceeds p

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Beam Search

A

Generate K hypotheses
Extend each hypothesis by K (thus bringing total hypotheses to K^2)

Keep K of the K^2 hypotheses - rinse and repeat

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the distance between any two words in one-hot representation?

A

2 - all words are equidistant

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Distributional Hypothesis

A

Words that occur in similar contexts tend to have similar meanings

17
Q

Cosine Similarity Formula

A

uTv/||u||||v||

18
Q

Cosine Similarity Geometric Interpretation

A

The angle between the two normalized vectors

19
Q

Word Co-Occurence Matrix

A

Count how many times each word appears when looking at a similar word within a specific context window - size is vocab x vocab

20
Q

Word Vector

A

co-occurence matrix but for a single word

21
Q

What is the issue tracking cosine similarity for word vectors?

A

They’re pretty sparse - close to zero

22
Q

Word Embedding

A

C (co-occurence matrix) can be approximated with C = EU^T where E is of size v x d and U is of size p x d

23
Q

What is the rank of the co-occurence matrix vs the embedding matrices?

A

Full Rank vs Maximum rank D

24
Q

How can you calculate E and U^T?

A

Do singular value decomposition and consider the first d components

25
How do you train a fully connected architecture for word embeddings?
Given a corpus, let V = {w|C(w) >= Cmin}, |V| = v Split the corpus into overlapping windows of size M Initialize a random embedding matrix Represent w1:M as [e1, e2,..., em] = so x^T is of size R 1xMd Use one-hot wM+1 as target
26
Why is contextualized word representation necessary?
Words have multiple meanings, so we need representations that reflect the context, leading to BERT
27
2018
BERT - bi-directional encoder representations from transformers
28
What is the input and output of the RNN language model represented both as a function of the representation AND as a sequence
{x[m]} ^M-1_m=1 -> Emb(w1:m-1) {y[m]}^M-1m=1 = one-hot(w2:M)
29
What are the RNN limitations for machine translation
Nearby context bias, fixed context size, slow sequential data processing
30
2015
Bahdanau, Cho, Neural Machine Translation by Jointly Learning to Align and Translate - Attentional seq2seq model - dynamic focus on relevant parts of input sequence
31
How is the hidden state of the decoder generated?
Project the hidden state at t-1 of the decoder on the hidden state of the encoder S^T h[n-1]
32
Target Language Decoding
Autoregressive greedy decoding Get y~[1] after feeding Emb() to the decoder, feed Emb(argmax(y[1])) to the decoder and get y~[2]. Do until argmax(y~[n]) =
33
What do you calculate for self-attention and how often?
For every x in {xm}M=1m=1 = X, compute query = Wqx key k = Wkx For the entire X, compute K = WkX query-key match r = K^T q a = softargmax(r) c = Va
34
Describe the unique properties of c, the self-attention vector
it is permutationally-invariant to X, but permutationally equivariant to x
35
How do you encode positional information in a set?
A sequence can be cast to an augmented set by adding input representation information about the index: additive positional encoding {x[m]}M, m=1 -> {xm + Enc(m))}M m=1
36
2017
Vaswani et al - Attention is all you need replaced recurrence and convolutions with self-attention mechanisms for sequence modeling.