Test 6 Deep Learning Flashcards by Sahil Sen

2012

AlexNet - Alex Krizhevsky, Hinton, Sustlever - revolutionized the world of computer vision blowing other competition out of the water in the ImageNet competition

How well did you know this?

Not at all

Perfectly

Bag of Words - feature vector size

100,000 (vocabulary size)

How well did you know this?

Not at all

Perfectly

language model

probability distribution of a string

How well did you know this?

Not at all

Perfectly

what is the probability of a sequence of words?

product (big pi) of P(wm|w_1:m-1)

How well did you know this?

Not at all

Perfectly

What is the issue with computing the entire probability of a word given the previous sequence

for large sequences (large m) it is not possible

How well did you know this?

Not at all

Perfectly

n-gram

an n-gram is the probability of a word given n-1 previous words

How well did you know this?

Not at all

Perfectly

Maximum likelihood estimator for an n-gram

C(wm−[n−1]:m−1wm)
divided by
Sum C(wm−[n−1]:m−1w)

which is the same as
C(wm−[n−1]:m−1wm)
/C(wm−[n−1]:m−1)

How well did you know this?

Not at all

Perfectly

Perplexity

Weighted average branching factor -

Mth root of (1/P(w1:M)

How well did you know this?

Not at all

Perfectly

branching factor

number of possible next words that can follow an word

How well did you know this?

Not at all

Perfectly

Greedy Decoding

Wm = argmax_w P(w|w1:m-1) - choose the most likely next word

How well did you know this?

Not at all

Perfectly

Sampling Decoding

Cover the probability space with intervals equal to each word frequency - sample from a [0, 1] uniform distribution

How well did you know this?

Not at all

Perfectly

Top-K Sampling

Consider the K most likely entries

How well did you know this?

Not at all

Perfectly

Top-p sampling

consider the smallest set of entries whose cumulative probability exceeds p

How well did you know this?

Not at all

Perfectly

Beam Search

Generate K hypotheses
Extend each hypothesis by K (thus bringing total hypotheses to K^2)

Keep K of the K^2 hypotheses - rinse and repeat

How well did you know this?

Not at all

Perfectly

What is the distance between any two words in one-hot representation?

2 - all words are equidistant

How well did you know this?

Not at all

Perfectly

Distributional Hypothesis

Study These Flashcards

Words that occur in similar contexts tend to have similar meanings

Cosine Similarity Formula

Study These Flashcards

uTv/||u||||v||

Cosine Similarity Geometric Interpretation

Study These Flashcards

The angle between the two normalized vectors

Word Co-Occurence Matrix

Study These Flashcards

Count how many times each word appears when looking at a similar word within a specific context window - size is vocab x vocab

Word Vector

Study These Flashcards

co-occurence matrix but for a single word

What is the issue tracking cosine similarity for word vectors?

Study These Flashcards

They’re pretty sparse - close to zero

Word Embedding

Study These Flashcards

C (co-occurence matrix) can be approximated with C = EU^T where E is of size v x d and U is of size p x d

What is the rank of the co-occurence matrix vs the embedding matrices?

Study These Flashcards

Full Rank vs Maximum rank D

How can you calculate E and U^T?

Study These Flashcards

Do singular value decomposition and consider the first d components

How do you train a fully connected architecture for word embeddings?

Given a corpus, let V = {w|C(w) >= Cmin}, |V| = v Split the corpus into overlapping windows of size M Initialize a random embedding matrix Represent w1:M as [e1, e2,..., em] = so x^T is of size R 1xMd Use one-hot wM+1 as target

Why is contextualized word representation necessary?

Words have multiple meanings, so we need representations that reflect the context, leading to BERT

2018

BERT - bi-directional encoder representations from transformers

What is the input and output of the RNN language model represented both as a function of the representation AND as a sequence

{x[m]} ^M-1_m=1 -> Emb(w1:m-1) {y[m]}^M-1m=1 = one-hot(w2:M)

What are the RNN limitations for machine translation

Nearby context bias, fixed context size, slow sequential data processing

2015

Bahdanau, Cho, Neural Machine Translation by Jointly Learning to Align and Translate - Attentional seq2seq model - dynamic focus on relevant parts of input sequence

How is the hidden state of the decoder generated?

Project the hidden state at t-1 of the decoder on the hidden state of the encoder S^T h[n-1]

Target Language Decoding

Autoregressive greedy decoding Get y~[1] after feeding Emb() to the decoder, feed Emb(argmax(y[1])) to the decoder and get y~[2]. Do until argmax(y~[n]) =

33

What do you calculate for self-attention and how often?

For every x in {xm}M=1m=1 = X, compute query = Wqx key k = Wkx For the entire X, compute K = WkX query-key match r = K^T q a = softargmax(r) c = Va

34

Describe the unique properties of c, the self-attention vector

it is permutationally-invariant to X, but permutationally equivariant to x

35

How do you encode positional information in a set?

A sequence can be cast to an augmented set by adding input representation information about the index: additive positional encoding {x[m]}M, m=1 -> {xm + Enc(m))}M m=1

36

2017

Vaswani et al - Attention is all you need replaced recurrence and convolutions with self-attention mechanisms for sequence modeling.

Test 6 Deep Learning Flashcards

(36 cards)