Quiz 5 Flashcards

1
Q

MDP: State

A

The possible scenarios of the world the agent can be in.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

MDP: Actions

A

Set of actions the agent can take based on its state

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

MDP: Environment

A
  • Environment produces a state which the agent can persist
  • Gives rewards to agent for actions it takes
  • Environment may be unknown, non-linear, stochastic and complex
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Dynamic programming methods for solving MDPs

A

Bellman Optimality Equation - Update value matrix at each iteration by applying the Bellman equation until convergence.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

RL: Why is setting data gathering policy to be the same as greedy train policies is a bad idea.

A
  1. Greedy training will not have enough incentive to explore other less rewarding states that may lead to higher reward
  2. Breaks IID
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

State value function (V-function)

A

“Expected discounted sum of rewards from state s”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

State-action value function (Q-value)

A

“Expected cumulative reward upon taking action a in state s”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

RL: 4 challenges of RL

A
  1. Evaluative feedback - need trial and error to find the right action
  2. Delayed feedback - actions may not lead to immediate reward
  3. Non-stationary - Data distribution of visited states changes when policy changes
  4. Fleeting nature of time and online data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

RL: Components of DQN

A
  1. Experience replay
  2. Epsilon greedy search
  3. Q-update
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

MDP: Model

A

The transition function, meaning given a state and an action, what is the probability that the agent will be in the new state 𝑠′

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

MDP: Policy

A

Set of actions given for each state the agent is in. RL attempts to find the optimal policy which maximizes the reward.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

MDP: Markovian property

A

Only the present matters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Bellman’s Equation

A

The true utility of a state is its immediate reward plus all discounted future rewards (utility)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Difference between value iteration and policy iteration

A

VI: Finds optimal value functions + policy extraction (just once)
PI: Policy evaluation + policy improvement (repeated)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Experience replay

A

Agent keeps memory bank that stores past experience. Instead of using immediate experience, sample from memory buffer.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

REINFORCE (policy gradient)

A
  1. Define parameterized policy
  2. Generate trajectories based on policy. Gets state, actions and rewards.
  3. Compute objective function (expected sum of rewards over all time steps)
  4. Compute the gradient (loss with respect to policy params)
  5. Update policy params
  6. Repeat until convergence
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Drawbacks of policy gradients

A

Coarse rewards. Can’t assign credit to subset of actions that were good or bad.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

How does experience replay solve problem of correlated data

A

By randomly sampling from the replay buffer, the training data becomes less correlated. This helps to stabilize and accelerate the learning process.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Diff between Q-learning and Deep Q-Networks

A

How Q-values are represented.

Q-learning uses table of discrete state and action.

DQN uses NN to approximate Q-values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

VI: Time complexity per iteration

A

O(|S|^2 |A|)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

VI / Q-learning - How does it differ in how it perform updates

A

Q loops over actions as well as states

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Why do policy iteration?

A

Policy converges faster

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Deep Q-learning - What 2 things to do for stability during learning

A
  1. Freeze Q_old and update Q_new parameters
  2. Set Q_old <- Q_new at regular intervals
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Loss for Deep Q-learning

A

MSE Loss

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Dependency of value/policy iteration

A

Must know transition and reward functions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

2 strategies if transition and reward function unknown

A
  1. Estimate transition / reward function.
  2. Estimate Q-values from data (DQNs, etc)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

What 2 components of trad RL does policy gradient not require?

A
  1. Environment model
  2. Reward function
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Policy gradient: likelihood ratio policy gradient

A

increases the (log) probability of the trajectories with high reward and decreases the (log) probability of the trajectories with low reward

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Key difference between TD Learning and SARSA

A

TD: Action in next state can be any action. Update is based on expected value over all possible next actions.
SARSA: Action in the next state is one actually taken in the environment. Update is based on the Q-value of the action actually chosen.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

Define Q-value

A

Estimate of the reward you might get for taking an action in a given state.

31
Q

Define few-shot learning

A

Build models and feature representations that will generalize or transfer to a new set of categories where we only have 1 to 5 examples per category.

32
Q

Define semi-supervised learning

A

Build ok models, predict data, use high confidence ones to feed back into training data.

33
Q

Benefit of doing semi-supervised learning with DL

A

Can do SSL in one pipeline, end-to-end,

34
Q

How is SSL done end-to-end in DL (type of data)

A

Labeled + unlabeled examples both included in batch

35
Q

How to get loss from unlabeled examples?

A

2 augmentations (weak/strong). Use weak prediction as baseline for strong predictions. Calculate loss function between them and backpropagate.

36
Q

Few-shot: What substitute layer is shown to be effective compared to fully connected?

A

Cosine layer

37
Q

Few-shot: Define N-Way K-Shot Task

A

Set up N tasks, each having K examples (small)

38
Q

Few-shot: Why does cosine similarity work better than fully connected?

A

Scale invariant - only cares about difference in angle. FC layer may be too sensitive to magnitude.

39
Q

Define meta-learning

A

Set up a set of smaller tasks (with train/test data) that prepares the learner for the new task it will see in actual test.

40
Q

Ways to define meta-learner inspired by trad ML

A

KNN - Matching networks
Gaussian - Prototypical networks
Gradient descent - Meta-learner LSTM

41
Q

Ways to define meta-learner inspired by black-box DL

A

MANN
SNAIL

42
Q

Define autoencoders

A

Encoder/decoder architecture that compresses input into low-dimensional embedding then upsamples back to original image.

43
Q

What’s the point of autoencoders

A

Embedding can be useful for downstream tasks.

No need for labels.

44
Q

Examples of autoencoder tasks

A
  1. Jigsaw
  2. Colorise
  3. Rotation
45
Q

Meta-learn: Key idea of MAML

A

Just learn an initialization using SGD.

46
Q

Meta-learn: How MAML works

A

Take small batch of train/test, predict and backprop, do this for 4-10 steps. Then, just do normal gradient descent afterwards. Use learned params as “smart” initialization.

47
Q

How to generate labels from unlabeled data using clustering

A

Parallel: random initialized CNN and K-means
1. CNN predict labels
2. K-means cluster, turn into labels
3. Compute loss between and backprop.

48
Q

Autoencoder: Jigsaw - loss function

A

cross-entropy

49
Q

Autoencoder: Rotation

A

cross-entropy

50
Q

Autoencoder: Colorization

A

MSE

51
Q

Instance discrimination - Inputs and outputs

A

Input: + / - examples
Output: Model that can discriminate between classes

52
Q

Instance discrimination: How is loss measured

A

Contrastive loss

53
Q

Instance discrimination: Define contrastive loss

A

Similarity (dot product) between positive augmented 1 and positive / augmented 2 + negative / augmented

54
Q

Instance discrimination: Inputs and outputs

A

Input:
Positive example: 2 augmentations
Negative example: 1 augmentation

Output:
Contrastive loss

54
Q

Instance discrimination: Point of using momentum encoders

A

Slow down the learner, stabilize learning.

55
Q

Generative model - key idea

A

Use maximum likelihood and unlabeled dataset to create a model.

56
Q

3 types of generative models

A
  1. Tractable density
  2. Variational
  3. Direct
57
Q

GM: Tractable density

A

Simplify joint distribution and learn those params

58
Q

GM: Variational

A

Learn distributions that approximate the true joint distribution

59
Q

GM: Direct

A

Learn to generate samples from data distribution without modeling it.

60
Q

GAN: 2 types of models

A

Generator + Discriminator

61
Q

GAN: loss function (example of image detection)

A

cross-entropy (real or fake?)

62
Q

GAN: Generator objective

A

Minimize discriminator’s performance

1 - D(G(z))

63
Q

GAN: Discriminator objective

A

Maximize prediction of real image - D(x)

Minimize predicting fake images as real - 1-D(G(z)

64
Q

GAN: Why do max-max game work better than min-max

A

Generator’s objective function doesn’t have good gradient properties.

65
Q

Variational autoencoders (VAE) - what assumption does it require

A

Gaussian distribution

66
Q

VAE - Why can’t we calculate maximum likelihood directly

A

Contains integral

67
Q

VAE: What is the alternative to calculating maximum likelihood

A

variational lower bound

68
Q

GAN: Example of training failing

A

Generator learns to memorize and output samples of your training data.

69
Q

VAE: Output of encoder

A

Mu and sigma - output parameters of a distribution

70
Q

VAE: How is mu/sigma from encoder used

A

Sample from it to generate example to feed to decoder

71
Q

VAE: Output of decoder

A

Mu and sigma of original data’s distribution (X)

72
Q

VAE: How is mu/sigma from decoder used

A

Sample from it to generate example to feed to encoder

73
Q

VAE: reparameterization trick

A

Moves sampling process outside of computation graph that has to go all the way back to the encoder