Quiz 5 Flashcards

1
Q

MDP: State

A

The possible scenarios of the world the agent can be in.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

MDP: Actions

A

Set of actions the agent can take based on its state

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

MDP: Environment

A
  • Environment produces a state which the agent can persist
  • Gives rewards to agent for actions it takes
  • Environment may be unknown, non-linear, stochastic and complex
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Dynamic programming methods for solving MDPs

A

Bellman Optimality Equation - Update value matrix at each iteration by applying the Bellman equation until convergence.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

RL: Why is setting data gathering policy to be the same as greedy train policies is a bad idea.

A
  1. Greedy training will not have enough incentive to explore other less rewarding states that may lead to higher reward
  2. Breaks IID
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

State value function (V-function)

A

“Expected discounted sum of rewards from state s”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

State-action value function (Q-value)

A

“Expected cumulative reward upon taking action a in state s”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

RL: 4 challenges of RL

A
  1. Evaluative feedback - need trial and error to find the right action
  2. Delayed feedback - actions may not lead to immediate reward
  3. Non-stationary - Data distribution of visited states changes when policy changes
  4. Fleeting nature of time and online data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

RL: Components of DQN

A
  1. Experience replay
  2. Epsilon greedy search
  3. Q-update
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

MDP: Model

A

The transition function, meaning given a state and an action, what is the probability that the agent will be in the new state 𝑠′

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

MDP: Policy

A

Set of actions given for each state the agent is in. RL attempts to find the optimal policy which maximizes the reward.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

MDP: Markovian property

A

Only the present matters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Bellman’s Equation

A

The true utility of a state is its immediate reward plus all discounted future rewards (utility)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Difference between value iteration and policy iteration

A

VI: Finds optimal value functions + policy extraction (just once)
PI: Policy evaluation + policy improvement (repeated)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Experience replay

A

Agent keeps memory bank that stores past experience. Instead of using immediate experience, sample from memory buffer.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

REINFORCE (policy gradient)

A
  1. Define parameterized policy
  2. Generate trajectories based on policy. Gets state, actions and rewards.
  3. Compute objective function (expected sum of rewards over all time steps)
  4. Compute the gradient (loss with respect to policy params)
  5. Update policy params
  6. Repeat until convergence
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Drawbacks of policy gradients

A

Coarse rewards. Can’t assign credit to subset of actions that were good or bad.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

How does experience replay solve problem of correlated data

A

By randomly sampling from the replay buffer, the training data becomes less correlated. This helps to stabilize and accelerate the learning process.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Diff between Q-learning and Deep Q-Networks

A

How Q-values are represented.

Q-learning uses table of discrete state and action.

DQN uses NN to approximate Q-values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

VI: Time complexity per iteration

A

O(|S|^2 |A|)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

VI / Q-learning - How does it differ in how it perform updates

A

Q loops over actions as well as states

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Why do policy iteration?

A

Policy converges faster

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Deep Q-learning - What 2 things to do for stability during learning

A
  1. Freeze Q_old and update Q_new parameters
  2. Set Q_old <- Q_new at regular intervals
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Loss for Deep Q-learning

A

MSE Loss

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Dependency of value/policy iteration
Must know transition and reward functions
26
2 strategies if transition and reward function unknown
1. Estimate transition / reward function. 2. Estimate Q-values from data (DQNs, etc)
27
What 2 components of trad RL does policy gradient not require?
1. Environment model 2. Reward function
28
Policy gradient: likelihood ratio policy gradient
increases the (log) probability of the trajectories with high reward and decreases the (log) probability of the trajectories with low reward
29
Key difference between TD Learning and SARSA
TD: Action in next state can be any action. Update is based on expected value over all possible next actions. SARSA: Action in the next state is one actually taken in the environment. Update is based on the Q-value of the action actually chosen.
30
Define Q-value
Estimate of the reward you might get for taking an action in a given state.
31
Define few-shot learning
Build models and feature representations that will generalize or transfer to a new set of categories where we only have 1 to 5 examples per category.
32
Define semi-supervised learning
Build ok models, predict data, use high confidence ones to feed back into training data.
33
Benefit of doing semi-supervised learning with DL
Can do SSL in one pipeline, end-to-end,
34
How is SSL done end-to-end in DL (type of data)
Labeled + unlabeled examples both included in batch
35
How to get loss from unlabeled examples?
2 augmentations (weak/strong). Use weak prediction as baseline for strong predictions. Calculate loss function between them and backpropagate.
36
Few-shot: What substitute layer is shown to be effective compared to fully connected?
Cosine layer
37
Few-shot: Define N-Way K-Shot Task
Set up N tasks, each having K examples (small)
38
Few-shot: Why does cosine similarity work better than fully connected?
Scale invariant - only cares about difference in angle. FC layer may be too sensitive to magnitude.
39
Define meta-learning
Set up a set of smaller tasks (with train/test data) that prepares the learner for the new task it will see in actual test.
40
Ways to define meta-learner inspired by trad ML
KNN - Matching networks Gaussian - Prototypical networks Gradient descent - Meta-learner LSTM
41
Ways to define meta-learner inspired by black-box DL
MANN SNAIL
42
Define autoencoders
Encoder/decoder architecture that compresses input into low-dimensional embedding then upsamples back to original image.
43
What's the point of autoencoders
Embedding can be useful for downstream tasks. No need for labels.
44
Examples of autoencoder tasks
1. Jigsaw 2. Colorise 3. Rotation
45
Meta-learn: Key idea of MAML
Just learn an initialization using SGD.
46
Meta-learn: How MAML works
Take small batch of train/test, predict and backprop, do this for 4-10 steps. Then, just do normal gradient descent afterwards. Use learned params as "smart" initialization.
47
How to generate labels from unlabeled data using clustering
Parallel: random initialized CNN and K-means 1. CNN predict labels 2. K-means cluster, turn into labels 3. Compute loss between and backprop.
48
Autoencoder: Jigsaw - loss function
cross-entropy
49
Autoencoder: Rotation
cross-entropy
50
Autoencoder: Colorization
MSE
51
Instance discrimination - Inputs and outputs
Input: + / - examples Output: Model that can discriminate between classes
52
Instance discrimination: How is loss measured
Contrastive loss
53
Instance discrimination: Define contrastive loss
Similarity (dot product) between positive augmented 1 and positive / augmented 2 + negative / augmented
54
Instance discrimination: Inputs and outputs
Input: Positive example: 2 augmentations Negative example: 1 augmentation Output: Contrastive loss
54
Instance discrimination: Point of using momentum encoders
Slow down the learner, stabilize learning.
55
Generative model - key idea
Use maximum likelihood and unlabeled dataset to create a model.
56
3 types of generative models
1. Tractable density 2. Variational 3. Direct
57
GM: Tractable density
Simplify joint distribution and learn those params
58
GM: Variational
Learn distributions that approximate the true joint distribution
59
GM: Direct
Learn to generate samples from data distribution without modeling it.
60
GAN: 2 types of models
Generator + Discriminator
61
GAN: loss function (example of image detection)
cross-entropy (real or fake?)
62
GAN: Generator objective
Minimize discriminator's performance 1 - D(G(z))
63
GAN: Discriminator objective
Maximize prediction of real image - D(x) Minimize predicting fake images as real - 1-D(G(z)
64
GAN: Why do max-max game work better than min-max
Generator's objective function doesn't have good gradient properties.
65
Variational autoencoders (VAE) - what assumption does it require
Gaussian distribution
66
VAE - Why can't we calculate maximum likelihood directly
Contains integral
67
VAE: What is the alternative to calculating maximum likelihood
variational lower bound
68
GAN: Example of training failing
Generator learns to memorize and output samples of your training data.
69
VAE: Output of encoder
Mu and sigma - output parameters of a distribution
70
VAE: How is mu/sigma from encoder used
Sample from it to generate example to feed to decoder
71
VAE: Output of decoder
Mu and sigma of original data's distribution (X)
72
VAE: How is mu/sigma from decoder used
Sample from it to generate example to feed to encoder
73
VAE: reparameterization trick
Moves sampling process outside of computation graph that has to go all the way back to the encoder