Intro - Lectures 1 and 2 Flashcards

1
Q

What are the three types of machine learning and how are they structured/represented?

A

Supervised Learning - Function approximation; given y = f(x), find y given x
Unsupervised Learning - Clustering or description; find f in f(x)
Reinforcement Learning - Superficially looks like supervised learning; a method for decision making. Instead of having x and y are given some other value z and need to learn y and f for y = f(x)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the components of a Markov Decision Process and some forms they can take?

A
States: S
Model: T(s,a,s') ~ Pr(s' | s,a)
Actions: A(s), A
Reward: R(s), R(s,a), R(s,a,s')
Policy: Pi(s) -> a
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the Markovian property?

A

Only the present matters. Your next state is only dependent on your current state (how you got there doesn’t matter)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How can you get around the Markovian property when past actions/states do matter?

A

Include all necessary/relevant past information in the current state

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the solution to an MDP?

A

The policy function, maps states to actions, Pi(s) -> a

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the MDP policy Pi*

A

Pi* is the optimal policy to maximize long-term rewards

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the difference between planning and RL policy?

A

Planning aims to develop a concrete (multi-action) plan to achieve an objective. RL policy asks “in each state what action should I take now?”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is one issue with delayed rewards in MDPs? Hint this problem has a name.

A

Minor changes matter and we must determine which states and actions resulted in the outcomes we saw. This is referred to as the (temporal) credit assignment problem.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What assumptions are made in the sequence of rewards for MDPs?

A

Infinite horizon (stationarity)
Utility of sequences e.g. if
U(s0,s1,s2,…) > U(s0,s’1,s’2,…) then
U(s1,s2,…) > U(s’1,s’2,…)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the purpose of gamma in an MDP

A

Gamma is the discount rate [0.0, 1.0) and used to guarantee convergence of the infinite sequence of rewards

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the bounded sum of rewards for an MDP given the max reward R_max and the discount rate gamma?

A

R_max/(1-gamma)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the difference between utility and reward?

A

Utility is based on the long term expected value of an action
Reward is the immediate impact from making an action

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the Bellman equation?

A

The Bellman equation describes the utility of a state in a discounted MDP

U(s) = R(s) + gamma * max_a Sum_s’ [T(s,a,s’)U(s’)]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How can we solve Bellman’s equation directly?

A

Value Iteration, Policy Iteration

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How do you perform value iteration?

A

Start with arbitrary utilities
Update utilities based on neighbors using Bellman’s equation
Repeat until convergence

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How do you perform policy iteration?

A

Start - Initialize an arbitrary policy Pi_0
Evaluate - For a timestep t given Pi_t calculate U_t using Bellman’ equation which gives n equations in n unknowns
Improve - Pi_t+1 = argmax_a Sum[T(s,a,s’)U_t(s’)]

17
Q

Compare policy iteration to value iteration

A

Policy iteration is more expensive per iteration (matrix inversion of nxn matrix) but may require less iterations
Both are guaranteed to converge

18
Q

What is the relationship between the three forms of the Bellman equation?

A

Watch Lession 1.29

19
Q

In a RL context the agent acts as ___ and the environment acts as ___

A

policy (pi); the MDP

20
Q

What types of behavior can AI agents learn?

A

Plan - Fixed sequence of actions (stochasticity causes problems)*
Conditional Plan - A plan with if/else statements (can handle some stochasticity)
Stationary Policy/Universal Plan - Mapping from state to action for every state (can handle any stochasticity, but very large)

*Dynamic replanning - Recreate a new plan if something goes wrong

21
Q

How do we evaluate a learner?

A
  1. Value of returned policy
  2. Computational complexity (how much time needed)
  3. Experience/sample complexity (how much data needed)