CS7642_Week1 Flashcards

1
Q

What are the four (4) components (the “Problem” portion) of an MDP?

A

State, model, actions, reward

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What two requirements are necessary to meet for a valid MDP?

A
  1. Must fulfill Markovian property, i.e. only the present matters
  2. It must be stationary, i.e. the physics of the world (the “model”), actions available, etc. must not change as a function of time.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is a policy, and what is an optimal policy?

A

A policy is a mapping from states to actions. An optimal policy is one that maps states to actions in such a way that it maximizes the expected reward the agent will receive over time.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the difference between a plan and a policy?

A

A policy is a mapping from states to actions. A plan is simply a sequence of actions. It’s an important distinction because the stochastic nature of the world means that just because you plan to go somewhere doesn’t guarantee you’ll get there. A policy is more general in that it simply says “if I’m in a state, what action do I take?”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the credit assignment problem?

A

It’s the problem of how we figure out how states and actions from far in the past influenced where we are now?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the “Stationarity of Preferences”?

A

The idea that if I prefer a set of states today over another set today, then that same inequality will hold the next day, and the next day, and so on. This is important because it leads to the natural decision to simply add rewards together to calculate the utility (i.e. value) of a sequence of states.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What does “discounting” rewards achieve? What must be true of the discount factor gamma for this condition to be met?

A

It allows us to calculate an infinite series in finite time. Gamma must be 0<=gamma<1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the difference between a reward and utility (i.e. value)?

A

R != V, i.e. reward only encompasses immediate gratification. Value tells us how good it is to be in some state based on the immediate reward + the discounted future value (i.e. it accounts for DELAYED rewards).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Why does value iteration work?

A

Because at each step, we get another glimpse of reality, and as we take more and more steps, we encompass more and more of the real dynamics of the problem.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Does policy iteration always converge?

A

Yes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What does the ‘Q’ in Q values stand for?

A

Quality, more specifically, how good it is to take some action.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Why do we bother having V functions AND Q functions? Isn’t V enough?

A

Q functions are very useful from learning from samples of data, because we don’t have to have access to the transition function (which we need for the value function because value iteration requires us to be able to calculate the value of the next state for each update).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly