CS7642_Week1 Flashcards

Question 1

Q

What are the four (4) components (the “Problem” portion) of an MDP?

Answer

A

State, model, actions, reward

Question 2

Q

What two requirements are necessary to meet for a valid MDP?

Answer

A

Must fulfill Markovian property, i.e. only the present matters
It must be stationary, i.e. the physics of the world (the “model”), actions available, etc. must not change as a function of time.

Question 3

Q

What is a policy, and what is an optimal policy?

Answer

A

A policy is a mapping from states to actions. An optimal policy is one that maps states to actions in such a way that it maximizes the expected reward the agent will receive over time.

Question 4

Q

What is the difference between a plan and a policy?

Answer

A

A policy is a mapping from states to actions. A plan is simply a sequence of actions. It’s an important distinction because the stochastic nature of the world means that just because you plan to go somewhere doesn’t guarantee you’ll get there. A policy is more general in that it simply says “if I’m in a state, what action do I take?”

Question 5

Q

What is the credit assignment problem?

Answer

A

It’s the problem of how we figure out how states and actions from far in the past influenced where we are now?

Question 6

Q

What is the “Stationarity of Preferences”?

Answer

A

The idea that if I prefer a set of states today over another set today, then that same inequality will hold the next day, and the next day, and so on. This is important because it leads to the natural decision to simply add rewards together to calculate the utility (i.e. value) of a sequence of states.

Question 7

Q

What does “discounting” rewards achieve? What must be true of the discount factor gamma for this condition to be met?

Answer

A

It allows us to calculate an infinite series in finite time. Gamma must be 0<=gamma<1

Question 8

Q

What is the difference between a reward and utility (i.e. value)?

Answer

A

R != V, i.e. reward only encompasses immediate gratification. Value tells us how good it is to be in some state based on the immediate reward + the discounted future value (i.e. it accounts for DELAYED rewards).

Question 9

Q

Why does value iteration work?

Answer

A

Because at each step, we get another glimpse of reality, and as we take more and more steps, we encompass more and more of the real dynamics of the problem.

Question 10

Q

Does policy iteration always converge?

Question 11

Q

What does the ‘Q’ in Q values stand for?

Answer

A

Quality, more specifically, how good it is to take some action.

Question 12

Q

Why do we bother having V functions AND Q functions? Isn’t V enough?

Answer

A

Q functions are very useful from learning from samples of data, because we don’t have to have access to the transition function (which we need for the value function because value iteration requires us to be able to calculate the value of the next state for each update).

Brainscape's Knowledge GenomeTM

CS7642_Week1 Flashcards

Brainscape's Knowledge Genome^TM