Question 1

what iis bellman optimality equation

Accepted Answer

thiis mean that all the things take aftert the first s,a has to be the optimaily stuff to do

wiki :Principle of Optimality: An optimal policy has the property that whatever the initial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision. (See Bellman, 1957, Chap. III.3.)[6][7][8]

Question 2

what is dynamic programming

Accepted Answer

solves problems by decomposing them into sub-problems that can be solved separately. DP works successfully in problems that have two properties:

Question 3

what is dynamic programming

Accepted Answer

solves problems by decomposing them into sub-problems that can be solved separately. DP works successfully in problems that have two properties:

Question 4

what are the two properiites that probulms need do that they can be DP

Accepted Answer

Optimal substructure:
• Principle of optimality: the optimal solution can be decomposed into
sub-problems.
• In MDPs, this is satisfied by the Bellman Optimality Equation.
• Overlapping sub-problems:
• Sub-problems may occur many times, and solutions can be cached and
reused.
• In MDPs, this is satisfied by information in the value function v(s).

Question 5

what isa Iterative Policy Evaluation

Accepted Answer

it iis a DP algorithm that evaluates a given policy PIe (cal all values for vpie(s)) for all states s, it performa a perdiicition estimates how good is to follow a policy pi in a giiven MDP

Question 6

what is the loop for the aglo to work e.g what are the steps

Accepted Answer

1. for all states s E S

Question 7

look at silde 11 for a good understanding of the iteratiive policy evaluation

Accepted Answer

iits a tree

Question 8

look at the sildes from 13 to15

Accepted Answer

gives you an understand of when a iterative poilcy of the bellman is used in a for a small girdworld

Question 9

look at the sildes from 13 to15

Accepted Answer

gives you an understand of when a iterative poilcy of the bellman is used in a for a small gird world

Question 10

Q? When we reach k = 10, is it sensible to keep using a random policy? Can’t we do better?

Accepted Answer

no, we can do better becouse you want to look at the best possible route

Question 11

what is a greedy poilcy \?

Accepted Answer

it is where afrer the fitst iteratiion we modify our poliict using values v(s) we are calculating pie'(s) = argmax qpiie(s,a)

Question 12

what is a policy improvement ?

Accepted Answer

iis the process of making a new poliicy that iimporoves on an oriiginal policy by acting greedily with repect to vpie

Question 13

i am a bit confusted by the, How to obtain π ∗ ? W

Accepted Answer

Process: we have a policy and evaluate how good is it (we complete Policy
Evaluation). Then, change our policy to act better (greedily) according to vπ
(Policy Improvement). We evaluate again how good this new policy is (another
round of Policy Evaluation), to improve it again later (Policy Improvement
again). Rinse and repeat.

Question 14

How does Policy Improvement work?

Accepted Answer

``` In the Small Gridworld example, π 0 = π ∗ in k = 3, but in general many more iterations on these two steps are needed. However, policy iteration always converges to π ∗ . ```

Question 15

policy improvement when do we stop?

Accepted Answer

when there is convergence mac q(s,a) > vpie(s)

week4 Flashcards

(26 cards)