week4 Flashcards
(26 cards)
what iis bellman optimality equation
thiis mean that all the things take aftert the first s,a has to be the optimaily stuff to do
wiki :Principle of Optimality: An optimal policy has the property that whatever the initial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision. (See Bellman, 1957, Chap. III.3.)[6][7][8]
what is dynamic programming
solves problems by decomposing them into
sub-problems that can be solved separately. DP works successfully in problems
that have two properties:
what is dynamic programming
solves problems by decomposing them into
sub-problems that can be solved separately. DP works successfully in problems
that have two properties:
what are the two properiites that probulms need do that they can be DP
Optimal substructure:
• Principle of optimality: the optimal solution can be decomposed into
sub-problems.
• In MDPs, this is satisfied by the Bellman Optimality Equation.
• Overlapping sub-problems:
• Sub-problems may occur many times, and solutions can be cached and
reused.
• In MDPs, this is satisfied by information in the value function v(s).
what isa Iterative Policy Evaluation
it iis a DP algorithm that evaluates a given policy PIe (cal all values for vpie(s)) for all states s, it performa a perdiicition estimates how good is to follow a policy pi in a giiven MDP
what is the loop for the aglo to work e.g what are the steps
- for all states s E S
look at silde 11 for a good understanding of the iteratiive policy evaluation
iits a tree
look at the sildes from 13 to15
gives you an understand of when a iterative poilcy of the bellman is used in a for a small girdworld
look at the sildes from 13 to15
gives you an understand of when a iterative poilcy of the bellman is used in a for a small gird world
Q? When we reach k = 10, is it sensible to keep using a random policy? Can’t
we do better?
no, we can do better becouse you want to look at the best possible route
what is a greedy poilcy \?
it is where afrer the fitst iteratiion we modify our poliict using values v(s) we are calculating
pie’(s) = argmax qpiie(s,a)
what is a policy improvement ?
iis the process of making a new poliicy that iimporoves on an oriiginal policy by acting greedily with repect to vpie
i am a bit confusted by the, How to obtain π
∗
? W
Process: we have a policy and evaluate how good is it (we complete Policy
Evaluation). Then, change our policy to act better (greedily) according to vπ
(Policy Improvement). We evaluate again how good this new policy is (another
round of Policy Evaluation), to improve it again later (Policy Improvement
again). Rinse and repeat.
How does Policy Improvement work?
In the Small Gridworld example, π 0 = π ∗ in k = 3, but in general many more iterations on these two steps are needed. However, policy iteration always converges to π ∗ .
policy improvement when do we stop?
when there is convergence
mac q(s,a) > vpie(s)
need to look over thsi more get a better understand of it
ii think iit is just saything that when then
function values {\displaystyle Q^{\pi }(s,a)}
need to look over the sides for the greedy search
sildes 22-24
when thinking about DP is it talking about
it is a way of soultiing a produlm, it has to have two inherate values(stuff) you have to be abale to break down the problem and and then it has to be do able over and over so this iis show but the bellman eq witch gives how to dreck down an proulm in to step if you thiink of it about from here to the walk, then the secound part is the value fucntion it give you the value at the middle point so that you can sovle for different stater points
when acting greeked what are to tryiing to max
so it is qpi(s,a) so it is the value of some state s applied some action
when acting greeked what are to tryiing to max
so it is qpi(s,a) so it is the value of some state s applied some action
need to understand the thiing when you go pass a mdp with complete understand of the envoment
this is on sildes 20 -27 and in the video that the dude does.
what is a a model free prediction ?
if someone does tell you the dinacstion or reward function of the envoment
what is the backward view TD(landa)
video 4 1:33
landa = 1
the sum of offline updates is identical for forward view and backward veiiw