week4 Flashcards

(26 cards)

1
Q

what iis bellman optimality equation

A

thiis mean that all the things take aftert the first s,a has to be the optimaily stuff to do

wiki :Principle of Optimality: An optimal policy has the property that whatever the initial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision. (See Bellman, 1957, Chap. III.3.)[6][7][8]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

what is dynamic programming

A

solves problems by decomposing them into
sub-problems that can be solved separately. DP works successfully in problems
that have two properties:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

what is dynamic programming

A

solves problems by decomposing them into
sub-problems that can be solved separately. DP works successfully in problems
that have two properties:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

what are the two properiites that probulms need do that they can be DP

A

Optimal substructure:
• Principle of optimality: the optimal solution can be decomposed into
sub-problems.
• In MDPs, this is satisfied by the Bellman Optimality Equation.
• Overlapping sub-problems:
• Sub-problems may occur many times, and solutions can be cached and
reused.
• In MDPs, this is satisfied by information in the value function v(s).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

what isa Iterative Policy Evaluation

A

it iis a DP algorithm that evaluates a given policy PIe (cal all values for vpie(s)) for all states s, it performa a perdiicition estimates how good is to follow a policy pi in a giiven MDP

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

what is the loop for the aglo to work e.g what are the steps

A
  1. for all states s E S
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

look at silde 11 for a good understanding of the iteratiive policy evaluation

A

iits a tree

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

look at the sildes from 13 to15

A

gives you an understand of when a iterative poilcy of the bellman is used in a for a small girdworld

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

look at the sildes from 13 to15

A

gives you an understand of when a iterative poilcy of the bellman is used in a for a small gird world

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Q? When we reach k = 10, is it sensible to keep using a random policy? Can’t
we do better?

A

no, we can do better becouse you want to look at the best possible route

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

what is a greedy poilcy \?

A

it is where afrer the fitst iteratiion we modify our poliict using values v(s) we are calculating

pie’(s) = argmax qpiie(s,a)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

what is a policy improvement ?

A

iis the process of making a new poliicy that iimporoves on an oriiginal policy by acting greedily with repect to vpie

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

i am a bit confusted by the, How to obtain π

? W

A

Process: we have a policy and evaluate how good is it (we complete Policy
Evaluation). Then, change our policy to act better (greedily) according to vπ
(Policy Improvement). We evaluate again how good this new policy is (another
round of Policy Evaluation), to improve it again later (Policy Improvement
again). Rinse and repeat.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How does Policy Improvement work?

A
In the Small Gridworld example, π
0 = π
∗
in k = 3, but in general many more
iterations on these two steps are needed. However, policy iteration always
converges to π
∗
.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

policy improvement when do we stop?

A

when there is convergence

mac q(s,a) > vpie(s)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

need to look over thsi more get a better understand of it

A

ii think iit is just saything that when then

function values {\displaystyle Q^{\pi }(s,a)}

17
Q

need to look over the sides for the greedy search

18
Q

when thinking about DP is it talking about

A

it is a way of soultiing a produlm, it has to have two inherate values(stuff) you have to be abale to break down the problem and and then it has to be do able over and over so this iis show but the bellman eq witch gives how to dreck down an proulm in to step if you thiink of it about from here to the walk, then the secound part is the value fucntion it give you the value at the middle point so that you can sovle for different stater points

19
Q

when acting greeked what are to tryiing to max

A

so it is qpi(s,a) so it is the value of some state s applied some action

20
Q

when acting greeked what are to tryiing to max

A

so it is qpi(s,a) so it is the value of some state s applied some action

21
Q

need to understand the thiing when you go pass a mdp with complete understand of the envoment

A

this is on sildes 20 -27 and in the video that the dude does.

22
Q

what is a a model free prediction ?

A

if someone does tell you the dinacstion or reward function of the envoment

23
Q

what is the backward view TD(landa)

24
Q

landa = 1

A

the sum of offline updates is identical for forward view and backward veiiw

25
td(landa) what is it
basese it is from td to mc due to the landa
26
what are the limits of td(landa)
2 of them forward view and backward view so it is finding the best thing between td and mc