Week 4 - Handling Uncertainty Flashcards

(18 cards)

1
Q

Certainty assumptions in normal searches

A

.actions are deterministic
. current state is fully observable

However this doesnt apply in many real world scenarios

ie a mars rover may intend to travel somewhere but there is a probabiliy of crashing ( not deterministic)

current state isnt fully obsevable - sensor may be wrong

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

stochastic

A

result of action uncertain ( is probabilistic)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

how does non determinism affect state transitions

A

means state transitions no longer consistent probability another outcome may occur ( than the intended)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

what do markov chains model

A

markov chains model probabilistic transition between states where next state only determinied by the probabilistic transition between states NOT HISTORY OF STATE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

what is markov property

A

next state determined by probabilistic transition between states not the history of a state

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

important distinction about markov chains

A

DOESNT USE ACTIONS IT USES STATES

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

what are MDP ( Markov decision process)

A

is a sequential decision making problem for a fully observable but stocahstic environment

we carry on with the notion of markov chains but this time with the probability of actions between states and that states have rewards

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

what are the assumptions of MDPs

A

assumption
markov property holds
probability distribution is stationary ( probabilirtiws dont randomly change)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

what do we need to model MDPS so we can solve them

A

set of all states the world can be in

T(s,a ,s’) - transition model that tells us if we are in a state s and take action a what is probabilirty we get to state s’

Initial state of problem

Reward (function) for a given state

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

what is an mdp solution

A

mdp solution is not a plan ( sequence ) of actions as environment is stochastic and so actions may fail (have unintended consequences)

hence an mdp solution is a politcy
that tells us :
for each state what is the (optimal) action to take
and hence the optimal policy has the highest utility

solutions can be different (per attempt solving due to stochastic nature)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

formula for expected value ( expected utitlity)

A

E(U) = [sum of] P(c) x U(c)

c - the values our variable can take

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

bayes rule

A

p(A n B) / p(B)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

just read

A

E[U|s,a]=∑s′∈nei(s)P(s′|s,a)U(s′)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

properties of an optimal policy

A

complete ( covers all states)
optimal ( gives best action for state s)
stationary (action depends on current state)
proper (reaches a terminal node )

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

explain how value iteration works

A

how value iteration works
Sets all states to have utility 0 exerpt terminal states
do one round of value iteration ie apply bellmans update to each state once
keep iterating ( using value iteration) until u i+1 = ui for all states ( utility for all states dont change when you iterate again)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

how to extract optimal policiy when values have converged

A

oh is it just saying to exteact optimal policity once its converged

pick the action that has highest expected utility for each state

17
Q

bellmans equation

A

U i+1 (s)=R(s)+γ⋅ a∈A max s ′∑ P(s ′∣s,a) ⋅ U i(s ′ )

18
Q

expected utility of a policy

A

U π (s)=E[ t=0 ∑ ∞ γ ^t R(St)]

its saying the discount factor increases in power each time you take an action

ie