Week 4 - Handling Uncertainty Flashcards
(18 cards)
Certainty assumptions in normal searches
.actions are deterministic
. current state is fully observable
However this doesnt apply in many real world scenarios
ie a mars rover may intend to travel somewhere but there is a probabiliy of crashing ( not deterministic)
current state isnt fully obsevable - sensor may be wrong
stochastic
result of action uncertain ( is probabilistic)
how does non determinism affect state transitions
means state transitions no longer consistent probability another outcome may occur ( than the intended)
what do markov chains model
markov chains model probabilistic transition between states where next state only determinied by the probabilistic transition between states NOT HISTORY OF STATE
what is markov property
next state determined by probabilistic transition between states not the history of a state
important distinction about markov chains
DOESNT USE ACTIONS IT USES STATES
what are MDP ( Markov decision process)
is a sequential decision making problem for a fully observable but stocahstic environment
we carry on with the notion of markov chains but this time with the probability of actions between states and that states have rewards
what are the assumptions of MDPs
assumption
markov property holds
probability distribution is stationary ( probabilirtiws dont randomly change)
what do we need to model MDPS so we can solve them
set of all states the world can be in
T(s,a ,s’) - transition model that tells us if we are in a state s and take action a what is probabilirty we get to state s’
Initial state of problem
Reward (function) for a given state
what is an mdp solution
mdp solution is not a plan ( sequence ) of actions as environment is stochastic and so actions may fail (have unintended consequences)
hence an mdp solution is a politcy
that tells us :
for each state what is the (optimal) action to take
and hence the optimal policy has the highest utility
solutions can be different (per attempt solving due to stochastic nature)
formula for expected value ( expected utitlity)
E(U) = [sum of] P(c) x U(c)
c - the values our variable can take
bayes rule
p(A n B) / p(B)
just read
E[U|s,a]=∑s′∈nei(s)P(s′|s,a)U(s′)
properties of an optimal policy
complete ( covers all states)
optimal ( gives best action for state s)
stationary (action depends on current state)
proper (reaches a terminal node )
explain how value iteration works
how value iteration works
Sets all states to have utility 0 exerpt terminal states
do one round of value iteration ie apply bellmans update to each state once
keep iterating ( using value iteration) until u i+1 = ui for all states ( utility for all states dont change when you iterate again)
how to extract optimal policiy when values have converged
oh is it just saying to exteact optimal policity once its converged
pick the action that has highest expected utility for each state
bellmans equation
U i+1 (s)=R(s)+γ⋅ a∈A max s ′∑ P(s ′∣s,a) ⋅ U i(s ′ )
expected utility of a policy
U π (s)=E[ t=0 ∑ ∞ γ ^t R(St)]
its saying the discount factor increases in power each time you take an action
ie