week 3 Flashcards
(28 cards)
What is reinforcement learning
1
what are the main points of RL
No supervisor, only a reward signal • Feedback is delayed • Sequential samples, not iid (time matters) • Actions influence future observations
idd: independently and identically distributed.
What is a reward and how does it relate to the rl problem
Rt is a scalar and it indicates how ell the agent is doing at step t, so there agents job is to maximize the cumulative reward, (the sum of the rewards across an episode )
what is sequential decision making and how does it effect the rl process ?
- Goal: select actions to maximize total future reward
- Actions may have long term consequences
- Reward may be delayed
- Immediate vs. long-term reward
this is thnk about how you get a reward and the proulm with sequential decision making
what is the reinforcement learning loop
the step for the loops are in t (time ) t, t+1, t+2 t+n
at each tiime it executes action At and gets back a observation that it gets in scalar reward from
– the enviroment
Receives action At
• Emits observation Ot+1(6= St+1)
• Emits scalar reward Rt+1
• t increments at every step
what happens when you have a forward model in rl and if not
if you have a forward model this loop happens in the agents’s brain,
what is the different between a forward model and just paying the game ?
i think it is just where the loop happens
what is history ?
history is the sequence of observations, actions and rewards, ht = O1r1a1,……,Ot-1Rt-1At-1
talk about state and stuff
2
the Markov state
A markov state contains all useful information from the history,
• ”The future is independent of the past given the present” (if St is known,
Ht can be thrown away)
what is full observability?
agents observe the full envioment, this is a markov decision process (MDP), chess
what is a partial observability
agent observes part of the environment, partially observable markov decision process, poker
look at slibes from 11-13
7
what is a policy and model, vaule functiuon what are there differnrets and how does it fit in with Rl ?
vaule and poilcy are two methods for solving MDP’s both
https://medium.com/@m.alzantot/deep-reinforcement-learning-demysitifed-episode-2-policy-iteration-value-iteration-and-q-978f9e89ddaa
3
need to read over this web page and stuff
4
Markov Decision Processes
5
what is the markov property
it mean that only the present matters , but the state has all the info form all state befor buiild in
is the markov property stationary
yes, this mean that the model and the different actions that you can take say the same,
what are the 4 things that make up the MDP
state : S
model/transitions T(S,A,S’) - Pr(S’ | S,A)
actions :A
reward :R(S), R(S,A), R(S,A,S’)
what is the solution of a MDP
it is a policy, pie(s) -> a
pie* is the optiime , max your long term reward
if you have a maze what are the 3 step
- maze define the SMAR
- POLICY(this define the best thing to do it each state )
- Value Function(give values to each of the states )
why do we need a discount in a MDP
6
what is the def for a value function
The state-value function v(s) of an MRP is the expected return starting from
state s: