Dynamic Programming Flashcards
(5 cards)
What is Dynamic Programming ? How does it fit into the solution of an MDP?
A method to solve complex problem by breaking them down into subproblems, solving this smaller problems, and combining those solutions
In reinforcement learning it leverages value functions to systematically compute good policies when the MDP’s model (transition probabilities and rewards) are known
DP breaks into two main routines. What are they? How the union of those routines is called? What is its aim?
Policy Evaluation: given its policy, compute its state-value
Policy Improvement: given a state-value function estimated, derive a better policy
By alternating these (Policy Iteration) or blending them in each sweep (Value Iteration), we converge to the optimal value function and optimal policy
What essentially is the representation of solving the state value function for |S| states? How can we solve it?
A system of linear equations in the unknowns values of each state
Instead of solving the linear system directly, we use successive approximation:
1. Initialize the values for each state arbitrarily
2. Iteratively calculate the value function
3. Stop when the difference is smaller then a given threshold
What is Policy Iteration? What ar e the processes involved in it?
Policy Iteration alternates full evaluation and full improvement
- Initialize an arbitrary policy
- Do Policy Evaluation (by iterative evaluation until convergence) followed by Policy Improvement
- Stop when the policy do not change from one iteration to another
What is Value Iteration?
It fuses evaluation and improvement into one update