Monte Carlo Flashcards

(6 cards)

1
Q

How does Monte Carlo methods work in reinforcement learning?

A

They solve reinforcement learning problems using sample episodes. They work by averaging returns from actual or simulated experience. The focus is on episodic tasks, where each trial eventually terminates, ensuring that returns are well-defined

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the fundamental difference between dynamic programming and Monte Carlo methods for reinforcement learning and what is the implications of that difference in their usage?

A

Because MC methods learn directly from episodes instead of requiring explicit transition probabilities, they are often easier to implement in complex domains

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the goal of Monte Carlo Methods? What is the challenge for deterministic policies according to this definition?

A

It is to estimate the state-action function of a state under a given policy by averaging the returns observed following visits to that state

If the policy is deterministic, only one action per state is visited, so no data is collected for alternative actions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is Monte Carlo Control?

A

The overall idea is to combine policy evaluation with policy improvement in an iterative scheme
- Policy Evaluation: use Monte Carlo methods to estimate the action-value function for the policy
- Policy Improvement: update the current policy to be greedy with respect to the estimated value function

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the Monte Carlo ES? Why is it important?

A

The Monte Carlo Exploring Starts is an algorithm for executing Monte Carlo Control where each episode begins with an arbitrary state and action chosen with non zero probability.

The random initialization is critical because, if the policy is deterministic or even nearly greedy, some actions might never be chosen in a state. With exploring starts you force the system to explore all available state action pairs, thus preventing the policy from settling on a suboptimal set of actions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is a requirement for on-policy methods without exploring starts? What is the final algorithm?

A

That the policy is made soft (every policy is chosen with a non-zero probability. This ensures that all state action pairs continue to receive updates and are explored.

  1. Generate an episode following the current soft policy
  2. For every state-action pair encountered, update the estimate by averaging the returns
  3. Improve the policy locally for all states in the episode by shifting the probability mass towards the action with the highest action value
How well did you know this?
1
Not at all
2
3
4
5
Perfectly