Chapter 23 - Reinforcement learning Flashcards

Question 1

Q

what is reinforcement learning really about+

Answer

A

Learning accoridng to own mistakes/experience

Question 2

Q

weakness of supervised learning…?

Answer

A

A weakness is related to how the training data must be representative of all possible data. In many cases it is. However, it also happens that it is NOT in many cases as well. For instance, chess. The set of training data is very small compared to all possible situations. All we need is one case that is sgnificantly differnet from training data, and we’re fucked.

Question 3

Q

How does RL differ from the “making complex decisons” case?

Answer

A

In that case, we gave the agent an MDP to solve. In RL, the agent is IN the MDP. The agent does not know the transition model, and it does not know the reward function. Yet somehow, it must be able to learn from its experiences.

We can view the RL problem like this:
“Consider playing game without knowing the rules. After a certain number of moves, the referee tells us that we loose”.

Question 4

Q

Define sparse rewards

Answer

A

Sparse rewards refer to cases where the agent is given very little number of rewards compared to the number of actions/moves it takes. Sparse rewards are typically associated with just being told if we win or loose.

Question 5

Q

How do we speed up learning in RL?

Answer

A

use more rewards, espeically intermediary

Question 6

Q

What are the main types of RL?

Answer

A

Model based

Model free

Question 7

Q

Elaborate on the model based RL

Answer

A

In model-based RL the agent will use a transition model of the environment to help interpret the reward signals and to make deicisons.

The agent may or may not know the transition model. It could be unknown, in which case the agent must find it based on observing the outcomes of actions etc.
The agent can know it, but not know how to associate the moves with utility. For instance, chess. The agent knows the rules, but not the utility.

Model-based RL often ends up trying to find a utility function.

Question 8

Q

Elaborate on the model-free RL

Answer

A

In model-free RL the agent will never know or try to learn the transition model. Instead, it will learn a more direct way on how to behave.

There are two types of model-free RL:
1) Action-utility learning. The agent learns a Q-function and use this to pick action with highest Q-value

2) Policy search. Reflex agent. Learns a policy

Question 9

Q

What is the difference between passive and active RL?

Answer

A

Passive RL has a fixed policy and the task is to learn the utilities of the states. Basically, passive RL is about figuring out if some given policy is good or not.

Active RL involves also figuring out what to do.

Question 10

Q

In passive RL, which of these does the agent know?
1) Transition model
2) Reward function

Question 11

Q

What are trials?

Answer

A

We say the agent executes a set of trials in the environment using its policy.

during each trial, the agent will start at the initial state, and experience a sequence of transitions until it reaches the terminal state. Its percepts supply both the current state and the reward received for the transition.

the objective is to learn the expected utilities in each state.

Question 12

Q

What formula is used for the trials?

Question 13

Q

If the utility of a state is defined as the expected total reward from that state onward, what do we call it?

Answer

A

Direct utility estimation

Question 14

Q

Elaborate on direct utility estimation

Answer

A

Direct utility estimation is that the utility of a state is defined as the total expected reward from that state onward.

Each trial provides as sample of this quantity for each state visited.

This value is also called reward-to-go.

So, what happens is this:
The agent will run the trials. Every time it reach a state in a trial, we get a new estimation of the reward-to-go value. We use this new value and add it to a running-average of this specific utility value. By using the running-average, we always keep the average value ready.

We expect the utility value/reward-to-go to converge to a value if we perform infinitely many trials.

However, due to slow leanring, this method will converge very slowly.

Question 15

Q

Brainscape's Knowledge GenomeTM

Chapter 23 - Reinforcement learning Flashcards

Brainscape's Knowledge Genome^TM