Exploration Flashcards

1
Q

What is said (by Littman) to be the thing that separates reinforcement learning from all other types of machine learning?

A

Exploration

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

In the k bandits scenario should we use maximum likelihood to make our decisions? Why or why not? What about maximum confidence? Minimum confidence?

A

We shouldn’t use any of these strategies exclusively:
Maximum likelihood is bad because we may pull a sub optimal lever due to poor luck and never discover the better option
Maximum confidence is bad because we will never break from it (we’ll always pull the one we’re most confident in and that makes us more confident)
Minimum confidence is bad because we will learn more about the game but we’ll never use the information

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the Gittins index (intuitively)? Does it work for everything?

A

Given your current information how high would a set payoff have to be to convince you to not try to gather more information.

No it only works for k-bandits

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are some example metrics for the k-bandits problem?

A
  1. Identify optimal arm in the limit
  2. Maximize discounted expected reward (Gittins Index)
  3. Maximize expected reward over finite horizon
  4. Identify near-optimal arm (e) with high probability (1-d) in time (poly) T(k,e,d)
  5. Nearly maximize reward (e) with high probability (1-d) in time (poly) T(k,e,d)
  6. Pull a non-near optimal arm (e) no more than T(k,e,d) times with high probability (1-d)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the Hoeffding bound?

A

It describes a confidence interval parametrized by some value d in (0,1) for an estimate of the true mean m based on n observations and our current estimate m’.

The interval is: [m’ - z_d/sqrt(n), m’ + z_d/sqrt(n)]
z_d = sqrt(0.5 ln(2/d))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are some issues with random exploration as an optimization criteria/approach in MDPs? What would be better?

A
  1. We may never reach a desired state
  2. There may be trap states and our criteria should take this into account

Mistake bounding is better. That is the number of e-suboptimal action choices is bounded in Poly(1/e, 1/d, n, k) with probability 1-d

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Describe the deterministic Rmax algorithm

A
  1. Keep track of MDP
  2. Any state-action pair is Rmax
  3. Solve the MDP
  4. Take action from Pi*
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How do we guarantee Rmax will make no more mistakes for a deterministic MDP?

A

Visit every state

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

When should we stop visiting unknown states in Rmax for a deterministic MDP?

A

Once the discounted reward from a known state is better than any possible discounted rewards from the unknown states (assuming they have Rmax value)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How many mistakes could we make while trying to get more information in Rmax for a deterministic MDP?

A

n where n is the number of states

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the maximum number of mistakes we can make with Rmax for a deterministic MDP?

A

k*n^2
n is number of states
k is number of actions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the lower bound for an algorithm solving a deterministic MDP?

A

k*n^2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Describe the GENERAL Rmax algorithm

A
  1. Keep track of MDP
  2. Any unknown state-action pair is Rmax. Define unknown as a state action pari which has been visited fewer than c times
  3. Solve the MDP
  4. Take action from Pi*
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the explore or exploit lemma?

A

If all transitions are either accurately estimated or unknown and we optimize then the resulting “optimal” policy is either near-optimal or an unknown state will be reached quickly.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the simulation lemma?

A

If we have an alpha good estimate of the MDP then optimizing our rewards in the estimate will be alpha good in the real MDP as well

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Briefly describe the KWIK algorithm class

A

Knows what it knows algorithms track what the algorithm has learned and make affirmative answers when it knows exactly that it will be right and requests the correct answer (to learn from) when it does not