Week 8 - reinforcement learning Flashcards

(7 cards)

1
Q

formula for qt(A)
estimates of average rewards in k armed bandits

A

(sum of rewards of A before t ) / no of times A occurred before t

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

exploitation vs exploration

A

Exploitation means:
“I know a₂ gives good rewards, so I’ll keep using it.”

Exploration means:
“Maybe a₁ just had bad luck before. If I try it again, it might actually be better than a₂.”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

just read

A

The challenge is balancing the two:

If you exploit too early, you might miss better options.

If you explore too much, you waste time on suboptimal choices.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

how greedy method works

A

so essentially greedy method :
at beginning it initialises all actions to have an estimate of 0
if theres a tie between maximum actions it randomly chooses betwen those maximum actions that tied
since everything is 0 at beginning ( everything ties) so it randomly picks from one of these actions
if the estimate it returns is higher than 0 it will keep exploiting and wont stop picking that action until ( the chance occurs that) the estimate returned from the action is lower than 0

Therefore greedy has a bad balance - it just exploits and has little to no exploration

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

how epsilon greedy works

A

how epsilon greedy works:

it has same inital setup as greedy but this time with probability:

ε (epsilon) it randomly chooses from amongst the actions ( and returns the random estimate)

with 1 - ε it behaves greedily

There is now a balance between exploration and exploitation meaning that you are eventually likely to find the optimal action

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

UCB formula

A

​arg max [ Qt(a) + c sqrt ( lnt / Nta)]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

just read

A

oh so essentially the thing with UCB if something looked bad ie its confidence bound was low and it was ignored for a long time its confidence bound would rise until we tried again and we would test it

this way we explore with UCB and e greedy

How well did you know this?
1
Not at all
2
3
4
5
Perfectly