Week 8 - reinforcement learning Flashcards
(7 cards)
formula for qt(A)
estimates of average rewards in k armed bandits
(sum of rewards of A before t ) / no of times A occurred before t
exploitation vs exploration
Exploitation means:
“I know a₂ gives good rewards, so I’ll keep using it.”
Exploration means:
“Maybe a₁ just had bad luck before. If I try it again, it might actually be better than a₂.”
just read
The challenge is balancing the two:
If you exploit too early, you might miss better options.
If you explore too much, you waste time on suboptimal choices.
how greedy method works
so essentially greedy method :
at beginning it initialises all actions to have an estimate of 0
if theres a tie between maximum actions it randomly chooses betwen those maximum actions that tied
since everything is 0 at beginning ( everything ties) so it randomly picks from one of these actions
if the estimate it returns is higher than 0 it will keep exploiting and wont stop picking that action until ( the chance occurs that) the estimate returned from the action is lower than 0
Therefore greedy has a bad balance - it just exploits and has little to no exploration
how epsilon greedy works
how epsilon greedy works:
it has same inital setup as greedy but this time with probability:
ε (epsilon) it randomly chooses from amongst the actions ( and returns the random estimate)
with 1 - ε it behaves greedily
There is now a balance between exploration and exploitation meaning that you are eventually likely to find the optimal action
UCB formula
arg max [ Qt(a) + c sqrt ( lnt / Nta)]
just read
oh so essentially the thing with UCB if something looked bad ie its confidence bound was low and it was ignored for a long time its confidence bound would rise until we tried again and we would test it
this way we explore with UCB and e greedy