Reinforcement Learning Flashcards

1
Q

Value Function

A

Value function: expected reward starting from state ‘s’
v(s_t) = E(R_t + gR_{t+1} + g^2 R_{t+2} … | s_t = s)

v(s_t) = E[ R_t + g v(S_{t+1})| s_t = s] (Bellman equation)
= R_s + g \sum P_{ss’} v(s’)

State-value function: i.e., expected reward starting from state ‘s’ if I follow policy pi, where
v_pi(s_t) = E(R_t + gR_{t+1} + g^2 R_{t+2} … | s_t = s, pi),

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Q-function (action-value function)

A

Q_pi = E(R + gR^1 + gR^2 … | s_0 = s, a_0 = a, pi) i.e., expected reward starting from state ‘s’ if I took initial action ‘a’ and then followed policy ‘pi’

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Bellman’s recursion

A

Value of a state = Reward at that state + discounted value of average rewards from other states
v_pi(d) = R(d) + g x sum_{s’} P_pi(s’ | s = d) v_pi(s’)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Value Iteration

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Banach Fixed Point Theorem

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Bellman’s Optimality for best value func (v) and policy (pi)

A

optimal value function v* is obtained by taking optimal actions given optimal Q*

Policy is to take action that maximizes Q*

But how is Q* provided?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Policy

A

Function from state to action. Defines the agent’s behavior

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Goal of RL

A

Find policy to maximize reward

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Monte Carlo Tree Search

A

Tree = (node=state), (edge=action leading to another state)

Search Algo:
Act according to the policy until you reach a state (s) where you don’t know what to do (policy not defined/confident?)
Do all actions and explore from next states e.g., randomly
Estimate Q(s, a), and provide that info to prior states of the tree too

SELECT, EXPAND, SIMULATE, BACKUP, repeat

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Nucleus Sampling (Top-p Sampling)

A

Sample from options whose probability masses sum to at least p
e.g., word1 = 0.6, word2=0.3, word3=0.05,….
if p = 0.9, sample between words#1 and #2
e.g., for generation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Beam Search

A

Given a beam width (w), find top-w candidates (initially single word), then predict next tokens for each of them (now each has two words) conditioned on prior words, selecting top-w again conditioned on the generated sequence, and repeating this process (n-words)

Say, y_2 is the next token to be predicted given y_1

Feed in y_1 to the network to obtain p(y_2 | x, y_1) and use p(y_1 | x) to get joint conditional p(y_1, y_2 | x)

p(y_1, y_2| x) = p(y_1 | x) p (y_2 | x, y_1)

param = beam width
e.g., for translation or transcription

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Discount factor

A

Return (G_t) = R_t + g R_{t+1} + g^2 R_{t+2} …
g near 0 = short-sighted
g near 1 = long-sighted i.e., higher weight to future

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Value function vs State-value function vs action-value function

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly