Two main challenges:
- Large number of samples required
- use value functions to reduce the variance at cost of some bias
- Difficult to get stability despite non-stationarity of data
- use trust region optimization for both policy and value function
- Variance of gradient estimator scales with time horizon
Proposals:
- GAE as an effective variance reduction scheme
- Trust region optimization method for value functions
Preliminaries
The goal in the problem formulation is to maximize the expected total reward $\sum_{t=0}^{\infin}r_t$.
Policy gradient maximizes this objective by estimating the gradient.
g=E[t=0∑∞ψt∇θlogπθ(at∣st)]
where $\psi_t$ can be one of the following,
Function |
Description |
$\sum_{t=0}^{\infin}{r_t}$ |
Total reward of the trajectory |
$\sum_{t’=t}^{\infin}{r_{t’}}$ |
Reward following action at $t$ |
$\sum_{t’=t}^{\infin}{r_{t’}-b(s_t)}$ |
Reward following $t$ with a baseline |
$Q^{\pi}(s_t, a_t) $ |
State - Action value action at $t$ |
$A^{\pi}(s_t, a_t) = Q^{\pi}(s_t, a_t) - V^{\pi}(s_t)$ |
Advantage function at $t$ |
$r_t + V^{\pi}(s_{t+1}) - V^{\pi}(s_t)$ |
TD residual |
Using advantage function yields the lowest possible variance. Intuitive definition of advantage function:
Step in the policy gradient direction should increase the probability of better-than-average actions and decrease the probability of worse-than average actions
Check Greensmith (2016) for analysis of variance
This paper introduces $\gamma$ to reduce variance by down-weighting rewards in the future at the cost of introducing bias. This is similar to the discount factor but is treated here as a variance reduction parameter.
The new policy gradient estimate becomes:
gγ=E[t=0∑∞Aπ,γ(st,at)∇θlogπθ(at∣st)]
Advantage Function Estimation
Assume an estimate $\hat{A}_t$ of the discounted advantage function $A^{\pi, \gamma}$
g^=N1n=1∑Nt=0∑∞A^tn∇θlogπθ(atn,stn)
where, $N$ is number of episodes
TD residual of Value function with discount factor: $\delta^V_t = r_t + \gamma V(s_{t+1}) - V(s_t)$
This can be considered as an unbiased estimate of $A^{\pi, \gamma}$ is the Value function is correctly estimated.
A^t(1)=δtV=rt+γV(st+1)−V(st)
A^t(2)=δtV+γδt+1V=rt+γV(st+1)−V(st)+γ(rt+1+γV(st+2)−V(st+1))=rt+γV(st+1)−V(st)+γrt+1+γ2V(st+2)−γV(st+1))=rt+γV(st+1)−V(st)+γrt+1+γ2V(st+2)−γV(st+1))=−V(st)+rt+γrt+1+γ2V(st+2)
A^t(k)=l=0∑k−1γlδt+lV=−V(st)+rt+...+γk−1rt+k−1+γkV(st+k)
The bias becomes smaller as the $k \rightarrow \infin$ since $\gamma^k V(s_{t+k})$ is discounted heavily.
The generalized advantage estimator $GAE(\gamma, \lambda)$ is defined as the exponentially-weighted average of these $k$-step estimators:
A^tGAE(λ,λ)=λA^GAE2+(1−λ)A^t(1)=λ(λA^GAE3+(1−λ)A^t(2))+(1−λ)A^t(1)=(1−λ)A^t(1)+λ(1−λ)A^t(2)+λ2(1−λ)A^t(3)+...=(1−λ)(A^t(1)+λA^t(2)+λ2A^t(3)+...)
Now, if we expand the $k$-step estimators we get:
A^tGAE(γ,λ)=(1−λ)(A^t(1)+λA^t(2)+λ2A^t(3)+...)=(1− λ)(δtV+λ(δtV+γδt+1V)+λ2(δtV+γδt+1V+γ2δt+2V))+...=(1−λ)(δtV(1+λ+λ2+...)+γδt+1V(λ+λ2+λ3+...)+γ2δt+2V(λ2+λ3+...)+...)=(1−λ)(δtV(1−λ1)+γδt+1V(1−λλ)+γ2δt+2V(1−λλ2)+...)=δtV+γλδt+1V+γ2λ2δt+2V+...=l=0∑∞(γλ)lδt+lV
A^tGAE(γ,λ)=l=0∑∞(γλ)lδt+lV=l=0∑∞(γλ)lrt+l+γV(st+l+1)−V(st+l)=(γλ)0[rt+γV(st+1)−V(st)]+(γλ)1[rt+1+γV(st+2)−V(st+1)]+(γλ)2[rt+2+γV(st+3)−V(st+2)]
Reward Shaping Interpretation
TODO: checkout Andrew Ng’s paper on reward shaping
Reward shaping refers to following transformation on the reward function.
r~(s,a,s′)=r(s,a,s′)+γϕ(s′)−ϕ(s)
Aπ,γ=l=0∑∞γlr~(st+l,at,st+l+1)=l=0∑∞γlr(st+l,at,st+l+1)−ϕ(st)
TODO: add the full expansion and how the term gets cancelled