Paper Notes: Generalized Advantage Estimation

Two main challenges:

Large number of samples required
- use value functions to reduce the variance at cost of some bias
Difficult to get stability despite non-stationarity of data
- use trust region optimization for both policy and value function
Variance of gradient estimator scales with time horizon

Proposals:

GAE as an effective variance reduction scheme
Trust region optimization method for value functions

Preliminaries

The goal in the problem formulation is to maximize the expected total reward $\sum_{t=0}^{\infin}r_t$.

Policy gradient maximizes this objective by estimating the gradient.

g = \mathbb{E}[\sum_{t=0}^{\infin} \psi_t \nabla_{\theta} \log{\pi_\theta}(a_t|s_t)]

where $\psi_t$ can be one of the following,

Function	Description
$\sum_{t=0}^{\infin}{r_t}$	Total reward of the trajectory
$\sum_{t’=t}^{\infin}{r_{t’}}$	Reward following action at $t$
$\sum_{t’=t}^{\infin}{r_{t’}-b(s_t)}$	Reward following $t$ with a baseline
$Q^{\pi}(s_t, a_t) $	State - Action value action at $t$
$A^{\pi}(s_t, a_t) = Q^{\pi}(s_t, a_t) - V^{\pi}(s_t)$	Advantage function at $t$
$r_t + V^{\pi}(s_{t+1}) - V^{\pi}(s_t)$	TD residual

Using advantage function yields the lowest possible variance. Intuitive definition of advantage function:

Step in the policy gradient direction should increase the probability of better-than-average actions and decrease the probability of worse-than average actions

Check Greensmith (2016) for analysis of variance

This paper introduces $\gamma$ to reduce variance by down-weighting rewards in the future at the cost of introducing bias. This is similar to the discount factor but is treated here as a variance reduction parameter.

The new policy gradient estimate becomes:

g^{\gamma} = \mathbb{E}[\sum_{t=0}^{\infin} A^{\pi, \gamma}(s_t,a_t) \nabla_{\theta} \log{\pi_\theta}(a_t|s_t)]

Advantage Function Estimation

Assume an estimate $\hat{A}_t$ of the discounted advantage function $A^{\pi, \gamma}$

\hat{g} = \frac{1}{N}\sum_{n=1}^{N}{\sum_{t=0}^{\infin}{ \hat{A}_t^{n} \nabla_{\theta} \log{\pi_{\theta}(a_t^n, s_t^n)} }}

where, $N$ is number of episodes

TD residual of Value function with discount factor: $\delta^V_t = r_t + \gamma V(s_{t+1}) - V(s_t)$

This can be considered as an unbiased estimate of $A^{\pi, \gamma}$ is the Value function is correctly estimated.

\hat{A}_t^{(1)} = \delta_t^V = r_t + \gamma V(s_{t+1}) - V(s_t)\\

\hat{A}_t^{(2)} = \delta_t^V + \gamma \delta_{t+1}^V \\ = r_t + \gamma V(s_{t+1}) - V(s_t) + \gamma(r_{t+1} + \gamma V(s_{t+2}) - V(s_{t+1}))\\ = r_t + \gamma V(s_{t+1}) - V(s_t) + \gamma r_{t+1} + \gamma^2 V(s_{t+2}) - \gamma V(s_{t+1}))\\ = r_t + \gamma V(s_{t+1}) - V(s_t) + \gamma r_{t+1} + \gamma^2 V(s_{t+2}) - \gamma V(s_{t+1}))\\ = - V(s_t) + r_t + \gamma r_{t+1} + \gamma^2 V(s_{t+2})

\hat{A}_t^{(k)} = \sum_{l=0}^{k-1} \gamma^l \delta^V_{t+l}\\ = -V(s_t) + r_t + ... + \gamma^{k-1} r_{t+k-1} + \gamma^k V(s_{t+k})

The bias becomes smaller as the $k \rightarrow \infin$ since $\gamma^k V(s_{t+k})$ is discounted heavily.

The generalized advantage estimator $GAE(\gamma, \lambda)$ is defined as the exponentially-weighted average of these $k$-step estimators:

\hat{A}_t^{GAE(\lambda, \lambda)} = \lambda \hat{A}^{GAE_{2}} + (1-\lambda) \hat{A}_t^{(1)}\\ =\lambda( \lambda \hat{A}^{GAE_{3}} + (1-\lambda) \hat{A}_t^{(2)} ) + (1-\lambda) \hat{A}_t^{(1)}\\ = (1-\lambda) \hat{A}_t^{(1)} + \lambda (1-\lambda) \hat{A}_t^{(2)} + \lambda^2 (1-\lambda) \hat{A}_t^{(3)} + ...\\ = (1-\lambda)(\hat{A}_t^{(1)} + \lambda \hat{A}_t^{(2)} + \lambda^2 \hat{A}_t^{(3)} + ...)

Now, if we expand the $k$-step estimators we get:

\hat{A}_t^{GAE(\gamma, \lambda)} = (1-\lambda)(\hat{A}_t^{(1)} + \lambda \hat{A}_t^{(2)} + \lambda^2 \hat{A}_t^{(3)} + ...)\\ = (1 -\ \lambda)(\delta_t^V + \lambda(\delta_t^V + \gamma \delta_{t+1}^V) + \lambda^2(\delta_t^V + \gamma \delta_{t+1}^V + \gamma^2 \delta_{t+2}^V)) + ...\\ = (1-\lambda)(\delta_t^V(1 + \lambda + \lambda^2 + ...) + \gamma\delta_{t+1}^V(\lambda + \lambda^2+ \lambda^3 + ...) + \gamma^2\delta_{t+2}^V(\lambda^2 + \lambda^3 +...)+...)\\ = (1-\lambda)(\delta_t^V(\frac{1}{1-\lambda}) + \gamma\delta_{t+1}^V(\frac{\lambda}{1-\lambda}) + \gamma^2\delta_{t+2}^V(\frac{\lambda^2}{1-\lambda})+...)\\ = \delta_t^V + \gamma\lambda\delta_{t+1}^V + \gamma^2\lambda^2\delta_{t+2}^V+...\\ = \sum_{l=0}^{\infin}{(\gamma \lambda)^l \delta_{t+l}^{V}}

\hat{A}_t^{GAE(\gamma, \lambda)} = \sum_{l=0}^{\infin}(\gamma \lambda)^l \delta_{t+l}^V\\ = \sum_{l=0}^{\infin}(\gamma \lambda)^l r_{t+l} + \gamma V(s_{t+l+1}) - V(s_{t+l})\\ = (\gamma \lambda)^0[r_{t} + \gamma V(s_{t+1}) - V(s_{t})] + (\gamma \lambda)^1[r_{t+1} + \gamma V(s_{t+2}) - V(s_{t+1})] + (\gamma \lambda)^2[r_{t+2} + \gamma V(s_{t+3}) - V(s_{t+2})]

Reward Shaping Interpretation

TODO: checkout Andrew Ng’s paper on reward shaping

Reward shaping refers to following transformation on the reward function.

\tilde{r}(s,a,s') = r(s,a,s') + \gamma \phi(s') - \phi(s)

A^{\pi, \gamma} = \sum_{l=0}^{\infin}\gamma^l \tilde{r}(s_{t+l},a_t,s_{t+l+1})\\ = \sum_{l=0}^{\infin}{\gamma^l r(s_{t+l},a_t,s_{t+l+1}) - \phi(s_{t})}

TODO: add the full expansion and how the term gets cancelled