Paper Notes: Asynchronous Advanatage Actor Critic (A3C)
Notes on A3C
Asynchronous Methods for Deep Reinforcement Learning
Contributions:
- Use of asynchronous actors and parameter updated using Hogwild! strategy
- Parallel running learners can do more exploration and updates are less likely to be correlated
Algorithm:
-
Policy: $\pi(a_t s_t;\theta)$ - Value function: $V(s_t;\theta_v)$
- Policy and value function share the same parameter except softmax output for policy and linear output for value function
- Adding entropy of the policy to the objective function is suggested to improve exploration; this later on becomes the soft actor-critic algorithm !?!
- Objective functions:
- SGD with momentum
- RMSProp without shared statistics
- RMSProp with shared statistics
- Algorithm in details:
- Initialize global shared parameters: $\theta$ and $\theta_v$
- Initialize global shared counter: $T=0$
- Initialize thread step counter: $t \leftarrow 1$
- Repeat for every thread
- Initialize gradients: $d\theta \leftarrow 0$ and $d\theta \leftarrow 0$
- Synchronize thread parameters with global parameters: $\theta^\prime = \theta$ and $\theta_v^\prime = \theta_v$
- t_start = t
- Get current state: $s_t$
- repeat:
-
Perform action from policy: $a_t = \pi(a_t s_t;\theta)$ - Store reward and next state: $r_t$ and $s_{t+1}$
- $t \leftarrow t + 1$
-
- until terminal state
- Initialize $R=0$
- for $i$ in ${t-1,… t_{start}}$:
- $R = r_i + \gamma R$