Paper Notes: Asynchronous Advanatage Actor Critic (A3C)

Asynchronous Methods for Deep Reinforcement Learning

Use of asynchronous actors and parameter updated using Hogwild! strategy
Parallel running learners can do more exploration and updates are less likely to be correlated

Policy: $\pi(a_t s_t;\theta)$
Value function: $V(s_t;\theta_v)$
Policy and value function share the same parameter except softmax output for policy and linear output for value function
Adding entropy of the policy to the objective function is suggested to improve exploration; this later on becomes the soft actor-critic algorithm !?!
Objective functions:
- SGD with momentum
- RMSProp without shared statistics
- RMSProp with shared statistics
Algorithm in details:
- Initialize global shared parameters: $\theta$ and $\theta_v$
- Initialize global shared counter: $T=0$
- Initialize thread step counter: $t \leftarrow 1$
- Repeat for every thread
  - Initialize gradients: $d\theta \leftarrow 0$ and $d\theta \leftarrow 0$
  - Synchronize thread parameters with global parameters: $\theta^\prime = \theta$ and $\theta_v^\prime = \theta_v$
  - t_start = t
  - Get current state: $s_t$
  - repeat:
    - Perform action from policy: $a_t = \pi(a_t s_t;\theta)$
    - Store reward and next state: $r_t$ and $s_{t+1}$
    - $t \leftarrow t + 1$
  - until terminal state
  - Initialize $R=0$
  - for $i$ in ${t-1,… t_{start}}$:
    - $R = r_i + \gamma R$