Asynchronous Methods for Deep Reinforcement Learning

Contributions:

  • Use of asynchronous actors and parameter updated using Hogwild! strategy
  • Parallel running learners can do more exploration and updates are less likely to be correlated

Algorithm:

  • Policy: $\pi(a_t s_t;\theta)$
  • Value function: $V(s_t;\theta_v)$
  • Policy and value function share the same parameter except softmax output for policy and linear output for value function
  • Adding entropy of the policy to the objective function is suggested to improve exploration; this later on becomes the soft actor-critic algorithm !?!
  • Objective functions:
    • SGD with momentum
    • RMSProp without shared statistics
    • RMSProp with shared statistics
  • Algorithm in details:
    • Initialize global shared parameters: $\theta$ and $\theta_v$
    • Initialize global shared counter: $T=0$
    • Initialize thread step counter: $t \leftarrow 1$
    • Repeat for every thread
      • Initialize gradients: $d\theta \leftarrow 0$ and $d\theta \leftarrow 0$
      • Synchronize thread parameters with global parameters: $\theta^\prime = \theta$ and $\theta_v^\prime = \theta_v$
      • t_start = t
      • Get current state: $s_t$
      • repeat:
        • Perform action from policy: $a_t = \pi(a_t s_t;\theta)$
        • Store reward and next state: $r_t$ and $s_{t+1}$
        • $t \leftarrow t + 1$
      • until terminal state
      • Initialize $R=0$
      • for $i$ in ${t-1,… t_{start}}$:
        • $R = r_i + \gamma R$