Deep Deterministic Policy Gradients in TensorFlow
Introduction
Deep Reinforcement Learning has recently gained a lot of traction in the machine learning community due to the significant amount of progress that has been made in the past few years. Traditionally, reinforcement learning algorithms were constrained to tiny, discretized grid worlds, which seriously inhibited them from gaining credibility as being viable machine learning tools. Here’s a classic example from Richard Sutton’s book, which I will be referencing a lot.
After Deep QNetworks [4] became a hit, people realized that deep learning methods could be used to solve highdimensional problems. One of the subsequent challenges that the reinforcement learning community faced was figuring out how to deal with continuous action spaces. This is a significant obstacle, since most interesting problems in robotic control, etc., fall into this category. Logically, if you discretize your continuous action space too finely, you end up with the same curse of dimensionality problem as before. On the other hand, a naive discretization of the action space throws away valuable information concerning the geometry of the action domain.
Google DeepMind has devised a solid algorithm for tackling the continuous action space problem. Building off the prior work of [3] on Deterministic Policy Gradients, they have produced a policygradient actorcritic algorithm called Deep Deterministic Policy Gradients (DDPG) [5] that is offpolicy and modelfree, and that uses some of the deep learning tricks that were introduced along with Deep QNetworks (hence the “deep”ness of DDPG). In this blog post, we’re going to discuss how to implement this algorithm using Tensorflow and tflearn, and then evaluate it with OpenAI Gym on the pendulum environment. I’ll also discuss some of the theory behind it. Regrettably, I can’t start with introducing the basics of reinforcement learning since that would make this blog post much too long; however, Richard Sutton’s book (linked above), as well as David Silver’s course, are excellent resources to get going with RL.
Lets Start With Some Theory
[Wait, I just want to skip to the Tensorflow part!]
PolicyGradient Methods
PolicyGradient (PG) algorithms optimize a policy endtoend by computing noisy estimates of the gradient of the expected reward of the policy and then updating the policy in the gradient direction. Traditionally, PG methods have assumed a stochastic policy $\mu (a  s)$, which gives a probability distribution over actions. Ideally, the algorithm sees lots of training examples of high rewards from good actions and negative rewards from bad actions. Then, it can increase the probability of the good actions. In practice, you tend to run into plenty of problems with vanillaPG; for example, getting one reward signal at the end of a long episode of interaction with the environment makes it difficult to ascertain exactly which action was the good one. This is known as the credit assignment problem. For RL problems with continuous action spaces, vanillaPG is all but useless. You can, however, get vanillaPG to work with some RL domains that take in visual inputs and have discrete action spaces with a convolutional neural network representing your policy (talk about standing on the shoulders of giants!). There are extensions to the vanillaPG algorithm such as REINFORCE and Natural Policy Gradients that make the algorithm much more viable. For a first look into stochastic policy gradients, you can find an overview of the Stochastic Policy Gradient theorem in [3], an indepth blog post by Andrej Karpathy on here, and a nice explanation from OpenAI here.
ActorCritic Algorithms
The ActorCritic learning algorithm is used to represent the policy function independently of the value function. The policy function structure is known as the actor, and the value function structure is referred to as the critic. The actor produces an action given the current state of the environment, and the critic produces a TD (TemporalDifference) error signal given the state and resultant reward. If the critic is estimating the actionvalue function $Q(s,a)$, it will also need the output of the actor. The output of the critic drives learning in both the actor and the critic. In Deep Reinforcement Learning, neural networks can be used to represent the actor and critic structures.
OffPolicy vs. OnPolicy
Reinforcement Learning algorithms which are characterized as offpolicy generally employ a separate behavior policy that is independent of the policy being improved upon; the behavior policy is used to simulate trajectories. A key benefit of this separation is that the behavior policy can operate by sampling all actions, whereas the estimation policy can be deterministic (e.g., greedy) [1]. Qlearning is an offpolicy algorithm, since it updates the Q values without making any assumptions about the actual policy being followed. Rather, the Qlearning algorithm simply states that the Qvalue corresponding to state $s(t)$ and action $a(t)$ is updated using the Qvalue of the next state $s(t + 1)$ and the action $a(t + 1)$ that maximizes the Qvalue at state $s(t + 1)$.
Onpolicy algorithms directly use the policy that is being estimated to sample trajectories during training.
Modelfree Algorithms
Modelfree RL algorithms are those that make no effort to learn the underlying dynamics that govern how an agent interacts with the environment. In the case where the environment has a discrete state space and the agent has a discrete number of actions to choose from, a model of the dynamics of the environment is the 1step transition matrix: $T(s(t + 1)s(t), a(t))$. This stochastic matrix gives all of the probabilities for arriving at a desired state given the current state and action. Clearly, for problems with highdimensional state and action spaces, this matrix is incredibly expensive in space and time to compute. If your state space is the set of all possible 64 x 64 RGB images and your agent has 18 actions available to it, the transition matrix’s size is $S \times S \times A \approx (68.7 \times 10^{9}) \times (68.7 \times 10^9) \times 18$, and at 32 bits per matrix element, thats around $3.4 \times 10^{14}$ GB to store it in RAM!
Rather than dealing with all of that, modelfree algorithms directly estimate the optimal policy or value function through algorithms such as policy iteration or value iteration. This is much more computationally efficient. I should note that, if possible, obtaining and using a good approximation of the underlying model of the environment can only be beneficial. Be wary using a bad approximation of a model of the environment will only bring you misery. Just as well, modelfree methods generally require a larger number of training examples.
The Meat and Potatoes of DDPG
At its core, DDPG is a policy gradient algorithm that uses a stochastic behavior policy for good exploration but estimates a deterministic target policy, which is much easier to learn. Policy gradient algorithms utilize a form of policy iteration: they evaluate the policy, and then follow the policy gradient to maximize performance. Since DDPG is offpolicy and uses a deterministic target policy, this allows for the use of the Deterministic Policy Gradient theorem (which will be derived shortly). DDPG is an actorcritic algorithm as well; it primarily uses two neural networks, one for the actor and one for the critic. These networks compute action predictions for the current state and generate a temporaldifference (TD) error signal each time step. The input of the actor network is the current state, and the output is a single real value representing an action chosen from a continuous action space (whoa!). The critic’s output is simply the estimated Qvalue of the current state and of the action given by the actor. The deterministic policy gradient theorem provides the update rule for the weights of the actor network. The critic network is updated from the gradients obtained from the TD error signal.
Sadly, it turns out that tossing neural networks at DPG results in an algorithm that behaves poorly, resisting all of your most valiant efforts to get it to converge. The following are most likely some of the key conspirators:

In general, training and evaluating your policy and/or value function with thousands of temporallycorrelated simulated trajectories leads to the introduction of enormous amounts of variance in your approximation of the true Qfunction (the critic). The TD error signal is excellent at compounding the variance introduced by your bad predictions over time. It is highly suggested to use a replay buffer to store the experiences of the agent during training, and then randomly sample experiences to use for learning in order to break up the temporal correlations within different training episodes. This technique is known as experience replay. DDPG uses this.

Directly updating your actor and critic neural network weights with the gradients obtained from the TD error signal that was computed from both your replay buffer and the output of the actor and critic networks causes your learning algorithm to diverge (or to not learn at all). It was recently discovered that using a set of target networks to generate the targets for your TD error computation regularizes your learning algorithm and increases stability. Accordingly, here are the equations for the TD target $y_i$ and the loss function for the critic network:
Here, a minibatch of size $N$ has been sampled from the replay buffer, with the $i$ index referring to the i’th sample. The target for the TD error computation, $y_i$, is computed from the sum of the immediate reward and the outputs of the target actor and critic networks, having weights $\theta^{\mu’}$ and $\theta^{Q’}$ respectively. Then, the critic loss can be computed w.r.t. the output $Q(s_i, a_i  \theta^{Q})$ of the critic network for the i’th sample.
See [4] for more details on the use of target networks.
Now, as mentioned above, the weights of the critic network can be updated with the gradients obtained from the loss function in Eq. 2. Also, remember that the actor network is updated with the Deterministic Policy Gradient. Here lies the crux of DDPG! Silver, et al., [3] proved that the stochastic policy gradient $\nabla_{\theta} \mu (a  s, \theta)$, which is the gradient of the policy’s performance, is equivalent to the deterministic policy gradient, which is given by:
Notice that the policy term in the expectation is not a distribution over actions. It turns out that all you need is the gradient of the output of the critic network w.r.t. its parameters, multiplied by the gradient of the output of the actor network w.r.t. its parameters, averaged over a minibatch. Simple!
I think the proof of the deterministic policy gradient theorem is quite illuminating, so I’d like to demonstrate it here before moving on to the code.
The expectation in the righthand side of Eq. 6 is exactly what we want. In [3], it is shown that the stochastic policy gradient converges to Eq. 4 in the limit as the variance of the stochastic policy gradient approaches 0. This is significant because it allows for all of the machinery for stochastic policy gradients to be applied to deterministic policy gradients.
Enough With The Chit Chat, Let’s See Some Code!
Let’s get started.
We’re writing code to solve the Pendulum environment in OpenAI gym, which has a lowdimensional state space and a single continuous action within [2, 2]. The goal is to swing up and balance the pendulum.
The first part is easy. Set up a data structure to represent your replay buffer. I recommend using a deque from python’s collections library. The replay buffer will return a randomly chosen batch of experiences when queried.
Okay, lets define our actor and critic networks. We’re going to use tflearn to condense the boilerplate code.
I would suggest placing these two functions in separate Actor and Critic classes, as shown above. The hyperparameter and layer details for the networks are in the appendix of the DDPG paper [5]. The networks for the lowdimensional statespace problems are pretty simple, though. For the actor network, the output is a tanh layer scaled to be between $[b, +b], b \in \mathbb{R}$. This is useful when your action space is on the real line but is bounded and closed, as is the case for the pendulum task.
Notice that the critic network takes both the state and the action as inputs; however, the action input skips the first layer. This is a design decision that has experimentally worked well. Accommodating this with tflearn was a bit tricky.
Make sure to use Tensorflow placeholders, created by tflearn.input_data
, for the inputs. Leaving the first dimension of the placeholders as None
allows you to train on batches of experiences.
You can simply call these creation methods twice, once to create the actor and critic networks that will be used for training, and again to create your target actor and critic networks.
You can create a Tensorflow Op to update the target network parameters like so:
This looks a bit convoluted, but it’s actually a great display of Tensorflow’s flexibility. You’re defining a Tensorflow Op, update_target_network_params
, that will copy the parameters of the online network with a mixing factor $\tau$. Make sure you’re copying over the correct Tensorflow variables by checking what is being returned by tf.trainable_variables()
. You’ll need to define this Op for both the actor and critic.
Let’s define the gradient computation and optimization Tensorflow operations. We’ll use ADAM as our optimization method. It’s sort of replaced SGD as the defacto standard, now that it’s implemented in so many plugandplay deeplearning libraries such as tflearn and keras and tends to outperform it.
For the actor network…
Notice how the gradients are combined. tf.gradients()
makes it quite easy to implement the Deterministic Policy Gradient equation (Eq. 4). I negate the actionvalue gradient since we want the actor to follow the actionvalue gradients. Tensorflow will take the sum and average of the gradients of your minibatch.
Then, for the critic network…
This is exactly Eq. 2. Make sure to grab the actionvalue gradients at the end there to pass to the policy network for gradient computation.
I like to encapsulate calls to my Tensorflow session to keep things organized and readable in my training code. For brevity’s sake, I’ll just show the ones for the actor network.
Now, lets show the main training loop and we’ll be done!
You’ll want to add bookkeeping to the code and arrange things a little more neatly you can find the full code here.
I used Tensorboard to view the total reward per episode and average max Q. Tensorflow is a bit lowlevel, so it’s definitely recommended to use tools like tflearn, keras, Tensorboard, etc., on top of it. I am quite pleased with TF though; it’s no surprise that it got so popular so quickly.
Looking Forward
Despite the fact that I ran this code on the courageous CPU of my Macbook Air, it converged relatively quickly to a good solution. Some ways to potentially get better performance (besides running the code on a GPU lol):

Use a priority algorithm for sampling from the replay buffer instead of uniformly sampling. See my summary of Prioritized Experience Replay.

Experiment with different stochastic policies to improve exploration.

Use recurrent networks to capture temporal nuances within the environment.
The authors of DDPG also used convolutional neural networks to tackle control tasks of higher complexities. They were able to learn good policies with just pixel inputs, which is really cool.
References
 Sutton, Richard S., and Andrew G. Barto. Reinforcement learning: An introduction.
 Reinforcement learning: An introduction  Fig 6.15
 Silver, et al. Deterministic Policy Gradients
 Mnih, et al. Humanlevel control through deep reinforcement learning
 Lillicrap, et al. Continuous control with Deep Reinforcement Learning
 Reinforcement learning: An introduction  Fig 6.13