Riedmiller, M. 2005

Summary

The key insight here is that neural networks react globally to local weight changes, which results in unwanted behavior. For a neural network that is representing the Q-value function, the influence of a weight change for a new datapoint must be constrained by presenting previous knowledge in the form or prior experiences. The proposed algorithm is a special case of experience replay.

In principle, classical Q-learning can be directly implemented in a neural network. An MSE can be used to calculate a loss between the expected Q-value and the generated Q-value. Vanilla application of this transformation requires tens of thousands of training examples due to the problem stated above.

NFQ attempts to address this by doing off-line learning considering an entire set of transition experiences. RPROP can be used for updating the weights of the neural network, which is an advanced supervised learning technique.

Strengths

NFQ is very flexible. One variant is to incrementally add new experiences to D, the set of transition experiences. This is useful for cases where a reasonable set of experiences can not be collected by controlling the system with purely random actions. It seems like you would want to do a combination of both this, and pre-training on sample paths.

Methods

For the mountain car setup, the authors used an MLP with 3 input neurons (2 state and 1 action), 2 layers of 5 hidden neurons each and 1 output neuron, all with sigmoidal activation functions.

RPROP and Batch Learning, both my Riedmiller