Kulkarni, et al., 2016

Summary

The first two sections of this paper provide a decent overview of recent advances in the hierarchical and intrinsic RL literature. The authors propose a learning framework for achieving complex goals in the face of sparse rewards. Drawing from Singh, et al., a separation between intrinsic and extrinsic learning is derived, such that external reward signals are generated from the environment and internal reward signals are generated by an internal critic.

There is a meta-controller that learns an approximation of the optimal goal policy $\pi_{g} (g | s )$ and a controller that learns an approximation of the optimal action policy $\pi_{a g} (a|g,s)$. The meta-controller operates on a slower time-scale than the controller; it is concerned with selecting the optimal goal for the controller to work towards. The internal critic provides incremental feedback for the controller, and the meta-controller receives external feedback from the environment.

Each controller is represented as a Deep Q-Network, and the usual DRL learning tricks are applied.

The framework is tested in two settings, a stochastic decision making problem and the Atari game Montezuma’s Revenge. The hierarchical agent significantly outperforms DQN on Montezuma’s Revenge (DQN gets 0 points).

My Notes

  1. The main contribution of this paper is the use of deep Q-networks for hierarchical/intrinsically motivated RL. However, the theory of intrinsically motivated RL and hierarchical RL with sub-policies for learning incremental behaviors under sparse extrinsic feedback is not novel; hence, the overall impact of the paper is substantially reduced.
  2. DQN is used as a baseline in the results/figures; but the reason for using such a questionable baseline is unclear. The authors even mention that Gorila DQN achieves a better average reward of 4.16; it would be better to see a more recent DRL algorithm used as a baseline.
  3. The set of goals, the intrinsic critic, and the external reward all still need to be hand-crafted for every learning problem- not a flaw of the research, since the motivation for this work was to handle sparse rewards.
  4. The algorithm relies doubly on epsilon-greedy exploration; an epsilon parameter is annealed for both policies. Even with the exploration decay schedule, the asymptotic variance of the total reward for the Montezuma experiment is pretty large.
  5. To solve Montezuma’s Revenge, they implemented a “custom object detector” to identify objects in the game such as the ladder and key. However, the authors only mention it in passing. Also, the details of the internal critic for the Montezuma agent are unclear. It appears that they defined certain relations such as “agent reaches ladder” and “agent reaches key” and used these for intrinsic rewards, but it is unclear exactly how they went about doing it. They claim that their method does not require explicit encoding of the relations between objects as well.
  6. Overall, the idea of using hierarchical DQN to learn a temporal abstraction is a promising one, and it should be explored more. Unsupervised discovery of sub-tasks/goals is still an open problem in RL as well.
  7. An interesting avenue to look into as mentioned by the authors is the use of evolutionary methods to search the space of reward functions.