Singh, et al., 2004

Summary

In the search for more general AI agents, we must eventually abandon the practice of handcrafting reward functions in RL. As of right now, state-of-the-art RL agents still require a “good” reward function that helps the agent learn complex behaviors and, by extension, an approximation of the optimal policy. For example, consider the popular pendulum task. The goal of the agent is to swing up and balance the pendulum. If the reward function is simply a +1 for having balanced the pendulum at the end of the episode (achieved the goal), or a -1 if the agent failed to balance the pendulum (failed to achieve the goal), the agent would never learn anything. With positive rewards arriving so incredibly infrequently, the agent would not have any motivation to explore different swing up behaviors, since every action it tries would seem equally bad. Hence, a more complex reward function is necessary that encourages the agent to develop intermediate “sub-behaviors” which allow it to achieve its goal.

One solution to this problem as presented by the authors is focusing on the acquisition of skills, or options, that provide the agent with the ability to use composition to carry out heirarchical planning-tasks. The authors claim that agents should have a sophisticated internal motivational system that should not have to be redesigned for different problems. Options are closed-loop “mini”-policies for taking action over a period of time. The authors present a learning framework based on semi-Markov Decision Processes, which are used for adding temporal abstractions to RL. This framework utilizes the saliency of certain events that occur in the environment to generate intrinsic reward signals that stimulate curiosity within the agent. Eventually, the agent will “lose interest” in the event, as it loses its novelty (a.k.a. boredom), but retain knowledge of the interaction. Extrinsic reward signals are present and are generated by accomplishing goals. The authors demonstrated their framework on a small world experiment, where an agent in a grid-world was able to interact with a few objects. Some of the options it could learn involved turning a light switch on and off, kicking a ball, or making a bell ring.

Notes

  • The policies of many options are updated simultaneously during an agent’s interaction with the environment. If an option could have produced the current action in the current state, its policy can be updated as well.
  • In general, options have to be provided by the system designer. The state that could lead to the execution of an option, the option’s terminating condition, and the reward function that evaluates the option’s performance is required. It is desirable to automate the discovery of new options.

This paper is a good starting point for looking into the area of RL that deals with the reward-function problem. The concept of developing a motivational system that is shared across tasks is one that has not really been explored/is not prevalent today.

Going back to my example of the pendulum task- an RL agent would require an internal motivational system that understood physics in order to explore the environment effectively and receive a reward from the salient event of balancing the pendulum. If the agent could understand from prior experiences what balancing means (i.e., the agent utilizes a learned model of physics to generate some type of understanding of the physical properties of an object in a balanced state), then the agent could motivate itself to select specific sequences of actions that bring the pendulum closer to a balanced state. Therefore, the system designer could use a much simpler and less informative reward signal without having to do almost any hand-crafting.

An interesting experiment to test this would be to train an RL agent on a number of different tasks that involve balancing, and then to use transfer learning (sharing network weights?) to have the agent solve the pendulum task.

Or, to use a form of (differentiable) memory to store an approximately optimal set of sequences of actions related to balancing objects on a number of tasks. The idea is to capture the learned “skills” and transfer them to new tasks. The agent could use these experiences to accelerate learning on a novel balancing task. If the memory is associative, you could also store multiple physical skills beyond just balancing. From a practical standpoint, the prior experience would need to influence the Q-values of a given state and action through some sort of “bonus”.