Mehdi Hellou

Robotics/AI student

Self-improving agent based on reinforcement learning

During my second Master in AI at Sorbonne University, I have worked into a school project which was to read a scientific article related to AI methods used in robotics, and reproduced these methods and compared the results from the paper. The paper focus in the using of Reinforcement Learning methods as the adaptive heuristic critic (AHC) learning architecture , and Q-learning into a complex and simulated environment. The environment included an agent, enemies, food and obstacles. The purpose of the agent was to survive into the environment by using its sensors to avoid the enemies and obstacles, and collect food.

During this project, we had to create the environment and the protagonists (agent and enemies), the sensors of the agent by using coarse coding and the python GUI toolkit Tkinter. We also adjusted the movement of the characters to be compatible with the GUI Tkinter. Thereafter, we had to set up the learning part by implementing the Q-Learning method in using a neural network which allowed the agent to choose its movement autonomously. The neural network had three layers including an input layer (145 neurons) whose the values were given by the agent's sensors, a hidden layer (30 neurons) and an output layer (1 neuron) corresponding to the Q-Value. At the end of implementing, we had to compare our results illustrated by the learning curves, to the ones from the paper.

Learning curves
Figure - Learning curves showing the number of resources eaten by the agent during 300 plays with and without using experience replay (blue and green curves, respectively).

The learning curves above depict the agent's performance over 300 training simulations. As expected, the agent gradually learns to collect resources and avoid being caught by the enemies. A short demo is provided below to consolidate the agent's performance and to understand how we built the simulation.

  1. Long-Ji Lin. 1992. Self-Improving Reactive Agents Based on Reinforcement Learning, Planning and Teaching. Mach. Learn. 8, 3–4 (May 1992), 293–321.
  2. Richard Stuart Sutton.Temporal Credit Assignment in Reinforcement Learning. PhD thesis, 1984. AAI8410337.
  3. Sutton, Richard & Barto, Andrew. (1990). Time-Derivative Models of Pavlovian Reinforcement.
  4. Watkins, C. J. C. H.. "Learning from Delayed Rewards." PhD diss., King's College, Oxford, 1989.
  5. D. E. Rumelhart, G. E. Hinton, and R. J. Williams. 1986. Learning internal representations by error propagation. Parallel distributed processing: explorations in the microstructure of cognition, vol. 1: foundations. MIT Press, Cambridge, MA, USA, 318–362.