Year of defence: 2023

Manuscript available here

Abstract

Despite numerous improvements regarding the effectiveness of ac{rl} methods in robotics, training from scratch still requires millions (or even tens of millions) of interactions with the environment to converge on high-performance behavior. In order to alleviate this huge need for data without losing performance, one promising avenue is ac{tl}. The aim of this thesis is to explore transfer learning in the context of RL, with the specific aim of transferring behaviors from one robot to another, even in the presence of morphological divergences or different state-action spaces. In particular, this thesis presents a process for reusing past knowledge acquired by a robot (source) on a task to accelerate (or even avoid) the learning process of a different robot (target) on the same task. The proposed method relies first on an unsupervised pre-training phase to learn a robot-agnostic latent space from trajectories collected on a set of robots. Then, it is possible to train a model within this space to solve a given task, in order to produce a task module that can be reused by any robot sharing this common feature space. In addition, this thesis tackles the problem of simulation-to-real-world adaptation when transferring a model trained in a simulator, with a focus on delay management, which is often overlook in the current literature. Indeed, we show that models oblivious to delay significantly drop in performance when tested on a physical robot, where the hardware and sensory system inevitably introduce delay. The approach developed is a simple but effective one for training agents to handle a user-defined range of delays. Through several robotic tasks and heterogeneous hardware platforms, both in simulation and on physical robots, this thesis shows the benefits of these approaches in terms of improved learning efficiency and performance. More specifically, we report zero-shot generalization in some instances, where performance after transfer is preserved. In the worst case, performance is recovered after a short adaptation on the target robot for a fraction of the training cost required to learn a policy with similar performance from scratch.

Publications