Multi-view Robotic Manipulation Policy with Foundation Models

  • Typ:Master's Thesis
  • Datum:From now on
  • Betreuung:

    Yitian Shi

Problem formulation

Developing robotic agents capable of executing diverse manipulation tasks from visual observations in unstructured real-world environments is a challenging problem. Beyond recognizing objects, the robot must understand their affordances—what actions they can perform with the objects. This includes grasping points, potential tool use, or interaction with other objects. Another major goal of introducing learning into robotic manipulation is to enable the robot to effectively handle unseen objects and successfully tackle various tasks in new environments.

Task definition

In this thesis, the primary objective is to have a comprehensive survey and evaluate the current state-of-the-art robotic manipulation policy that leverages multiview visual inputs and foundation models to perform complex manipulation tasks in unstructured environments. We’ll integrate und evaluate pre-trained foundation models (e.g., DINO, CLIP) on their ability to enhance the robot's understanding of the scene. Finally, experiments on real robots is necessary for policy generalization tests.

You shall offer

• Solid knowledge and experience in computer vision, deep learning.

• Coding skills in Python and Linux.

• Experience in simulation is a plus.

We will offer

• Powerful robot for experiments.

• Powerful GPU server for training your AI.

References

• Ze, Yanjie, et al. "Gnfactor: Multi-task real robot learning with generalizable neural feature fields." Conference on Robot Learning. PMLR, 2023.

• Ke, Tsung-Wei, Nikolaos Gkanatsios, and Katerina Fragkiadaki. "3d diffuser actor: Policy diffusion with 3d scene representations." arXiv preprint arXiv:2402.10885 (2024)