Navigation and Manipulation with Vision-Language Models
- Sponsored by: TUM Computer Vision and Artificial Intelligence & Visual Geometry Group (University of Oxford)
- Project lead: Dr. Ricardo Acevedo Cabra
- Scientific lead: Dr. Yan Xia and Junyu Xie
- TUM co-mentor: TBA
- Term: Summer semester 2025
- Application deadline: Sunday 19.01.2025
Apply to this project here

Introduction
With the rapid advancement of Large Language Models (LLMs) [12, 2, 1, 11] and Vision-Language Models (VLMs) [8, 7, 3, 5], there has been a significant surge in research utilizing these models for navigation and manipulation over the past two years. These models are revolutionizing how robotic systems interpret visual information and follow natural language instructions. By combining the strengths of vision-based perception and language understanding, VLMs have the potential to reshape the way robots interact with their environments and communicate with humans. As human-robot collaboration continues to grow, the demand for systems that can seamlessly operate in complex environments using natural language guidance becomes increasingly vital.
Objectives
- To review and analyze existing methods of navigation and manipulation in simulation/real-world environments by integrating vision and language models.
- To develop navigation and manipulation frameworks for allowing agents to understand and execute natural language instructions while perceiving and interacting with the surrounding environment.
- To evaluate the efficiency and accuracy of the proposed frameworks using benchmark datasets.
Data Availability
Vision language models for manipulation.
- Grasp-Anything-6D [9] is a large-scale dataset for language-driven 6-DoF grasp detection in 3D point clouds.
- GraspNet-1Billion [6] is a large-scale training data set and a standard evaluation platform for the task of general robotic grasping. The dataset contains 97,280 RGB-D image with over one billion grasp poses.
Vision language models for navigation.
- Matterport3D dataset [4] is a large-scale RGB-D collection that features 10,800 panoramic views derived from 194,400 RGB-D images across 90 building-scale scenes.
- Habitat Matterport 3D [10] is a large-scale dataset of 3D indoor spaces. It consists of 1,000 high-resolution 3D scans (or digital twins) of building-scale residential, commercial, and civic spaces generated from real-world environments.
Requirements for the students
The students should fulfill the following requirements:
- Good background in math and excellent grades;
- Self-motivated and strongly interested in publishing top venues like CVPR, NeurIPS, or others.
- Practical Python or C++ programming skills;
- Familiar with Pytorch or Tensorflow deep learning framework;
- Have attended at least one related CV course or seminar.
References
[1] Qwen2 technical report. 2024.
[2] AI@Meta. Llama 3 model card. 2024.
[3] Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966, 2023.
[4] Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Niessner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. Matterport3d: Learning from rgb-d data in indoor environments. International Conference on 3D Vision (3DV), 2017.
[5] Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, Bin Li, Ping Luo, Tong Lu, Yu Qiao, and Jifeng Dai. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. arXiv preprint arXiv:2312.14238, 2023.
[6] Hao-Shu Fang, Chenxi Wang, Minghao Gou, and Cewu Lu. Graspnet-1billion: A large-scale benchmark for general object grasping. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11444–11453, 2020.
[7] Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reasoning, ocr, and world knowledge, 2024.
[8] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning, 2023.
[9] Toan Nguyen, Minh Nhat Vu, Baoru Huang, An Vuong, Quan Vuong, Ngan Le, Thieu Vo, and Anh Nguyen. Language-driven 6-dof grasp detection using negative prompt guidance. arXiv preprint arXiv:2407.13842, 2024.
[10] Santhosh Kumar Ramakrishnan, Aaron Gokaslan, Erik Wijmans, Oleksandr Maksymets, Alexander Clegg, John M Turner, Eric Undersander, Wojciech Galuba, Andrew Westbury, Angel X Chang, Manolis Savva, Yili Zhao, and Dhruv Batra. Habitat-matterport 3d dataset (HM3d): 1000 large-scale 3d environments for embodied AI. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2021.
[11] Gemma Team. Gemma 2: Improving open language models at a practical size. arxiv preprint 2408.00118, 2024.
[12] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv: 2307.09288, 2023.