Real-Time Open-Vocabulary Object Detection
The results of this project will be uploaded here as a final report by mid-May 2025.
- Sponsored by: PreciTaste
- Project lead: Dr. Ricardo Acevedo Cabra
- Scientific lead: TUM-DI-LAB Alumni M.Sc. Sebastian Freytag and M.Sc. Mathias Sundholm
- TUM co-mentor: Dr. Alessandro Scagliotti
- Term: Winter semester 2024
- Application deadline: Sunday 21.07.2024

Motivation
PreciTaste, a computer vision tech company with offices in Munich, New York, and Mumbai, specializes in developing AI solutions tailored for the gastronomy and baking industry. One main product is vision-based inventory tracking, which empowers restaurants and retail clients to monitor changes in their food inventory in real-time. This solution facilitates dynamic scheduling and optimization of stock-ups, ensuring a steady supply of fresh food while reducing food waste.
A typical inventory might contain hundreds or even thousands of unique items. This poses a big challenge for standard object detection approaches with fixed classes. Large and high quality datasets need to be maintained
in order to achieve high accuracy. Moreover, closed-set object detectors encounter difficulties when inventory items are added, removed, or undergo packaging changes. This requires labor-intensive relabeling and retraining.
Goal
To address these challenges, the project aims to design a model for open-vocabulary object detection, capable of identifying any object, even if the class was not seen during training. The goal is to develop a multimodal model that can query objects using a text description, an example image, or even a combination of both. One of our previous datalabs explored this concept and achieved promising results using a two-stage approach. However, this two-stage approach proved to be too slow for real-time inventory tracking. Therefore, the main goal will be to develop a model which runs in real-time, while maintaining comparable accuracy.
As part of this project, you will:
1. Explore the capabilities of the latest open-vocabulary object detection models.
2. Develop a method to query objects in real-time using text description or example images.
3. Evaluate the accuracy of your model using publicly available datasets and compare it to existing solutions.
Requirements
1. Good understanding of deep learning theory
2. Proficient Python programming skills
3. Experience with deep learning frameworks such as PyTorch or Tensorflow
Apply to this project here
References
[1] Tianheng Cheng, Lin Song, Yixiao Ge, Wenyu Liu, Xinggang Wang, Ying Shan: YOLO-World: Real-Time Open-Vocabulary Object Detection https://arxiv.org/abs/2401.17270
[2] Dahun Kim, Anelia Angelova, Weicheng Kuo: Region-Aware Pretraining for Open-Vocabulary Object Detection with Vision Transformers https://arxiv.org/abs/2305.07011
[3] Matthias Minderer, Alexey Gritsenko, Austin Stone, Maxim Neumann, Dirk Weissenborn, Alexey Dosovitskiy, Aravindh Mahendran, Anurag Arnab, Mostafa Dehghani, Zhuoran Shen, Xiao Wang, Xiaohua Zhai, Thomas Kipf, Neil Houlsby: Simple Open-Vocabulary Object Detection with Vision Transformers https://arxiv.org/abs/2205.06230