Visual Question Answering using Large Language Models

Results of this project are shown in the final report (PDF) below:

About the project:

PreciTaste is a computer vision tech company based in Munich, New York and Mumbai developing AI solutions for the gastronomy and baking industry. The recent advancement in large languages models (LLM) opens up new exciting possibilities to interact with software and data using natural language rather than predefined input queries. As vision models can infer information about each individual image, language models have capabilities to filter out and summarize information across whole datasets.
The combination of vision and language could allow for a natural way for even non-technical users to study and interact with image datasets by simply asking questions. A vision-language model would utilize the visual information inferred from images to provide the answer to the users questions about the visual content of the images and the dataset as a whole.

The goal of this project is to explore and study the capabilities of recent open source large language models and their abilities to understand and interact with image datasets when combined with vision models.

As part of this project you will:

  • Study capabilities of recent computer vision and large language models.
  • Develop a method to connect vision models to language models to create a chatbot that can answer questions about contents of an image dataset.
  • Evaluate the accuracy of your combined vision-language model on various visual question answering tasks using publicly available datasets.

To apply you should have:

  • Good understanding of deep learning theory
  • Proficient python programming skills
  • Experience with deep learning frameworks such as Pytorch or Tensorflow 

We are looking forward to your application :)

Datasets:
Pascal: https://pjreddie.com/projects/pascal-voc-dataset-mirror/
LVIS: https://www.lvisdataset.org/
VQA: https://visualqa.org/


Related papers:
VQA: Visual Question Answering
LLaMA: Open and Efficient Foundation Language Models
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models