SS26 TUM Helmholtz: Patient representation learning in biobank-scale datasets with LLM-guided feature hierarchical encoding

Patient representation learning in biobank-scale datasets with LLM-guided feature hierarchical encoding

Sponsored by: TUM Chair of Artificial Intelligence in Healthcare and Medicine & Helmholtz AI for Health (Helmholtz Zentrum München)
Project lead: Dr. Ricardo Acevedo Cabra
Scientific lead: PhD Francesco Paolo Casale, Dr. Diyuan Lu
TUM co-mentor: TBA
Term: Summer semester 2026
Application deadline: Sunday 25.01.2026

Apply to this project here

TUM Chair of Artificial Intelligence in Healthcare and Medicine & Helmholtz Munich

Motivation
Large-scale biobank and clinical datasets such as the UK Biobank comprise hundreds of heterogeneous features capturing lifestyle, clinical, and biological characteristics. However, most existing machine learning approaches still treat these variables as flat tabular inputs, overlooking their inherent semantic and biochemical structure or relying on extensive feature selection procedures that risk introducing bias. As a result, both the interpretability and generalizability of learned patient representations remain limited.

Goal
The goal of this project is to develop a patient representation learning framework that organizes input features into meaningful semantic and biochemical groups using LLM-guided feature clustering, followed by self-supervised pretraining of cluster-specific encoders. These cluster representations will be fused into a unified patient embedding for downstream tasks such as disease prediction and survival analysis.

Key Milestones

Semantic and statistical grouping of biobank features using LLM-based embeddings.
Self-supervised pretraining of cluster-specific encoders using objectives such as masked reconstruction.
Hierarchical fusion of cluster embeddings using a lightweight Transformer model.
Evaluation of learned patient representations for information retention, interpretability, and clinical relevance.

Student Requirements
Students should have a solid background in machine learning and deep learning with Python, with hands-on experience in PyTorch or TensorFlow. Familiarity with tabular data analysis, biomedical data, representation learning, or self-supervised learning is highly desirable. You can find the project figure and Helmholtz Logo in the attachment. Hereby, we explicitly confirm the copyright clearance and written consent for use on the TUM-DI-LAB webpage.

To top

Be part of TUM-DI-LAB!

Mentors:

cutting edge knowledge is essential for our lab. Professors, postdocs and doctoral students are welcome as project mentors. Find out here how to become a mentor.

Partners:

Industrial partners are indispensable for TUM-DI-LAB. Find out here how to become a partner

Munich Data Science Institute

TUM-DI-LAB is part of the Munich Data Science Institute (MDSI) since October 2021