Patient representation learning in biobank-scale datasets with LLM-guided feature hierarchical encoding
- Sponsored by: TUM Chair of Artificial Intelligence in Healthcare and Medicine & Helmholtz AI for Health (Helmholtz Zentrum München)
- Project lead: Dr. Ricardo Acevedo Cabra
- Scientific lead: PhD Francesco Paolo Casale, Dr. Diyuan Lu
- TUM co-mentor: TBA
- Term: Summer semester 2026
- Application deadline: Sunday 25.01.2026
Apply to this project here

Motivation
Large-scale biobank and clinical datasets such as the UK Biobank comprise hundreds of heterogeneous features capturing lifestyle, clinical, and biological characteristics. However, most existing machine learning approaches still treat these variables as flat tabular inputs, overlooking their inherent semantic and biochemical structure or relying on extensive feature selection procedures that risk introducing bias. As a result, both the interpretability and generalizability of learned patient representations remain limited.
Goal
The goal of this project is to develop a patient representation learning framework that organizes input features into meaningful semantic and biochemical groups using LLM-guided feature clustering, followed by self-supervised pretraining of cluster-specific encoders. These cluster representations will be fused into a unified patient embedding for downstream tasks such as disease prediction and survival analysis.
Key Milestones
- Semantic and statistical grouping of biobank features using LLM-based embeddings.
- Self-supervised pretraining of cluster-specific encoders using objectives such as masked reconstruction.
- Hierarchical fusion of cluster embeddings using a lightweight Transformer model.
- Evaluation of learned patient representations for information retention, interpretability, and clinical relevance.
Student Requirements
Students should have a solid background in machine learning and deep learning with Python, with hands-on experience in PyTorch or TensorFlow. Familiarity with tabular data analysis, biomedical data, representation learning, or self-supervised learning is highly desirable. You can find the project figure and Helmholtz Logo in the attachment. Hereby, we explicitly confirm the copyright clearance and written consent for use on the TUM-DI-LAB webpage.