Creating a single-cell atlas of human blood
Results of this project are explained in the final report.
- Sponsored by: TUM Chair of Mathematical Modeling of Biological Systems (MDSI Prof. Theis) and Institute of Computational Biology (Helmholtz Zentrum München)
- Project Lead: Dr. Ricardo Acevedo Cabra
- Scientific Lead: PhD candidate Karin Hrovatin, Dr. Malte Luecken, PhD candidate Christopher Lance, PhD candidate Lisa Sikkema
- TUM Co-Mentor: Prof. Massimo Fornasier
- Term: Winter semester 2022
Cells are the fundamental building blocks of tissues, organs and entire organisms. Breakthrough technologies for profiling RNA in individual cells (scRNA-seq) provide new insights into the functioning of and interaction between cells, which has led to a leap forward in biomedical science.1–3. While great strides have been taken forward, current single-cell datasets typically contain many cells, but only few samples, failing to represent the diversity existing in human populations. To comprehensively understand a human biological tissue in health and disease, it is therefore necessary to capture population-level diversity by aggregating data across conditions, individuals and datasets into a single “reference”4. However, the creation of cross-dataset references is complicated by technical artifacts in the data (batch effects) that arise from differences in experimental design, making direct comparison of multiple datasets impossible. To enable cross-dataset analyses, approaches that integrate data and remove batch effects are used to construct so-called “integrated atlases”5,6. Leveraging these computational advances, multiple initiatives that aim at building integrated atlases have been started7-4. However, many organs are still missing adequate atlases, including the blood, which plays a vital role in the immune system and of which a wide variety of single-cell datasets have been generated8–10. We have recently made the first attempts to integrate peripheral blood datasets into a single atlas that must be further evaluated and iteratively improved to construct a final reference atlas, as proposed below.
- Evaluate the quality of our initial PBMC atlas including single-cell RNA-sequencing (scRNA-seq) data of >8 million cells and >1500 healthy and diseased individuals from 22 different peripheral blood mononuclear cell (PBMC) datasets across different biological conditions
- Iteratively improve the PBMC reference atlas by fine-tuning of the integration protocol
- Describe the molecular heterogeneity captured within the atlas and test how it can be used to provide context to new data by transfer learning
- Speed up methods used for large scRNA-seq data analysis
- Data preprocessing: Measurements obtained from scRNA-seq are not perfect and need quality assessment from different perspectives (e.g. removal of dying cells) with existing and new metrics. Furthermore, as metadata is collected in a study-specific manner and not directly comparable, we will align different terms used across datasets in accordance with standard terminologies.
- Evaluation and improvement of integration: There is no single best way to build an integrated atlas, thus we will take effort to optimize this process by consecutive integration evaluation followed by re-integration. For example, every batch effect removal also removes part of biological variation. This trade-off can be fine tuned with integration strength and improved preprocessing and assessed by quantifying the conservation of cross-dataset effects of disease (COVID-19) and cell type identity. Furthermore, it is unclear what should be chosen as a batch variable, i.e. the variable that is expected to cause significant technical artifacts. Therefore, we will look into the batch effect per covariate such as biological samples, subsets of datasets, or whole datasets. Lastly, atlases can be built from all available data, or from a subset of datasets, such as healthy samples alone followed by transfer-learning based mapping of diseased data on top of healthy. We will compare the two approaches based on batch effect removal and biological information conservation metrics.
- Using the atlas: At the end we will explore the atlas to examine the captured biological information in terms of cell types present in the atlas and effects of external factors, such as disease, on cell function. We will contextualize new data by mapping it on top of the atlas, such as for automatic cell type and state transfer.
- Speed up clustering, visualization, and other processing steps for large tabular data and nearest-neighbors graphs from the atlas by adapting existing methods developed for working with large data or defining and testing new approaches where necessary.
Taken together, in this project we will generate a single-cell reference atlas of blood capturing the variation of a large human population. This is of tremendous value for the research community investigating blood cell states in health and disease.
Accepted students to this project should attend (unless they have proven knowledge) online workshops at the LRZ from TBA. More information will be provided to students accepted to this project.