panCRISPR Toolbox - a deep learning approach to improve CRISPR/Cas experiments

This project took place in winter term 2021, you CAN NOT apply to this project anymore!

Results of this project are explained in detail in the final report and presentation.

Goal: Build an integrative panCRISPR toolbox that can predict guide RNA efficiency but also measure off-target effects.

Background: CRISPR technology is a simple yet powerful tool for editing genomes. It allows researchers to easily alter DNA sequences and modify gene function. Its many potential applications include correcting genetic defects, treating and preventing the spread of diseases and improving crops. The two major components of the CRISPR/Cas editing system are the Cas protein and the guide RNA. The guide RNA contains a user-defined RNA sequence that guides the Cas protein to the desired target DNA segment and hence, needs to be designed specifically for each target DNA segment. Two key aspects in the design process are the efficiency and specificity of the designed guide RNAs. Efficiency measures how well a designed guide RNA binds to the target DNA segment, while specificity measures how often the guide RNA binds to off-target locations in the DNA. Our panCRISPR toolbox will be able to design highly specific and efficient CRISPR guide RNAs in a fast and explainable manner using advanced Deep Learning methods.

Methods: Our goal is to build an integrative panCRISPR toolbox that can 1) predict guide RNA efficiency but also 2) measure off-target effects. Therefore, we aim to extract the most promising features from existing efficiency prediction approaches and improve those models by implementing models of different complexities such as MLPs, 1D CNNs and GNNs. Finding all possible off-target sites is the most time-consuming step in the guide RNA design process since the user-defined RNA sequence needs to be mapped to the whole genome sequence. We aim to speed up this step by either using heuristics or pre-filtering approaches to reduce the number of guide RNAs to be investigated. Another possible way of reducing the designed guide RNA pool is the usage of an AI agent (trained via genetic algorithms or reinforcement learning) that designs possible guides with high efficiency and specificity from scratch, using efficiency and specificity predictions as a reward signal or fitness function. Since the majority of CRISPR screen experiments, measuring guide RNA efficiency and specificity, are conducted in human cell lines, we aim to develop a model that shows high transferability from human to other eukaryotic species, such as mice or plants. For comparison, we will benchmark existing tools with cross-species CRISPR data and compare their results to our panCRISPR toolbox. Although accurate predictions are important, one important aspect of machine learning application in Biology is the interpretability of the models' decisions. To understand why certain guides are more efficient than others, we will apply suitable explainability methods such as feature importance via SHAP, saliency maps, layer-wise relevance propagation or GNNExplainer.

Data:

There are several resources for CRISPR screening experiments. One main resource we will use is the GenomeCRISPR Database with more than 700 000 guide RNAs used in ~500 different experiments performed in 421 different human cell lines: genomecrispr.dkfz.de as well as this github repository, which collected several datasets from different publications: github.com/maximilianh/crisporPaper/tree/master/effData

Accepted students to this project should attend (unless they have proven knowledge) online workshops at the LRZ from 11.10. - 15.10.2021. More information will be provided to students accepted to this project.