Deep Learning for de novo peptide sequencing
DL4DNPS
Abstract
The three major classes of molecules of life are DNA, RNA, and proteins. Unlike for DNA and RNA, there is to date no accurate and high-throughput sequencing technology for proteins. The closest technology is tandem mass spectrometry which yields mass spectra of protein fragments called peptides. Highly accurate de novo peptide sequencing (DNPS), i.e. determining peptide amino acid sequences solely from tandem mass spectra, will make proteomics amenable for applications including genotyping, cancer surveillance, pathogen surveillance, immuno-oncology, metagenomics, and paleogenomics. Recently, algorithms leveraging deep learning have provided promising attempts to the problem [1-4]. However, their performance remains very poor (<15%) in the high precision range (90%) that is required in relevant, including clinical, applications.
Here we propose developing two innovative complementary ideas to DNPS. On the one hand, we frame the DNPS problem as a 1D-image translation task, which takes as input a discretized spectrum and returns peak labels tagging the ion series and contamination peaks. On the other hand, we consider DNPS as a combinatorial optimization problem for which we will investigate the use of genetic algorithms (GA). The two methods complement each other as the bin classification algorithm can be used both for defining the fitness function of the GA and for generating guided mutations.
The algorithm will be trained on data generated in the ProteomeTools [5] project which systematically characterized ~1.4 million synthetic peptides using tandem mass spectrometry covering all human gene products including post-translational modifications. In total, the project generated >100 million high-quality reference tandem mass spectra.
[1] Ma, B. Novor: real-time peptide de novo sequencing software. J. Am. Soc. Mass Spectrom. 26, 1885–1894 (2015).
[2] Yang, H., Chi, H., Zeng, W.-F., Zhou, W.-J. & He, S.-M. pNovo 3: precise de novo peptide sequencing using a learning-to-rank framework. Bioinforma. Oxf. Engl. 35, i183–i190 (2019).
[3] Tran, N. H., Zhang, X., Xin, L., Shan, B. & Li, M. De novo peptide sequencing by deep learning. Proc. Natl. Acad. Sci. 114, 8247–8252 (2017).
[4] Yilmaz, M., Fondrie, W. E., Bittremieux, W., Oh, S. & Noble, W. S. De novo mass spectrometry peptide sequencing with a transformer model. 2022.02.07.479481 (2022) doi:10.1101/2022.02.07.479481.
[5] Zolg, D. et al. Building ProteomeTools based on a complete synthetic human proteome. Nat Methods 14, 259–262 (2017)