dc.description.abstract
Mass spectrometry-based proteomics allows to take a snapshot of the state of cells on the
protein level. In particular, it allows to study the entire proteome in a high-throughput
manner with high sensitivity. It is furthermore able to provide insights into peptidoforms,
which are peptides that have sequence variations or are post-translationally modified.
These play a major role in cell signaling and gene regulation, but can also be linked
to certain disorders. So, studying peptidoforms and proteoforms provides fine-grained
phenotypic insights that are of high biological relevance. However, the identification of
peptidoforms, using mass spectrometry-based proteomics, is challenging from at least
two directions. First, the underlying tandem mass spectra contain additional, rare, and
potentially hard-to-simulate patterns. Furthermore, contextual knowledge in the form
of reference proteomes, which initially contain canonical reference proteins, cannot
be extended indefinitely to account for all potential peptidoforms. This is because
considering all possible combinations of modifications and sequence variations is not
tractable due to combinatorial complexity. Hence, this thesis provides new deep learning
methods that improve the identification of peptidoforms in mass spectrometry-based
proteomics while being interpretable. First, I introduce AHLF as a new end-to-end
trained deep learning model that predicts modified peptides based on their fragmentation
tandem mass spectrum (MS/MS) spectra. I am able to show that AHLF’s prediction
score boosts peptide identification rates in the context of phosphoproteomics and for
cross-linked peptides. AHLF is a temporal convolutional neural network temporal
convolutional neural network (TCN), which I trained on 19.2 million of historical
peptide-to-spectrum matches. For interpreting AHLF, I estimate feature importances per
peak and per spectrum. These peak-level importances show that AHLF is indeed focusing
on peptide-specific fragment ions. Additionally, I investigate the prediction performance
across varying quality of data, denoting that variation occurs due to instrument type,
resolution and dissociation method. In the second part of this work, I present yHydra
which is a foundation model that I trained on nearly 20 million peptides and MS/MS
spectra. I designed yHydra as a foundation model to be able to implement various
downstream sub-tasks. In particular, I demonstrate that yHydra is able to perform closed,
open, and error-tolerant searching. Using yHydra, I demonstrate error-tolerant searching
of antibody peptide sequences searching and in the context of cross-species searching
of chimpanzee plasma samples. Lastly, I demonstrate the of explainability yHydra by
visualizing the learned joint embedding of peptides and spectra which reveals a learned
manifold that is structured in concordance of physico-chemical properties of embedded
peptides.
en