dc.contributor.author
Budach, Stefan
dc.date.accessioned
2021-03-29T14:08:47Z
dc.date.available
2021-03-29T14:08:47Z
dc.identifier.uri
https://refubium.fu-berlin.de/handle/fub188/30124
dc.identifier.uri
http://dx.doi.org/10.17169/refubium-29866
dc.description.abstract
Biological sequences - DNA, RNA and proteins - orchestrate the behavior of all living cells and trying to understand the mechanisms that govern and regulate the interactions among these molecules has motivated biological research for many years. The introduction of experimental protocols that analyze such interactions on a genome- or transcriptome-wide scale has also established the usage of machine learning in our field to make sense of the vast amounts of generated data. Recently, deep learning, a branch of machine learning based on artificial neural networks, and especially convolutional neural networks (CNNs) were shown to deliver promising results for predictive tasks and automated feature extraction. However, the resulting models are often very complex and thus make model application and interpretation hard, but the possibility to interpret which features a model has learned from the data is crucial to understand and to explain new biological mechanisms.
This work therefore presents pysster, our open source software library that enables researchers to more easily train, apply and interpret CNNs on biological sequence data. We evaluate and implement different feature interpretation and visualization strategies and show that the flexibility of CNNs allows for the integration of additional data beyond pure sequences to improve the biological feature interpretability. We demonstrate this by building, among others, predictive models for transcription factor and RNA-binding protein binding sites and by supplementing these models with structural information in the form of DNA shape and RNA secondary structure. Features learned by models are then visualized as sequence and structure motifs together with information about motif locations and motif co-occurrence. By further analyzing an artificial data set containing implanted motifs we also illustrate how the hierarchical feature extraction process in a multi-layer deep neural network operates.
Finally, we present a larger biological application by predicting RNA-binding of proteins for transcripts for which experimental protein-RNA interaction data is not yet available. Here, the comprehensive interpretation options of CNNs made us aware of potential technical bias in the experimental eCLIP data (enhanced crosslinking and immunoprecipitation) that were used as a basis for the models. This allowed for subsequent tuning of the models and data to get more meaningful predictions in practice.
en
dc.format.extent
vii, 119 Seiten
dc.rights.uri
https://creativecommons.org/licenses/by-sa/4.0/
dc.subject
deep learning
en
dc.subject
interpretability
en
dc.subject
bioinformatics
en
dc.subject.ddc
500 Naturwissenschaften und Mathematik::570 Biowissenschaften; Biologie::576 Genetik und Evolution
dc.title
Explainable deep learning models for biological sequence classification
dc.contributor.gender
male
dc.contributor.firstReferee
Marsico, Annalisa
dc.contributor.furtherReferee
Lippert, Christoph
dc.date.accepted
2020-12-11
dc.identifier.urn
urn:nbn:de:kobv:188-refubium-30124-8
refubium.affiliation
Mathematik und Informatik
dcterms.accessRights.dnb
free
dcterms.accessRights.openaire
open access