Explainable deep learning models for biological sequence classification

Budach, Stefan

Explainable deep learning models for biological sequence classification

Metadata

dc.contributor.author

Budach, Stefan

dc.date.accessioned

2021-03-29T14:08:47Z

dc.date.available

2021-03-29T14:08:47Z

dc.date.issued

2021

dc.identifier.uri

https://refubium.fu-berlin.de/handle/fub188/30124

dc.identifier.uri

http://dx.doi.org/10.17169/refubium-29866

dc.description.abstract

Biological sequences - DNA, RNA and proteins - orchestrate the behavior of all living cells and trying to understand the mechanisms that govern and regulate the interactions among these molecules has motivated biological research for many years. The introduction of experimental protocols that analyze such interactions on a genome- or transcriptome-wide scale has also established the usage of machine learning in our field to make sense of the vast amounts of generated data. Recently, deep learning, a branch of machine learning based on artificial neural networks, and especially convolutional neural networks (CNNs) were shown to deliver promising results for predictive tasks and automated feature extraction. However, the resulting models are often very complex and thus make model application and interpretation hard, but the possibility to interpret which features a model has learned from the data is crucial to understand and to explain new biological mechanisms. This work therefore presents pysster, our open source software library that enables researchers to more easily train, apply and interpret CNNs on biological sequence data. We evaluate and implement different feature interpretation and visualization strategies and show that the flexibility of CNNs allows for the integration of additional data beyond pure sequences to improve the biological feature interpretability. We demonstrate this by building, among others, predictive models for transcription factor and RNA-binding protein binding sites and by supplementing these models with structural information in the form of DNA shape and RNA secondary structure. Features learned by models are then visualized as sequence and structure motifs together with information about motif locations and motif co-occurrence. By further analyzing an artificial data set containing implanted motifs we also illustrate how the hierarchical feature extraction process in a multi-layer deep neural network operates. Finally, we present a larger biological application by predicting RNA-binding of proteins for transcripts for which experimental protein-RNA interaction data is not yet available. Here, the comprehensive interpretation options of CNNs made us aware of potential technical bias in the experimental eCLIP data (enhanced crosslinking and immunoprecipitation) that were used as a basis for the models. This allowed for subsequent tuning of the models and data to get more meaningful predictions in practice.

dc.format.extent

vii, 119 Seiten

dc.language

eng

dc.rights.uri

https://creativecommons.org/licenses/by-sa/4.0/

dc.subject

deep learning

dc.subject

interpretability

dc.subject

bioinformatics

dc.subject.ddc

500 Naturwissenschaften und Mathematik::570 Biowissenschaften; Biologie::576 Genetik und Evolution

dc.title

Explainable deep learning models for biological sequence classification

dc.type

Dissertation

dcterms.format

Text

dc.contributor.gender

male

dc.contributor.firstReferee

Marsico, Annalisa

dc.contributor.furtherReferee

Lippert, Christoph

dc.date.accepted

2020-12-11

dc.identifier.urn

urn:nbn:de:kobv:188-refubium-30124-8

refubium.affiliation

Mathematik und Informatik

dcterms.accessRights.dnb

free

dcterms.accessRights.openaire

open access

Show Simple Item Record

This Item appears in the following Collection(s)

Dissertationen FU

Files in This Item

Dissertation_StefanBudach.pdf

Size: 32.21MB

Format: PDF

Checksum (MD5): 9ed877ce52390fbe61bcc1aec854d5d0

View/Open

Explainable deep learning models for biological sequence classification

Refubium - Freie Universität Berlin Repository

Explainable deep learning models for biological sequence classification

Metadata

This Item appears in the following Collection(s)

Files in This Item

Export metadata