dc.contributor.author
Krützfeldt, Louisa-Marie
dc.contributor.author
Schubach, Max
dc.contributor.author
Kircher, Martin
dc.date.accessioned
2021-04-16T06:57:23Z
dc.date.available
2021-04-16T06:57:23Z
dc.identifier.uri
https://refubium.fu-berlin.de/handle/fub188/30373
dc.identifier.uri
http://dx.doi.org/10.17169/refubium-30114
dc.description.abstract
Regulatory regions, like promoters and enhancers, cover an estimated 5-15% of the human genome. Changes to these sequences are thought to underlie much of human phenotypic variation and a substantial proportion of genetic causes of disease. However, our understanding of their functional encoding in DNA is still very limited. Applying machine or deep learning methods can shed light on this encoding and gapped k-mer support vector machines (gkm-SVMs) or convolutional neural networks (CNNs) are commonly trained on putative regulatory sequences. Here, we investigate the impact of negative sequence selection on model performance. By training gkm-SVM and CNN models on open chromatin data and corresponding negative training dataset, both learners and two approaches for negative training data are compared. Negative sets use either genomic background sequences or sequence shuffles of the positive sequences. Model performance was evaluated on three different tasks: predicting elements active in a cell-type, predicting cell-type specific elements, and predicting elements' relative activity as measured from independent experimental data. Our results indicate strong effects of the negative training data, with genomic backgrounds showing overall best results. Specifically, models trained on highly shuffled sequences perform worse on the complex tasks of tissue-specific activity and quantitative activity prediction, and seem to learn features of artificial sequences rather than regulatory activity. Further, we observe that insufficient matching of genomic background sequences results in model biases. While CNNs achieved and exceeded the performance of gkm-SVMs for larger training datasets, gkm-SVMs gave robust and best results for typical training dataset sizes without the need of hyperparameter optimization.
en
dc.rights.uri
https://creativecommons.org/licenses/by/4.0/
dc.subject
Hep G2 Cells
en
dc.subject
Neural Networks, Computer
en
dc.subject
Regulatory Sequences, Nucleic Acid
en
dc.subject
Sequence Analysis, DNA
en
dc.subject.ddc
600 Technik, Medizin, angewandte Wissenschaften::610 Medizin und Gesundheit::610 Medizin und Gesundheit
dc.title
The impact of different negative training data on regulatory sequence predictions
dc.type
Wissenschaftlicher Artikel
dcterms.bibliographicCitation.articlenumber
e0237412
dcterms.bibliographicCitation.doi
10.1371/journal.pone.0237412
dcterms.bibliographicCitation.journaltitle
PLOS ONE
dcterms.bibliographicCitation.number
12
dcterms.bibliographicCitation.originalpublishername
Public Library of Science (PLoS)
dcterms.bibliographicCitation.volume
15
refubium.affiliation
Charité - Universitätsmedizin Berlin
refubium.resourceType.isindependentpub
no
dcterms.accessRights.openaire
open access
dcterms.bibliographicCitation.pmid
33259518
dcterms.isPartOf.eissn
1932-6203