dc.contributor.author
Abbasi-Vineh, Mohammad Ali
dc.contributor.author
Rouzbahani, Shirin
dc.contributor.author
Kavousi, Kaveh
dc.contributor.author
Emadpour, Masoumeh
dc.date.accessioned
2025-09-24T11:14:00Z
dc.date.available
2025-09-24T11:14:00Z
dc.identifier.uri
https://refubium.fu-berlin.de/handle/fub188/49519
dc.identifier.uri
http://dx.doi.org/10.17169/refubium-49241
dc.description.abstract
One key barrier to applying deep learning (DL) to omics and other biological datasets is data scarcity, particularly when each gene or protein is represented by a single sequence. This fundamental challenge is mainly relevant in research involving genetically constrained organisms, organelles, specialized cell types, and biological cycles and pathways. This study introduces a novel data augmentation strategy designed to facilitate the application of DL models to omics datasets. This approach generated a high number of overlapping subsequences with controlled overlaps and shared nucleotide features through a sliding window technique. A hybrid model of Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM) layers was applied across augmented datasets comprising genes and proteins from eight microalgae and higher plant chloroplasts. The data augmentation strategy enabled employing DL methods on these datasets and significantly improved the model performance by avoiding common issues such as overfitting and non-representative sequence variations. The current augmentation process is highly adaptable, providing flexibility across different types of biological data repositories. Furthermore, a complementary k-mer-based data augmentation strategy was introduced for unlabeled datasets, enhancing unsupervised analysis. Overall, these innovative strategies provide robust solutions for optimizing model training potential in the study of datasets with limited data availability.
en
dc.format.extent
19 Seiten
dc.rights.uri
https://creativecommons.org/licenses/by-nc-nd/4.0/
dc.subject
Machine learning
en
dc.subject
Chloroplast genome
en
dc.subject.ddc
500 Naturwissenschaften und Mathematik::570 Biowissenschaften; Biologie::570 Biowissenschaften; Biologie
dc.title
Innovative data augmentation strategy for deep learning on biological datasets with limited gene representations focused on chloroplast genomes
dc.type
Wissenschaftlicher Artikel
dcterms.bibliographicCitation.articlenumber
27079
dcterms.bibliographicCitation.doi
10.1038/s41598-025-12796-9
dcterms.bibliographicCitation.journaltitle
Scientific Reports
dcterms.bibliographicCitation.number
1
dcterms.bibliographicCitation.volume
15
dcterms.bibliographicCitation.url
https://doi.org/10.1038/s41598-025-12796-9
refubium.affiliation
Biologie, Chemie, Pharmazie
refubium.affiliation.other
Institut für Biologie

refubium.resourceType.isindependentpub
no
dcterms.accessRights.openaire
open access
dcterms.isPartOf.eissn
2045-2322
refubium.resourceType.provider
WoS-Alert