Innovative data augmentation strategy for deep learning on biological datasets with limited gene representations focused on chloroplast genomes

Abbasi-Vineh, Mohammad Ali; Rouzbahani, Shirin; Kavousi, Kaveh; Emadpour, Masoumeh

doi:10.1038/s41598-025-12796-9

Innovative data augmentation strategy for deep learning on biological datasets with limited gene representations focused on chloroplast genomes

Title:

Innovative data augmentation strategy for deep learning on biological datasets with limited gene representations focused on chloroplast genomes

Author(s):

Abbasi-Vineh, Mohammad Ali; Rouzbahani, Shirin; Kavousi, Kaveh; Emadpour, Masoumeh

Year of publication:

2025

Available Date:

2025-09-24T11:14:00Z

Abstract:

One key barrier to applying deep learning (DL) to omics and other biological datasets is data scarcity, particularly when each gene or protein is represented by a single sequence. This fundamental challenge is mainly relevant in research involving genetically constrained organisms, organelles, specialized cell types, and biological cycles and pathways. This study introduces a novel data augmentation strategy designed to facilitate the application of DL models to omics datasets. This approach generated a high number of overlapping subsequences with controlled overlaps and shared nucleotide features through a sliding window technique. A hybrid model of Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM) layers was applied across augmented datasets comprising genes and proteins from eight microalgae and higher plant chloroplasts. The data augmentation strategy enabled employing DL methods on these datasets and significantly improved the model performance by avoiding common issues such as overfitting and non-representative sequence variations. The current augmentation process is highly adaptable, providing flexibility across different types of biological data repositories. Furthermore, a complementary k-mer-based data augmentation strategy was introduced for unlabeled datasets, enhancing unsupervised analysis. Overall, these innovative strategies provide robust solutions for optimizing model training potential in the study of datasets with limited data availability.

Identifier:

https://refubium.fu-berlin.de/handle/fub188/49519
http://dx.doi.org/10.17169/refubium-49241

Part of Identifier:

e-ISSN (online): 2045-2322

Language:

English

Keywords:

Machine learning
Small data
Chloroplast genome
Genomics

DDC-Classification:

570 Biowissenschaften; Biologie

Publication Type:

Wissenschaftlicher Artikel