Network-based feature selection methods on omics data have been developed in recent years. Their performance gain, however, is shown to be affected by the datasets, networks, and evaluation metrics. The reproducibility and robustness of biomarkers await to be improved. In this endeavor, one of the major challenges is the curse of dimensionality.
To mitigate this issue, we proposed the Phenotype Relevant Network-based Feature Selection (PRNFS) framework. By employing a much smaller but phenotype relevant network, we could avoid irrelevant information and select robust molecular signatures. The advantages of PRNFS were demonstrated with the application of lung cancer prognosis prediction. Specifically, we constructed epithelial mesenchymal transition (EMT) networks and employed them for feature selection. We mapped multiple types of omics data on it alternatively to select single-omics signatures and further integrated them into multi-omics signatures. Then we introduced a multiplex network-based feature selection method to directly select multi-omics signatures. Both single-omics and multi-omics EMT signatures were evaluated on TCGA data as well as an independent multi-omics dataset.
The results showed that EMT signatures achieved significant performance gain, although EMT networks covered less than 2.5% of the original data dimensions. Frequently selected EMT features achieved average AUC values of 0.83 on TCGA data. Employing EMT signatures on the independent dataset stratified the patients into significantly different prognostic groups. Multi-omics features showed superior performance over single-omics features on both TCGA data and the independent data.
Additionally, we tested the performance of a few relational and non-relational databases for storing and retrieving omics data. Since biological data have large volume, high velocity, and wide varieties, it is necessary to have database systems that meet the need of integrative omics data analysis. Based on the results, we provided a few advices on building scalable omics data infrastructures.