During cancer development, malignant tumours accumulate genetic and epigenetic alterations that cause dysregulation of gene expression and cellular processes. Since the regulation of gene expression controls many cellular processes, understanding the transcriptome of malignant tumours provides insights into the biology of cancer. Key technology for the molecular analysis of whole cancer transcriptomes is next- generation sequencing (NGS) of RNA (RNA-seq) from bulk tumours. However, to derive information about cancer transcriptomes from RNA-seq data, a variety of computational tools and analyses are needed. The following work presents two cancer transcriptome studies addressing the computational analysis of RNA-seq data from colorectal carcinomas (CRC) and medulloblastomas (MB) by applying statistical and machine learning (ML) methods.
CRC is a clinically challenging disease because only a fraction of tumours responds to available chemo- and targeted therapies. Functional loss of the tumour suppressor APC has been suggested to represent the initial mutation, activating Wnt signalling. Additional events include mutually exclusive mutations in the RAS/RAF proto-oncogenes as well as in the TGFβ, PI3K, and TP53 pathways. Routinely used biomarkers of resistance to the EGFR inhibitor cetuximab are RAS/RAF mutations that activate signalling downstream of EGFR. Still, a fraction of wild-type CRCs is resistant to cetuximab treatment. Addressing the need for a better molecular understanding of CRC in precision oncology, the OncoTrack consortium (Innovative Medicine Initiative) designed a multi-omics strategy integrating the establishment of a pre-clinical platform for CRC organoid and animal models. In the study presented below, we focused on the integrative analysis of gene expression and drug response data obtained from patient-derived xenografts (PDXs) treated with cetuximab. Applying statistical methods, we identified a signature of 241 genes associated with response to cetuximab, which allowed us to dissect the expression profiles of responding and non-responding CRC. We used a support vector machine (SVM), a supervised ML algorithm, to obtain a gene-expression-based classifier predictive of response to cetuximab. Here, we selected 16 highly predictive genes using multiple SVM recursive feature elimination. The built classifier outperformed RAS/RAF mutations as a predictor of cetuximab response and performed well in RAS/RAF-wild-type CRC that currently lacks biomarkers of cetuximab treatment outcome in clinical practice.
The second study addressed the molecular analysis of MB. MB, a tumour of the cerebellum, is the most common malignant brain tumour in children. Transcriptome profiling of MB using microarrays had revealed four tumour subgroups, namely WNT, SHH, Group 3 and Group 4, each related to distinct genetic alterations, molecular profiles, and clinical features. Recurrent mutations mainly cause pathway activation in WNT and SHH MB, respectively, whereas in Group 3 and Group 4 MB, gross chromosomal alterations are more prevalent and tumours express a specific cell-type- rather than a pathway-related gene signature. Additional molecular complexity has been identified within these four main subgroups, which could be dissected further into subtypes. However, the gene regulatory networks that contribute to the molecular heterogeneity in MB are only partially known, and the role of long non-coding (lnc) genes remains poorly addressed in this disease. To gain further insights into the molecular biology of MB, the PedBrain project was founded within the ICGC framework. As a contribution to this project, we sequenced and computationally analysed 164 MB RNA-seq samples. Addressing the heterogeneity of MB, we identified and validated molecular subclusters within the four main subgroups. Subgroup- and subcluster-specific gene expression profiles were analysed by functional enrichments and gene regulatory networks (GRNs) inferred from gene expression data. These GRNs revealed communalities and differences in gene regulation among subclusters and subgroups. By estimating the impact of TFs, we could unravel master regulators of subcluster-specific gene expression in a systematic fashion for the first time and highlight unknown regulators of Group 4 MB. Furthermore, we characterised lnc genes that were differentially expressed in MB. Among these genes, we identified 20 lnc genes that show brain- development-associated expression patterns, which is of interest due to the embryonic origin of MB. We identified a co-expression cluster that accumulates known cancer-related lnc genes and associates these genes with cancer-promoting protein biogenesis. Survival analyses revealed the lnc gene MEG3 as a prognostic marker in SHH and Group 4 subcluster, potentially acting as a tumour suppressor that negatively regulates cell cycle and TGFβ receptor expression.