dc.contributor.author
Sieg, Miriam
dc.contributor.author
Richter, Gesa
dc.contributor.author
Schaefer, Arne S.
dc.contributor.author
Kruppa, Jochen
dc.date.accessioned
2020-02-12T10:40:08Z
dc.date.available
2020-02-12T10:40:08Z
dc.identifier.uri
https://refubium.fu-berlin.de/handle/fub188/26651
dc.identifier.uri
http://dx.doi.org/10.17169/refubium-26408
dc.description.abstract
BACKGROUND:
In methylation analyses like epigenome-wide association studies, a high amount of biomarkers is tested for an association between the measured continuous outcome and different covariates. In the case of a continuous covariate like smoking pack years (SPY), a measure of lifetime exposure to tobacco toxins, a spike at zero can occur. Hence, all non-smokers are generating a peak at zero, while the smoking patients are distributed over the other SPY values. Additionally, the spike might also occur on the right side of the covariate distribution, if a category "heavy smoker" is designed. Here, we will focus on methylation data with a spike at the left or the right of the distribution of a continuous covariate. After the methylation data is generated, analysis is usually performed by preprocessing, quality control, and determination of differentially methylated sites, often performed in pipeline fashion. Hence, the data is processed in a string of methods, which are available in one software package. The pipelines can distinguish between categorical covariates, i.e. for group comparisons or continuous covariates, i.e. for linear regression. The differential methylation analysis is often done internally by a linear regression without checking its inherent assumptions. A spike in the continuous covariate is ignored and can cause biased results.
RESULTS:
We have reanalysed five data sets, four freely available from ArrayExpress, including methylation data and smoking habits reported by smoking pack years. Therefore, we generated an algorithm to check for the occurrences of suspicious interactions between the values associated with the spike position and the non-spike positions of the covariate. Our algorithm helps to decide if a suspicious interaction can be found and further investigations should be carried out. This is mostly important, because the information on the differentially methylated sites will be used for post-hoc analyses like pathway analyses.
CONCLUSIONS:
We help to check for the validation of the linear regression assumptions in a methylation analysis pipeline. These assumptions should also be considered for machine learning approaches. In addition, we are able to detect outliers in the continuous covariate. Therefore, more statistical robust results should be produced in methylation analysis using our algorithm as a preprocessing step.
en
dc.rights.uri
https://creativecommons.org/licenses/by/4.0/
dc.subject
Spike at zero
en
dc.subject
Outlier detection
en
dc.subject
High dimensional data
en
dc.subject.ddc
600 Technik, Medizin, angewandte Wissenschaften::610 Medizin und Gesundheit::610 Medizin und Gesundheit
dc.title
Detection of suspicious interactions of spiking covariates in methylation data
dc.type
Wissenschaftlicher Artikel
dcterms.bibliographicCitation.articlenumber
36
dcterms.bibliographicCitation.doi
10.1186/s12859-020-3364-6
dcterms.bibliographicCitation.journaltitle
BMC Bioinformatics
dcterms.bibliographicCitation.originalpublishername
BMC
dcterms.bibliographicCitation.volume
21
refubium.affiliation
Charité - Universitätsmedizin Berlin
refubium.resourceType.isindependentpub
no
dcterms.accessRights.openaire
open access
dcterms.bibliographicCitation.pmid
32000657
dcterms.isPartOf.eissn
1471-2105