Detection of suspicious interactions of spiking covariates in methylation data

Miriam Sieg; Gesa Richter; Arne S Schaefer; Jochen Kruppa

doi:10.1186/s12859-020-3364-6

Detection of suspicious interactions of spiking covariates in methylation data

BMC Bioinformatics. 2020 Jan 30;21(1):36. doi: 10.1186/s12859-020-3364-6.

Authors

Miriam Sieg^{1

2}, Gesa Richter^{2

3}, Arne S Schaefer^{2

3}, Jochen Kruppa^{4

5}

Affiliations

¹ Charité - University Medicine, corporate member of Freie Universität Berlin, Humboldt-Universität zu Berlin, and Berlin Institute of Health, Institute of Biometry and Clinical Epidemiology, Charitéplatz 1, Berlin, 10117, Germany.
² Berlin Institute of Health (BIH), Anna-Louisa-Karsch-Strane 2, Berlin, 10178, Germany.
³ Department of Periodontology and Synoptic Dentistry, Institute of Dental, Oral and Maxillary Medicine, Charité - University Medicine, Charitéplatz 1, Berlin, 10117, Germany.
⁴ Charité - University Medicine, corporate member of Freie Universität Berlin, Humboldt-Universität zu Berlin, and Berlin Institute of Health, Institute of Biometry and Clinical Epidemiology, Charitéplatz 1, Berlin, 10117, Germany. jochen.kruppa@charite.de.
⁵ Berlin Institute of Health (BIH), Anna-Louisa-Karsch-Strane 2, Berlin, 10178, Germany. jochen.kruppa@charite.de.

Abstract

Background: In methylation analyses like epigenome-wide association studies, a high amount of biomarkers is tested for an association between the measured continuous outcome and different covariates. In the case of a continuous covariate like smoking pack years (SPY), a measure of lifetime exposure to tobacco toxins, a spike at zero can occur. Hence, all non-smokers are generating a peak at zero, while the smoking patients are distributed over the other SPY values. Additionally, the spike might also occur on the right side of the covariate distribution, if a category "heavy smoker" is designed. Here, we will focus on methylation data with a spike at the left or the right of the distribution of a continuous covariate. After the methylation data is generated, analysis is usually performed by preprocessing, quality control, and determination of differentially methylated sites, often performed in pipeline fashion. Hence, the data is processed in a string of methods, which are available in one software package. The pipelines can distinguish between categorical covariates, i.e. for group comparisons or continuous covariates, i.e. for linear regression. The differential methylation analysis is often done internally by a linear regression without checking its inherent assumptions. A spike in the continuous covariate is ignored and can cause biased results.

Results: We have reanalysed five data sets, four freely available from ArrayExpress, including methylation data and smoking habits reported by smoking pack years. Therefore, we generated an algorithm to check for the occurrences of suspicious interactions between the values associated with the spike position and the non-spike positions of the covariate. Our algorithm helps to decide if a suspicious interaction can be found and further investigations should be carried out. This is mostly important, because the information on the differentially methylated sites will be used for post-hoc analyses like pathway analyses.

Conclusions: We help to check for the validation of the linear regression assumptions in a methylation analysis pipeline. These assumptions should also be considered for machine learning approaches. In addition, we are able to detect outliers in the continuous covariate. Therefore, more statistical robust results should be produced in methylation analysis using our algorithm as a preprocessing step.

Keywords: Epigenetic; High dimensional data; Methylation; Outlier detection; Spike at zero.

MeSH terms

Adult
Algorithms
Analysis of Variance
DNA Methylation*
Humans
Linear Models
Machine Learning
Middle Aged
Smoking / genetics*
Smoking / metabolism