Predicting preeclampsia and related risk factors using data mining approaches: A cross-sectional study

Zohreh Manoochehri; Sara Manoochehri; Farzaneh Soltani; Leili Tapak; Majid Sadeghifar

doi:10.18502/ijrm.v19i11.9911

Predicting preeclampsia and related risk factors using data mining approaches: A cross-sectional study

Int J Reprod Biomed. 2021 Dec 13;19(11):959-968. doi: 10.18502/ijrm.v19i11.9911. eCollection 2021 Nov.

Authors

Zohreh Manoochehri¹, Sara Manoochehri¹, Farzaneh Soltani², Leili Tapak³, Majid Sadeghifar⁴

Affiliations

¹ Department of Biostatistics, Student Research Committee, Hamadan University of Medical Sciences, Hamadan, Iran.
² Department of Midwifery, School of Nursing and Midwifery, Hamadan University of Medical Sciences, Hamadan, Iran.
³ Modeling of Noncommunicable Disease Research Center, Department of Biostatistics, School of Public Health, Hamadan University of Medical Sciences, Hamadan, Iran.
⁴ Department of Statistics, Faculty of Basic Sciences, Bu-Ali Sina University, Hamadan, Iran.

Abstract

Background: Preeclampsia is a type of pregnancy hypertension disorder that has adverse effects on both the mother and the fetus. Despite recent advances in the etiology of preeclampsia, no adequate clinical screening tests have been identified to diagnose the disorder.

Objective: We aimed to provide a model based on data mining approaches that can be used as a screening tool to identify patients with this syndrome and also to identify the risk factors associated with it.

Materials and methods: The data used to perform this cross-sectional study were extracted from the clinical records of 726 mothers with preeclampsia and 726 mothers without preeclampsia who were referred to Fatemieh Hospital in Hamadan City during April 2005-March 2015. In this study, six data mining methods were adopted, including logistic regression, k-nearest neighborhood, C5.0 decision tree, discriminant analysis, random forest, and support vector machine, and their performance was compared using the criteria of accuracy, sensitivity, and specificity.

Results: Underlying condition, age, pregnancy season and the number of pregnancies were the most important risk factors for diagnosing preeclampsia. The accuracy of the models were as follows: logistic regression (0.713), k-nearest neighborhood (0.742), C5.0 decision tree (0.788), discriminant analysis (0.687), random forest (0.758) and support vector machine (0.791).

Conclusion: Among the data mining methods employed in this study, support vector machine was the most accurate in predicting preeclampsia. Therefore, this model can be considered as a screening tool to diagnose this disorder.

Keywords: C5.0 decision tree; Logistic regression.; Random forest; Support vector machine; Preeclampsia.