Data quality improvement of a multicenter clinical trial dataset

Gian Maria Zaccaria; Samanta Rosati; Cristina Castagneri; Simone Ferrero; Marco Ladetto; Mario Boccadoro; Gabriella Balestra

doi:10.1109/EMBC.2017.8037043

Data quality improvement of a multicenter clinical trial dataset

Annu Int Conf IEEE Eng Med Biol Soc. 2017 Jul:2017:1190-1193. doi: 10.1109/EMBC.2017.8037043.

Authors

Gian Maria Zaccaria, Samanta Rosati, Cristina Castagneri, Simone Ferrero, Marco Ladetto, Mario Boccadoro, Gabriella Balestra

PMID: 29060088
DOI: 10.1109/EMBC.2017.8037043

Abstract

Medical datasets are usually affected by several problems, such as missing values, inconsistencies, redundancies, that can influence the data mining process and the extraction of useful knowledge. For these reasons, a preprocessing phase should be performed for improving the overall quality of data and, consequently, of the information that may be discovered from them. In this study we applied five steps of data preprocessing to improve the quality of a large dataset derived from a multicenter clinical trial. Our dataset included 298 patients enrolled in a prospective, multicenter, clinical trial, characterized by 22 input variables and one class variable (MIPI value). In particular, data coming from different medical centers were firstly integrated to obtain a homogeneous dataset. The latter was normalized to scale all variables into smaller and similar intervals. Then, all missing values were estimated by means of an imputation step. The complete dataset was finally discretized and reduced to remove redundant variables and decrease the amount of data to be managed. The improvement of data quality after each step was evaluated by means of the patients' classification accuracy using the KNN classifier. Our results showed that the proposed pipeline produced an increment of more than 20% of the classification performances. Moreover, the highest growth of accuracy was obtained after missing value imputation, whereas the discretization and feature selection steps allowed for a significant reduction of variables to be managed, without any deterioration of the information contained in data.

MeSH terms

Clinical Trials as Topic*
Data Accuracy
Data Mining*
Humans
Multicenter Studies as Topic
Prospective Studies
Quality Improvement