Exploratory data analysis of a clinical study group: Development of a procedure for exploring multidimensional data

PLoS One. 2018 Aug 23;13(8):e0201950. doi: 10.1371/journal.pone.0201950. eCollection 2018.

Abstract

Thorough knowledge of the structure of analyzed data allows to form detailed scientific hypotheses and research questions. The structure of data can be revealed with methods for exploratory data analysis. Due to multitude of available methods, selecting those which will work together well and facilitate data interpretation is not an easy task. In this work we present a well fitted set of tools for a complete exploratory analysis of a clinical dataset and perform a case study analysis on a set of 515 patients. The proposed procedure comprises several steps: 1) robust data normalization, 2) outlier detection with Mahalanobis (MD) and robust Mahalanobis distances (rMD), 3) hierarchical clustering with Ward's algorithm, 4) Principal Component Analysis with biplot vectors. The analyzed set comprised elderly patients that participated in the PolSenior project. Each patient was characterized by over 40 biochemical and socio-geographical attributes. Introductory analysis showed that the case-study dataset comprises two clusters separated along the axis of sex hormone attributes. Further analysis was carried out separately for male and female patients. The most optimal partitioning in the male set resulted in five subgroups. Two of them were related to diseased patients: 1) diabetes and 2) hypogonadism patients. Analysis of the female set suggested that it was more homogeneous than the male dataset. No evidence of pathological patient subgroups was found. In the study we showed that outlier detection with MD and rMD allows not only to identify outliers, but can also assess the heterogeneity of a dataset. The case study proved that our procedure is well suited for identification and visualization of biologically meaningful patient subgroups.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Aged
  • Aged, 80 and over
  • Algorithms
  • Clinical Studies as Topic / statistics & numerical data*
  • Cluster Analysis
  • Data Analysis*
  • Female
  • Humans
  • Male
  • Middle Aged
  • Principal Component Analysis
  • Sex Factors

Grants and funding

The project was partly supported by Wroclaw Centre of Biotechnology through the programme The Leading National Research Centre (KNOW) for years 2014-2018. BMK would like to acknowledge the funding from the statuary fund of the Department of Biomedical Engineering, Wroclaw University of Science and Technology. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.