Machine Learning in Therapeutic Research: The Hard Work of Outlier Detection in Large Data

Am J Ther. 2016 May-Jun;23(3):e837-43. doi: 10.1097/MJT.0b013e31827ab4a0.

Abstract

With large data files, outlier recognition requires a more sophisticated approach than the traditional data plots and regression lines. In addition, the number of outliers tends to rise linearly with the data's sample size. The objective of this study was to examine whether balanced iterative reducing and clustering using hierarchies (BIRCH) clustering is able to detect previously unrecognized outlier data.A simulated and a real data files were used as examples. SPSS statistical software was used for data analysis. In 50 mentally depressed persons, a regression analysis failed to detect any outliers. BIRCH analysis of these data showed in addition to 2 clusters a relevant outlier cluster consistent of 7 patients (14%) not fitting in the formed clusters. In 576 iatrogenic admissions, the number of comedications was not a significant loglinear predictor of the iatrogenic admission. In contrast, BIRCH analysis revealed an outlier cluster consistent of 174 patients (30%) with extremely many comedications. The conclusions were as follows: (1) A systematic assessment for outliers is important in therapeutic research with large data, because the lack of it can lead to catastrophic consequences. (2) Traditional data analysis, such as regression analysis, was unable to demonstrate outliers in our examples. (3) BIRCH cluster analysis of the examples produced relevant outlier clusters of patients not fitting in the data otherwise. (4) On theoretical grounds, BIRCH cluster analysis is, particularly, suitable for large datasets.

MeSH terms

  • Biomedical Research* / methods
  • Data Interpretation, Statistical*
  • Humans
  • Regression Analysis
  • Statistics as Topic*