The Univariate Flagging Algorithm (UFA): An interpretable approach for predictive modeling

Mallory Sheth; Albert Gerovitch; Roy Welsch; Natasha Markuzon

doi:10.1371/journal.pone.0223161

The Univariate Flagging Algorithm (UFA): An interpretable approach for predictive modeling

PLoS One. 2019 Oct 11;14(10):e0223161. doi: 10.1371/journal.pone.0223161. eCollection 2019.

Authors

Mallory Sheth^{1

2}, Albert Gerovitch¹, Roy Welsch¹, Natasha Markuzon²

Affiliations

¹ Massachusetts Institute of Technology, Cambridge, Massachusetts, United States of America.
² The Charles Stark Draper Laboratory, Cambridge, Massachusetts, United States of America.

Abstract

In many data classification problems, a number of methods will give similar accuracy. However, when working with people who are not experts in data science such as doctors, lawyers, and judges among others, finding interpretable algorithms can be a critical success factor. Practitioners have a deep understanding of the individual input variables but far less insight into how they interact with each other. For example, there may be ranges of an input variable for which the observed outcome is significantly more or less likely. This paper describes an algorithm for automatic detection of such thresholds, called the Univariate Flagging Algorithm (UFA). The algorithm searches for a separation that optimizes the difference between separated areas while obtaining a high level of support. We evaluate its performance using six sample datasets and demonstrate that thresholds identified by the algorithm align well with published results and known physiological boundaries. We also introduce two classification approaches that use UFA and show that the performance attained on unseen test data is comparable to or better than traditional classifiers when confidence intervals are considered. We identify conditions under which UFA performs well, including applications with large amounts of missing or noisy data, applications with a large number of inputs relative to observations, and applications where incidence of the target is low. We argue that ease of explanation of the results, robustness to missing data and noise, and detection of low incidence adverse outcomes are desirable features for clinical applications that can be achieved with relatively simple classifier, like UFA.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Algorithms*
Body Temperature
Breast Neoplasms / diagnosis*
Breast Neoplasms / mortality
Breast Neoplasms / pathology
Datasets as Topic
Diabetes Mellitus / diagnosis*
Diabetes Mellitus / mortality
Diabetes Mellitus / pathology
Female
Humans
Leukemia, Myeloid, Acute / diagnosis*
Leukemia, Myeloid, Acute / mortality
Leukemia, Myeloid, Acute / pathology
Male
Models, Statistical
Precursor Cell Lymphoblastic Leukemia-Lymphoma / diagnosis*
Precursor Cell Lymphoblastic Leukemia-Lymphoma / mortality
Precursor Cell Lymphoblastic Leukemia-Lymphoma / pathology
Sepsis / diagnosis*
Sepsis / mortality
Sepsis / pathology
Survival Analysis

Grants and funding

This research was partially supported by the internal Draper IR&D grant. The funder had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.