NS-kNN: a modified k-nearest neighbors approach for imputing metabolomics data

Justin Y Lee; Mark P Styczynski

doi:10.1007/s11306-018-1451-8

NS-kNN: a modified k-nearest neighbors approach for imputing metabolomics data

Metabolomics. 2018 Nov 23;14(12):153. doi: 10.1007/s11306-018-1451-8.

Authors

Justin Y Lee¹, Mark P Styczynski²

Affiliations

¹ School of Chemical & Biomolecular Engineering, Georgia Institute of Technology, 311 Ferst Drive, Atlanta, GA, 30332-0100, USA.
² School of Chemical & Biomolecular Engineering, Georgia Institute of Technology, 311 Ferst Drive, Atlanta, GA, 30332-0100, USA. mark.styczynski@chbe.gatech.edu.

Abstract

Introduction: A common problem in metabolomics data analysis is the existence of a substantial number of missing values, which can complicate, bias, or even prevent certain downstream analyses. One of the most widely-used solutions to this problem is imputation of missing values using a k-nearest neighbors (kNN) algorithm to estimate missing metabolite abundances. kNN implicitly assumes that missing values are uniformly distributed at random in the dataset, but this is typically not true in metabolomics, where many values are missing because they are below the limit of detection of the analytical instrumentation.

Objectives: Here, we explore the impact of nonuniformly distributed missing values (missing not at random, or MNAR) on imputation performance. We present a new model for generating synthetic missing data and a new algorithm, No-Skip kNN (NS-kNN), that accounts for MNAR values to provide more accurate imputations.

Methods: We compare the imputation errors of the original kNN algorithm using two distance metrics, NS-kNN, and a recently developed algorithm KNN-TN, when applied to multiple experimental datasets with different types and levels of missing data.

Results: Our results show that NS-kNN typically outperforms kNN when at least 20-30% of missing values in a dataset are MNAR. NS-kNN also has lower imputation errors than KNN-TN on realistic datasets when at least 50% of missing values are MNAR.

Conclusion: Accounting for the nonuniform distribution of missing values in metabolomics data can significantly improve the results of imputation algorithms. The NS-kNN method imputes missing metabolomics data more accurately than existing kNN-based approaches when used on realistic datasets.

Keywords: GC–MS; Imputation; Metabolomics; Missing data; kNN.

Publication types

Research Support, N.I.H., Extramural
Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

Algorithms*
Animals
Bacteria / metabolism
Computational Biology / methods*
Data Accuracy
Data Interpretation, Statistical
Datasets as Topic
Humans
Metabolomics / methods*
Mice
Models, Biological*

Abstract

Publication types

MeSH terms

Grants and funding