k-Means NANI: an improved clustering algorithm for Molecular Dynamics simulations

Lexin Chen; Daniel R Roe; Matthew Kochert; Carlos Simmerling; Ramón Alain Miranda-Quintana

doi:10.1101/2024.03.07.583975

k-Means NANI: an improved clustering algorithm for Molecular Dynamics simulations

bioRxiv [Preprint]. 2024 Mar 8:2024.03.07.583975. doi: 10.1101/2024.03.07.583975.

Authors

Lexin Chen^{1

2}, Daniel R Roe³, Matthew Kochert^{4

5}, Carlos Simmerling^{4

5

6}, Ramón Alain Miranda-Quintana^{1

2}

Affiliations

¹ Department of Chemistry, University of Florida, FL, USA.
² Quantum Theory Project, University of Florida, FL, USA.
³ Laboratory of Computational Biology, National Heart, Lung, and Blood Institute, National Institutes of Health, Bethesda, Maryland, USA.
⁴ Laufer Center for Physical & Quantitative Biology, Stony Brook University, Stony Brook, 11794, USA.
⁵ Department of Chemistry, Stony Brook University, Stony Brook 11794, USA.
⁶ Department of Applied Mathematics and Statistics, Stony Brook University, Stony Brook 11794, USA.

Abstract

One of the key challenges of k-means clustering is the seed selection or the initial centroid estimation since the clustering result depends heavily on this choice. Alternatives such as k-means++ have mitigated this limitation by estimating the centroids using an empirical probability distribution. However, with high-dimensional and complex datasets such as those obtained from molecular simulation, k-means++ fails to partition the data in an optimal manner. Furthermore, stochastic elements in all flavors of k-means++ will lead to a lack of reproducibility. K-means N-Ary Natural Initiation (NANI) is presented as an alternative to tackle this challenge by using efficient n-ary comparisons to both identify high-density regions in the data and select a diverse set of initial conformations. Centroids generated from NANI are not only representative of the data and different from one another, helping k-means to partition the data accurately, but also deterministic, providing consistent cluster populations across replicates. From peptide and protein folding molecular simulations, NANI was able to create compact and well-separated clusters as well as accurately find the metastable states that agree with the literature. NANI can cluster diverse datasets and be used as a standalone tool or as part of our MDANCE clustering package.

Keywords: algorithms; clustering; conformational analysis; k-means; molecular dynamics; protein folding.

Publication types

Preprint

Abstract

Publication types

Grants and funding