UMAP-assisted K-means clustering of large-scale SARS-CoV-2 mutation datasets

Yuta Hozumi; Rui Wang; Changchuan Yin; Guo-Wei Wei

doi:10.1016/j.compbiomed.2021.104264

UMAP-assisted K-means clustering of large-scale SARS-CoV-2 mutation datasets

Comput Biol Med. 2021 Apr:131:104264. doi: 10.1016/j.compbiomed.2021.104264. Epub 2021 Feb 22.

Authors

Yuta Hozumi¹, Rui Wang¹, Changchuan Yin², Guo-Wei Wei³

Affiliations

¹ Department of Mathematics, Michigan State University, MI, 48824, USA.
² Department of Mathematics, Statistics, and Computer Science, University of Illinois at Chicago, Chicago, IL, 60607, USA.
³ Department of Mathematics, Michigan State University, MI, 48824, USA; Department of Electrical and Computer Engineering, Michigan State University, MI, 48824, USA; Department of Biochemistry and Molecular Biology, Michigan State University, MI, 48824, USA. Electronic address: weig@msu.edu.

Abstract

Coronavirus disease 2019 (COVID-19) caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has a worldwide devastating effect. Understanding the evolution and transmission of SARS-CoV-2 is of paramount importance for controlling, combating and preventing COVID-19. Due to the rapid growth in both the number of SARS-CoV-2 genome sequences and the number of unique mutations, the phylogenetic analysis of SARS-CoV-2 genome isolates faces an emergent large-data challenge. We introduce a dimension-reduced K-means clustering strategy to tackle this challenge. We examine the performance and effectiveness of three dimension-reduction algorithms: principal component analysis (PCA), t-distributed stochastic neighbor embedding (t-SNE), and uniform manifold approximation and projection (UMAP). By using four benchmark datasets, we found that UMAP is the best-suited technique due to its stable, reliable, and efficient performance, its ability to improve clustering accuracy, especially for large Jaccard distanced-based datasets, and its superior clustering visualization. The UMAP-assisted K-means clustering enables us to shed light on increasingly large datasets from SARS-CoV-2 genome isolates.

Keywords: COVID-19; PCA; SARS-CoV-2; UMAP; t-SNE.

Publication types

Research Support, N.I.H., Extramural
Research Support, Non-U.S. Gov't
Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

Algorithms*
COVID-19 / genetics*
Databases, Nucleic Acid*
Genome, Viral*
Humans
Mutation*
Phylogeny*
SARS-CoV-2 / genetics*

Grants and funding

R01 GM126189/GM/NIGMS NIH HHS/United States