Selecting Representative Samples From Complex Biological Datasets Using K-Medoids Clustering

Lei Li; Linda Yu-Ling Lan; Lei Huang; Congting Ye; Jorge Andrade; Patrick C Wilson

doi:10.3389/fgene.2022.954024

Selecting Representative Samples From Complex Biological Datasets Using K-Medoids Clustering

Front Genet. 2022 Jul 18:13:954024. doi: 10.3389/fgene.2022.954024. eCollection 2022.

Authors

Lei Li^{1

2}, Linda Yu-Ling Lan^{1

2}, Lei Huang³, Congting Ye⁴, Jorge Andrade^{3

5}, Patrick C Wilson^{1

2}

Affiliations

¹ University of Chicago Department of Medicine, Section of Rheumatology, University of Chicago, Chicago, IL, United States.
² Knapp Center for Lupus and Immunology Research, University of Chicago, Chicago, IL, United States.
³ Center for Research Informatics, University of Chicago, Chicago, IL, United States.
⁴ Key Laboratory of the Ministry of Education for Coastal and Wetland Ecosystems, College of the Environment and Ecology, Xiamen University, Xiamen, China.
⁵ Department of Pediatrics, University of Chicago, Chicago, IL, United States.

Abstract

Rapid growth of single-cell sequencing techniques enables researchers to investigate almost millions of cells with diverse properties in a single experiment. Meanwhile, it also presents great challenges for selecting representative samples from massive single-cell populations for further experimental characterization, which requires a robust and compact sampling with balancing diverse properties of different priority levels. The conventional sampling methods fail to generate representative and generalizable subsets from a massive single-cell population or more complicated ensembles. Here, we present a toolkit called Cookie which can efficiently select out the most representative samples from a massive single-cell population with diverse properties. This method quantifies the relationships/similarities among samples using their Manhattan distances by vectorizing all given properties and then determines an appropriate sample size by evaluating the coverage of key properties from multiple candidate sizes, following by a k-medoids clustering to group samples into several clusters and selects centers from each cluster as the most representatives. Comparison of Cookie with conventional sampling methods using a single-cell atlas dataset, epidemiology surveillance data, and a simulated dataset shows the high efficacy, efficiency, and flexibly of Cookie. The Cookie toolkit is implemented in R and is freely available at https://wilsonimmunologylab.github.io/Cookie/.

Keywords: R; antibody candidate selection; k-medoids; sampling; single cell.