DPCfam: Unsupervised protein family classification by Density Peak Clustering of large sequence datasets

Elena Tea Russo; Federico Barone; Alex Bateman; Stefano Cozzini; Marco Punta; Alessandro Laio

doi:10.1371/journal.pcbi.1010610

DPCfam: Unsupervised protein family classification by Density Peak Clustering of large sequence datasets

PLoS Comput Biol. 2022 Oct 19;18(10):e1010610. doi: 10.1371/journal.pcbi.1010610. eCollection 2022 Oct.

Authors

Elena Tea Russo^{1

2}, Federico Barone^{1

2

3}, Alex Bateman⁴, Stefano Cozzini², Marco Punta^{5

6}, Alessandro Laio^{1

7}

Affiliations

¹ SISSA, Trieste, Italy.
² AREA SCIENCE PARK, Trieste, Italy.
³ Department of Mathematics and Geosciences, University of Trieste, Trieste, Italy.
⁴ European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, United Kingdom.
⁵ Center for Omics Sciences, IRCCS San Raffaele Institute, Milan, Italy.
⁶ Unit of Immunogenetics, Leukemia Genomics and Immunobiology, Division of Immunology, Transplantation and Infectious Disease, IRCCS San Raffaele Scientific Institute, Milan, Italy.
⁷ ICTP, Trieste, Italy.

Abstract

Proteins that are known only at a sequence level outnumber those with an experimental characterization by orders of magnitude. Classifying protein regions (domains) into homologous families can generate testable functional hypotheses for yet unannotated sequences. Existing domain family resources typically use at least some degree of manual curation: they grow slowly over time and leave a large fraction of the protein sequence space unclassified. We here describe automatic clustering by Density Peak Clustering of UniRef50 v. 2017_07, a protein sequence database including approximately 23M sequences. We performed a radical re-implementation of a pipeline we previously developed in order to allow handling millions of sequences and data volumes of the order of 3 TeraBytes. The modified pipeline, which we call DPCfam, finds ∼ 45,000 protein clusters in UniRef50. Our automatic classification is in close correspondence to the ones of the Pfam and ECOD resources: in particular, about 81% of medium-large Pfam families and 72% of ECOD families can be mapped to clusters generated by DPCfam. In addition, our protocol finds more than 14,000 clusters constituted of protein regions with no Pfam annotation, which are therefore candidates for representing novel protein families. These results are made available to the scientific community through a dedicated repository.

MeSH terms

Amino Acid Sequence
Cluster Analysis
Databases, Protein
Protein Domains
Proteins* / genetics

Substances

Proteins

Grants and funding

The author(s) received no specific funding for this work.