Efficient inference of homologs in large eukaryotic pan-proteomes

Siavash Sheikhizadeh Anari; Dick de Ridder; M Eric Schranz; Sandra Smit

doi:10.1186/s12859-018-2362-4

Efficient inference of homologs in large eukaryotic pan-proteomes

BMC Bioinformatics. 2018 Sep 26;19(1):340. doi: 10.1186/s12859-018-2362-4.

Authors

Siavash Sheikhizadeh Anari¹, Dick de Ridder², M Eric Schranz³, Sandra Smit²

Affiliations

¹ Bioinformatics Group, Wageningen University, Wageningen, The Netherlands. siavash.sheikhizadehanari@wur.nl.
² Bioinformatics Group, Wageningen University, Wageningen, The Netherlands.
³ Biosystematics Group, Wageningen University, Wageningen, The Netherlands.

Abstract

Background: Identification of homologous genes is fundamental to comparative genomics, functional genomics and phylogenomics. Extensive public homology databases are of great value for investigating homology but need to be continually updated to incorporate new sequences. As new sequences are rapidly being generated, there is a need for efficient standalone tools to detect homologs in novel data.

Results: To address this, we present a fast method for detecting homology groups across a large number of individuals and/or species. We adopted a k-mer based approach which considerably reduces the number of pairwise protein alignments without sacrificing sensitivity. We demonstrate accuracy, scalability, efficiency and applicability of the presented method for detecting homology in large proteomes of bacteria, fungi, plants and Metazoa.

Conclusions: We clearly observed the trade-off between recall and precision in our homology inference. Favoring recall or precision strongly depends on the application. The clustering behavior of our program can be optimized for particular applications by altering a few key parameters. The program is available for public use at https://github.com/sheikhizadeh/pantools as an extension to our pan-genomic analysis tool, PanTools.

Keywords: Homologous genes; Orthology; Pan-genome; Protein similarity; k-mer.

MeSH terms

Algorithms
Brassicaceae / genetics
Cluster Analysis
Databases, Protein
Eukaryota / metabolism*
Genes, Plant
Genome
Genomics
Humans
Proteome / metabolism*
Sequence Homology, Amino Acid
Software

Substances

Proteome

Grants and funding

3184519600/Experimental Plant Sciences