Improvement in Protein Domain Identification Is Reached by Breaking Consensus, with the Agreement of Many Profiles and Domain Co-occurrence

Juliana Bernardes; Gerson Zaverucha; Catherine Vaquero; Alessandra Carbone

doi:10.1371/journal.pcbi.1005038

Improvement in Protein Domain Identification Is Reached by Breaking Consensus, with the Agreement of Many Profiles and Domain Co-occurrence

PLoS Comput Biol. 2016 Jul 29;12(7):e1005038. doi: 10.1371/journal.pcbi.1005038. eCollection 2016 Jul.

Authors

Juliana Bernardes¹, Gerson Zaverucha², Catherine Vaquero³, Alessandra Carbone^{1

4}

Affiliations

¹ Sorbonne Universités, UPMC Univ-Paris 6, CNRS, UMR 7238, Laboratoire de Biologie Computationnelle et Quantitative, Paris, France.
² COPPE, Programa de Engenharia de Sistemas e Computação, Universidade Federal do Rio de Janeiro, Rio de Janeiro, Brazil.
³ Sorbonne Universités, UPMC Univ-Paris 6, INSERM U1135, CNRS ERL 8255, Centre d'Immunologie et des Maladies Infectieuses (CIMI-Paris), Paris, France.
⁴ Institut Universitaire de France, Paris, France.

Abstract

Traditional protein annotation methods describe known domains with probabilistic models representing consensus among homologous domain sequences. However, when relevant signals become too weak to be identified by a global consensus, attempts for annotation fail. Here we address the fundamental question of domain identification for highly divergent proteins. By using high performance computing, we demonstrate that the limits of state-of-the-art annotation methods can be bypassed. We design a new strategy based on the observation that many structural and functional protein constraints are not globally conserved through all species but might be locally conserved in separate clades. We propose a novel exploitation of the large amount of data available: 1. for each known protein domain, several probabilistic clade-centered models are constructed from a large and differentiated panel of homologous sequences, 2. a decision-making protocol combines outcomes obtained from multiple models, 3. a multi-criteria optimization algorithm finds the most likely protein architecture. The method is evaluated for domain and architecture prediction over several datasets and statistical testing hypotheses. Its performance is compared against HMMScan and HHblits, two widely used search methods based on sequence-profile and profile-profile comparison. Due to their closeness to actual protein sequences, clade-centered models are shown to be more specific and functionally predictive than the broadly used consensus models. Based on them, we improved annotation of Plasmodium falciparum protein sequences on a scale not previously possible. We successfully predict at least one domain for 72% of P. falciparum proteins against 63% achieved previously, corresponding to 30% of improvement over the total number of Pfam domain predictions on the whole genome. The method is applicable to any genome and opens new avenues to tackle evolutionary questions such as the reconstruction of ancient domain duplications, the reconstruction of the history of protein architectures, and the estimation of protein domain age. Website and software: http://www.lcqb.upmc.fr/CLADE.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Amino Acid Sequence
Computational Biology
Consensus Sequence
Databases, Protein
Plasmodium falciparum / genetics
Plasmodium falciparum / metabolism
Protein Domains*
Proteins / chemistry*
Proteins / genetics
Proteins / metabolism
Protozoan Proteins / chemistry
Protozoan Proteins / metabolism
Sequence Alignment / methods*
Sequence Analysis, Protein / methods*

Substances

Proteins
Protozoan Proteins

Grants and funding

This work undertaken (partially) in the framework of CALSIMLAB is supported by the public grant ANR-11-LABX-0037-0 from the “Investissements d’Avenir” program (ANR-11-IDEX-0004-02). Experiments were carried out using Grid’5000 (https://www.grid5000.fr) and the UPMC MESU machine financed by the project Equip@Meso (ANR-10-EQPX-29-01) of the “Investissements d’Avenir” program. Funds from the Institut Universitaire de France. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.