Alignment-Free Methods for the Detection and Specificity Prediction of Adenylation Domains

Guillermin Agüero-Chapin; Gisselle Pérez-Machado; Aminael Sánchez-Rodríguez; Miguel Machado Santos; Agostinho Antunes

doi:10.1007/978-1-4939-3375-4_16

Alignment-Free Methods for the Detection and Specificity Prediction of Adenylation Domains

Methods Mol Biol. 2016:1401:253-72. doi: 10.1007/978-1-4939-3375-4_16.

Authors

Guillermin Agüero-Chapin^{1

2}, Gisselle Pérez-Machado², Aminael Sánchez-Rodríguez³, Miguel Machado Santos^{1

4}, Agostinho Antunes^{5

6}

Affiliations

¹ CIMAR/CIIMAR, Centro Interdisciplinar de Investigação Marinha e Ambiental, Universidade do Porto, Rua dos Bragas, 177, Porto, 4050-123, Portugal.
² Centro de Bioactivos Químicos, Universidad Central "Marta Abreu" de Las Villas (UCLV), Santa Clara, 54830, Cuba.
³ Departamento de Ciencias Naturales, Universidad Técnica Particular de Loja, San Cayetano Alto, S/N, Loja, Ecuador.
⁴ Departamento de Biologia, Faculdade de Ciências, Universidade do Porto, Rua do Campo Alegre, Porto, 4169-007, Portugal.
⁵ CIMAR/CIIMAR, Centro Interdisciplinar de Investigação Marinha e Ambiental, Universidade do Porto, Rua dos Bragas, 177, Porto, 4050-123, Portugal. aantunes@ciimar.up.pt.
⁶ Departamento de Biologia, Faculdade de Ciências, Universidade do Porto, Rua do Campo Alegre, Porto, 4169-007, Portugal. aantunes@ciimar.up.pt.

PMID: 26831713
DOI: 10.1007/978-1-4939-3375-4_16

Abstract

Identifying adenylation domains (A-domains) and their substrate specificity can aid the detection of nonribosomal peptide synthetases (NRPS) at genome/proteome level and allow inferring the structure of oligopeptides with relevant biological activities. However, that is challenging task due to the high sequence diversity of A-domains (~10-40 % of amino acid identity) and their selectivity for 50 different natural/unnatural amino acids. Altogether these characteristics make their detection and the prediction of their substrate specificity a real challenge when using traditional sequence alignment methods, e.g., BLAST searches. In this chapter we describe two workflows based on alignment-free methods intended for the identification and substrate specificity prediction of A-domains. To identify A-domains we introduce a graphical-numerical method, implemented in TI2BioP version 2.0 (topological indices to biopolymers), which in a first step uses protein four-color maps to represent A-domains. In a second step, simple topological indices (TIs), called spectral moments, are derived from the graphical representations of known A-domains (positive dataset) and of unrelated but well-characterized sequences (negative set). Spectral moments are then used as input predictors for statistical classification techniques to build alignment-free models. Finally, the resulting alignment-free models can be used to explore entire proteomes for unannotated A-domains. In addition, this graphical-numerical methodology works as a sequence-search method that can be ensemble with homology-based tools to deeply explore the A-domain signature and cope with the diversity of this class (Aguero-Chapin et al., PLoS One 8(7):e65926, 2013). The second workflow for the prediction of A-domain's substrate specificity is based on alignment-free models constructed by transductive support vector machines (TSVMs) that incorporate information of uncharacterized A-domains. The construction of the models was implemented in the NRPSpredictor and in a first step uses the physicochemical fingerprint of the 34 residues lining the active site of the phenylalanine-adenylation domain of gramicidin synthetase A [PDB ID 1 amu] to derive a feature vector. Homologous positions were extracted for A-domains with known and unknown substrate specificities and turned into feature vectors. At the same time, A-domains with known specificities towards similar substrates were clustered by physicochemical properties of amino acids (AA). In a second step, support vector machines (SVMs) were optimized from feature vectors of characterized A-domains in each of the resulting clusters. Later, SVMs were used in the variant of TSVMs that integrate a fraction of uncharacterized A-domains during training to predict unknown specificities. Finally, uncharacterized A-domains were scored by each of the constructed alignment-free models (TSVM) representing each substrate specificity resulting from the clustering. The model producing the largest score for the uncharacterized A-domain assigns the substrate specificity to it (Rausch et al., Nucleic Acids Res 33:5799-5808, 2005).

Keywords: Adenylation domains; Alignment-free models; NRPS; Topological Indices; Transductive support vector machines.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Bacteria / chemistry
Bacteria / enzymology*
Bacteria / metabolism
Catalytic Domain
Computer Graphics
Models, Biological
Peptide Synthases / chemistry
Peptide Synthases / metabolism*
Protein Structure, Tertiary
Proteomics / methods*
Software
Substrate Specificity
Support Vector Machine*
Workflow

Substances

Peptide Synthases
non-ribosomal peptide synthase