Effect of sequence padding on the performance of deep learning models in archaeal protein functional prediction

Angela Lopez-Del Rio; Maria Martin; Alexandre Perera-Lluna; Rabie Saidi

doi:10.1038/s41598-020-71450-8

Effect of sequence padding on the performance of deep learning models in archaeal protein functional prediction

Sci Rep. 2020 Sep 3;10(1):14634. doi: 10.1038/s41598-020-71450-8.

Authors

Angela Lopez-Del Rio^{1

2}, Maria Martin³, Alexandre Perera-Lluna^{4

5}, Rabie Saidi³

Affiliations

¹ B2SLab, Department d'Enginyeria de Sistemes, Automàtica i Informàtica Industrial, Universitat Politècnica de Catalunya, 08028, Barcelona, Spain. angela.lopez.del.rio@upc.edu.
² Department of Biomedical Engineering, Institut de Recerca Pediàtrica Hospital Sant Joan de Dèu, 08950, Esplugues de Llobregat, Spain. angela.lopez.del.rio@upc.edu.
³ European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Hinxton, CB10 1SD, UK.
⁴ B2SLab, Department d'Enginyeria de Sistemes, Automàtica i Informàtica Industrial, Universitat Politècnica de Catalunya, 08028, Barcelona, Spain.
⁵ Department of Biomedical Engineering, Institut de Recerca Pediàtrica Hospital Sant Joan de Dèu, 08950, Esplugues de Llobregat, Spain.

Abstract

The use of raw amino acid sequences as input for deep learning models for protein functional prediction has gained popularity in recent years. This scheme obliges to manage proteins with different lengths, while deep learning models require same-shape input. To accomplish this, zeros are usually added to each sequence up to a established common length in a process called zero-padding. However, the effect of different padding strategies on model performance and data structure is yet unknown. We propose and implement four novel types of padding the amino acid sequences. Then, we analysed the impact of different ways of padding the amino acid sequences in a hierarchical Enzyme Commission number prediction problem. Results show that padding has an effect on model performance even when there are convolutional layers implied. Contrastingly to most of deep learning works which focus mainly on architectures, this study highlights the relevance of the deemed-of-low-importance process of padding and raises awareness of the need to refine it for better performance. The code of this analysis is publicly available at https://github.com/b2slab/padding_benchmark .

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Amino Acid Sequence
Archaea / metabolism*
Archaeal Proteins / metabolism*
Deep Learning*

Substances

Archaeal Proteins