ProFeatX: A parallelized protein feature extraction suite for machine learning

David Guevara-Barrientos; Rakesh Kaundal

doi:10.1016/j.csbj.2022.12.044

ProFeatX: A parallelized protein feature extraction suite for machine learning

Comput Struct Biotechnol J. 2022 Dec 29:21:796-801. doi: 10.1016/j.csbj.2022.12.044. eCollection 2023.

Authors

David Guevara-Barrientos^{1

2}, Rakesh Kaundal^{1

2

3}

Affiliations

¹ Department of Computer Science, College of Science, Utah State University, Logan, UT, USA.
² Bioinformatics Facility, Center for Integrated BioSystems, Utah State University, Logan, UT, USA.
³ Department of Plants, Soils, and Climate, College of Agriculture and Applied Sciences, Utah State University, Logan, UT, USA.

Abstract

Machine learning algorithms have been successfully applied in proteomics, genomics and transcriptomics. and have helped the biological community to answer complex questions. However, most machine learning methods require lots of data, with every data point having the same vector size. The biological sequence data, such as proteins, are amino acid sequences of variable length, which makes it essential to extract a definite number of features from all the proteins for them to be used as input into machine learning models. There are numerous methods to achieve this, but only several tools let researchers encode their proteins using multiple schemes without having to use different programs or, in many cases, code these algorithms themselves, or even come up with new algorithms. In this work, we created ProFeatX, a tool that contains 50 encodings to extract protein features in an efficient and fast way supporting desktop as well as high-performance computing environment. It can also encode concatenated features for protein-protein interactions. The tool has an easy-to-use web interface, allowing non-experts to use feature extraction techniques, as well as a stand-alone version for advanced users. ProFeatX is implemented in C++ and available on GitHub at https://github.com/usubioinfo/profeatx. The web server is available at http://bioinfo.usu.edu/profeatx/.

Keywords: Amino-acid sequence; Descriptors; Feature extraction; Machine learning; Protein-protein interactions.