Rapid multiple protein sequence search by parallel and heterogeneous computation

Jiefu Li; Ziyuan Wang; Xuwei Fan; Ruijie Yao; Guoqing Zhang; Rui Fan; Zefeng Wang

doi:10.1093/bioinformatics/btae151

Rapid multiple protein sequence search by parallel and heterogeneous computation

Bioinformatics. 2024 Mar 29;40(4):btae151. doi: 10.1093/bioinformatics/btae151.

Authors

Jiefu Li¹, Ziyuan Wang², Xuwei Fan², Ruijie Yao³, Guoqing Zhang^{1

4}, Rui Fan², Zefeng Wang^{1

4

5}

Affiliations

¹ CAS Key Laboratory of Computational Biology, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, 320 Yueyang Road, Shanghai 200031, China.
² School of Information Science and Technology, ShanghaiTech University, Shanghai 201210, China.
³ Institute of Intelligent Computing Technology, Chinese Academy of Sciences, 88 Jinjihu Avenue, Suzhou, Jiangsu 215000, China.
⁴ Bio-Med Big Data Center, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, 320 Yueyang Road, Shanghai 200031, China.
⁵ School of Life Science, Southern University of Science and Technology, Shenzhen, Guangdong 518055, China.

Abstract

Motivation: Protein sequence database search and multiple sequence alignment generation is a fundamental task in many bioinformatics analyses. As the data volume of sequences continues to grow rapidly, there is an increasing need for efficient and scalable multiple sequence query algorithms for super-large databases without expensive time and computational costs.

Results: We introduce Chorus, a novel protein sequence query system that leverages parallel model and heterogeneous computation architecture to enable users to query thousands of protein sequences concurrently against large protein databases on a desktop workstation. Chorus achieves over 100× speedup over BLASTP without sacrificing sensitivity. We demonstrate the utility of Chorus through a case study of analyzing a ∼1.5-TB large-scale metagenomic datasets for novel CRISPR-Cas protein discovery within 30 min.

Availability and implementation: Chorus is open-source and its code repository is available at https://github.com/Bio-Acc/Chorus.

MeSH terms

Algorithms*
Amino Acid Sequence
Databases, Protein
Proteins
Software*

Substances

Proteins

Grants and funding

31730110/National Natural Science Foundation of China