sOCP: a framework predicting smORF coding potential based on TIS and in-frame features and effectively applied in the human genome

Brief Bioinform. 2024 Mar 27;25(3):bbae147. doi: 10.1093/bib/bbae147.

Abstract

Small open reading frames (smORFs) have been acknowledged to play various roles on essential biological pathways and affect human beings from diabetes to tumorigenesis. Predicting smORFs in silico is quite a prerequisite for processing the omics data. Here, we proposed the smORF-coding-potential-predicting framework, sOCP, which provides functions to construct a model for predicting novel smORFs in some species. The sOCP model constructed in human was based on in-frame features and the nucleotide bias around the start codon, and the small feature subset was proved to be competent enough and avoid overfitting problems for complicated models. It showed more advanced prediction metrics than previous methods and could correlate closely with experimental evidence in a heterogeneous dataset. The model was applied to Rattus norvegicus and exhibited satisfactory performance. We then scanned smORFs with ATG and non-ATG start codons from the human genome and generated a database containing about a million novel smORFs with coding potential. Around 72 000 smORFs are located on the lncRNA regions of the genome. The smORF-encoded peptides may be involved in biological pathways rare for canonical proteins, including glucocorticoid catabolic process and the prokaryotic defense system. Our work provides a model and database for human smORF investigation and a convenient tool for further smORF prediction in other species.

Keywords: coding potential; genome annotation; machine learning; microprotein; small ORF.

MeSH terms

  • Animals
  • Genome, Human*
  • Humans
  • Open Reading Frames
  • Peptides* / genetics
  • Proteins / genetics
  • Rats

Substances

  • Peptides
  • Proteins