De novo approach to classify protein-coding and noncoding transcripts based on sequence composition

Methods Mol Biol. 2014:1182:203-7. doi: 10.1007/978-1-4939-1062-5_18.

Abstract

Each day, more and more transcripts are being discovered along the genome (especially in poorly annotated species) thanks to the rapid progress of high-throughput technology such as RNA sequencing. However, this situation unravels the challenge of how to classify the newly identified transcripts into protein coding or noncoding. Here, we describe a de novo approach named coding-noncoding index (CNCI), a powerful signature tool by profiling adjoining nucleotide triplets (ANT) to effectively distinguish between protein-coding and noncoding sequences independently of known annotations. The main advantage of CNCI is its ability to accurately classify transcripts assembled from whole-transcriptome sequencing data in a cross-species manner, which allowed it to be used for all vertebrates and invertebrates based on the training data of well-annotated species (such as human and Arabidopsis). In this chapter, we illustrate the CNCI method in detail through an example of RNA-sequencing data generated from six biological replicates of six mouse tissues. CNCI software is available at http://www.bioinfo.org/software/cnci.

MeSH terms

  • Animals
  • Computational Biology / methods*
  • Humans
  • Molecular Sequence Annotation
  • Proteins / genetics*
  • RNA, Messenger / genetics*
  • RNA, Untranslated / genetics*
  • Sequence Analysis, RNA*

Substances

  • Proteins
  • RNA, Messenger
  • RNA, Untranslated