A new approach for the identification of processed pseudogenes

J Comput Biol. 2010 May;17(5):755-65. doi: 10.1089/cmb.2009.0027.

Abstract

Processed pseudogenes are DNA sequences generated through reverse transcription (RT) and retrotransposition of mature mRNAs. These sequences are usually considered junk DNA, since in most cases they lack a suitable promoter and are no longer transcribed. Nonetheless, due to their origin, they represent a valuable source of information on the transcriptome, which becomes particularly interesting for organisms lacking large EST collections. Here, we describe REtrotransposed Gene EXPlorer (REGEXP), a new method for the systematic identification of retrotransposition events that, unlike existing approaches, does not rely on a priori knowledge of mRNA sequences. Using our pipeline, we were able to identify 2288 processed pseudogenes in the human genome, showing a good overlap with the ENSEMBL, VEGA, and pseudogene.org datasets. These pseudogenes could be traced back to 987 genes, mostly corresponding to already known genes. In many cases, we recovered the signature of additional exons, likely due to alternative splicing. Interestingly, some of our predictions did not match previously known or predicted genes, and we were able to validate most of them by RT-polymerase chain reaction (PCR). Similar results were obtained with the mouse genome. Our data show that the REGEXP method is capable of identifying processed pseudogenes and to predict most of the corresponding genes with high specificity. Therefore, it may represent a valuable integration to the current genome annotation pipelines.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms
  • Animals
  • Base Sequence
  • Databases, Nucleic Acid
  • Genome, Human
  • Humans
  • Introns
  • Mice
  • Molecular Sequence Data
  • Pseudogenes*
  • Retroelements / genetics
  • Sequence Alignment
  • Sequence Analysis, DNA / methods*

Substances

  • Retroelements