A new approach for the identification of processed pseudogenes

Ivan Molineris; Gabriele Sales; Federico Bianchi; Ferdinando Di Cunto; Michele Caselle

doi:10.1089/cmb.2009.0027

A new approach for the identification of processed pseudogenes

J Comput Biol. 2010 May;17(5):755-65. doi: 10.1089/cmb.2009.0027.

Authors

Ivan Molineris¹, Gabriele Sales, Federico Bianchi, Ferdinando Di Cunto, Michele Caselle

Affiliation

¹ Theoretical Physics Department, Universit di Torino, Torino, Italy.

PMID: 20500026
DOI: 10.1089/cmb.2009.0027

Abstract

Processed pseudogenes are DNA sequences generated through reverse transcription (RT) and retrotransposition of mature mRNAs. These sequences are usually considered junk DNA, since in most cases they lack a suitable promoter and are no longer transcribed. Nonetheless, due to their origin, they represent a valuable source of information on the transcriptome, which becomes particularly interesting for organisms lacking large EST collections. Here, we describe REtrotransposed Gene EXPlorer (REGEXP), a new method for the systematic identification of retrotransposition events that, unlike existing approaches, does not rely on a priori knowledge of mRNA sequences. Using our pipeline, we were able to identify 2288 processed pseudogenes in the human genome, showing a good overlap with the ENSEMBL, VEGA, and pseudogene.org datasets. These pseudogenes could be traced back to 987 genes, mostly corresponding to already known genes. In many cases, we recovered the signature of additional exons, likely due to alternative splicing. Interestingly, some of our predictions did not match previously known or predicted genes, and we were able to validate most of them by RT-polymerase chain reaction (PCR). Similar results were obtained with the mouse genome. Our data show that the REGEXP method is capable of identifying processed pseudogenes and to predict most of the corresponding genes with high specificity. Therefore, it may represent a valuable integration to the current genome annotation pipelines.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Algorithms
Animals
Base Sequence
Databases, Nucleic Acid
Genome, Human
Humans
Introns
Mice
Molecular Sequence Data
Pseudogenes*
Retroelements / genetics
Sequence Alignment
Sequence Analysis, DNA / methods*

Substances

Retroelements