A complexity reduction algorithm for analysis and annotation of large genomic sequences

Trees-Juen Chuang; Wen-Chang Lin; Hurng-Chun Lee; Chi-Wei Wang; Keh-Lin Hsiao; Zi-Hao Wang; Danny Shieh; Simon C Lin; Lan-Yang Ch'ang

doi:10.1101/gr.313703

A complexity reduction algorithm for analysis and annotation of large genomic sequences

Genome Res. 2003 Feb;13(2):313-22. doi: 10.1101/gr.313703.

Authors

Trees-Juen Chuang¹, Wen-Chang Lin, Hurng-Chun Lee, Chi-Wei Wang, Keh-Lin Hsiao, Zi-Hao Wang, Danny Shieh, Simon C Lin, Lan-Yang Ch'ang

Affiliation

¹ Bioinformatics Research Center, Institute of Biomedical Sciences, Academia Sinica, Taipei 11529, Taiwan.

Abstract

DNA is a universal language encrypted with biological instruction for life. In higher organisms, the genetic information is preserved predominantly in an organized exon/intron structure. When a gene is expressed, the exons are spliced together to form the transcript for protein synthesis. We have developed a complexity reduction algorithm for sequence analysis (CRASA) that enables direct alignment of cDNA sequences to the genome. This method features a progressive data structure in hierarchical orders to facilitate a fast and efficient search mechanism. CRASA implementation was tested with already annotated genomic sequences in two benchmark data sets and compared with 15 annotation programs (10 ab initio and 5 homology-based approaches) against the EST database. By the use of layered noise filters, the complexity of CRASA-matched data was reduced exponentially. The results from the benchmark tests showed that CRASA annotation excelled in both the sensitivity and specificity categories. When CRASA was applied to the analysis of human Chromosomes 21 and 22, an additional 83 potential genes were identified. With its large-scale processing capability, CRASA can be used as a robust tool for genome annotation with high accuracy by matching the EST sequences precisely to the genomic sequences.

Publication types

Comparative Study
Research Support, Non-U.S. Gov't

MeSH terms

Algorithms*
Chromosomes, Human, Pair 21 / genetics
Chromosomes, Human, Pair 22 / genetics
DNA / analysis*
DNA / genetics
DNA, Complementary / analysis
DNA, Complementary / genetics
Exons / genetics
Expressed Sequence Tags
Genes / genetics
Genome, Human
Humans
Pseudogenes / genetics
Reproducibility of Results
Sensitivity and Specificity
Sequence Alignment / methods
Sequence Analysis, DNA / methods
Sequence Analysis, DNA / trends
Sequence Homology, Nucleic Acid

Substances

DNA, Complementary
DNA