Evaluation of genome sequencing quality in selected plant species using expressed sequence tags

PLoS One. 2013 Jul 29;8(7):e69890. doi: 10.1371/journal.pone.0069890. Print 2013.

Abstract

Background: With the completion of genome sequencing projects for more than 30 plant species, large volumes of genome sequences have been produced and stored in online databases. Advancements in sequencing technologies have reduced the cost and time of whole genome sequencing enabling more and more plants to be subjected to genome sequencing. Despite this, genome sequence qualities of multiple plants have not been evaluated.

Methodology/principal finding: Integrity and accuracy were calculated to evaluate the genome sequence quality of 32 plants. The integrity of a genome sequence is presented by the ratio of chromosome size and genome size (or between scaffold size and genome size), which ranged from 55.31% to nearly 100%. The accuracy of genome sequence was presented by the ratio between matched EST and selected ESTs where 52.93% ∼ 98.28% and 89.02% ∼ 98.85% of the randomly selected clean ESTs could be mapped to chromosome and scaffold sequences, respectively. According to the integrity, accuracy and other analysis of each plant species, thirteen plant species were divided into four levels. Arabidopsis thaliana, Oryza sativa and Zea mays had the highest quality, followed by Brachypodium distachyon, Populus trichocarpa, Vitis vinifera and Glycine max, Sorghum bicolor, Solanum lycopersicum and Fragaria vesca, and Lotus japonicus, Medicago truncatula and Malus × domestica in that order. Assembling the scaffold sequences into chromosome sequences should be the primary task for the remaining nineteen species. Low GC content and repeat DNA influences genome sequence assembly.

Conclusion: The quality of plant genome sequences was found to be lower than envisaged and thus the rapid development of genome sequencing projects as well as research on bioinformatics tools and the algorithms of genome sequence assembly should provide increased processing and correction of genome sequences that have already been published.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Arabidopsis / genetics
  • Expressed Sequence Tags*
  • Genome, Plant / genetics*
  • Oryza / genetics
  • Zea mays / genetics

Grants and funding

This work was supported by a Special project of the Ministry of Science and Technology (2012FY110100-3), the National Natural Science Foundation of China (31171273), the Special Fund for Independent innovation of Agricultural Science and Technology in Jiangsu province (SCX(11)2044) and the Jiangsu Province Postgraduate Cultivation Innovation Project (No. CXZZ12_0284). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.