OMIGA: Optimized Maker-Based Insect Genome Annotation

Mol Genet Genomics. 2014 Aug;289(4):567-73. doi: 10.1007/s00438-014-0831-7. Epub 2014 Mar 9.

Abstract

Insects are one of the largest classes of animals on Earth and constitute more than half of all living species. The i5k initiative has begun sequencing of more than 5,000 insect genomes, which should greatly help in exploring insect resource and pest control. Insect genome annotation remains challenging because many insects have high levels of heterozygosity. To improve the quality of insect genome annotation, we developed a pipeline, named Optimized Maker-Based Insect Genome Annotation (OMIGA), to predict protein-coding genes from insect genomes. We first mapped RNA-Seq reads to genomic scaffolds to determine transcribed regions using Bowtie, and the putative transcripts were assembled using Cufflink. We then selected highly reliable transcripts with intact coding sequences to train de novo gene prediction software, including Augustus. The re-trained software was used to predict genes from insect genomes. Exonerate was used to refine gene structure and to determine near exact exon/intron boundary in the genome. Finally, we used the software Maker to integrate data from RNA-Seq, de novo gene prediction, and protein alignment to produce an official gene set. The OMIGA pipeline was used to annotate the draft genome of an important insect pest, Chilo suppressalis, yielding 12,548 genes. Different strategies were compared, which demonstrated that OMIGA had the best performance. In summary, we present a comprehensive pipeline for identifying genes in insect genomes that can be widely used to improve the annotation quality in insects. OMIGA is provided at http://ento.njau.edu.cn/omiga.html .

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Amino Acid Sequence
  • Animals
  • Base Sequence
  • Butterflies / genetics*
  • Exons
  • Genetic Markers / genetics
  • Genome, Insect / genetics*
  • Genomics*
  • High-Throughput Nucleotide Sequencing
  • Introns
  • Molecular Sequence Annotation*
  • Molecular Sequence Data
  • Moths / genetics*
  • Open Reading Frames
  • Oryza / parasitology
  • Plant Diseases / parasitology
  • Repetitive Sequences, Nucleic Acid / genetics
  • Sequence Alignment
  • Sequence Analysis, RNA
  • Software

Substances

  • Genetic Markers

Associated data

  • GENBANK/ANCD00000000
  • SRA/SRA060774