Fast and accurate genome comparison using genome images: The Extended Natural Vector Method

Mol Phylogenet Evol. 2019 Dec:141:106633. doi: 10.1016/j.ympev.2019.106633. Epub 2019 Sep 26.

Abstract

Using numerical methods for genome comparison has always been of importance in bioinformatics. The Chaos Game Representation (CGR) is an effective genome sequence mapping technology, which converts genome sequences to CGR images. To each CGR image, we associate a vector called an Extended Natural Vector (ENV). The ENV is based on the distribution of intensity values. This mapping produces a one-to-one correspondence between CGR images and their ENVs. We define the distance between two DNA sequences as the distance between their associated ENVs. We cluster and classify several datasets including Influenza A viruses, Bacillus genomes, and Conoidea mitochondrial genomes to build their phylogenetic trees. Results show that our ENV combining CGR method (CGR-ENV) compares favorably in classification accuracy and efficiency against the multiple sequence alignment (MSA) method and other alignment-free methods. The research provides significant insights into the study of phylogeny, evolution, and efficient DNA comparison algorithms for large genomes.

Keywords: Chaos game representation; Extended natural vector; Genome comparison.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms*
  • Base Sequence
  • DNA / genetics
  • Genome*
  • Genome, Mitochondrial
  • Genomics*
  • Markov Chains
  • Phylogeny

Substances

  • DNA