EGM: encapsulated gene-by-gene matching to identify gene orthologs and homologous segments in genomes

Bioinformatics. 2010 Sep 1;26(17):2076-84. doi: 10.1093/bioinformatics/btq339. Epub 2010 Jun 27.

Abstract

Motivation: Identification of functionally equivalent genes in different species is essential to understand the evolution of biological pathways and processes. At the same time, identification of strings of conserved orthologous genes helps identify complex genomic rearrangements across different organisms. Such an insight is particularly useful, for example, in the transfer of experimental results between different experimental systems such as Drosophila and mammals.

Results: Here, we describe the Encapsulated Gene-by-gene Matching (EGM) approach, a method that employs a graph matching strategy to identify gene orthologs and conserved gene segments. Given a pair of genomes, EGM constructs a global gene match for all genes taking into account gene context and family information. The Hungarian method for identifying the maximum weight matching in bipartite graphs is employed, where the resulting matching reveals one-to-one correspondences between nodes (genes) in a manner that maximizes the gene similarity and context.

Conclusion: We tested our approach by performing several comparisons including a detailed Human versus Mouse genome mapping. We find that the algorithm is robust and sensitive in detecting orthologs and conserved gene segments. EGM can sensitively detect rearrangements within large and small chromosomal segments. The EGM tool is fully automated and easy to use compared to other more complex methods that also require extensive manual intervention and input.

Availability: The EGM software, Supplementary information and other tools are available online from http://vbc.med.monash.edu.au/ approximately kmahmood/EGM.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms*
  • Animals
  • Chromosome Mapping / methods*
  • Comparative Genomic Hybridization
  • Conserved Sequence
  • Humans
  • Mice
  • Software
  • Synteny