Indexing molecules with chemical graph identifiers

J Comput Chem. 2011 Sep;32(12):2638-46. doi: 10.1002/jcc.21843. Epub 2011 Jun 6.

Abstract

Fast and robust algorithms for indexing molecules have been historically considered strategic tools for the management and storage of large chemical libraries. This work introduces a modified and further extended version of the molecular equivalence number naming adaptation of the Morgan algorithm (J Chem Inf Comput Sci 2001, 41, 181-185) for the generation of a chemical graph identifier (CGI). This new version corrects for the collisions recognized in the original adaptation and includes the ability to deal with graph canonicalization, ensembles (salts), and isomerism (tautomerism, regioisomerism, optical isomerism, and geometrical isomerism) in a flexible manner. Validation of the current CGI implementation was performed on the open NCI database and the drug-like subset of the ZINC database containing 260,071 and 5,348,089 structures, respectively. The results were compared with those obtained with some of the most widely used indexing codes, such as the CACTVS hash code and the new InChIKey. The analyses emphasize the fact that compound management activities, like duplicate analysis of chemical libraries, are sensitive to the exact definition of compound uniqueness and thus still depend, to a minor extent, on the type and flexibility of the molecular index being used.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Abstracting and Indexing / methods*
  • Algorithms*
  • Databases, Factual
  • Molecular Conformation
  • Organic Chemicals
  • Pharmaceutical Preparations / chemistry
  • Small Molecule Libraries / chemistry*

Substances

  • Organic Chemicals
  • Pharmaceutical Preparations
  • Small Molecule Libraries