Validation and extraction of molecular-geometry information from small-molecule databases

Acta Crystallogr D Struct Biol. 2017 Feb 1;73(Pt 2):103-111. doi: 10.1107/S2059798317000079. Epub 2017 Feb 1.

Abstract

A freely available small-molecule structure database, the Crystallography Open Database (COD), is used for the extraction of molecular-geometry information on small-molecule compounds. The results are used for the generation of new ligand descriptions, which are subsequently used by macromolecular model-building and structure-refinement software. To increase the reliability of the derived data, and therefore the new ligand descriptions, the entries from this database were subjected to very strict validation. The selection criteria made sure that the crystal structures used to derive atom types, bond and angle classes are of sufficiently high quality. Any suspicious entries at a crystal or molecular level were removed from further consideration. The selection criteria included (i) the resolution of the data used for refinement (entries solved at 0.84 Å resolution or higher) and (ii) the structure-solution method (structures must be from a single-crystal experiment and all atoms of generated molecules must have full occupancies), as well as basic sanity checks such as (iii) consistency between the valences and the number of connections between atoms, (iv) acceptable bond-length deviations from the expected values and (v) detection of atomic collisions. The derived atom types and bond classes were then validated using high-order moment-based statistical techniques. The results of the statistical analyses were fed back to fine-tune the atom typing. The developed procedure was repeated four times, resulting in fine-grained atom typing, bond and angle classes. The procedure will be repeated in the future as and when new entries are deposited in the COD. The whole procedure can also be applied to any source of small-molecule structures, including the Cambridge Structural Database and the ZINC database.

Keywords: Crystallography Open Database; high-order statistics; validation.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Crystallography, X-Ray*
  • Databases, Factual
  • Ligands
  • Models, Molecular
  • Molecular Conformation*
  • Small Molecule Libraries / chemistry*
  • Software

Substances

  • Ligands
  • Small Molecule Libraries