Accelerating chemical database searching using graphics processing units

J Chem Inf Model. 2011 Aug 22;51(8):1807-16. doi: 10.1021/ci200164g. Epub 2011 Jul 13.

Abstract

The utility of chemoinformatics systems depends on the accurate computer representation and efficient manipulation of chemical compounds. In such systems, a small molecule is often digitized as a large fingerprint vector, where each element indicates the presence/absence or the number of occurrences of a particular structural feature. Since in theory the number of unique features can be exceedingly large, these fingerprint vectors are usually folded into much shorter ones using hashing and modulo operations, allowing fast "in-memory" manipulation and comparison of molecules. There is increasing evidence that lossless fingerprints can substantially improve retrieval performance in chemical database searching (substructure or similarity), which have led to the development of several lossless fingerprint compression algorithms. However, any gains in storage and retrieval afforded by compression need to be weighed against the extra computational burden required for decompression before these fingerprints can be compared. Here we demonstrate that graphics processing units (GPU) can greatly alleviate this problem, enabling the practical application of lossless fingerprints on large databases. More specifically, we show that, with the help of a ~$500 ordinary video card, the entire PubChem database of ~32 million compounds can be searched in ~0.2-2 s on average, which is 2 orders of magnitude faster than a conventional CPU. If multiple query patterns are processed in batch, the speedup is even more dramatic (less than 0.02-0.2 s/query for 1000 queries). In the present study, we use the Elias gamma compression algorithm, which results in a compression ratio as high as 0.097.

MeSH terms

  • Algorithms
  • Chemistry, Pharmaceutical / methods*
  • Chemistry, Pharmaceutical / statistics & numerical data
  • Computer Graphics
  • Data Compression
  • Data Mining / methods*
  • Databases, Factual
  • Models, Chemical
  • Molecular Structure
  • Organic Chemicals / analysis*
  • Software

Substances

  • Organic Chemicals