Blocked inverted indices for exact clustering of large chemical spaces

Philipp Thiel; Lisa Sach-Peltason; Christian Ottmann; Oliver Kohlbacher

doi:10.1021/ci500150t

Blocked inverted indices for exact clustering of large chemical spaces

J Chem Inf Model. 2014 Sep 22;54(9):2395-401. doi: 10.1021/ci500150t. Epub 2014 Sep 2.

Authors

Philipp Thiel¹, Lisa Sach-Peltason, Christian Ottmann, Oliver Kohlbacher

Affiliation

¹ Applied Bioinformatics, Center for Bioinformatics, Quantitative Biology Center and Dept. of Computer Science, University of Tübingen , Sand 14, 72076 Tübingen, Germany.

PMID: 25136755
DOI: 10.1021/ci500150t

Abstract

The calculation of pairwise compound similarities based on fingerprints is one of the fundamental tasks in chemoinformatics. Methods for efficient calculation of compound similarities are of the utmost importance for various applications like similarity searching or library clustering. With the increasing size of public compound databases, exact clustering of these databases is desirable, but often computationally prohibitively expensive. We present an optimized inverted index algorithm for the calculation of all pairwise similarities on 2D fingerprints of a given data set. In contrast to other algorithms, it neither requires GPU computing nor yields a stochastic approximation of the clustering. The algorithm has been designed to work well with multicore architectures and shows excellent parallel speedup. As an application example of this algorithm, we implemented a deterministic clustering application, which has been designed to decompose virtual libraries comprising tens of millions of compounds in a short time on current hardware. Our results show that our implementation achieves more than 400 million Tanimoto similarity calculations per second on a common desktop CPU. Deterministic clustering of the available chemical space thus can be done on modern multicore machines within a few days.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Algorithms
Cluster Analysis*
Models, Chemical
Stochastic Processes