Using the gini coefficient to measure the chemical diversity of small-molecule libraries

J Comput Chem. 2016 Aug 15;37(22):2091-7. doi: 10.1002/jcc.24423. Epub 2016 Jun 29.

Abstract

Modern databases of small organic molecules contain tens of millions of structures. The size of theoretically available chemistry is even larger. However, despite the large amount of chemical information, the "big data" moment for chemistry has not yet provided the corresponding payoff of cheaper computer-predicted medicine or robust machine-learning models for the determination of efficacy and toxicity. Here, we present a study of the diversity of chemical datasets using a measure that is commonly used in socioeconomic studies. We demonstrate the use of this diversity measure on several datasets that were constructed to contain various congeneric subsets of molecules as well as randomly selected molecules. We also apply our method to a number of well-known databases that are frequently used for structure-activity relationship modeling. Our results show the poor diversity of the common sources of potential lead compounds compared to actual known drugs. © 2016 Wiley Periodicals, Inc.

Keywords: Diversity Genie; chemical databases; cheminformatics; molecular diversity.