MASSA Algorithm: an automated rational sampling of training and test subsets for QSAR modeling

J Comput Aided Mol Des. 2023 Dec;37(12):735-754. doi: 10.1007/s10822-023-00536-y. Epub 2023 Oct 7.

Abstract

QSAR models capable of predicting biological, toxicity, and pharmacokinetic properties were widely used to search lead bioactive molecules in chemical databases. The dataset's preparation to build these models has a strong influence on the quality of the generated models, and sampling requires that the original dataset be divided into training (for model training) and test (for statistical evaluation) sets. This sampling can be done randomly or rationally, but the rational division is superior. In this paper, we present MASSA, a Python tool that can be used to automatically sample datasets by exploring the biological, physicochemical, and structural spaces of molecules using PCA, HCA, and K-modes. The proposed algorithm is very useful when the variables used for QSAR are not available or to construct multiple QSAR models with the same training and test sets, producing models with lower variability and better values for validation metrics. These results were obtained even when the descriptors used in the QSAR/QSPR were different from those used in the separation of training and test sets, indicating that this tool can be used to build models for more than one QSAR/QSPR technique. Finally, this tool also generates useful graphical representations that can provide insights into the data.

Keywords: Clustering; Computer-aided drug design; Hierarchical clustering analysis (HCA); K-modes; Python; QSAR; Training and test sampling.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms*
  • Benchmarking
  • Databases, Chemical
  • Quantitative Structure-Activity Relationship*