Mapping chemical structure-activity information of HAART-drug cocktails over complex networks of AIDS epidemiology and socioeconomic data of U.S. counties

Biosystems. 2015 Jun:132-133:20-34. doi: 10.1016/j.biosystems.2015.04.007. Epub 2015 Apr 24.

Abstract

Using computational algorithms to design tailored drug cocktails for highly active antiretroviral therapy (HAART) on specific populations is a goal of major importance for both pharmaceutical industry and public health policy institutions. New combinations of compounds need to be predicted in order to design HAART cocktails. On the one hand, there are the biomolecular factors related to the drugs in the cocktail (experimental measure, chemical structure, drug target, assay organisms, etc.); on the other hand, there are the socioeconomic factors of the specific population (income inequalities, employment levels, fiscal pressure, education, migration, population structure, etc.) to study the relationship between the socioeconomic status and the disease. In this context, machine learning algorithms, able to seek models for problems with multi-source data, have to be used. In this work, the first artificial neural network (ANN) model is proposed for the prediction of HAART cocktails, to halt AIDS on epidemic networks of U.S. counties using information indices that codify both biomolecular and several socioeconomic factors. The data was obtained from at least three major sources. The first dataset included assays of anti-HIV chemical compounds released to ChEMBL. The second dataset is the AIDSVu database of Emory University. AIDSVu compiled AIDS prevalence for >2300 U.S. counties. The third data set included socioeconomic data from the U.S. Census Bureau. Three scales or levels were employed to group the counties according to the location or population structure codes: state, rural urban continuum code (RUCC) and urban influence code (UIC). An analysis of >130,000 pairs (network links) was performed, corresponding to AIDS prevalence in 2310 counties in U.S. vs. drug cocktails made up of combinations of ChEMBL results for 21,582 unique drugs, 9 viral or human protein targets, 4856 protocols, and 10 possible experimental measures. The best model found with the original data was a linear neural network (LNN) with AUROC>0.80 and accuracy, specificity, and sensitivity≈77% in training and external validation series. The change of the spatial and population structure scale (State, UIC, or RUCC codes) does not affect the quality of the model. Unbalance was detected in all the models found comparing positive/negative cases and linear/non-linear model accuracy ratios. Using synthetic minority over-sampling technique (SMOTE), data pre-processing and machine-learning algorithms implemented into the WEKA software, more balanced models were found. In particular, a multilayer perceptron (MLP) with AUROC=97.4% and precision, recall, and F-measure >90% was found.

Keywords: AIDS epidemiology; Box–Jenkins operators; Information theory; Shannon entropy; Urban influence code.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Acquired Immunodeficiency Syndrome / drug therapy*
  • Acquired Immunodeficiency Syndrome / economics
  • Acquired Immunodeficiency Syndrome / epidemiology*
  • Algorithms
  • Anti-HIV Agents / chemistry*
  • Anti-HIV Agents / therapeutic use*
  • Antiretroviral Therapy, Highly Active / economics
  • Antiretroviral Therapy, Highly Active / statistics & numerical data*
  • Computer Simulation
  • Data Mining / methods
  • Databases, Factual
  • Educational Status
  • Employment
  • Humans
  • Machine Learning
  • Models, Statistical*
  • Prevalence
  • Social Media / statistics & numerical data
  • Socioeconomic Factors
  • Structure-Activity Relationship
  • Treatment Outcome
  • United States / epidemiology

Substances

  • Anti-HIV Agents