Mapping chemical structure-activity information of HAART-drug cocktails over complex networks of AIDS epidemiology and socioeconomic data of U.S. counties

Diana María Herrera-Ibatá; Alejandro Pazos; Ricardo Alfredo Orbegozo-Medina; Francisco Javier Romero-Durán; Humberto González-Díaz

doi:10.1016/j.biosystems.2015.04.007

Mapping chemical structure-activity information of HAART-drug cocktails over complex networks of AIDS epidemiology and socioeconomic data of U.S. counties

Biosystems. 2015 Jun:132-133:20-34. doi: 10.1016/j.biosystems.2015.04.007. Epub 2015 Apr 24.

Authors

Diana María Herrera-Ibatá¹, Alejandro Pazos², Ricardo Alfredo Orbegozo-Medina³, Francisco Javier Romero-Durán⁴, Humberto González-Díaz⁵

Affiliations

¹ Department of Information and Communication Technologies, University of A Coruña (UDC), 15071 A Coruña, Spain. Electronic address: diana.herrera@udc.es.
² Department of Information and Communication Technologies, University of A Coruña (UDC), 15071 A Coruña, Spain.
³ Department of Microbiology and Parasitology, Faculty of Pharmacy, University of Santiago de Compostela (USC), 15782 Santiago de Compostela, Spain.
⁴ Department of Organic Chemistry (USC), 15782 Santiago de Compostela, Spain.
⁵ Department of Organic Chemistry II, Faculty of Science and Technology, University of the Basque Country (UPV/EHU), 48940 Leioa, Spain; IKERBASQUE, Basque Foundation for Science, 48011 Bilbao, Spain. Electronic address: humberto.gonzalezdiaz@ehu.es.

PMID: 25916548
DOI: 10.1016/j.biosystems.2015.04.007

Abstract

Using computational algorithms to design tailored drug cocktails for highly active antiretroviral therapy (HAART) on specific populations is a goal of major importance for both pharmaceutical industry and public health policy institutions. New combinations of compounds need to be predicted in order to design HAART cocktails. On the one hand, there are the biomolecular factors related to the drugs in the cocktail (experimental measure, chemical structure, drug target, assay organisms, etc.); on the other hand, there are the socioeconomic factors of the specific population (income inequalities, employment levels, fiscal pressure, education, migration, population structure, etc.) to study the relationship between the socioeconomic status and the disease. In this context, machine learning algorithms, able to seek models for problems with multi-source data, have to be used. In this work, the first artificial neural network (ANN) model is proposed for the prediction of HAART cocktails, to halt AIDS on epidemic networks of U.S. counties using information indices that codify both biomolecular and several socioeconomic factors. The data was obtained from at least three major sources. The first dataset included assays of anti-HIV chemical compounds released to ChEMBL. The second dataset is the AIDSVu database of Emory University. AIDSVu compiled AIDS prevalence for >2300 U.S. counties. The third data set included socioeconomic data from the U.S. Census Bureau. Three scales or levels were employed to group the counties according to the location or population structure codes: state, rural urban continuum code (RUCC) and urban influence code (UIC). An analysis of >130,000 pairs (network links) was performed, corresponding to AIDS prevalence in 2310 counties in U.S. vs. drug cocktails made up of combinations of ChEMBL results for 21,582 unique drugs, 9 viral or human protein targets, 4856 protocols, and 10 possible experimental measures. The best model found with the original data was a linear neural network (LNN) with AUROC>0.80 and accuracy, specificity, and sensitivity≈77% in training and external validation series. The change of the spatial and population structure scale (State, UIC, or RUCC codes) does not affect the quality of the model. Unbalance was detected in all the models found comparing positive/negative cases and linear/non-linear model accuracy ratios. Using synthetic minority over-sampling technique (SMOTE), data pre-processing and machine-learning algorithms implemented into the WEKA software, more balanced models were found. In particular, a multilayer perceptron (MLP) with AUROC=97.4% and precision, recall, and F-measure >90% was found.

Keywords: AIDS epidemiology; Box–Jenkins operators; Information theory; Shannon entropy; Urban influence code.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Acquired Immunodeficiency Syndrome / drug therapy*
Acquired Immunodeficiency Syndrome / economics
Acquired Immunodeficiency Syndrome / epidemiology*
Algorithms
Anti-HIV Agents / chemistry*
Anti-HIV Agents / therapeutic use*
Antiretroviral Therapy, Highly Active / economics
Antiretroviral Therapy, Highly Active / statistics & numerical data*
Computer Simulation
Data Mining / methods
Databases, Factual
Educational Status
Employment
Humans
Machine Learning
Models, Statistical*
Prevalence
Social Media / statistics & numerical data
Socioeconomic Factors
Structure-Activity Relationship
Treatment Outcome
United States / epidemiology

Substances

Anti-HIV Agents