Application of data augmentation techniques towards metabolomics

Comput Biol Med. 2022 Sep:148:105916. doi: 10.1016/j.compbiomed.2022.105916. Epub 2022 Jul 27.

Abstract

Niemann-Pick Class 1 (NPC1) disease is a rare and debilitating neurodegenerative lysosomal storage disease (LSD). Metabolomics datasets of NPC1 patients available to perform this type of analysis are often limited in the number of samples and severely unbalanced. In order to improve the predictive capability and identify new biomarkers in an NPC1 disease urinary dataset, data augmentation (DA) techniques based on computational intelligence have been employed to create synthetic samples, i.e. the addition of noise, oversampling techniques and conditional generative adversarial networks. These techniques have been used to evaluate their predictive capacities on a set of urine samples donated by 13 untreated NPC1 disease and 47 heterozygous (parental) carrier control participants. Results on the prediction have also been obtained using different machine learning classification models and the partial least squares techniques. These results provide strong evidence for the ability of DA techniques to generate good quality synthetic data. Results acquired show increases in sensitivity of 20%-50%, an F1 score of 6%-30%, and a predictive capacity of 0.3 (out of 1). Additionally, more conventional forms of multivariate data analysis have been employed. These have allowed the detection of unusual urinary metabolite profiles, and the identification of biomarkers through the use of synthetically augmented datasets. Results indicate that urinary branched-chain amino acids such as valine, 3-aminoisobutyrate and quinolinate, may be employable as valuable biomarkers for the diagnosis and prognostic monitoring of NPC1 disease.

Keywords: Data augmentation; Machine learning; Metabolomics; Niemann–Pick type C disease; Rare diseases.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Biomarkers
  • Humans
  • Metabolomics
  • Niemann-Pick Disease, Type C*

Substances

  • Biomarkers