Insights into the Effects of Violating Statistical Assumptions for Dimensionality Reduction for Chemical "-omics" Data with Multiple Explanatory Variables

ACS Omega. 2023 Jun 9;8(24):22042-22054. doi: 10.1021/acsomega.3c01613. eCollection 2023 Jun 20.

Abstract

Biological volatilome analysis is inherently complex due to the considerable number of compounds (i.e., dimensions) and differences in peak areas by orders of magnitude, between and within compounds found within datasets. Traditional volatilome analysis relies on dimensionality reduction techniques which aid in the selection of compounds that are considered relevant to respective research questions prior to further analysis. Currently, compounds of interest are identified using either supervised or unsupervised statistical methods which assume the data residuals are normally distributed and exhibit linearity. However, biological data often violate the statistical assumptions of these models related to normality and the presence of multiple explanatory variables which are innate to biological samples. In an attempt to address deviations from normality, volatilome data can be log transformed. However, whether the effects of each assessed variable are additive or multiplicative should be considered prior to transformation, as this will impact the effect of each variable on the data. If assumptions of normality and variable effects are not investigated prior to dimensionality reduction, ineffective or erroneous compound dimensionality reduction can impact downstream analyses. It is the aim of this manuscript to assess the impact of single and multivariable statistical models with and without the log transformation to volatilome dimensionality reduction prior to any supervised or unsupervised classification analysis. As a proof of concept, Shingleback lizard (Tiliqua rugosa) volatilomes were collected across their species distribution and from captivity and were assessed. Shingleback volatilomes are suspected to be influenced by multiple explanatory variables related to habitat (Bioregion), sex, parasite presence, total body volume, and captive status. This work determined that the exclusion of relevant multiple explanatory variables from analysis overestimates the effect of Bioregion and the identification of significant compounds. The log transformation increased the number of compounds that were identified as significant, as did analyses that assumed that residuals were normally distributed. Among the methods considered in this work, the most conservative form of dimensionality reduction was achieved through analyzing untransformed data using Monte Carlo tests with multiple explanatory variables.