Three approaches to supervised learning for compositional data with pairwise logratios

Germà Coenders; Michael Greenacre

doi:10.1080/02664763.2022.2108007

Three approaches to supervised learning for compositional data with pairwise logratios

J Appl Stat. 2022 Aug 6;50(16):3272-3293. doi: 10.1080/02664763.2022.2108007. eCollection 2023.

Authors

Germà Coenders¹, Michael Greenacre²

Affiliations

¹ Department of Economics, Universitat de Girona, Girona, Spain.
² Department of Economics and Business and Barcelona School of Management, Universitat Pompeu Fabra, Barcelona, Spain.

Abstract

Logratios between pairs of compositional parts (pairwise logratios) are the easiest to interpret in compositional data analysis, and include the well-known additive logratios as particular cases. When the number of parts is large (sometimes even larger than the number of cases), some form of logratio selection is needed. In this article, we present three alternative stepwise supervised learning methods to select the pairwise logratios that best explain a dependent variable in a generalized linear model, each geared for a specific problem. The first method features unrestricted search, where any pairwise logratio can be selected. This method has a complex interpretation if some pairs of parts in the logratios overlap, but it leads to the most accurate predictions. The second method restricts parts to occur only once, which makes the corresponding logratios intuitively interpretable. The third method uses additive logratios, so that K-1 selected logratios involve a K-part subcomposition. Our approach allows logratios or non-compositional covariates to be forced into the models based on theoretical knowledge, and various stopping criteria are available based on information measures or statistical significance with the Bonferroni correction. We present an application on a dataset from a study predicting Crohn's disease.

Keywords: Compositional data; generalized linear modelling; logratios; stepwise regression; variable selection.

Grants and funding

This work was supported by the Spanish Ministry of Science and Innovation/AEI/10.13039/501100011033 and by ERDF A way of making Europe [grant number PID2021-123833OB-I00]; the Spanish Ministry of Health (Ministerio de Sanidad, Consumo y Bienestar Social) [grant number CIBERCB06/02/1002]; and the Government of Catalonia (Generalitat de Catalunya) [grant number 2017SGR656].