Variable Selection in the Regularized Simultaneous Component Analysis Method for Multi-Source Data Integration

Zhengguo Gu; Niek C de Schipper; Katrijn Van Deun

doi:10.1038/s41598-019-54673-2

Variable Selection in the Regularized Simultaneous Component Analysis Method for Multi-Source Data Integration

Sci Rep. 2019 Dec 9;9(1):18608. doi: 10.1038/s41598-019-54673-2.

Authors

Zhengguo Gu¹, Niek C de Schipper², Katrijn Van Deun²

Affiliations

¹ Department of Methodology and Statistics, Tilburg University, Tilburg, 5000, LE, The Netherlands. z.gu@tilburguniversity.edu.
² Department of Methodology and Statistics, Tilburg University, Tilburg, 5000, LE, The Netherlands.

Abstract

Interdisciplinary research often involves analyzing data obtained from different data sources with respect to the same subjects, objects, or experimental units. For example, global positioning systems (GPS) data have been coupled with travel diary data, resulting in a better understanding of traveling behavior. The GPS data and the travel diary data are very different in nature, and, to analyze the two types of data jointly, one often uses data integration techniques, such as the regularized simultaneous component analysis (regularized SCA) method. Regularized SCA is an extension of the (sparse) principle component analysis model to the cases where at least two data blocks are jointly analyzed, which - in order to reveal the joint and unique sources of variation - heavily relies on proper selection of the set of variables (i.e., component loadings) in the components. Regularized SCA requires a proper variable selection method to either identify the optimal values for tuning parameters or stably select variables. By means of two simulation studies with various noise and sparseness levels in simulated data, we compare six variable selection methods, which are cross-validation (CV) with the "one-standard-error" rule, repeated double CV (rdCV), BIC, Bolasso with CV, stability selection, and index of sparseness (IS) - a lesser known (compared to the first five methods) but computationally efficient method. Results show that IS is the best-performing variable selection method.

Publication types

Research Support, Non-U.S. Gov't