A Fitted Sparse-Group Lasso for Genome-Based Evaluations

IEEE/ACM Trans Comput Biol Bioinform. 2023 Jan-Feb;20(1):30-38. doi: 10.1109/TCBB.2022.3156805. Epub 2023 Feb 3.

Abstract

In life sciences, high-throughput techniques typically lead to high-dimensional data and often the number of covariates is much larger than the number of observations. This inherently comes with multicollinearity challenging a statistical analysis in a linear regression framework. Penalization methods such as the lasso, ridge regression, the group lasso, and convex combinations thereof, which introduce additional conditions on regression variables, have proven themselves effective. In this study, we introduce a novel approach by combining the lasso and the standardized group lasso leading to meaningful weighting of the predicted ("fitted") outcome which is of primary importance, e.g., in breeding populations. This "fitted" sparse-group lasso was implemented as a proximal-averaged gradient descent method and is part of the R package "seagull" available at CRAN. For the evaluation of the novel method, we executed an extensive simulation study. We simulated genotypes and phenotypes which resemble data of a dairy cattle population. Genotypes at thousands of genomic markers were used as covariates to fit a quantitative response. The proximity of markers on a chromosome determined grouping. In the majority of simulated scenarios, the new method revealed improved prediction abilities compared to other penalization approaches and was able to localize the signals of simulated features.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Animals
  • Cattle
  • Computer Simulation
  • Genome* / genetics
  • Genotype
  • Linear Models
  • Phenotype