A screening method for ultra-high dimensional features with overlapped partition structures

Stat Methods Med Res. 2023 Jan;32(1):22-40. doi: 10.1177/09622802221129043. Epub 2022 Sep 29.

Abstract

Ultra-high dimensional data, such as gene and neuroimaging data, are becoming increasingly important in biomedical science. Identifying important biomarkers from the huge number of features can help us gain better insights into further researches. Variable screening is an efficient tool to achieve this goal under the large scale cases, which reduces the dimension of features into a moderate size by removing the major part of inactive ones. Developing novel variable screening methods for high-dimensional features with group structures is challenging, especially under the overlapped cases. For example, the huge-scaled genes usually can be partitioned into hundreds of pathways according to background knowledge. One primary characteristic for this type of data is that many genes may appear across more than one pathway, which means that different pathways are overlapped. However, existing variable screening methods only could deal with disjoint group structure cases. To fill this gap, we propose a novel variable screening method for the generalized linear model by incorporating overlapped partition structures with theoretical guarantee. Besides the sure screening property, we also test the performance of the proposed method through a series of numerical studies and apply it to statistical analysis of a breast cancer data.

Keywords: Variable screening; generalized linear model; overlapped partition structures; sure screening.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Biomarkers
  • Linear Models*

Substances

  • Biomarkers