Practical approach to determine sample size for building logistic prediction models using high-throughput data

Dae-Soon Son; DongHyuk Lee; Kyusang Lee; Sin-Ho Jung; Taejin Ahn; Eunjin Lee; Insuk Sohn; Jongsuk Chung; Woongyang Park; Nam Huh; Jae Won Lee

doi:10.1016/j.jbi.2014.12.010

Practical approach to determine sample size for building logistic prediction models using high-throughput data

J Biomed Inform. 2015 Feb:53:355-62. doi: 10.1016/j.jbi.2014.12.010. Epub 2014 Dec 30.

Authors

Dae-Soon Son¹, DongHyuk Lee², Kyusang Lee³, Sin-Ho Jung⁴, Taejin Ahn⁵, Eunjin Lee⁶, Insuk Sohn⁷, Jongsuk Chung⁸, Woongyang Park⁹, Nam Huh¹⁰, Jae Won Lee¹¹

Affiliations

¹ Samsung Genome Institute, Samsung Medical Center, Seoul, Republic of Korea; In vitro Diagnostics Research Lab, Bio Research Center, Samsung Advanced Institute of Technology, Gyeonggi-do, Republic of Korea; Department of Statistics, Korea University, Seoul, Republic of Korea. Electronic address: ds3.son@samsung.com.
² Department of Statistics, Texas A&M University, College Station, TX 77843, USA. Electronic address: dhyuklee@tamu.edu.
³ Clinomics, Inc., A-616 Gardenfive Works, Seoul, Republic of Korea. Electronic address: klee@clinomics.co.kr.
⁴ Department of Biostatistics and Bioinformatics, Duke University, NC 27710, USA. Electronic address: sinho.jung@duke.edu.
⁵ Samsung Genome Institute, Samsung Medical Center, Seoul, Republic of Korea; In vitro Diagnostics Research Lab, Bio Research Center, Samsung Advanced Institute of Technology, Gyeonggi-do, Republic of Korea. Electronic address: taejin.ahn@samsung.com.
⁶ Samsung Genome Institute, Samsung Medical Center, Seoul, Republic of Korea; In vitro Diagnostics Research Lab, Bio Research Center, Samsung Advanced Institute of Technology, Gyeonggi-do, Republic of Korea. Electronic address: eunjin.lee@samsung.com.
⁷ Samsung Cancer Research Institute, Samsung Medical Center, Seoul, Republic of Korea. Electronic address: insuk.sohn@samsung.com.
⁸ Samsung Genome Institute, Samsung Medical Center, Seoul, Republic of Korea; In vitro Diagnostics Research Lab, Bio Research Center, Samsung Advanced Institute of Technology, Gyeonggi-do, Republic of Korea. Electronic address: doogie.chung@samsung.com.
⁹ Samsung Genome Institute, Samsung Medical Center, Seoul, Republic of Korea. Electronic address: woongyang.park@samsung.com.
¹⁰ In vitro Diagnostics Research Lab, Bio Research Center, Samsung Advanced Institute of Technology, Gyeonggi-do, Republic of Korea. Electronic address: bio.stat@daum.net.
¹¹ Department of Statistics, Korea University, Seoul, Republic of Korea. Electronic address: jael@korea.ac.kr.

PMID: 25555898
DOI: 10.1016/j.jbi.2014.12.010

Abstract

An empirical method of sample size determination for building prediction models was proposed recently. Permutation method which is used in this procedure is a commonly used method to address the problem of overfitting during cross-validation while evaluating the performance of prediction models constructed from microarray data. But major drawback of such methods which include bootstrapping and full permutations is prohibitively high cost of computation required for calculating the sample size. In this paper, we propose that a single representative null distribution can be used instead of a full permutation by using both simulated and real data sets. During simulation, we have used a dataset with zero effect size and confirmed that the empirical type I error approaches to 0.05. Hence this method can be confidently applied to reduce overfitting problem during cross-validation. We have observed that pilot data set generated by random sampling from real data could be successfully used for sample size determination. We present our results using an experiment that was repeated for 300 times while producing results comparable to that of full permutation method. Since we eliminate full permutation, sample size estimation time is not a function of pilot data size. In our experiment we have observed that this process takes around 30min. With the increasing number of clinical studies, developing efficient sample size determination methods for building prediction models is critical. But empirical methods using bootstrap and permutation usually involve high computing costs. In this study, we propose a method that can reduce required computing time drastically by using representative null distribution of permutations. We use data from pilot experiments to apply this method for designing clinical studies efficiently for high throughput data.

Keywords: Null distribution; Permutation; Prediction and validation; Sample size; Statistical power.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Algorithms
Computational Biology / methods*
Computer Simulation
Gene Expression Profiling / methods*
Humans
Logistic Models
Pilot Projects
Reproducibility of Results
Research Design*
Sample Size
Software