Learning curves for drug response prediction in cancer cell lines

Alexander Partin; Thomas Brettin; Yvonne A Evrard; Yitan Zhu; Hyunseung Yoo; Fangfang Xia; Songhao Jiang; Austin Clyde; Maulik Shukla; Michael Fonstein; James H Doroshow; Rick L Stevens

doi:10.1186/s12859-021-04163-y

Learning curves for drug response prediction in cancer cell lines

BMC Bioinformatics. 2021 May 17;22(1):252. doi: 10.1186/s12859-021-04163-y.

Authors

Alexander Partin^{1

2}, Thomas Brettin^{3

4}, Yvonne A Evrard⁵, Yitan Zhu^{6

3}, Hyunseung Yoo^{6

3}, Fangfang Xia^{6

3}, Songhao Jiang⁷, Austin Clyde^{6

7}, Maulik Shukla^{6

3}, Michael Fonstein⁸, James H Doroshow⁹, Rick L Stevens^{4

7}

Affiliations

¹ Division of Data Science and Learning, Argonne National Laboratory, Lemont, IL, USA. apartin@anl.gov.
² University of Chicago Consortium for Advanced Science and Engineering, University of Chicago, Chicago, IL, USA. apartin@anl.gov.
³ University of Chicago Consortium for Advanced Science and Engineering, University of Chicago, Chicago, IL, USA.
⁴ Computing, Environment and Life Sciences, Argonne National Laboratory, Lemont, IL, USA.
⁵ Frederick National Laboratory for Cancer Research, Leidos Biomedical Research Inc., Frederick, MD, USA.
⁶ Division of Data Science and Learning, Argonne National Laboratory, Lemont, IL, USA.
⁷ Department of Computer Science, University of Chicago, Chicago, IL, USA.
⁸ Biosciences Division, Argonne National Laboratory, Lemont, IL, USA.
⁹ Division of Cancer Therapeutics and Diagnosis, National Cancer Institute, Bethesda, MD, USA.

Abstract

Background: Motivated by the size and availability of cell line drug sensitivity data, researchers have been developing machine learning (ML) models for predicting drug response to advance cancer treatment. As drug sensitivity studies continue generating drug response data, a common question is whether the generalization performance of existing prediction models can be further improved with more training data.

Methods: We utilize empirical learning curves for evaluating and comparing the data scaling properties of two neural networks (NNs) and two gradient boosting decision tree (GBDT) models trained on four cell line drug screening datasets. The learning curves are accurately fitted to a power law model, providing a framework for assessing the data scaling behavior of these models.

Results: The curves demonstrate that no single model dominates in terms of prediction performance across all datasets and training sizes, thus suggesting that the actual shape of these curves depends on the unique pair of an ML model and a dataset. The multi-input NN (mNN), in which gene expressions of cancer cells and molecular drug descriptors are input into separate subnetworks, outperforms a single-input NN (sNN), where the cell and drug features are concatenated for the input layer. In contrast, a GBDT with hyperparameter tuning exhibits superior performance as compared with both NNs at the lower range of training set sizes for two of the tested datasets, whereas the mNN consistently performs better at the higher range of training sizes. Moreover, the trajectory of the curves suggests that increasing the sample size is expected to further improve prediction scores of both NNs. These observations demonstrate the benefit of using learning curves to evaluate prediction models, providing a broader perspective on the overall data scaling characteristics.

Conclusions: A fitted power law learning curve provides a forward-looking metric for analyzing prediction performance and can serve as a co-design tool to guide experimental biologists and computational scientists in the design of future experiments in prospective research studies.

Keywords: Cell line; Deep learning; Drug response prediction; Learning curve; Machine learning; Power law.

MeSH terms

Cell Line
Learning Curve
Machine Learning
Neoplasms* / drug therapy
Neoplasms* / genetics
Pharmaceutical Preparations*
Prospective Studies

Substances

Pharmaceutical Preparations