ClinicalRisk: A New Therapy-related Clinical Trial Dataset for Predicting Trial Status and Failure Reasons

Junyu Luo; Zhi Qiao; Lucas Glass; Cao Xiao; Fenglong Ma

doi:10.1145/3583780.3615113

ClinicalRisk: A New Therapy-related Clinical Trial Dataset for Predicting Trial Status and Failure Reasons

Proc ACM Int Conf Inf Knowl Manag. 2023 Oct:2023:5356-5360. doi: 10.1145/3583780.3615113. Epub 2023 Oct 21.

Authors

Junyu Luo¹, Zhi Qiao², Lucas Glass³, Cao Xiao⁴, Fenglong Ma¹

Affiliations

¹ The Pennsylvania State University, University Park, USA.
² United Imaging Healthcare, Beijing, China.
³ IQVIA, Chicago, USA.
⁴ GE HealthCare, Chicago, USA.

Abstract

Clinical trials aim to study new tests and evaluate their effects on human health outcomes, which has a huge market size. However, carrying out clinical trials is expensive and time-consuming and often ends in no results. It will revolutionize clinical practice if we can develop an effective model to automatically estimate the status of a clinical trial and find out possible failure reasons. However, it is challenging to develop such a model because of the lack of a benchmark dataset. To address these challenges, in this paper, we first build a new dataset by extracting the publicly available clinical trial reports from ClinicalTrials.gov. The associated status of each report is treated as the status label. To analyze the failure reasons, domain experts help us manually annotate each failed report based on the description associated with it. More importantly, we examine several state-of-the-art text classification baselines on this task and find out that the unique format of the clinical trial protocols plays an essential role in affecting prediction accuracy, demonstrating the need for specially designed clinical trial classification models.

Keywords: benchmark; clinical trial; text classification.

Grants and funding

R01 AG077016/AG/NIA NIH HHS/United States