Case-based retrieval framework for gene expression data

Ali Anaissi; Madhu Goyal; Daniel R Catchpoole; Ali Braytee; Paul J Kennedy

doi:10.4137/CIN.S22371

Case-based retrieval framework for gene expression data

Cancer Inform. 2015 Mar 19:14:21-31. doi: 10.4137/CIN.S22371. eCollection 2015.

Authors

Ali Anaissi¹, Madhu Goyal¹, Daniel R Catchpoole², Ali Braytee¹, Paul J Kennedy¹

Affiliations

¹ Center for Quantum Computation and Intelligent Systems, Faculty of Engineering and Information Technology, University of Technology Sydney, Broadway, New South Wales, Australia.
² The Tumour Bank, Children's Cancer Research Unit, The Children's Hospital at Westmead, Westmead, New South Wales, Australia.

Abstract

Background: The process of retrieving similar cases in a case-based reasoning system is considered a big challenge for gene expression data sets. The huge number of gene expression values generated by microarray technology leads to complex data sets and similarity measures for high-dimensional data are problematic. Hence, gene expression similarity measurements require numerous machine-learning and data-mining techniques, such as feature selection and dimensionality reduction, to be incorporated into the retrieval process.

Methods: This article proposes a case-based retrieval framework that uses a k-nearest-neighbor classifier with a weighted-feature-based similarity to retrieve previously treated patients based on their gene expression profiles.

Results: The herein-proposed methodology is validated on several data sets: a childhood leukemia data set collected from The Children's Hospital at Westmead, as well as the Colon cancer, the National Cancer Institute (NCI), and the Prostate cancer data sets. Results obtained by the proposed framework in retrieving patients of the data sets who are similar to new patients are as follows: 96% accuracy on the childhood leukemia data set, 95% on the NCI data set, 93% on the Colon cancer data set, and 98% on the Prostate cancer data set.

Conclusion: The designed case-based retrieval framework is an appropriate choice for retrieving previous patients who are similar to a new patient, on the basis of their gene expression data, for better diagnosis and treatment of childhood leukemia. Moreover, this framework can be applied to other gene expression data sets using some or all of its steps.

Keywords: case base reasoning; data mining; dimensionality reduction; feature weighting; gene expression; machine learning.