Prediction of human breast and colon cancers from imbalanced data using nearest neighbor and support vector machines

Comput Methods Programs Biomed. 2014 Mar;113(3):792-808. doi: 10.1016/j.cmpb.2014.01.001. Epub 2014 Jan 10.

Abstract

This study proposes a novel prediction approach for human breast and colon cancers using different feature spaces. The proposed scheme consists of two stages: the preprocessor and the predictor. In the preprocessor stage, the mega-trend diffusion (MTD) technique is employed to increase the samples of the minority class, thereby balancing the dataset. In the predictor stage, machine-learning approaches of K-nearest neighbor (KNN) and support vector machines (SVM) are used to develop hybrid MTD-SVM and MTD-KNN prediction models. MTD-SVM model has provided the best values of accuracy, G-mean and Matthew's correlation coefficient of 96.71%, 96.70% and 71.98% for cancer/non-cancer dataset, breast/non-breast cancer dataset and colon/non-colon cancer dataset, respectively. We found that hybrid MTD-SVM is the best with respect to prediction performance and computational cost. MTD-KNN model has achieved moderately better prediction as compared to hybrid MTD-NB (Naïve Bayes) but at the expense of higher computing cost. MTD-KNN model is faster than MTD-RF (random forest) but its prediction is not better than MTD-RF. To the best of our knowledge, the reported results are the best results, so far, for these datasets. The proposed scheme indicates that the developed models can be used as a tool for the prediction of cancer. This scheme may be useful for study of any sequential information such as protein sequence or any nucleic acid sequence.

Keywords: Breast/colon cancer; K-nearest neighbor; Mega-trend diffusion; Naïve Bayes; Random forest; Support vector machines.

Publication types

  • Comparative Study
  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms
  • Artificial Intelligence
  • Breast Neoplasms / diagnosis*
  • Colonic Neoplasms / diagnosis*
  • Computational Biology
  • Databases, Protein
  • Diagnosis, Computer-Assisted / statistics & numerical data*
  • False Positive Reactions
  • Female
  • Humans
  • Linear Models
  • Male
  • Support Vector Machine*