Taxonomy dimension reduction for colorectal cancer prediction

Comput Biol Chem. 2019 Dec:83:107160. doi: 10.1016/j.compbiolchem.2019.107160. Epub 2019 Nov 9.

Abstract

A growing number of people suffer from colorectal cancer, which is one of the most common cancers. It is essential to diagnose and treat the cancer as early as possible. The disease may change the microorganism communities in the gut, and it could be an efficient method to employ gut microorganisms to predict colorectal cancer. In this study, we selected operational taxonomic units that include several kinds of microorganisms to predict colorectal cancer. To find the most important microorganisms and obtain the best prediction performance, we explore effective feature selection methods. We employ three main steps. First, we use a single method to reduce features. Next, to reduce the number of features, we integrate the dimension reduction methods correlation-based feature selection and maximum relevance-maximum distance (MRMD 1.0 and MRMD 2.0). Then, we selected the important features according to the taxonomy files. In this study, we created training and test sets to obtain a more objective evaluation. Random forest, naïve Bayes, and decision tree classifiers were evaluated. The results show that the methods proposed in this study are better than hierarchical feature engineering. The proposed method, which combines correlation-based feature selection with MRMD 2.0, performed the best on the CRC2 dataset. The dataset and methods can be found in http://lab.malab.cn/data/microdata/data.html.

Keywords: Colorectal cancer; Correlation-based feature selection; Machine learning; Maximum relevant Maximum distance; Microbial.

MeSH terms

  • Bacteria / classification*
  • Bacteria / isolation & purification*
  • Bayes Theorem
  • Colorectal Neoplasms / diagnosis*
  • Colorectal Neoplasms / microbiology*
  • Decision Trees
  • Gastrointestinal Microbiome*
  • Humans