Machine learning and deep learning methods that use omics data for metastasis prediction

Comput Struct Biotechnol J. 2021 Sep 4:19:5008-5018. doi: 10.1016/j.csbj.2021.09.001. eCollection 2021.

Abstract

Knowing metastasis is the primary cause of cancer-related deaths, incentivized research directed towards unraveling the complex cellular processes that drive the metastasis. Advancement in technology and specifically the advent of high-throughput sequencing provides knowledge of such processes. This knowledge led to the development of therapeutic and clinical applications, and is now being used to predict the onset of metastasis to improve diagnostics and disease therapies. In this regard, predicting metastasis onset has also been explored using artificial intelligence approaches that are machine learning, and more recently, deep learning-based. This review summarizes the different machine learning and deep learning-based metastasis prediction methods developed to date. We also detail the different types of molecular data used to build the models and the critical signatures derived from the different methods. We further highlight the challenges associated with using machine learning and deep learning methods, and provide suggestions to improve the predictive performance of such methods.

Keywords: AE, autoencoder; ANN, Artificial Neural Network; AUC, area under the curve; Acc, Accuracy; Artificial intelligence; BC, Betweenness centrality; BH, Benjamini-Hochberg; BioGRID, Biological General Repository for Interaction Datasets; CCP, compound covariate predictor; CEA, Carcinoembryonic antigen; CNN, convolution neural networks; CV, cross-validation; Cancer; DBN, deep belief network; DDBN, discriminative deep belief network; DEGs, differentially expressed genes; DIP, Database of Interacting Proteins; DNN, Deep neural network; DT, Decision Tree; Deep learning; EMT, epithelial-mesenchymal transition; FC, fully connected; GA, Genetic Algorithm; GANs, generative adversarial networks; GEO, Gene Expression Omnibus; HCC, hepatocellular carcinoma; HPRD, Human Protein Reference Database; KNN, K-nearest neighbor; L-SVM, linear SVM; LIMMA, linear models for microarray data; LOOCV, Leave-one-out cross-validation; LR, Logistic Regression; MCCV, Monte Carlo cross-validation; MLP, multilayer perceptron; Machine learning; Metastasis; NPV, negative predictive value; PCA, Principal component analysis; PPI, protein-protein interaction; PPV, positive predictive value; RC, ridge classifier; RF, Random Forest; RFE, recursive feature elimination; RMA, robust multi‐array average; RNN, recurrent neural networks; SGD, stochastic gradient descent; SMOTE, synthetic minority over-sampling technique; SVM, Support Vector Machine; Se, sensitivity; Sp, specificity; TCGA, The Cancer Genome Atlas; k-CV, k-fold cross validation; mRMR, minimum redundancy maximum relevance.

Publication types

  • Review