CNN-MGP: Convolutional Neural Networks for Metagenomics Gene Prediction

Interdiscip Sci. 2019 Dec;11(4):628-635. doi: 10.1007/s12539-018-0313-4. Epub 2018 Dec 27.

Abstract

Accurate gene prediction in metagenomics fragments is a computationally challenging task due to the short-read length, incomplete, and fragmented nature of the data. Most gene-prediction programs are based on extracting a large number of features and then applying statistical approaches or supervised classification approaches to predict genes. In our study, we introduce a convolutional neural network for metagenomics gene prediction (CNN-MGP) program that predicts genes in metagenomics fragments directly from raw DNA sequences, without the need for manual feature extraction and feature selection stages. CNN-MGP is able to learn the characteristics of coding and non-coding regions and distinguish coding and non-coding open reading frames (ORFs). We train 10 CNN models on 10 mutually exclusive datasets based on pre-defined GC content ranges. We extract ORFs from each fragment; then, the ORFs are encoded numerically and inputted into an appropriate CNN model based on the fragment-GC content. The output from the CNN is the probability that an ORF will encode a gene. Finally, a greedy algorithm is used to select the final gene list. Overall, CNN-MGP is effective and achieves a 91% accuracy on testing dataset. CNN-MGP shows the ability of deep learning to predict genes in metagenomics fragments, and it achieves an accuracy higher than or comparable to state-of-the-art gene-prediction programs that use pre-defined features.

Keywords: Convolutional neural network; Deep learning; Gene prediction; Metagenomics; ORF.

MeSH terms

  • Algorithms
  • Base Composition
  • Binding Sites
  • Computational Biology / methods*
  • Gene Regulatory Networks
  • Genes, Archaeal*
  • Genes, Bacterial*
  • Machine Learning
  • Metagenomics*
  • Neural Networks, Computer
  • Open Reading Frames*
  • Probability
  • Promoter Regions, Genetic
  • Reproducibility of Results
  • Sensitivity and Specificity
  • Sequence Analysis, DNA