Suite of tools for statistical N-gram language modeling for pattern mining in whole genome sequences

J Bioinform Comput Biol. 2012 Dec;10(6):1250016. doi: 10.1142/S0219720012500163. Epub 2012 Jul 22.

Abstract

Genome sequences contain a number of patterns that have biomedical significance. Repetitive sequences of various kinds are a primary component of most of the genomic sequence patterns. We extended the suffix-array based Biological Language Modeling Toolkit to compute n-gram frequencies as well as n-gram language-model based perplexity in windows over the whole genome sequence to find biologically relevant patterns. We present the suite of tools and their application for analysis on whole human genome sequence.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms
  • Data Interpretation, Statistical
  • Genome*
  • Genomics / methods*
  • Humans
  • Models, Statistical*
  • Sequence Alignment / methods
  • Sequence Analysis, DNA / methods