Computational approaches to mine publicly available databases

Methods Mol Biol. 2014:1126:325-40. doi: 10.1007/978-1-62703-980-2_24.

Abstract

Publicly available sequence annotation data is a vital resource for researchers. Many types of information are available, including structural annotations (i.e., the locations and identities of genomic features) and functional annotations (e.g., gene expression and protein interactions). Annotation data is especially useful for interrogating Next-Gen sequencing data (e.g., identifying genomic features that are associated with mapped reads). Additionally, the vast amount of data that is available offers researchers the opportunity to mine existing data sets and make new discoveries. The ability to efficiently obtain, manipulate, and interrogate this data is a valuable and empowering skill. In this chapter, we introduce several primary data repositories and describe the most commonly encountered file formats. In order to highlight some of the key concepts, operations, and utilities that are involved in working with annotation data we provide a fully worked example of using annotations to answer some basic questions about a particular CHIP-seq data set.

MeSH terms

  • Computational Biology / methods*
  • Data Mining / methods*
  • Databases, Nucleic Acid*
  • Molecular Sequence Annotation / methods