Analysis and simulation of gene expression profiles in pure and mixed cell populations

Phys Biol. 2011 Jun;8(3):035013. doi: 10.1088/1478-3975/8/3/035013. Epub 2011 May 13.

Abstract

For analysis and interpretation of data obtained from experimental readouts of gene expression, such as microarrays and RNA-sequencing, log transformation is routinely applied. This is because expression data, like many biological parameters, are strongly skewed. We show here that gene expression levels in multicellular organisms often deviate from simple (log) normal distributions and instead exhibit shouldered or bimodal distributions. Based on a mathematical model and numerical simulations, we demonstrate that many observed distributions can be explained as mixtures of bimodal two-component lognormal models. This is due to the fact that after log-transformation, the resulting distributions display reductions in the first peak rather than increasing overlaps over a wide range of parameter values. By comparing the theoretical results with biological datasets, our findings suggest that the distributions are generally bimodal for single cell types and get obscured by the different cell types that are present in tissue samples. Our analysis thus provides an initial explanation for the various types of expression level distributions that are found for different datasets. This will be important for the interpretation of next-generation sequencing data such as transcriptomics by mRNA-sequencing and ChIP-sequencing of epigenetic marks.

Publication types

  • Research Support, Non-U.S. Gov't
  • Review

MeSH terms

  • Animals
  • Cells / cytology
  • Cells / metabolism*
  • Computational Biology*
  • Computer Simulation*
  • Gene Expression Profiling / methods*
  • Humans
  • RNA, Messenger / analysis
  • RNA, Messenger / genetics
  • Sequence Analysis, RNA

Substances

  • RNA, Messenger