Comparative study of probability distribution distances to define a metric for the stability of multi-source biomedical research data

Annu Int Conf IEEE Eng Med Biol Soc. 2013:2013:3226-9. doi: 10.1109/EMBC.2013.6610228.

Abstract

Research biobanks are often composed by data from multiple sources. In some cases, these different subsets of data may present dissimilarities among their probability density functions (PDF) due to spatial shifts. This, may lead to wrong hypothesis when treating the data as a whole. Also, the overall quality of the data is diminished. With the purpose of developing a generic and comparable metric to assess the stability of multi-source datasets, we have studied the applicability and behaviour of several PDF distances over shifts on different conditions (such as uni- and multivariate, different types of variable, and multi-modality) which may appear in real biomedical data. From the studied distances, we found information-theoretic based and Earth Mover's Distance to be the most practical distances for most conditions. We discuss the properties and usefulness of each distance according to the possible requirements of a general stability metric.

MeSH terms

  • Biomedical Research*
  • Databases, Factual
  • Models, Statistical*
  • Probability
  • Research Design
  • Statistics, Nonparametric