Characterizing the genetic structure of a forensic DNA database using a latent variable approach

Maarten Kruijver

doi:10.1016/j.fsigen.2016.03.007

Characterizing the genetic structure of a forensic DNA database using a latent variable approach

Forensic Sci Int Genet. 2016 Jul:23:130-149. doi: 10.1016/j.fsigen.2016.03.007. Epub 2016 Apr 1.

Author

Maarten Kruijver¹

Affiliation

¹ Department of Mathematics, VU University, De Boelelaan 1081a, 1081 HV Amsterdam, The Netherlands. Electronic address: m.v.kruijver@vu.nl.

PMID: 27128695
DOI: 10.1016/j.fsigen.2016.03.007

Abstract

Several problems in forensic genetics require a representative model of a forensic DNA database. Obtaining an accurate representation of the offender database can be difficult, since databases typically contain groups of persons with unregistered ethnic origins in unknown proportions. We propose to estimate the allele frequencies of the subpopulations comprising the offender database and their proportions from the database itself using a latent variable approach. We present a model for which parameters can be estimated using the expectation maximization (EM) algorithm. This approach does not rely on relatively small and possibly unrepresentative population surveys, but is driven by the actual genetic composition of the database only. We fit the model to a snapshot of the Dutch offender database (2014), which contains close to 180,000 profiles, and find that three subpopulations suffice to describe a large fraction of the heterogeneity in the database. We demonstrate the utility and reliability of the approach with three applications. First, we use the model to predict the number of false leads obtained in database searches. We assess how well the model predicts the number of false leads obtained in mock searches in the Dutch offender database, both for the case of familial searching for first degree relatives of a donor and searching for contributors to three-person mixtures. Second, we study the degree of partial matching between all pairs of profiles in the Dutch database and compare this to what is predicted using the latent variable approach. Third, we use the model to provide evidence to support that the Dutch practice of estimating match probabilities using the Balding-Nichols formula with a native Dutch reference database and θ=0.03 is conservative.

Keywords: DNA database; DNA mixtures; Familial searching; Subpopulations.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Algorithms
DNA / genetics
DNA Fingerprinting
Databases, Nucleic Acid*
Gene Frequency
Humans
Likelihood Functions
Models, Genetic*
Netherlands
Reproducibility of Results

Substances

DNA