Mining whole genome sequence data to efficiently attribute individuals to source populations

Francisco J Pérez-Reche; Ovidiu Rotariu; Bruno S Lopes; Ken J Forbes; Norval J C Strachan

doi:10.1038/s41598-020-68740-6

Mining whole genome sequence data to efficiently attribute individuals to source populations

Sci Rep. 2020 Jul 22;10(1):12124. doi: 10.1038/s41598-020-68740-6.

Authors

Francisco J Pérez-Reche¹, Ovidiu Rotariu², Bruno S Lopes³, Ken J Forbes³, Norval J C Strachan²

Affiliations

¹ Institute of Complex Systems and Mathematical Biology, SUPA, School of Natural and Computing Sciences, University of Aberdeen, Aberdeen, AB24 3UE, Scotland, UK. fperez-reche@abdn.ac.uk.
² School of Biological Sciences, University of Aberdeen, Aberdeen, AB24 3UU, Scotland, UK.
³ School of Medicine, Medical Sciences and Dentistry, University of Aberdeen, Foresterhill, Aberdeen, AB25 2ZD, Scotland, UK.

Abstract

Whole genome sequence (WGS) data could transform our ability to attribute individuals to source populations. However, methods that efficiently mine these data are yet to be developed. We present a minimal multilocus distance (MMD) method which rapidly deals with these large data sets as well as methods for optimally selecting loci. This was applied on WGS data to determine the source of human campylobacteriosis, the geographical origin of diverse biological species including humans and proteomic data to classify breast cancer tumours. The MMD method provides a highly accurate attribution which is computationally efficient for extended genotypes. These methods are generic, easy to implement for WGS and proteomic data and have wide application.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Animals
Campylobacter / genetics
Campylobacter / isolation & purification
Campylobacter Infections / genetics
Campylobacter Infections / pathology
Databases, Genetic*
Disease Reservoirs / microbiology
Genome, Bacterial
Genotype
Humans
Multilocus Sequence Typing / methods*
Polymorphism, Single Nucleotide
Whole Genome Sequencing*