Improving the regional Y-STR haplotype resolution utilizing haplogroup-determining Y-SNPs and the application of machine learning in Y-SNP haplogroup prediction in a forensic Y-STR database: A pilot study on male Chinese Yunnan Zhaoyang Han population

Forensic Sci Int Genet. 2022 Mar:57:102659. doi: 10.1016/j.fsigen.2021.102659. Epub 2021 Dec 29.

Abstract

Improving the resolution of the current widely used Y-chromosomal short tandem repeat (Y-STR) dataset is of great importance for forensic investigators, and the current approach is limited, except for the addition of more Y-STR loci. In this research, a regional Y-DNA database was investigated to improve the Y-STR haplotype resolution utilizing a Y-SNP Pedigree Tagging System that includes 24 Y-chromosomal single nucleotide polymorphism (Y-SNP) loci. This pilot study was conducted in the Chinese Yunnan Zhaoyang Han population, and 3473 unrelated male individuals were enrolled. Based on data on the male haplogroups under different panels, the matched or near-matching (NM) Y-STR haplotype pairs from different haplogroups indicated the critical roles of haplogroups in improving the regional Y-STR haplotype resolution. A classic median-joining network analysis was performed using Y-STR or Y-STR/Y-SNP data to reconstruct population substructures, which revealed the ability of Y-SNPs to correct misclassifications from Y-STRs. Additionally, population substructures were reconstructed using multiple unsupervised or supervised dimensionality reduction methods, which indicated the potential of Y-STR haplotypes in predicting Y-SNP haplogroups. Haplogroup prediction models were built based on nine publicly accessible machine-learning (ML) approaches. The results showed that the best prediction accuracy score could reach 99.71% for major haplogroups and 98.54% for detailed haplogroups. Potential influences on prediction accuracy were assessed by adjusting the Y-STR locus numbers, selecting Y-STR loci with various mutabilities, and performing data processing. ML-based predictors generally presented a better prediction accuracy than two available predictors (Nevgen and EA-YPredictor). Three tree models were developed based on the Yfiler Plus panel with unprocessed input data, which showed their strong generalization ability in classifying various Chinese Han subgroups (validation dataset). In conclusion, this study revealed the significance and application prospects of Y-SNP haplogroups in improving regional Y-STR databases. Y-SNP haplogroups can be used to discriminate NM Y-STR haplotype pairs, and it is important for forensic Y-STR databases to develop haplogroup prediction tools to improve the accuracy of biogeographic ancestry inferences.

Keywords: Database development; Machine learning; Y-SNP haplogroup; Y-STR haplotype resolution.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • China
  • Chromosomes, Human, Y*
  • Genetics, Population
  • Haplotypes
  • Humans
  • Machine Learning
  • Male
  • Microsatellite Repeats
  • Pilot Projects
  • Polymorphism, Single Nucleotide*