A high-performance SNP panel developed by machine-learning approaches for characterizing genetic differences of Southern and Northern Han Chinese, Korean, and Japanese individuals

Electrophoresis. 2022 Jun;43(11):1183-1192. doi: 10.1002/elps.202100184. Epub 2022 May 6.

Abstract

Population stratification analyses targeting genetically closely related East Asians have revealed that distinguishable differentiation exists between Han Chinese, Korean, and Japanese individuals, as well as between southern (S-) and northern (N-) Han Chinese. Previous studies offer a number of choices for ancestry informative single nucleotide polymorphisms (AISNPs) to discriminate East-Asian populations. In this study, we collected and examined the efficiency of 1185 AISNPs using frequency and genotype data from various publicly available databases. With the aim to perform fine-scale classification of S-Han, N-Han, Korean, and Japanese subjects, machine-learning methods (Softmax and Random Forest) were used to screen a panel of highly informative AISNPs and to develop a superior classification model. Stepwise classification was implemented to increase and balance the discrimination in the process of AISNP selection, first discriminating Han, Korean, and Japanese individuals, and then characterizing stratification between S-Han and N-Han. The final 272-AISNP panel is an alternative optimization of various previous works, which promises reliable and >90% accuracy in classification of the four East-Asian groups. This AISNP panel and the machine-learning model could be a useful and superior choice in medical genome-wide association studies and in forensic investigations for unknown suspect identity.

Keywords: East-Asian populations; Random Forest; Softmax; ancestry informative marker.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Asian People / genetics
  • China
  • Gene Frequency
  • Genetics, Population*
  • Genome-Wide Association Study
  • Humans
  • Japan
  • Machine Learning
  • Polymorphism, Single Nucleotide* / genetics
  • Republic of Korea