Screening Discriminating SNPs for Chinese Indigenous Pig Breeds Identification Using a Random Forests Algorithm

Genes (Basel). 2022 Nov 25;13(12):2207. doi: 10.3390/genes13122207.

Abstract

Chinese indigenous pig breeds have unique genetic characteristics and a rich diversity; however, effective breed identification methods have not yet been well established. In this study, a genotype file of 62,822 single-nucleotide polymorphisms (SNPs), which were obtained from 1059 individuals of 18 Chinese indigenous pig breeds and 5 cosmopolitan breeds, were used to screen the discriminating SNPs for pig breed identification. After linkage disequilibrium (LD) pruning filtering, this study excluded 396 SNPs on non-constant chromosomes and retained 20.92~-27.84% of SNPs for each of the 18 autosomes, leaving a total of 14,823 SNPs. The principal component analysis (PCA) showed the largest differences between cosmopolitan and Chinese pig breeds (PC1 = 10.452%), while relatively small differences were found among the 18 indigenous pig breeds from the Yangtze River Delta region of China. Next, a random forest (RF) algorithm was used to filter these SNPs and obtain the optimal number of decision trees (ntree = 1000) using corresponding out-of-bag (OOB) error rates. By comparing two different SNP ranking methods in the RF analysis, the mean decreasing accuracy (MDA) and mean decreasing Gini index (MDG), the effects of panels with different numbers of SNPs on the assignment accuracy, and the statistics of SNP distribution on each chromosome in the panels, a panel of 1000 of the most breed-discriminative tagged SNPs were finally selected based on the MDA screening method. A high accuracy (>99.3%) was obtained by the breed prediction of 318 samples in the RF test set; thus, a machine learning classification method was established for the multi-breed identification of Chinese indigenous pigs based on a low-density panel of SNPs.

Keywords: breed identification; random forests; single-nucleotide polymorphisms (SNP).

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Animals
  • Genotype
  • Linkage Disequilibrium
  • Polymorphism, Single Nucleotide* / genetics
  • Random Forest*
  • Swine / genetics

Grants and funding

This research was funded by the National Key Research and Development Plan, grant number 2021YFD1200303 and 2021YFD1200305; Chongqing Technology Innovation and Application Development Project, grant number CSTC2021-JSCX-DXWTBX0004; and the Project of Developing Agriculture by Science and Technology in Shanghai (2022, No. 1–1).