Application of t-SNE to human genetic data

J Bioinform Comput Biol. 2017 Aug;15(4):1750017. doi: 10.1142/S0219720017500172. Epub 2017 Jun 23.

Abstract

The t-distributed stochastic neighbor embedding t-SNE is a new dimension reduction and visualization technique for high-dimensional data. t-SNE is rarely applied to human genetic data, even though it is commonly used in other data-intensive biological fields, such as single-cell genomics. We explore the applicability of t-SNE to human genetic data and make these observations: (i) similar to previously used dimension reduction techniques such as principal component analysis (PCA), t-SNE is able to separate samples from different continents; (ii) unlike PCA, t-SNE is more robust with respect to the presence of outliers; (iii) t-SNE is able to display both continental and sub-continental patterns in a single plot. We conclude that the ability for t-SNE to reveal population stratification at different scales could be useful for human genetic association studies.

Keywords: PCA; SNP; dimension reduction; t-SNE.

MeSH terms

  • Computational Biology / methods*
  • Computer Graphics*
  • Genetics, Population*
  • Genome-Wide Association Study*
  • Human Genetics*
  • Humans
  • Polymorphism, Single Nucleotide*
  • Principal Component Analysis