Evaluation of vicinity-based hidden Markov models for genotype imputation

Su Wang; Miran Kim; Xiaoqian Jiang; Arif Ozgun Harmanci

doi:10.1186/s12859-022-04896-4

Evaluation of vicinity-based hidden Markov models for genotype imputation

BMC Bioinformatics. 2022 Aug 29;23(1):356. doi: 10.1186/s12859-022-04896-4.

Authors

Su Wang¹, Miran Kim², Xiaoqian Jiang³, Arif Ozgun Harmanci⁴

Affiliations

¹ Center for Precision Health, School of Biomedical Informatics, University of Texas Health Science Center, Houston, TX, 77030, USA.
² Department of Mathematics, Hanyang University, Seoul, 04763, Republic of Korea.
³ Center for Secure Artificial Intelligence For hEalthcare (SAFE), School of Biomedical Informatics, University of Texas Health Science Center, Houston, TX, 77030, USA.
⁴ Center for Precision Health, School of Biomedical Informatics, University of Texas Health Science Center, Houston, TX, 77030, USA. arif.o.harmanci@uth.tmc.edu.

Abstract

Background: The decreasing cost of DNA sequencing has led to a great increase in our knowledge about genetic variation. While population-scale projects bring important insight into genotype-phenotype relationships, the cost of performing whole-genome sequencing on large samples is still prohibitive. In-silico genotype imputation coupled with genotyping-by-arrays is a cost-effective and accurate alternative for genotyping of common and uncommon variants. Imputation methods compare the genotypes of the typed variants with the large population-specific reference panels and estimate the genotypes of untyped variants by making use of the linkage disequilibrium patterns. Most accurate imputation methods are based on the Li-Stephens hidden Markov model, HMM, that treats the sequence of each chromosome as a mosaic of the haplotypes from the reference panel.

Results: Here we assess the accuracy of vicinity-based HMMs, where each untyped variant is imputed using the typed variants in a small window around itself (as small as 1 centimorgan). Locality-based imputation is used recently by machine learning-based genotype imputation approaches. We assess how the parameters of the vicinity-based HMMs impact the imputation accuracy in a comprehensive set of benchmarks and show that vicinity-based HMMs can accurately impute common and uncommon variants.

Conclusions: Our results indicate that locality-based imputation models can be effectively used for genotype imputation. The parameter settings that we identified can be used in future methods and vicinity-based HMMs can be used for re-structuring and parallelizing new imputation methods. The source code for the vicinity-based HMM implementations is publicly available at https://github.com/harmancilab/LoHaMMer .

Keywords: Forward–Backward algorithm; Genotype imputation; Hidden Markov models; Viterbi algorithm.

MeSH terms

Genome-Wide Association Study / methods
Genotype
Haplotypes
Linkage Disequilibrium
Polymorphism, Single Nucleotide*
Sequence Analysis, DNA / methods
Software*