vi-HMM: a novel HMM-based method for sequence variant identification in short-read data

Man Tang; Mohammad Shabbir Hasan; Hongxiao Zhu; Liqing Zhang; Xiaowei Wu

doi:10.1186/s40246-019-0194-6

vi-HMM: a novel HMM-based method for sequence variant identification in short-read data

Hum Genomics. 2019 Feb 13;13(1):9. doi: 10.1186/s40246-019-0194-6.

Authors

Man Tang¹, Mohammad Shabbir Hasan², Hongxiao Zhu¹, Liqing Zhang², Xiaowei Wu³

Affiliations

¹ Department of Statistics, Virginia Tech, 250 Drillfield Drive, Blacksburg, 24061, VA, USA.
² Department of Computer Science, Virginia Tech, 225 Stanger Street, Blacksburg, 24060, VA, USA.
³ Department of Statistics, Virginia Tech, 250 Drillfield Drive, Blacksburg, 24061, VA, USA. xwwu@vt.edu.

Abstract

Background: Accurate and reliable identification of sequence variants, including single nucleotide polymorphisms (SNPs) and insertion-deletion polymorphisms (INDELs), plays a fundamental role in next-generation sequencing (NGS) applications. Existing methods for calling these variants often make simplified assumptions of positional independence and fail to leverage the dependence between genotypes at nearby loci that is caused by linkage disequilibrium (LD).

Results and conclusion: We propose vi-HMM, a hidden Markov model (HMM)-based method for calling SNPs and INDELs in mapped short-read data. This method allows transitions between hidden states (defined as "SNP," "Ins," "Del," and "Match") of adjacent genomic bases and determines an optimal hidden state path by using the Viterbi algorithm. The inferred hidden state path provides a direct solution to the identification of SNPs and INDELs. Simulation studies show that, under various sequencing depths, vi-HMM outperforms commonly used variant calling methods in terms of sensitivity and F₁ score. When applied to the real data, vi-HMM demonstrates higher accuracy in calling SNPs and INDELs.

Keywords: HMM; INDEL; SNP; Variant calling; Viterbi algorithm.

Publication types

Research Support, Non-U.S. Gov't
Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

Algorithms*
Databases, Genetic
Genetic Variation*
Haplotypes
High-Throughput Nucleotide Sequencing / methods*
High-Throughput Nucleotide Sequencing / statistics & numerical data
Humans
INDEL Mutation
Linkage Disequilibrium
Markov Chains*
Polymorphism, Single Nucleotide