Inverted genomic regions between reference genome builds in humans impact imputation accuracy and decrease the power of association testing

Xin Sheng; Lucy Xia; Jordan L Cahoon; David V Conti; Christopher A Haiman; Linda Kachuri; Charleston W K Chiang

doi:10.1016/j.xhgg.2022.100159

Inverted genomic regions between reference genome builds in humans impact imputation accuracy and decrease the power of association testing

HGG Adv. 2022 Nov 11;4(1):100159. doi: 10.1016/j.xhgg.2022.100159. eCollection 2023 Jan 12.

Authors

Xin Sheng¹, Lucy Xia¹, Jordan L Cahoon^{2

3}, David V Conti^{1

4}, Christopher A Haiman^{1

4}, Linda Kachuri⁵, Charleston W K Chiang^{1

2

4}

Affiliations

¹ Center for Genetic Epidemiology, Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California, Los Angeles, CA 90033, USA.
² Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA 90089, USA.
³ Department of Computer Science, University of Southern California, Los Angeles, CA 90089, USA.
⁴ Norris Comprehensive Cancer Center, University of Southern California, Los Angeles, CA 90033, USA.
⁵ Department of Epidemiology and Population Health, Stanford University, Stanford, CA 94305, USA.

Abstract

Over the last two decades, the human reference genome has undergone multiple updates as we complete a linear representation of our genome. Two versions of human references are currently used in the biomedical literature, GRCh37/hg19 and GRCh38. Conversions between these versions are critical for quality control, imputation, and association analysis. In the present study, we show that single-nucleotide variants (SNVs) in regions inverted between different builds of the reference genome are often mishandled bioinformatically. Depending on the array type, SNVs are found in approximately 2-5 Mb of the genome that are inverted between reference builds. Coordinate conversions of these variants are mishandled by both the TOPMed imputation server as well as routine in-house quality control pipelines, leading to underrecognized downstream analytical consequences. Specifically, we observe that undetected allelic conversion errors for palindromic (i.e., A/T or C/G) variants in these inverted regions would destabilize the local haplotype structure, leading to loss of imputation accuracy and power in association analyses. Though only a small proportion of the genome is affected, these regions include important disease susceptibility variants that would be affected. For example, the p value of a known locus associated with prostate cancer on chromosome 10 (chr10) would drop from 2.86 × 10^-7 to 0.0011 in a case-control analysis of 20,286 Africans and African Americans (10,643 cases and 9,643 controls). We devise a straight-forward heuristic based on the popular tool, liftOver, that can easily detect and correct these variants in the inverted regions between genome builds to locally improve imputation accuracy.

Keywords: bioinformatics; genetic associations; genome build; imputation; reference genome.

Publication types

Research Support, Non-U.S. Gov't
Research Support, N.I.H., Extramural

MeSH terms

Black or African American
Genome, Human / genetics
Genome-Wide Association Study*
Genomics*
Haplotypes / genetics
Humans
Male

Abstract

Publication types

MeSH terms

Grants and funding