A Method for Localizing Non-Reference Sequences to the Human Genome

Brianna Sierra Chrisman; Kelley M Paskov; Chloe He; Jae-Yoon Jung; Nate Stockham; Peter Yigitcan Washington; Dennis Paul Wall

A Method for Localizing Non-Reference Sequences to the Human Genome

Pac Symp Biocomput. 2022:27:313-324.

Authors

Brianna Sierra Chrisman¹, Kelley M Paskov, Chloe He, Jae-Yoon Jung, Nate Stockham, Peter Yigitcan Washington, Dennis Paul Wall

Affiliation

¹ Departments of Bioengineering, Stanford University, Stanford, CA 94305, USA, briannac@stanford.edu.

PMID: 34890159
PMCID: PMC8730539

Abstract

As the last decade of human genomics research begins to bear the fruit of advancements in precision medicine, it is important to ensure that genomics' improvements in human health are distributed globally and equitably. An important step to ensuring health equity is to improve the human reference genome to capture global diversity by including a wide variety of alternative haplotypes, sequences that are not currently captured on the reference genome.We present a method that localizes 100 basepair (bp) long sequences extracted from short-read sequencing that can ultimately be used to identify what regions of the human genome non-reference sequences belong to.We extract reads that don't align to the reference genome, and compute the population's distribution of 100-mers found within the unmapped reads. We use genetic data from families to identify shared genetic material between siblings and match the distribution of unmapped k-mers to these inheritance patterns to determine the the most likely genomic region of a k-mer. We perform this localization with two highly interpretable methods of artificial intelligence: a computationally tractable Hidden Markov Model coupled to a Maximum Likelihood Estimator. Using a set of alternative haplotypes with known locations on the genome, we show that our algorithm is able to localize 96% of k-mers with over 90% accuracy and less than 1Mb median resolution. As the collection of sequenced human genomes grows larger and more diverse, we hope that this method can be used to improve the human reference genome, a critical step in addressing precision medicine's diversity crisis.

MeSH terms

Artificial Intelligence*
Computational Biology
Genome, Human*
Genomics
High-Throughput Nucleotide Sequencing
Humans
Sequence Analysis, DNA

Abstract

MeSH terms

Grants and funding