Benchmarking challenging small variants with linked and long reads

Justin Wagner; Nathan D Olson; Lindsay Harris; Ziad Khan; Jesse Farek; Medhat Mahmoud; Ana Stankovic; Vladimir Kovacevic; Byunggil Yoo; Neil Miller; Jeffrey A Rosenfeld; Bohan Ni; Samantha Zarate; Melanie Kirsche; Sergey Aganezov; Michael C Schatz; Giuseppe Narzisi; Marta Byrska-Bishop; Wayne Clarke; Uday S Evani; Charles Markello; Kishwar Shafin; Xin Zhou; Arend Sidow; Vikas Bansal; Peter Ebert; Tobias Marschall; Peter Lansdorp; Vincent Hanlon; Carl-Adam Mattsson; Alvaro Martinez Barrio; Ian T Fiddes; Chunlin Xiao; Arkarachai Fungtammasan; Chen-Shan Chin; Aaron M Wenger; William J Rowell; Fritz J Sedlazeck; Andrew Carroll; Marc Salit; Justin M Zook

doi:10.1016/j.xgen.2022.100128

Benchmarking challenging small variants with linked and long reads

Cell Genom. 2022 May;2(5):100128. doi: 10.1016/j.xgen.2022.100128.

Authors

Justin Wagner¹, Nathan D Olson¹, Lindsay Harris¹, Ziad Khan², Jesse Farek², Medhat Mahmoud², Ana Stankovic³, Vladimir Kovacevic³, Byunggil Yoo⁴, Neil Miller⁴, Jeffrey A Rosenfeld⁵, Bohan Ni⁶, Samantha Zarate⁶, Melanie Kirsche⁶, Sergey Aganezov⁶, Michael C Schatz⁶, Giuseppe Narzisi⁷, Marta Byrska-Bishop⁷, Wayne Clarke⁷, Uday S Evani⁷, Charles Markello⁸, Kishwar Shafin⁸, Xin Zhou⁹, Arend Sidow^{10

11}, Vikas Bansal¹², Peter Ebert¹³, Tobias Marschall¹³, Peter Lansdorp¹³, Vincent Hanlon¹⁴, Carl-Adam Mattsson¹⁴, Alvaro Martinez Barrio¹⁵, Ian T Fiddes¹⁵, Chunlin Xiao¹⁶, Arkarachai Fungtammasan¹⁷, Chen-Shan Chin¹⁷, Aaron M Wenger¹⁸, William J Rowell¹⁸, Fritz J Sedlazeck², Andrew Carroll¹⁹, Marc Salit^{20

21}, Justin M Zook^{1

21

22}

Affiliations

¹ Material Measurement Laboratory, National Institute of Standards and Technology, 100 Bureau Dr, MS8312, Gaithersburg, MD 20899, USA.
² Human Genome Sequencing Center, Baylor College of Medicine, One Baylor Plaza, Houston, TX 77030, USA.
³ Seven Bridges, Omladinskih brigada 90g, 11070 Belgrade, Republic of Serbia.
⁴ Children's Mercy Kansas City, Kansas City, MO, USA.
⁵ Rutgers Cancer Institute of New Jersey, New Brunswick, NJ, USA.
⁶ Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA.
⁷ New York Genome Center, 101 Avenue of the Americas, New York, NY, USA.
⁸ University of California at Santa Cruz Genomics Institute, 1156 High Street, Santa Cruz, CA, USA.
⁹ Department of Computer Science, Stanford University, Stanford, CA 94305, USA.
¹⁰ Department of Pathology, Stanford University, Stanford, CA 94305, USA.
¹¹ Department of Genetics, Stanford University, Stanford, CA 94305, USA.
¹² Department of Pediatrics, University of California, San Diego, La Jolla, CA 92093, USA.
¹³ Institute of Medical Biometry and Bioinformatics, Medical Faculty, Heinrich Heine University Düsseldorf, 40225 Düsseldorf, Germany.
¹⁴ Terry Fox Laboratory, BC Cancer Research Institute and Department of Medical Genetics, University of British Columbia, Vancouver, BC, Canada.
¹⁵ 10X Genomics, Pleasanton, CA 94588, USA.
¹⁶ National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD 20894, USA.
¹⁷ DNAnexus, Inc., Mountain View, CA 94040, USA.
¹⁸ Pacific Biosciences, Menlo Park, CA 94025, USA.
¹⁹ Google Inc., 1600 Amphitheatre Pkwy., Mountain View, CA 94040, USA.
²⁰ Joint Initiative for Metrology in Biology, SLAC National Laboratory, Stanford, CA, USA.
²¹ Senior author.
²² Lead contact.

Abstract

Genome in a Bottle benchmarks are widely used to help validate clinical sequencing pipelines and develop variant calling and sequencing methods. Here we use accurate linked and long reads to expand benchmarks in 7 samples to include difficult-to-map regions and segmental duplications that are challenging for short reads. These benchmarks add more than 300,000 SNVs and 50,000 insertions or deletions (indels) and include 16% more exonic variants, many in challenging, clinically relevant genes not covered previously, such as PMS2. For HG002, we include 92% of the autosomal GRCh38 assembly while excluding regions problematic for benchmarking small variants, such as copy number variants, that should not have been in the previous version, which included 85% of GRCh38. It identifies eight times more false negatives in a short read variant call set relative to our previous benchmark. We demonstrate that this benchmark reliably identifies false positives and false negatives across technologies, enabling ongoing methods development.

Abstract

Grants and funding