MONI: A Pangenomic Index for Finding Maximal Exact Matches

Massimiliano Rossi; Marco Oliva; Ben Langmead; Travis Gagie; Christina Boucher

doi:10.1089/cmb.2021.0290

MONI: A Pangenomic Index for Finding Maximal Exact Matches

J Comput Biol. 2022 Feb;29(2):169-187. doi: 10.1089/cmb.2021.0290. Epub 2022 Jan 17.

Authors

Massimiliano Rossi¹, Marco Oliva¹, Ben Langmead², Travis Gagie³, Christina Boucher¹

Affiliations

¹ Department of Computer and Information Science and Engineering, University of Florida, Gainesville, Florida, USA.
² Department of Computer Science, John Hopkins University, Baltimore, Maryland, USA.
³ Faculty of Computer Science, Dalhousie University, Halifax, Canada.

Abstract

Recently, Gagie et al. proposed a version of the FM-index, called the r-index, that can store thousands of human genomes on a commodity computer. Then Kuhnle et al. showed how to build the r-index efficiently via a technique called prefix-free parsing (PFP) and demonstrated its effectiveness for exact pattern matching. Exact pattern matching can be leveraged to support approximate pattern matching, but the r-index itself cannot support efficiently popular and important queries such as finding maximal exact matches (MEMs). To address this shortcoming, Bannai et al. introduced the concept of thresholds, and showed that storing them together with the r-index enables efficient MEM finding-but they did not say how to find those thresholds. We present a novel algorithm that applies PFP to build the r-index and find the thresholds simultaneously and in linear time and space with respect to the size of the prefix-free parse. Our implementation called $M O N I$ can rapidly find MEMs between reads and large-sequence collections of highly repetitive sequences. Compared with other read aligners-PuffAligner, Bowtie2, BWA-MEM, and CHIC- MONI used 2-11 times less memory and was 2-32 times faster for index construction. Moreover, MONI was less than one thousandth the size of competing indexes for large collections of human chromosomes. Thus, MONI represents a major advance in our ability to perform MEM finding against very large collections of related references.

Keywords: MEM-finding; r-index; run-length-encoded Burrows-Wheeler transform; thresholds.

Publication types

Research Support, N.I.H., Extramural
Research Support, Non-U.S. Gov't
Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

Algorithms*
Computational Biology
Databases, Genetic / statistics & numerical data
Genome, Bacterial
Genome, Human
Genomics / statistics & numerical data*
High-Throughput Nucleotide Sequencing / statistics & numerical data
Humans
Salmonella / genetics
Sequence Alignment / statistics & numerical data*
Sequence Analysis, DNA / statistics & numerical data
Software*
Wavelet Analysis

Abstract

Publication types

MeSH terms

Grants and funding