CRISPRclassify: Repeat-Based Classification of CRISPR Loci

Matthew A Nethery; Michael Korvink; Kira S Makarova; Yuri I Wolf; Eugene V Koonin; Rodolphe Barrangou

doi:10.1089/crispr.2021.0021

CRISPRclassify: Repeat-Based Classification of CRISPR Loci

CRISPR J. 2021 Aug;4(4):558-574. doi: 10.1089/crispr.2021.0021.

Authors

Matthew A Nethery¹, Michael Korvink², Kira S Makarova³, Yuri I Wolf³, Eugene V Koonin³, Rodolphe Barrangou¹

Affiliations

¹ Genomic Sciences Graduate Program, North Carolina State University, Raleigh, North Carolina, USA; National Library of Medicine, Bethesda, Maryland, USA.
² ITS Data Science, Premier Inc., Charlotte, North Carolina, USA; and National Library of Medicine, Bethesda, Maryland, USA.
³ National Center for Biotechnology Information, National Library of Medicine, Bethesda, Maryland, USA.

Abstract

Detection and classification of CRISPR-Cas systems in metagenomic data have become increasingly prevalent in recent years due to their potential for diverse applications in genome editing. Traditionally, CRISPR-Cas systems are classified through reference-based identification of proximate cas genes. Here, we present a machine learning approach for the detection and classification of CRISPR loci using repeat sequences in a cas-independent context, enabling identification of unclassified loci missed by traditional cas-based approaches. Using biological attributes of the CRISPR repeat, the core element in CRISPR arrays, and leveraging methods from natural language processing, we developed a machine learning model capable of accurate classification of CRISPR loci in an extensive set of metagenomes, resulting in an F1 measure of 0.82 across all predictions and an F1 measure of 0.97 when limiting to classifications with probabilities >0.85. Furthermore, assessing performance on novel repeats yielded an F1 measure of 0.96. Although the performance of cas-based identification will exceed that of a repeat-based approach in many cases, CRISPRclassify provides an efficient approach to classification of CRISPR loci for cases in which cas gene information is unavailable, such as metagenomes and fragmented genome assemblies.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Algorithms
Area Under Curve
Base Sequence
CRISPR-Cas Systems*
Clustered Regularly Interspaced Short Palindromic Repeats*
Computational Biology / methods
Databases, Genetic
Gene Editing*
Genetic Loci*
Genome, Bacterial
Genomics / methods
Reproducibility of Results