TRiCoLOR: tandem repeat profiling using whole-genome long-read sequencing data

Gigascience. 2020 Oct 7;9(10):giaa101. doi: 10.1093/gigascience/giaa101.

Abstract

Background: Tandem repeat sequences are widespread in the human genome, and their expansions cause multiple repeat-mediated disorders. Genome-wide discovery approaches are needed to fully elucidate their roles in health and disease, but resolving tandem repeat variation accurately remains a challenging task. While traditional mapping-based approaches using short-read data have severe limitations in the size and type of tandem repeats they can resolve, recent third-generation sequencing technologies exhibit substantially higher sequencing error rates, which complicates repeat resolution.

Results: We developed TRiCoLOR, a freely available tool for tandem repeat profiling using error-prone long reads from third-generation sequencing technologies. The method can identify repetitive regions in sequencing data without a prior knowledge of their motifs or locations and resolve repeat multiplicity and period size in a haplotype-specific manner. The tool includes methods to interactively visualize the identified repeats and to trace their Mendelian consistency in pedigrees.

Conclusions: TRiCoLOR demonstrates excellent performance and improved sensitivity and specificity compared with alternative tools on synthetic data. For real human whole-genome sequencing data, TRiCoLOR achieves high validation rates, suggesting its suitability to identify tandem repeat variation in personal genomes.

Keywords: bioinformatics software; long-read sequencing; tandem repeat variation.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Genome, Human*
  • High-Throughput Nucleotide Sequencing
  • Humans
  • Sensitivity and Specificity
  • Sequence Analysis, DNA
  • Tandem Repeat Sequences*
  • Whole Genome Sequencing