DiTaxa: nucleotide-pair encoding of 16S rRNA for host phenotype and biomarker detection

Ehsaneddin Asgari; Philipp C Münch; Till R Lesker; Alice C McHardy; Mohammad R K Mofrad

doi:10.1093/bioinformatics/bty954

DiTaxa: nucleotide-pair encoding of 16S rRNA for host phenotype and biomarker detection

Bioinformatics. 2019 Jul 15;35(14):2498-2500. doi: 10.1093/bioinformatics/bty954.

Authors

Ehsaneddin Asgari^{1

2}, Philipp C Münch^{2

3}, Till R Lesker², Alice C McHardy², Mohammad R K Mofrad^{1

4}

Affiliations

¹ Molecular Cell Biomechanics Laboratory, Departments of Bioengineering and Mechanical Engineering, University of California, Berkeley, CA, USA.
² Computational Biology of Infection Research, Helmholtz Centre for Infection Research, Brunswick, Germany.
³ Faculty of Medicine, LMU Munich, Max von Pettenkofer-Institute of Hygiene and Medical Microbiology, Munich, Germany.
⁴ Molecular Biophysics and Integrated Bioimaging, Lawrence Berkeley National Lab, Berkeley, CA, USA.

PMID: 30500871
DOI: 10.1093/bioinformatics/bty954

Abstract

Summary: Identifying distinctive taxa for micro-biome-related diseases is considered key to the establishment of diagnosis and therapy options in precision medicine and imposes high demands on the accuracy of micro-biome analysis techniques. We propose an alignment- and reference- free subsequence based 16S rRNA data analysis, as a new paradigm for micro-biome phenotype and biomarker detection. Our method, called DiTaxa, substitutes standard operational taxonomic unit (OTU)-clustering by segmenting 16S rRNA reads into the most frequent variable-length subsequences. We compared the performance of DiTaxa to the state-of-the-art methods in phenotype and biomarker detection, using human-associated 16S rRNA samples for periodontal disease, rheumatoid arthritis and inflammatory bowel diseases, as well as a synthetic benchmark dataset. DiTaxa performed competitively to the k-mer based state-of-the-art approach in phenotype prediction while outperforming the OTU-based state-of-the-art approach in finding biomarkers in both resolution and coverage evaluated over known links from literature and synthetic benchmark datasets.

Availability and implementation: DiTaxa is available under the Apache 2 license at http://llp.berkeley.edu/ditaxa.

Supplementary information: Supplementary data are available at Bioinformatics online.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Algorithms*
Biomarkers
Humans
Nucleotides
Phenotype
RNA, Ribosomal, 16S / genetics*
Sequence Analysis, DNA
Software

Substances

Biomarkers
Nucleotides
RNA, Ribosomal, 16S