Deconvolute individual genomes from metagenome sequences through short read clustering

Kexue Li; Yakang Lu; Li Deng; Lili Wang; Lizhen Shi; Zhong Wang

doi:10.7717/peerj.8966

Deconvolute individual genomes from metagenome sequences through short read clustering

PeerJ. 2020 Apr 8:8:e8966. doi: 10.7717/peerj.8966. eCollection 2020.

Authors

Kexue Li^#^{1

2}, Yakang Lu^#^{1

2}, Li Deng^{1

2

3}, Lili Wang^{1

2}, Lizhen Shi⁴, Zhong Wang^{3

5

6}

Affiliations

¹ School of Mechanics Engineering and Automation, Shanghai University, Shanghai, China.
² Shanghai Key Laboratory of Power Station Automation Technology, Shanghai, China.
³ Department of Energy, Joint Genome Institute, Walnut Creek, CA, USA.
⁴ Department of Computer Science, Florida State University, Tallahassee, FL, USA.
⁵ Environmental Genomics and Systems Biology Division, Lawrence Berkeley National Laboratory, Berkeley, CA, USA.
⁶ School of Natural Sciences, University of California at Merced, Merced, CA, USA.

^# Contributed equally.

Abstract

Metagenome assembly from short next-generation sequencing data is a challenging process due to its large scale and computational complexity. Clustering short reads by species before assembly offers a unique opportunity for parallel downstream assembly of genomes with individualized optimization. However, current read clustering methods suffer either false negative (under-clustering) or false positive (over-clustering) problems. Here we extended our previous read clustering software, SpaRC, by exploiting statistics derived from multiple samples in a dataset to reduce the under-clustering problem. Using synthetic and real-world datasets we demonstrated that this method has the potential to cluster almost all of the short reads from genomes with sufficient sequencing coverage. The improved read clustering in turn leads to improved downstream genome assembly quality.

Keywords: Apache Spark; Metagenome clustering; Short-read clustering.

Grants and funding

The work was supported by the National Natural Science Foundation of China (No. 61802246) and the 111 Project (No. D18003). Zhong Wang’s work was supported by the U.S. Department of Energy, Office of Science, Office of Biological and Environmental Research under Contract No. DE-AC02-05CH11231. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.