TraRECo: a greedy approach based de novo transcriptome assembler with read error correction using consensus matrix

BMC Genomics. 2018 Sep 4;19(1):653. doi: 10.1186/s12864-018-5034-x.

Abstract

Background: The challenges when developing a good de novo transcriptome assembler include how to deal with read errors and sequence repeats. Almost all de novo assemblers utilize a de Bruijn graph, with which complexity grows linearly with data size while suffering from errors and repeats. Although one can correct the errors by inspecting the topological structure of the graph, this is not an easy task when there are too many branches. Two research directions are to improve either the graph reliability or the path search precision, and in this study, we focused on the former.

Results: We present TraRECo, a greedy approach to de novo assembly employing error-aware graph construction. In the proposed approach, we built contigs by direct read alignment within a distance margin and performed a junction search to construct splicing graphs. While doing so, a contig of length l was represented by a 4 × l matrix (called a consensus matrix), in which each element was the base count of the aligned reads so far. A representative sequence was obtained by taking the majority in each column of the consensus matrix to be used for further read alignment. Once the splicing graphs had been obtained, we used IsoLasso to find paths with a noticeable read depth. The experiments using real and simulated reads show that the method provided considerable improvement in sensitivity and moderately better performance when comparing sensitivity and precision. This was achieved by the error-aware graph construction using the consensus matrix, with which the reads having errors were made usable for the graph construction (otherwise, they might have been eventually discarded). This improved the quality of the coverage depth information used in the subsequent path search step and finally the reliability of the graph.

Conclusions: De novo assembly is mainly used to explore undiscovered isoforms and must be able to represent as many reads as possible in an efficient way. In this sense, TraRECo provides us with a potential alternative for improving graph reliability even though the computational burden is much higher than the single k-mer in the de Bruijn graph approach.

Keywords: RNA-Seq; consensus matrix; de novo transcriptome assembly; greedy approach; read error correction.

MeSH terms

  • Animals
  • Computational Biology*
  • Embryonic Stem Cells / cytology
  • Embryonic Stem Cells / metabolism*
  • High-Throughput Nucleotide Sequencing
  • Humans
  • Mice
  • Sequence Analysis, DNA / methods*
  • Software*
  • Transcriptome*