Virus expression detection reveals RNA-sequencing contamination in TCGA

BMC Genomics. 2020 Jan 28;21(1):79. doi: 10.1186/s12864-020-6483-6.

Abstract

Background: Contamination of reagents and cross contamination across samples is a long-recognized issue in molecular biology laboratories. While often innocuous, contamination can lead to inaccurate results. Cantalupo et al., for example, found HeLa-derived human papillomavirus 18 (H-HPV18) in several of The Cancer Genome Atlas (TCGA) RNA-sequencing samples. This work motivated us to assess a greater number of samples and determine the origin of possible contaminations using viral sequences. To detect viruses with high specificity, we developed the publicly available workflow, VirDetect, that detects virus and laboratory vector sequences in RNA-seq samples. We applied VirDetect to 9143 RNA-seq samples sequenced at one TCGA sequencing center (28/33 cancer types) over 5 years.

Results: We confirmed that H-HPV18 was present in many samples and determined that viral transcripts from H-HPV18 significantly co-occurred with those from xenotropic mouse leukemia virus-related virus (XMRV). Using laboratory metadata and viral transcription, we determined that the likely contaminant was a pool of cell lines known as the "common reference", which was sequenced alongside TCGA RNA-seq samples as a control to monitor quality across technology transitions (i.e. microarray to GAII to HiSeq), and to link RNA-seq to previous generation microarrays that standardly used the "common reference". One of the cell lines in the pool was a laboratory isolate of MCF-7, which we discovered was infected with XMRV; another constituent of the pool was likely HeLa cells.

Conclusions: Altogether, this indicates a multi-step contamination process. First, MCF-7 was infected with an XMRV. Second, this infected cell line was added to a pool of cell lines, which contained HeLa. Finally, RNA from this pool of cell lines contaminated several TCGA tumor samples most-likely during library construction. Thus, these human tumors with H-HPV or XMRV reads were likely not infected with H-HPV 18 or XMRV.

Keywords: Bioinformatics; Contamination; Human papilloma virus; Virus detection; Xenotropic murine leukemia virus-related.

MeSH terms

  • Animals
  • Cell Line, Tumor
  • Computational Biology / methods
  • DNA Contamination*
  • HeLa Cells
  • High-Throughput Nucleotide Sequencing / standards*
  • Humans
  • Mice
  • Molecular Diagnostic Techniques / standards*
  • Neoplasms / diagnosis
  • Neoplasms / genetics*
  • Neoplasms / virology
  • Phylogeny
  • RNA*
  • Software
  • Workflow

Substances

  • RNA