An analysis of proteogenomics and how and when transcriptome-informed reduction of protein databases can enhance eukaryotic proteomics

Laura Fancello; Thomas Burger

doi:10.1186/s13059-022-02701-2

An analysis of proteogenomics and how and when transcriptome-informed reduction of protein databases can enhance eukaryotic proteomics

Genome Biol. 2022 Jun 20;23(1):132. doi: 10.1186/s13059-022-02701-2.

Authors

Laura Fancello¹, Thomas Burger²

Affiliations

¹ CNRS, CEA, Inserm, BioSanté U1292, Profi FR2048, Université Grenoble Alpes, Grenoble, France.
² CNRS, CEA, Inserm, BioSanté U1292, Profi FR2048, Université Grenoble Alpes, Grenoble, France. thomas.burger@cea.fr.

Abstract

Background: Proteogenomics aims to identify variant or unknown proteins in bottom-up proteomics, by searching transcriptome- or genome-derived custom protein databases. However, empirical observations reveal that these large proteogenomic databases produce lower-sensitivity peptide identifications. Various strategies have been proposed to avoid this, including the generation of reduced transcriptome-informed protein databases, which only contain proteins whose transcripts are detected in the sample-matched transcriptome. These were found to increase peptide identification sensitivity. Here, we present a detailed evaluation of this approach.

Results: We establish that the increased sensitivity in peptide identification is in fact a statistical artifact, directly resulting from the limited capability of target-decoy competition to accurately model incorrect target matches when using excessively small databases. As anti-conservative false discovery rates (FDRs) are likely to hamper the robustness of the resulting biological conclusions, we advocate for alternative FDR control methods that are less sensitive to database size. Nevertheless, reduced transcriptome-informed databases are useful, as they reduce the ambiguity of protein identifications, yielding fewer shared peptides. Furthermore, searching the reference database and subsequently filtering proteins whose transcripts are not expressed reduces protein identification ambiguity to a similar extent, but is more transparent and reproducible.

Conclusions: In summary, using transcriptome information is an interesting strategy that has not been promoted for the right reasons. While the increase in peptide identifications from searching reduced transcriptome-informed databases is an artifact caused by the use of an FDR control method unsuitable to excessively small databases, transcriptome information can reduce the ambiguity of protein identifications.

Keywords: FDR control; Peptide identification sensitivity; Protein identification ambiguity; Proteogenomics; Proteomics; Target-decoy competition; Transcriptome-informed protein databases.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Databases, Protein
Eukaryota
Peptides
Proteins
Proteogenomics* / methods
Proteomics* / methods
Transcriptome

Substances

Peptides
Proteins