How to check new peptides accuracy in proteogenomics

Proteogenomics is an emerging area which is an interface of proteomics and genomics. This intersection employs the genomic and transcriptomic information to identify the novel peptides by using mass spectrometry based techniques. The proteomic data can then be used to identify the fingerprints of genic regions in that particular genome which may results in the modification of gene models and can also improve the gene annotations. So, we can say that proteogenomics has been well accepted as a tool to discover novel proteins and genes.

“But, during the discovery of novel genes, there are huge chances of getting false results as positives, i.e., we can also get those peptides which in actual are not but the algorithm may show them”.

Therefore, to avoid or more accurately, to minimize the chances of false positives, a False Discovery Rate (FDR) is used. FDR is a ratio of number of decoy hits / number of targets.

FDR = decoy/ target

In most conventional proteogenomic studies, a global false discovery rate (i.e., the identifications of annotated peptides and novel peptides are subjected to FDR estimation in combination) is used to filter out false positives for identifying credible novel peptides. However, it has been found that the actual level of false positives in novel peptides is often out of control and behaves differently for different genomes. It has been observed previously that, under a fixed FDR, the inflated database generated by, e.g. six-open-reading-frame (6-ORF) translation of a whole genome significantly decreases the sensitivity of peptide identification. Recently, Krug implied that the identification accuracy of novel peptides is greatly affected by the completeness of genome annotation, i.e., more the genome is annotated, higher are the chances of identification of accurate novel peptides.

In this recent paper, they followed the same framework as in Fu’s work to quantitatively investigate the subgroup FDRs of annotated and novel peptides identified by 6-ORF translation search.

In this article, they have revealed that the genome annotation completeness ratio is the dominant factor influencing the identification accuracy of novel peptides identified by 6-ORF translation search when a global FDR is used for quality assessment. However, with a stringent FDR control (e.g. 1%), many low scoring but true peptide identifications may be excluded along with false positives. To increase the sensitivity and specificity of novel gene discovery, one should reduce the size of searched database as much as possible. For example, when the transcriptome information (especially from the strand-specific cDNAseq data) is available, it is apparently more favorable to search against the transcriptome as well than to search against the genome alone. If the transcriptome information is unavailable, it would be also helpful to reduce the 6-OFR translation database by removing sequences that are predicted to be hardly possible to be real proteins.

Reference:

A note on the false discovery rate of novel peptides in proteogenomics
Kun Zhang1,2, Yan Fu3,*, Wen-Feng Zeng1,2, Kun He1,2, Hao Chi1,
Chao Liu1, Yan-Chang Li4, Yuan Gao4, Ping Xu4,* and Si-Min He1,*

Reference:

Leave a Reply Cancel reply