640ebced50314d2662db498562167e0d21980f78
max
  Tue Jun 2 16:36:43 2026 -0700
gencNcOrfs: add peptide evidence fields from TransCODE/PeptideAtlas (PMID:42092140)

Extends the GENCODE Phase I ncORF bigGenePred (Ribo-seq_ORFs.kozak.peptides.bb,
44 fields) with 14 peptide evidence fields joined from five supplementary tables
in Mudge et al. 2025. Introduces the "peptidein" concept — ncORFs confidently
detected by mass spectrometry but lacking functional annotation for protein status.

Data covers 2,150 of 7,264 Phase I ORFs; the remaining 5,114 carry default empty
values. Three color schemes are available via the colorFields dropdown:
- Kozak strength (default, pre-existing field 9 colors)
- Evidence type: gold=peptidein, blue=HLA only, green=non-HLA only, purple=both
- HLA class: blue=class I, crimson=class II, purple=both

New filters: isPeptidein, hlaClass, hlaFinalTier, hlaHppCategory, riboseqQuality.
Phase II shortLabels: "GENCODE Phase II ncORFs Prim/Compr".
Phase I kozakTE filterLimits set to -1:1.5 so ORFs with unavailable context are accessible.

Build: peptideSupport/addPeptideEvidence.py
AutoSql: Ribo-seq_ORFs.kozak.peptides.as
BigBed: /gbdb/hg38/ncOrfs/gencNcOrf/Ribo-seq_ORFs.kozak.peptides.bb

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

diff --git src/hg/makeDb/doc/hg38/ncOrfs.txt src/hg/makeDb/doc/hg38/ncOrfs.txt
index b0c6f85f412..c9fc8cb66e2 100644
--- src/hg/makeDb/doc/hg38/ncOrfs.txt
+++ src/hg/makeDb/doc/hg38/ncOrfs.txt
@@ -331,15 +331,74 @@
 #   bigDataUrl ...kozak.bb   type bigGenePred   itemRgb on
 #   baseColorUseCds given   baseColorDefault genomicCodons
 #   mouseOver  <HTML using $startCodon, $kozakStrength, $kozakTE, ...>
 #   filterValues.startCodon ATG,CTG,GTG,TTG,ACG,other,none
 #   filterValues.kozakStrength Strong,Moderate,Weak,non-ATG,None
 #   filterByRange.kozakTE on   filter.kozakTE 0:1.5   filterLimits.kozakTE 0:1.5
 # OpenProt's pre-existing binary "kozak" field is renamed to "kozakMotif" in
 # openprot.as so that $kozakStrength substitution in the mouseOver is not
 # greedily eaten by $kozak.
 
 # Reference:
 # Noderer WL, Flockhart RJ, Bhaduri A, Diaz de Arce AJ, Zhang J, Khavari PA,
 # Wang CL. Quantitative analysis of mammalian translation initiation sites by
 # FACS-seq. Mol Syst Biol. 2014 Aug 28;10(8):748. DOI: 10.15252/msb.20145136
 # PMID: 25170020
+
+##############################################################################
+# GENCODE Phase I ncORFs: peptide evidence from TransCODE/PeptideAtlas
+# 2026-06-02 Claude max
+#
+# Deutsch et al. (Nature 2026, PMID 42092140) queried the 7,264 Phase I ORFs
+# against two PeptideAtlas builds: HLA immunopeptidomics (>100 HLA-typed donors,
+# HLA-I and HLA-II peptidomes) and whole-cell tryptic proteomics. The study
+# introduced the "peptidein" concept for translation products detectable by MS
+# but not yet annotatable as proteins.
+#
+# Five supplementary tables were downloaded from the Nature paper at
+# https://doi.org/10.1038/s41586-026-10459-x and saved to:
+#   /hive/data/genomes/hg38/bed/ncorfs/gencNcOrf/peptideSupport/
+#   41586_2026_10459_MOESM4_ESM_Table_S2.tsv  (non-HLA peptides, per peptide)
+#   41586_2026_10459_MOESM5_ESM_Table_S3.tsv  (non-HLA ORF-level summary)
+#   41586_2026_10459_MOESM8_ESM_Table_S6.tsv  (HLA peptides, per peptide)
+#   41586_2026_10459_MOESM9_ESM_Table_S7.tsv  (HLA ORF-level summary)
+#   41586_2026_10459_MOESM12_ESM_Sheet1.tsv   (peptidein set, Table S12)
+
+cd /hive/data/genomes/hg38/bed/ncorfs/gencNcOrf
+
+# Join all five tables onto the Phase I kozak bigBed by the short ORF identifier.
+# Adds 14 extra fields; ORFs without evidence get default empty values.
+# Uses ~/miniconda3/bin/python (needs py2bit) - the script reads field 9 (Kozak
+# colors) directly from the source bigBed so no color recomputation is needed.
+python3 peptideSupport/addPeptideEvidence.py \
+    2>peptideSupport/addPeptides.log > Ribo-seq_ORFs.kozak.peptides.bed
+
+# Counts: 2,150 ORFs with evidence, 5,114 without
+# Non-HLA peptides with 'exclude' != "" (Maps2Core / TooShort) are skipped: 331
+
+bedSort Ribo-seq_ORFs.kozak.peptides.bed Ribo-seq_ORFs.kozak.peptides.sorted.bed
+bedToBigBed -as=Ribo-seq_ORFs.kozak.peptides.as -type=bed12+ -tab \
+    Ribo-seq_ORFs.kozak.peptides.sorted.bed \
+    /hive/data/genomes/hg38/chrom.sizes \
+    Ribo-seq_ORFs.kozak.peptides.bb
+rm Ribo-seq_ORFs.kozak.peptides.sorted.bed
+
+# Verify: itemCount=7264, fieldCount=44
+bigBedInfo Ribo-seq_ORFs.kozak.peptides.bb
+
+# AutoSql and join script are in:
+#   ~/kent/src/hg/makeDb/scripts/ncOrfs/addPeptideEvidence.py
+#   ~/kent/src/hg/makeDb/scripts/ncOrfs/gencNcOrf.as (shared with kozak step)
+#   Ribo-seq_ORFs.kozak.peptides.as (in data dir, matches 44-field schema)
+
+# To recolor Kozak strength (updates all three kozak bigBeds and then the
+# peptides bigBed). Requires ~/miniconda3/bin/python for py2bit:
+export PATH=~/miniconda3/bin:$PATH
+scriptDir=~/kent/src/hg/makeDb/scripts/ncOrfs
+bash $scriptDir/run_kozak.sh bigGenePred       Ribo-seq_ORFs.bb               $scriptDir/gencNcOrf.as       Ribo-seq_ORFs.kozak.bb
+bash $scriptDir/run_kozak.sh gencNcOrfBed12    Ribo-seq_ORFs.primary.bb       $scriptDir/gencNcOrfPhase2.as Ribo-seq_ORFs.primary.kozak.bb
+bash $scriptDir/run_kozak.sh gencNcOrfBed12    Ribo-seq_ORFs.comprehensive.bb $scriptDir/gencNcOrfPhase2.as Ribo-seq_ORFs.comprehensive.kozak.bb
+# Then rebuild the peptides bigBed as shown above.
+
+# Reference:
+# Deutsch EW et al. Expanding the human proteome with microproteins and peptideins.
+# Nature. 2026 May 6. DOI: 10.1038/s41586-026-10459-x. PMID: 42092140