34d2eee845f5f45e571d1e153c632683b8a93f75
lrnassar
  Tue Apr 21 16:17:53 2026 -0700
Refine NMD Escape Rule 2 gate to "single coding exon and no 3'UTR intron". refs #33737

Previously Rule 2 required exonCount==1 (truly intronless). This
overcorrected for single-CDS-exon transcripts whose only introns are in
the 5'UTR: biologically these have no EJC downstream of the stop codon
(5'UTR EJCs are cleared by the scanning 40S or sit upstream of the
terminating ribosome) and are NMD-immune, but the code pushed them to
Rules 1/3 under a less accurate "last coding exon" label.

New gate: len(cdsExons) == 1 AND no exon-exon junction strictly
downstream of the stop codon (strand-aware). Transcripts with a single
coding exon but a 3'UTR intron correctly stay in Rules 1/3 because that
intron deposits an EJC that can trigger NMD.

3,113 RefSeq Curated and 10,790 Gencode V49 transcripts move into Rule
2. 140 RefSeq and 1,135 Gencode single-CDS-exon transcripts with 3'UTR
introns correctly remain in Rules 1/3. Description page and makedoc
updated.

diff --git src/hg/makeDb/doc/hg38/nmd.txt src/hg/makeDb/doc/hg38/nmd.txt
index cf65f672985..b1a45d82daa 100644
--- src/hg/makeDb/doc/hg38/nmd.txt
+++ src/hg/makeDb/doc/hg38/nmd.txt
@@ -1,29 +1,43 @@
 #######################################################################
 # NMD escape regions from Gencode (2025-03-24 max/Claude)
 # Two outputs: decorator bigBed (per-transcript) and collapsed bigBed (merged by coordinates)
 # Collapsed version uses gene symbols from input, colors by rule, transcript lists
 # Script accepts -f bigGenePred (gencode .bb) or -f genePredExt (ncbiRefSeq .txt.gz)
 #
 # 2026-04-20 lrnassar: Added Rule 4 (long-exon rule, Lindeboom 2016) - coding
 # exons >400 nt excluding the last coding exon. Rebuilt Gencode + RefSeq.
 #
 # 2026-04-21 lrnassar: Fixed Rule 2 to test rec["exonCount"]==1 instead of
 # len(cdsExons)==1. The old test misclassified multi-exon transcripts with a
 # single CDS exon (UTR introns) as "intronless", AND silently suppressed their
 # Rule 1/3/4 assignments via the if/else short-circuit. ~3,253 RefSeq curated
 # transcripts and ~2,000 Gencode transcripts reassigned. Rebuilt both tracks.
+#
+# 2026-04-21 lrnassar: Refined Rule 2 gate to reflect the real NMD biology:
+# "single coding exon AND no 3'UTR intron" instead of "exonCount==1".
+# 5'UTR introns do not deposit EJCs downstream of the stop codon (their EJCs
+# are cleared by the scanning 40S or sit upstream of the stop codon and are
+# never encountered by the terminating ribosome), so transcripts with a single
+# coding exon and only 5'UTR introns are NMD-immune and belong in Rule 2.
+# Transcripts with a single coding exon but a 3'UTR intron remain in Rules 1/3
+# because that intron deposits a downstream EJC. Reclassified 3,113 RefSeq
+# curated transcripts (95.7% of the single-CDS-exon set with UTR introns) and
+# 10,790 Gencode V49 transcripts into Rule 2.
+# Post-fix rule counts (collapsed regions):
+#   RefSeq Curated:  R1=54,015  R2=2,942  R3=49,443  R4=6,503  (total 112,903)
+#   Gencode V49:     R1=134,464 R2=6,599  R3=85,319  R4=7,547  (total 233,929)
 
 cd /hive/data/genomes/hg38/bed/nmd/gencode/
 
 # run the script on gencode bigGenePred - produces decorator + collapsed BED files
 ~/kent/src/hg/makeDb/scripts/nmd/genePredNmdEsc -f bigGenePred \
     /hive/data/genomes/hg38/bed/gencodeV49/build/hg38.gencodeV49.bb \
     knownGeneNmdDeco.bed nmdEscRegions.bed
 
 # build decorator bigBed
 bedSort knownGeneNmdDeco.bed knownGeneNmdDeco.bed
 bedToBigBed knownGeneNmdDeco.bed ../../../chrom.sizes knownGeneNmdDeco.bb \
     -tab -type=bed12+5 -as=${HOME}/kent/src/hg/makeDb/scripts/nmd/nmdEscDecoration.as
 
 # build collapsed bigBed
 bedSort nmdEscRegions.bed nmdEscRegions.bed