34d2eee845f5f45e571d1e153c632683b8a93f75
lrnassar
  Tue Apr 21 16:17:53 2026 -0700
Refine NMD Escape Rule 2 gate to "single coding exon and no 3'UTR intron". refs #33737

Previously Rule 2 required exonCount==1 (truly intronless). This
overcorrected for single-CDS-exon transcripts whose only introns are in
the 5'UTR: biologically these have no EJC downstream of the stop codon
(5'UTR EJCs are cleared by the scanning 40S or sit upstream of the
terminating ribosome) and are NMD-immune, but the code pushed them to
Rules 1/3 under a less accurate "last coding exon" label.

New gate: len(cdsExons) == 1 AND no exon-exon junction strictly
downstream of the stop codon (strand-aware). Transcripts with a single
coding exon but a 3'UTR intron correctly stay in Rules 1/3 because that
intron deposits an EJC that can trigger NMD.

3,113 RefSeq Curated and 10,790 Gencode V49 transcripts move into Rule
2. 140 RefSeq and 1,135 Gencode single-CDS-exon transcripts with 3'UTR
introns correctly remain in Rules 1/3. Description page and makedoc
updated.

diff --git src/hg/makeDb/doc/hg38/nmd.txt src/hg/makeDb/doc/hg38/nmd.txt
index cf65f672985..b1a45d82daa 100644
--- src/hg/makeDb/doc/hg38/nmd.txt
+++ src/hg/makeDb/doc/hg38/nmd.txt
@@ -1,78 +1,92 @@
 #######################################################################
 # NMD escape regions from Gencode (2025-03-24 max/Claude)
 # Two outputs: decorator bigBed (per-transcript) and collapsed bigBed (merged by coordinates)
 # Collapsed version uses gene symbols from input, colors by rule, transcript lists
 # Script accepts -f bigGenePred (gencode .bb) or -f genePredExt (ncbiRefSeq .txt.gz)
 #
 # 2026-04-20 lrnassar: Added Rule 4 (long-exon rule, Lindeboom 2016) - coding
 # exons >400 nt excluding the last coding exon. Rebuilt Gencode + RefSeq.
 #
 # 2026-04-21 lrnassar: Fixed Rule 2 to test rec["exonCount"]==1 instead of
 # len(cdsExons)==1. The old test misclassified multi-exon transcripts with a
 # single CDS exon (UTR introns) as "intronless", AND silently suppressed their
 # Rule 1/3/4 assignments via the if/else short-circuit. ~3,253 RefSeq curated
 # transcripts and ~2,000 Gencode transcripts reassigned. Rebuilt both tracks.
+#
+# 2026-04-21 lrnassar: Refined Rule 2 gate to reflect the real NMD biology:
+# "single coding exon AND no 3'UTR intron" instead of "exonCount==1".
+# 5'UTR introns do not deposit EJCs downstream of the stop codon (their EJCs
+# are cleared by the scanning 40S or sit upstream of the stop codon and are
+# never encountered by the terminating ribosome), so transcripts with a single
+# coding exon and only 5'UTR introns are NMD-immune and belong in Rule 2.
+# Transcripts with a single coding exon but a 3'UTR intron remain in Rules 1/3
+# because that intron deposits a downstream EJC. Reclassified 3,113 RefSeq
+# curated transcripts (95.7% of the single-CDS-exon set with UTR introns) and
+# 10,790 Gencode V49 transcripts into Rule 2.
+# Post-fix rule counts (collapsed regions):
+#   RefSeq Curated:  R1=54,015  R2=2,942  R3=49,443  R4=6,503  (total 112,903)
+#   Gencode V49:     R1=134,464 R2=6,599  R3=85,319  R4=7,547  (total 233,929)
 
 cd /hive/data/genomes/hg38/bed/nmd/gencode/
 
 # run the script on gencode bigGenePred - produces decorator + collapsed BED files
 ~/kent/src/hg/makeDb/scripts/nmd/genePredNmdEsc -f bigGenePred \
     /hive/data/genomes/hg38/bed/gencodeV49/build/hg38.gencodeV49.bb \
     knownGeneNmdDeco.bed nmdEscRegions.bed
 
 # build decorator bigBed
 bedSort knownGeneNmdDeco.bed knownGeneNmdDeco.bed
 bedToBigBed knownGeneNmdDeco.bed ../../../chrom.sizes knownGeneNmdDeco.bb \
     -tab -type=bed12+5 -as=${HOME}/kent/src/hg/makeDb/scripts/nmd/nmdEscDecoration.as
 
 # build collapsed bigBed
 bedSort nmdEscRegions.bed nmdEscRegions.bed
 bedToBigBed nmdEscRegions.bed ../../../chrom.sizes nmdEscRegions.bb \
     -tab -type=bed9+2 -as=${HOME}/kent/src/hg/makeDb/scripts/nmd/nmdEscCollapsed.as
 
 # symlinks to /gbdb
 ln -sf /hive/data/genomes/hg38/bed/nmd/gencode/nmdEscRegions.bb /gbdb/hg38/nmd/nmdEscRegions.bb
 ln -sf /hive/data/genomes/hg38/bed/nmd/gencode/knownGeneNmdDeco.bb /gbdb/hg38/nmd/knownGeneNmdDeco.bb
 
 
 #######################################################################
 # NMD escape regions from NCBI RefSeq (2025-03-24 max)
 #
 # 2026-04-21 lrnassar: Switched from RefSeq all to RefSeq curated (NM_/NR_ only,
 # no XM_/XR_ predicted models) per Max's request on RM #33737. Prior RefSeq-all
 # outputs moved to refseqAll.bak/ within the same build dir.
 
 cd /hive/data/genomes/hg38/bed/nmd/ncbiRefSeq/
 
 # run the script on ncbiRefSeq curated genePredExt
 # Note: the script writes nmdNcbiRefSeqDeco.bed (per-transcript decorator format)
 # alongside the collapsed output, but we intentionally do not convert it to bigBed
 # for RefSeq. The decorator workflow currently only ships for Gencode/knownGene
 # (via knownGeneNmdDeco.bb).
 ~/kent/src/hg/makeDb/scripts/nmd/genePredNmdEsc -f genePredExt \
     /hive/data/genomes/hg38/bed/ncbiRefSeq.p14.2025-08-13/archive/hg38.ncbiRefSeqCurated.txt.gz \
     nmdNcbiRefSeqDeco.bed nmdEscNcbiRefSeq.bed
 
 # build collapsed bigBed
 bedSort nmdEscNcbiRefSeq.bed nmdEscNcbiRefSeq.bed
 bedToBigBed nmdEscNcbiRefSeq.bed ../../../chrom.sizes nmdEscNcbiRefSeq.bb \
     -tab -type=bed9+2 -as=${HOME}/kent/src/hg/makeDb/scripts/nmd/nmdEscCollapsed.as
 
 # symlink to gbdb
 ln -sf /hive/data/genomes/hg38/bed/nmd/ncbiRefSeq/nmdEscNcbiRefSeq.bb /gbdb/hg38/nmd/nmdEscNcbiRefSeq.bb
 
 #######################################################################
 # Lindeboom et al. NMDetective scores (2025-03-23 max/Claude)
 # NMD efficiency predictions from Lindeboom et al. 2016, Nat Genet.
 # Four bedGraph custom track files downloaded to:
 #   /hive/data/genomes/hg38/bed/nmd/lindeboom/
 # Data downloaded from https://figshare.com/articles/dataset/NMDetective/7803398
 # Custom track data in the session links from that page
 # - NMDetectiveA.ct  - Random forest prediction of NMD efficiency
 # - NMDetectiveB.ct  - Decision tree prediction of NMD efficiency
 # - nmdDectA-ptc.ct  - Random forest, first out-of-frame PTC
 # - nmdDectB-ptc.ct  - Decision tree, first out-of-frame PTC
 
 # Convert bedGraph custom tracks to bigWig and symlink from /gbdb:
 cd /hive/data/genomes/hg38/bed/nmd/lindeboom/
 bash ~/kent/src/hg/makeDb/scripts/nmd/lindeboomToBigWig.sh