src/hg/makeDb/doc/hg38/nmd.txt 0151d00a4a1d73a78c35f6158c6c936ff338faeb

0151d00a4a1d73a78c35f6158c6c936ff338faeb
max
  Fri Apr 24 10:37:34 2026 -0700
NMD Escape: MANE subtrack, Rule 1 bug fix, transcript filter. refs #33737

- Add nmdEscMane subtrack (MANE Select Plus Clinical 1.5), built from
/gbdb/hg38/mane/mane.bb. Reuses nmdEscTranscripts.html.
- Fix Rule 1: measure 50 bp upstream of the transcript's last splice
junction (including 3'UTR introns) rather than stripping 3'UTR from
the exon list first. The old logic painted the entire last CDS exon
as NMD-escape whenever the transcript had only one CDS exon, even
when a 3'UTR intron sat far past the stop codon (e.g. NBDY: 207 bp
of CDS over-painted for a junction 2.6 kb past the stop).
- Add --rule1-mode {cds,mrna} (default cds): cds counts only CDS bp
on the walk-back (paints up to 50 bp of CDS matching the rule label
literally); mrna counts mRNA bp and clips to CDS (tracks the 55 bp
rule literature). Documented in makeDoc.
- Rule 4: when a 3'UTR intron exists, the last CDS-containing exon
has a downstream EJC and is now eligible for the long-exon rule.
- Mouseover lists contributing transcript accessions when 1-3 items
collapse into a region; falls back to a count above that.
- Add filterText/filterType/filterLabel on all three escape subtracks
so a user can narrow the display to one transcript.
- genePredNmdEsc: --gene-sym-field (default 17 for Gencode; pass 18
for MANE, whose HGNC symbol lives in bigGenePred geneName2).
- Add findShortTxLongUtrIntron.py helper for finding MANE transcripts
with long UTR introns (used to pick NMD edge-case test cases).

Post-fix collapsed-region counts (--rule1-mode=cds):
MANE 1.5:        67,752
Gencode V49:    233,375
RefSeq Curated: 112,356

diff --git src/hg/makeDb/doc/hg38/nmd.txt src/hg/makeDb/doc/hg38/nmd.txt
index b1a45d82daa..fafb069b827 100644
--- src/hg/makeDb/doc/hg38/nmd.txt
+++ src/hg/makeDb/doc/hg38/nmd.txt
@@ -1,92 +1,143 @@
 #######################################################################
 # NMD escape regions from Gencode (2025-03-24 max/Claude)
 # Two outputs: decorator bigBed (per-transcript) and collapsed bigBed (merged by coordinates)
 # Collapsed version uses gene symbols from input, colors by rule, transcript lists
 # Script accepts -f bigGenePred (gencode .bb) or -f genePredExt (ncbiRefSeq .txt.gz)
 #
 # 2026-04-20 lrnassar: Added Rule 4 (long-exon rule, Lindeboom 2016) - coding
 # exons >400 nt excluding the last coding exon. Rebuilt Gencode + RefSeq.
 #
 # 2026-04-21 lrnassar: Fixed Rule 2 to test rec["exonCount"]==1 instead of
 # len(cdsExons)==1. The old test misclassified multi-exon transcripts with a
 # single CDS exon (UTR introns) as "intronless", AND silently suppressed their
 # Rule 1/3/4 assignments via the if/else short-circuit. ~3,253 RefSeq curated
 # transcripts and ~2,000 Gencode transcripts reassigned. Rebuilt both tracks.
 #
 # 2026-04-21 lrnassar: Refined Rule 2 gate to reflect the real NMD biology:
 # "single coding exon AND no 3'UTR intron" instead of "exonCount==1".
 # 5'UTR introns do not deposit EJCs downstream of the stop codon (their EJCs
 # are cleared by the scanning 40S or sit upstream of the stop codon and are
 # never encountered by the terminating ribosome), so transcripts with a single
 # coding exon and only 5'UTR introns are NMD-immune and belong in Rule 2.
 # Transcripts with a single coding exon but a 3'UTR intron remain in Rules 1/3
 # because that intron deposits a downstream EJC. Reclassified 3,113 RefSeq
 # curated transcripts (95.7% of the single-CDS-exon set with UTR introns) and
 # 10,790 Gencode V49 transcripts into Rule 2.
 # Post-fix rule counts (collapsed regions):
 #   RefSeq Curated:  R1=54,015  R2=2,942  R3=49,443  R4=6,503  (total 112,903)
 #   Gencode V49:     R1=134,464 R2=6,599  R3=85,319  R4=7,547  (total 233,929)
 
 cd /hive/data/genomes/hg38/bed/nmd/gencode/
 
 # run the script on gencode bigGenePred - produces decorator + collapsed BED files
 ~/kent/src/hg/makeDb/scripts/nmd/genePredNmdEsc -f bigGenePred \
     /hive/data/genomes/hg38/bed/gencodeV49/build/hg38.gencodeV49.bb \
     knownGeneNmdDeco.bed nmdEscRegions.bed
 
 # build decorator bigBed
 bedSort knownGeneNmdDeco.bed knownGeneNmdDeco.bed
 bedToBigBed knownGeneNmdDeco.bed ../../../chrom.sizes knownGeneNmdDeco.bb \
     -tab -type=bed12+5 -as=${HOME}/kent/src/hg/makeDb/scripts/nmd/nmdEscDecoration.as
 
 # build collapsed bigBed
 bedSort nmdEscRegions.bed nmdEscRegions.bed
 bedToBigBed nmdEscRegions.bed ../../../chrom.sizes nmdEscRegions.bb \
     -tab -type=bed9+2 -as=${HOME}/kent/src/hg/makeDb/scripts/nmd/nmdEscCollapsed.as
 
 # symlinks to /gbdb
 ln -sf /hive/data/genomes/hg38/bed/nmd/gencode/nmdEscRegions.bb /gbdb/hg38/nmd/nmdEscRegions.bb
 ln -sf /hive/data/genomes/hg38/bed/nmd/gencode/knownGeneNmdDeco.bb /gbdb/hg38/nmd/knownGeneNmdDeco.bb
 
 
 #######################################################################
 # NMD escape regions from NCBI RefSeq (2025-03-24 max)
 #
 # 2026-04-21 lrnassar: Switched from RefSeq all to RefSeq curated (NM_/NR_ only,
 # no XM_/XR_ predicted models) per Max's request on RM #33737. Prior RefSeq-all
 # outputs moved to refseqAll.bak/ within the same build dir.
 
 cd /hive/data/genomes/hg38/bed/nmd/ncbiRefSeq/
 
 # run the script on ncbiRefSeq curated genePredExt
 # Note: the script writes nmdNcbiRefSeqDeco.bed (per-transcript decorator format)
 # alongside the collapsed output, but we intentionally do not convert it to bigBed
 # for RefSeq. The decorator workflow currently only ships for Gencode/knownGene
 # (via knownGeneNmdDeco.bb).
 ~/kent/src/hg/makeDb/scripts/nmd/genePredNmdEsc -f genePredExt \
     /hive/data/genomes/hg38/bed/ncbiRefSeq.p14.2025-08-13/archive/hg38.ncbiRefSeqCurated.txt.gz \
     nmdNcbiRefSeqDeco.bed nmdEscNcbiRefSeq.bed
 
 # build collapsed bigBed
 bedSort nmdEscNcbiRefSeq.bed nmdEscNcbiRefSeq.bed
 bedToBigBed nmdEscNcbiRefSeq.bed ../../../chrom.sizes nmdEscNcbiRefSeq.bb \
     -tab -type=bed9+2 -as=${HOME}/kent/src/hg/makeDb/scripts/nmd/nmdEscCollapsed.as
 
 # symlink to gbdb
 ln -sf /hive/data/genomes/hg38/bed/nmd/ncbiRefSeq/nmdEscNcbiRefSeq.bb /gbdb/hg38/nmd/nmdEscNcbiRefSeq.bb
 
+#######################################################################
+# NMD escape regions from MANE Select Plus Clinical (2026-04-24 max)
+# Same script, run on the MANE bigGenePred. MANE puts the HGNC symbol in
+# bigGenePred field 18 (Gencode uses field 17), so pass --gene-sym-field 18.
+
+mkdir -p /hive/data/genomes/hg38/bed/nmd/mane && cd /hive/data/genomes/hg38/bed/nmd/mane
+
+~/kent/src/hg/makeDb/scripts/nmd/genePredNmdEsc -f bigGenePred --gene-sym-field 18 \
+    /gbdb/hg38/mane/mane.bb \
+    nmdManeDeco.bed nmdEscMane.bed
+
+bedSort nmdEscMane.bed nmdEscMane.bed
+bedToBigBed nmdEscMane.bed ../../../chrom.sizes nmdEscMane.bb \
+    -tab -type=bed9+2 -as=${HOME}/kent/src/hg/makeDb/scripts/nmd/nmdEscCollapsed.as
+
+ln -sf /hive/data/genomes/hg38/bed/nmd/mane/nmdEscMane.bb /gbdb/hg38/nmd/nmdEscMane.bb
+
+# Collapsed-region counts (current script, no Rule 1/4 algorithmic fix):
+#   MANE 1.5:        68,345
+#   Gencode V49:    233,929
+#   RefSeq Curated: 112,903
+
+# 2026-04-24 max: Fixed Rule 1 to measure 50 bp upstream of the last splice
+# junction of the transcript (including 3'UTR introns), not the last CDS-CDS
+# junction; output regions are clipped to CDS. The old logic stripped 3'UTR
+# from the exon list before computing the "last coding junction", which
+# over-painted the last CDS exon as NMD-escape whenever there was only one
+# CDS exon, even when a 3'UTR intron sat far downstream (e.g. NBDY: the
+# entire 207 bp CDS was painted Rule 1 despite the last junction being
+# 2.6 kb past the stop). Rule 4 updated in the same pass: when a 3'UTR
+# intron exists, the last CDS-containing exon has a downstream EJC and is
+# now eligible for Rule 4.
+#
+# The script supports two ways of counting the 50 bp walk-back from the
+# last junction (--rule1-mode):
+#   cds  (default) - count only CDS nucleotides, skipping 3'UTR. A
+#                    transcript like NBDY (last junction 2.6 kb past the
+#                    stop, in 3'UTR) gets 50 bp of CDS painted, matching
+#                    the literal "last 50 bp" reading of the rule label.
+#   mrna           - count mRNA nucleotides including 3'UTR, then clip
+#                    output to CDS. NBDY-like transcripts get nothing
+#                    painted because the 50 mRNA-bp window stays inside
+#                    3'UTR. Tracks the 55 bp-rule literature, where the
+#                    ribosome-EJC distance is measured in mRNA bp.
+# We ship the 'cds' mode; the 'mrna' mode is retained for comparison.
+#
+# Post-fix collapsed-region counts (--rule1-mode=cds):
+#   MANE 1.5:        67,752
+#   Gencode V49:    233,375
+#   RefSeq Curated: 112,356
+
 #######################################################################
 # Lindeboom et al. NMDetective scores (2025-03-23 max/Claude)
 # NMD efficiency predictions from Lindeboom et al. 2016, Nat Genet.
 # Four bedGraph custom track files downloaded to:
 #   /hive/data/genomes/hg38/bed/nmd/lindeboom/
 # Data downloaded from https://figshare.com/articles/dataset/NMDetective/7803398
 # Custom track data in the session links from that page
 # - NMDetectiveA.ct  - Random forest prediction of NMD efficiency
 # - NMDetectiveB.ct  - Decision tree prediction of NMD efficiency
 # - nmdDectA-ptc.ct  - Random forest, first out-of-frame PTC
 # - nmdDectB-ptc.ct  - Decision tree, first out-of-frame PTC
 
 # Convert bedGraph custom tracks to bigWig and symlink from /gbdb:
 cd /hive/data/genomes/hg38/bed/nmd/lindeboom/
 bash ~/kent/src/hg/makeDb/scripts/nmd/lindeboomToBigWig.sh