src/hg/makeDb/doc/hg38/nmd.txt 3a62ea7e9a8cb3503586a0a78570331308c9bc58

3a62ea7e9a8cb3503586a0a78570331308c9bc58
max
  Mon Apr 27 02:23:00 2026 -0700
NMD Escape MANE: expose NM_ accession via labelFields. refs #33737

Per QA, the MANE subtrack now shows the NCBI RefSeq accession by default
instead of the HGNC gene symbol, with the ENST and gene symbol still
selectable via labelFields.

- genePredNmdEsc: new --ncbi-id-field N option (default -1 = unused).
When set, the named bigGenePred column is captured per-transcript and
written into a new ncbiIds output column. For MANE pass 21.
- genePredNmdEsc: new --no-collapse option. By default, regions with
identical (chrom, start, end, rule) from multiple transcripts collapse
into one row with comma-separated lists. With --no-collapse the script
emits one row per (transcript, region). Used for MANE so each
label-field column holds a single value: the 74 MANE Plus Clinical
genes (e.g. LMNA) get two rows per region instead of one row with a
two-element list.
- nmdEscCollapsed.as: add lstring ncbiIds column. Schema is now bed9+3.
- nmd.ra (nmdEscMane only): labelFields ncbiIds,name,transcripts;
defaultLabelFields ncbiIds; labelSeparator " / ". Gencode and RefSeq
subtracks unchanged - they default to the gene symbol (name column)
and have an empty ncbiIds column.
- doc/hg38/nmd.txt: bump all three bedToBigBed invocations to bed9+3
and document the --ncbi-id-field 21 + --no-collapse invocation for
MANE.

Counts: MANE 68,028 (--no-collapse); Gencode 233,375; RefSeq 112,356.

diff --git src/hg/makeDb/doc/hg38/nmd.txt src/hg/makeDb/doc/hg38/nmd.txt
index fafb069b827..691e42ca0f7 100644
--- src/hg/makeDb/doc/hg38/nmd.txt
+++ src/hg/makeDb/doc/hg38/nmd.txt
@@ -1,143 +1,151 @@
 #######################################################################
 # NMD escape regions from Gencode (2025-03-24 max/Claude)
 # Two outputs: decorator bigBed (per-transcript) and collapsed bigBed (merged by coordinates)
 # Collapsed version uses gene symbols from input, colors by rule, transcript lists
 # Script accepts -f bigGenePred (gencode .bb) or -f genePredExt (ncbiRefSeq .txt.gz)
 #
 # 2026-04-20 lrnassar: Added Rule 4 (long-exon rule, Lindeboom 2016) - coding
 # exons >400 nt excluding the last coding exon. Rebuilt Gencode + RefSeq.
 #
 # 2026-04-21 lrnassar: Fixed Rule 2 to test rec["exonCount"]==1 instead of
 # len(cdsExons)==1. The old test misclassified multi-exon transcripts with a
 # single CDS exon (UTR introns) as "intronless", AND silently suppressed their
 # Rule 1/3/4 assignments via the if/else short-circuit. ~3,253 RefSeq curated
 # transcripts and ~2,000 Gencode transcripts reassigned. Rebuilt both tracks.
 #
 # 2026-04-21 lrnassar: Refined Rule 2 gate to reflect the real NMD biology:
 # "single coding exon AND no 3'UTR intron" instead of "exonCount==1".
 # 5'UTR introns do not deposit EJCs downstream of the stop codon (their EJCs
 # are cleared by the scanning 40S or sit upstream of the stop codon and are
 # never encountered by the terminating ribosome), so transcripts with a single
 # coding exon and only 5'UTR introns are NMD-immune and belong in Rule 2.
 # Transcripts with a single coding exon but a 3'UTR intron remain in Rules 1/3
 # because that intron deposits a downstream EJC. Reclassified 3,113 RefSeq
 # curated transcripts (95.7% of the single-CDS-exon set with UTR introns) and
 # 10,790 Gencode V49 transcripts into Rule 2.
 # Post-fix rule counts (collapsed regions):
 #   RefSeq Curated:  R1=54,015  R2=2,942  R3=49,443  R4=6,503  (total 112,903)
 #   Gencode V49:     R1=134,464 R2=6,599  R3=85,319  R4=7,547  (total 233,929)
 
 cd /hive/data/genomes/hg38/bed/nmd/gencode/
 
 # run the script on gencode bigGenePred - produces decorator + collapsed BED files
 ~/kent/src/hg/makeDb/scripts/nmd/genePredNmdEsc -f bigGenePred \
     /hive/data/genomes/hg38/bed/gencodeV49/build/hg38.gencodeV49.bb \
     knownGeneNmdDeco.bed nmdEscRegions.bed
 
 # build decorator bigBed
 bedSort knownGeneNmdDeco.bed knownGeneNmdDeco.bed
 bedToBigBed knownGeneNmdDeco.bed ../../../chrom.sizes knownGeneNmdDeco.bb \
     -tab -type=bed12+5 -as=${HOME}/kent/src/hg/makeDb/scripts/nmd/nmdEscDecoration.as
 
 # build collapsed bigBed
 bedSort nmdEscRegions.bed nmdEscRegions.bed
 bedToBigBed nmdEscRegions.bed ../../../chrom.sizes nmdEscRegions.bb \
-    -tab -type=bed9+2 -as=${HOME}/kent/src/hg/makeDb/scripts/nmd/nmdEscCollapsed.as
+    -tab -type=bed9+3 -as=${HOME}/kent/src/hg/makeDb/scripts/nmd/nmdEscCollapsed.as
 
 # symlinks to /gbdb
 ln -sf /hive/data/genomes/hg38/bed/nmd/gencode/nmdEscRegions.bb /gbdb/hg38/nmd/nmdEscRegions.bb
 ln -sf /hive/data/genomes/hg38/bed/nmd/gencode/knownGeneNmdDeco.bb /gbdb/hg38/nmd/knownGeneNmdDeco.bb
 
 
 #######################################################################
 # NMD escape regions from NCBI RefSeq (2025-03-24 max)
 #
 # 2026-04-21 lrnassar: Switched from RefSeq all to RefSeq curated (NM_/NR_ only,
 # no XM_/XR_ predicted models) per Max's request on RM #33737. Prior RefSeq-all
 # outputs moved to refseqAll.bak/ within the same build dir.
 
 cd /hive/data/genomes/hg38/bed/nmd/ncbiRefSeq/
 
 # run the script on ncbiRefSeq curated genePredExt
 # Note: the script writes nmdNcbiRefSeqDeco.bed (per-transcript decorator format)
 # alongside the collapsed output, but we intentionally do not convert it to bigBed
 # for RefSeq. The decorator workflow currently only ships for Gencode/knownGene
 # (via knownGeneNmdDeco.bb).
 ~/kent/src/hg/makeDb/scripts/nmd/genePredNmdEsc -f genePredExt \
     /hive/data/genomes/hg38/bed/ncbiRefSeq.p14.2025-08-13/archive/hg38.ncbiRefSeqCurated.txt.gz \
     nmdNcbiRefSeqDeco.bed nmdEscNcbiRefSeq.bed
 
 # build collapsed bigBed
 bedSort nmdEscNcbiRefSeq.bed nmdEscNcbiRefSeq.bed
 bedToBigBed nmdEscNcbiRefSeq.bed ../../../chrom.sizes nmdEscNcbiRefSeq.bb \
-    -tab -type=bed9+2 -as=${HOME}/kent/src/hg/makeDb/scripts/nmd/nmdEscCollapsed.as
+    -tab -type=bed9+3 -as=${HOME}/kent/src/hg/makeDb/scripts/nmd/nmdEscCollapsed.as
 
 # symlink to gbdb
 ln -sf /hive/data/genomes/hg38/bed/nmd/ncbiRefSeq/nmdEscNcbiRefSeq.bb /gbdb/hg38/nmd/nmdEscNcbiRefSeq.bb
 
 #######################################################################
 # NMD escape regions from MANE Select Plus Clinical (2026-04-24 max)
 # Same script, run on the MANE bigGenePred. MANE puts the HGNC symbol in
 # bigGenePred field 18 (Gencode uses field 17), so pass --gene-sym-field 18.
 
 mkdir -p /hive/data/genomes/hg38/bed/nmd/mane && cd /hive/data/genomes/hg38/bed/nmd/mane
 
+# --ncbi-id-field 21 puts the NCBI RefSeq accession (NM_/NR_) into the
+# collapsed bigBed's ncbiIds column so the trackDb stanza can offer NM_ as
+# the default label via labelFields/defaultLabelFields.
+# --no-collapse emits one row per (transcript, region). MANE Select gives
+# one transcript per gene; MANE Plus Clinical adds a second transcript for
+# 74 genes (e.g. LMNA). Keeping rows per-transcript means each label-field
+# column holds a single value, which renders cleaner than a comma list.
 ~/kent/src/hg/makeDb/scripts/nmd/genePredNmdEsc -f bigGenePred --gene-sym-field 18 \
+    --ncbi-id-field 21 --no-collapse \
     /gbdb/hg38/mane/mane.bb \
     nmdManeDeco.bed nmdEscMane.bed
 
 bedSort nmdEscMane.bed nmdEscMane.bed
 bedToBigBed nmdEscMane.bed ../../../chrom.sizes nmdEscMane.bb \
-    -tab -type=bed9+2 -as=${HOME}/kent/src/hg/makeDb/scripts/nmd/nmdEscCollapsed.as
+    -tab -type=bed9+3 -as=${HOME}/kent/src/hg/makeDb/scripts/nmd/nmdEscCollapsed.as
 
 ln -sf /hive/data/genomes/hg38/bed/nmd/mane/nmdEscMane.bb /gbdb/hg38/nmd/nmdEscMane.bb
 
 # Collapsed-region counts (current script, no Rule 1/4 algorithmic fix):
 #   MANE 1.5:        68,345
 #   Gencode V49:    233,929
 #   RefSeq Curated: 112,903
 
 # 2026-04-24 max: Fixed Rule 1 to measure 50 bp upstream of the last splice
 # junction of the transcript (including 3'UTR introns), not the last CDS-CDS
 # junction; output regions are clipped to CDS. The old logic stripped 3'UTR
 # from the exon list before computing the "last coding junction", which
 # over-painted the last CDS exon as NMD-escape whenever there was only one
 # CDS exon, even when a 3'UTR intron sat far downstream (e.g. NBDY: the
 # entire 207 bp CDS was painted Rule 1 despite the last junction being
 # 2.6 kb past the stop). Rule 4 updated in the same pass: when a 3'UTR
 # intron exists, the last CDS-containing exon has a downstream EJC and is
 # now eligible for Rule 4.
 #
 # The script supports two ways of counting the 50 bp walk-back from the
 # last junction (--rule1-mode):
 #   cds  (default) - count only CDS nucleotides, skipping 3'UTR. A
 #                    transcript like NBDY (last junction 2.6 kb past the
 #                    stop, in 3'UTR) gets 50 bp of CDS painted, matching
 #                    the literal "last 50 bp" reading of the rule label.
 #   mrna           - count mRNA nucleotides including 3'UTR, then clip
 #                    output to CDS. NBDY-like transcripts get nothing
 #                    painted because the 50 mRNA-bp window stays inside
 #                    3'UTR. Tracks the 55 bp-rule literature, where the
 #                    ribosome-EJC distance is measured in mRNA bp.
 # We ship the 'cds' mode; the 'mrna' mode is retained for comparison.
 #
 # Post-fix collapsed-region counts (--rule1-mode=cds):
-#   MANE 1.5:        67,752
+#   MANE 1.5:        68,028  (--no-collapse: one row per transcript)
 #   Gencode V49:    233,375
 #   RefSeq Curated: 112,356
 
 #######################################################################
 # Lindeboom et al. NMDetective scores (2025-03-23 max/Claude)
 # NMD efficiency predictions from Lindeboom et al. 2016, Nat Genet.
 # Four bedGraph custom track files downloaded to:
 #   /hive/data/genomes/hg38/bed/nmd/lindeboom/
 # Data downloaded from https://figshare.com/articles/dataset/NMDetective/7803398
 # Custom track data in the session links from that page
 # - NMDetectiveA.ct  - Random forest prediction of NMD efficiency
 # - NMDetectiveB.ct  - Decision tree prediction of NMD efficiency
 # - nmdDectA-ptc.ct  - Random forest, first out-of-frame PTC
 # - nmdDectB-ptc.ct  - Decision tree, first out-of-frame PTC
 
 # Convert bedGraph custom tracks to bigWig and symlink from /gbdb:
 cd /hive/data/genomes/hg38/bed/nmd/lindeboom/
 bash ~/kent/src/hg/makeDb/scripts/nmd/lindeboomToBigWig.sh