4bd316f5f1ca47328bd3f9a181214b788055f0bc
lrnassar
  Tue Apr 21 13:29:26 2026 -0700
NMD Escape QA round 3: switch RefSeq to curated, fix Rule 2 misclassification. refs #33737

Switched the NMD Escape RefSeq subtrack input from hg38.ncbiRefSeq.txt.gz (all)
to hg38.ncbiRefSeqCurated.txt.gz (NM_/NR_ only, no XM_/XR_ predicted models)
per Max's feedback. longLabel updated to "NCBI RefSeq Curated transcripts".

Fixed Rule 2 in genePredNmdEsc to test rec["exonCount"]==1 instead of
len(cdsExons)==1. The old test misclassified multi-exon transcripts with a
single CDS exon (UTR introns) as "intronless" and silently suppressed their
Rule 1/3/4 assignments via the if/else short-circuit. 3,253 RefSeq curated
and ~2,000 Gencode transcripts reassigned from Rule 2 to Rules 1/3. Rebuilt
both tracks.

Added Rule 1 caveat to nmdEscTranscripts.html for transcripts with a
penultimate coding exon shorter than 50 bp.

Added reciprocal relatedTracks.ra entries for nmd <-> mane and nmd <-> ncbiRefSeq.

QA cleanups: non-ASCII prime char replaced with &#8242;, mailing list links
given target="_blank" across all three HTML pages, dead commented nmdGencode
block removed from nmd.ra, AutoSQL field comments updated to cover Rule 4
color and the gene-symbol-to-transcript-ID fallback.

Makedoc updated with the full Gencode + RefSeq pipeline and /gbdb symlinks.

diff --git src/hg/makeDb/doc/hg38/nmd.txt src/hg/makeDb/doc/hg38/nmd.txt
index 5a013417d73..cf65f672985 100644
--- src/hg/makeDb/doc/hg38/nmd.txt
+++ src/hg/makeDb/doc/hg38/nmd.txt
@@ -1,56 +1,70 @@
 #######################################################################
 # NMD escape regions from Gencode (2025-03-24 max/Claude)
 # Two outputs: decorator bigBed (per-transcript) and collapsed bigBed (merged by coordinates)
 # Collapsed version uses gene symbols from input, colors by rule, transcript lists
 # Script accepts -f bigGenePred (gencode .bb) or -f genePredExt (ncbiRefSeq .txt.gz)
 #
 # 2026-04-20 lrnassar: Added Rule 4 (long-exon rule, Lindeboom 2016) - coding
 # exons >400 nt excluding the last coding exon. Rebuilt Gencode + RefSeq.
+#
+# 2026-04-21 lrnassar: Fixed Rule 2 to test rec["exonCount"]==1 instead of
+# len(cdsExons)==1. The old test misclassified multi-exon transcripts with a
+# single CDS exon (UTR introns) as "intronless", AND silently suppressed their
+# Rule 1/3/4 assignments via the if/else short-circuit. ~3,253 RefSeq curated
+# transcripts and ~2,000 Gencode transcripts reassigned. Rebuilt both tracks.
 
 cd /hive/data/genomes/hg38/bed/nmd/gencode/
 
 # run the script on gencode bigGenePred - produces decorator + collapsed BED files
 ~/kent/src/hg/makeDb/scripts/nmd/genePredNmdEsc -f bigGenePred \
     /hive/data/genomes/hg38/bed/gencodeV49/build/hg38.gencodeV49.bb \
     knownGeneNmdDeco.bed nmdEscRegions.bed
 
 # build decorator bigBed
 bedSort knownGeneNmdDeco.bed knownGeneNmdDeco.bed
 bedToBigBed knownGeneNmdDeco.bed ../../../chrom.sizes knownGeneNmdDeco.bb \
     -tab -type=bed12+5 -as=${HOME}/kent/src/hg/makeDb/scripts/nmd/nmdEscDecoration.as
 
 # build collapsed bigBed
 bedSort nmdEscRegions.bed nmdEscRegions.bed
 bedToBigBed nmdEscRegions.bed ../../../chrom.sizes nmdEscRegions.bb \
     -tab -type=bed9+2 -as=${HOME}/kent/src/hg/makeDb/scripts/nmd/nmdEscCollapsed.as
 
+# symlinks to /gbdb
+ln -sf /hive/data/genomes/hg38/bed/nmd/gencode/nmdEscRegions.bb /gbdb/hg38/nmd/nmdEscRegions.bb
+ln -sf /hive/data/genomes/hg38/bed/nmd/gencode/knownGeneNmdDeco.bb /gbdb/hg38/nmd/knownGeneNmdDeco.bb
+
 
 #######################################################################
 # NMD escape regions from NCBI RefSeq (2025-03-24 max)
+#
+# 2026-04-21 lrnassar: Switched from RefSeq all to RefSeq curated (NM_/NR_ only,
+# no XM_/XR_ predicted models) per Max's request on RM #33737. Prior RefSeq-all
+# outputs moved to refseqAll.bak/ within the same build dir.
 
 cd /hive/data/genomes/hg38/bed/nmd/ncbiRefSeq/
 
-# run the script on ncbiRefSeq genePredExt
-# Using all of RefSeq, not just refseq curated - good idea?
-# This is the file for RefSeq curated: /hive/data/genomes/hg38/bed/ncbiRefSeq.p14.2025-08-13/archive/hg38.ncbiRefSeqCurated.txt.gz 
+# run the script on ncbiRefSeq curated genePredExt
+# Note: the script writes nmdNcbiRefSeqDeco.bed (per-transcript decorator format)
+# alongside the collapsed output, but we intentionally do not convert it to bigBed
+# for RefSeq. The decorator workflow currently only ships for Gencode/knownGene
+# (via knownGeneNmdDeco.bb).
 ~/kent/src/hg/makeDb/scripts/nmd/genePredNmdEsc -f genePredExt \
-    /hive/data/genomes/hg38/bed/ncbiRefSeq.p14.2025-08-13/archive/hg38.ncbiRefSeq.txt.gz \
+    /hive/data/genomes/hg38/bed/ncbiRefSeq.p14.2025-08-13/archive/hg38.ncbiRefSeqCurated.txt.gz \
     nmdNcbiRefSeqDeco.bed nmdEscNcbiRefSeq.bed
 
-# not building decorator file - needed? Useful?
-
 # build collapsed bigBed
 bedSort nmdEscNcbiRefSeq.bed nmdEscNcbiRefSeq.bed
 bedToBigBed nmdEscNcbiRefSeq.bed ../../../chrom.sizes nmdEscNcbiRefSeq.bb \
     -tab -type=bed9+2 -as=${HOME}/kent/src/hg/makeDb/scripts/nmd/nmdEscCollapsed.as
 
 # symlink to gbdb
 ln -sf /hive/data/genomes/hg38/bed/nmd/ncbiRefSeq/nmdEscNcbiRefSeq.bb /gbdb/hg38/nmd/nmdEscNcbiRefSeq.bb
 
 #######################################################################
 # Lindeboom et al. NMDetective scores (2025-03-23 max/Claude)
 # NMD efficiency predictions from Lindeboom et al. 2016, Nat Genet.
 # Four bedGraph custom track files downloaded to:
 #   /hive/data/genomes/hg38/bed/nmd/lindeboom/
 # Data downloaded from https://figshare.com/articles/dataset/NMDetective/7803398
 # Custom track data in the session links from that page