34d2eee845f5f45e571d1e153c632683b8a93f75 lrnassar Tue Apr 21 16:17:53 2026 -0700 Refine NMD Escape Rule 2 gate to "single coding exon and no 3'UTR intron". refs #33737 Previously Rule 2 required exonCount==1 (truly intronless). This overcorrected for single-CDS-exon transcripts whose only introns are in the 5'UTR: biologically these have no EJC downstream of the stop codon (5'UTR EJCs are cleared by the scanning 40S or sit upstream of the terminating ribosome) and are NMD-immune, but the code pushed them to Rules 1/3 under a less accurate "last coding exon" label. New gate: len(cdsExons) == 1 AND no exon-exon junction strictly downstream of the stop codon (strand-aware). Transcripts with a single coding exon but a 3'UTR intron correctly stay in Rules 1/3 because that intron deposits an EJC that can trigger NMD. 3,113 RefSeq Curated and 10,790 Gencode V49 transcripts move into Rule 2. 140 RefSeq and 1,135 Gencode single-CDS-exon transcripts with 3'UTR introns correctly remain in Rules 1/3. Description page and makedoc updated. diff --git src/hg/makeDb/doc/hg38/nmd.txt src/hg/makeDb/doc/hg38/nmd.txt index cf65f672985..b1a45d82daa 100644 --- src/hg/makeDb/doc/hg38/nmd.txt +++ src/hg/makeDb/doc/hg38/nmd.txt @@ -1,78 +1,92 @@ ####################################################################### # NMD escape regions from Gencode (2025-03-24 max/Claude) # Two outputs: decorator bigBed (per-transcript) and collapsed bigBed (merged by coordinates) # Collapsed version uses gene symbols from input, colors by rule, transcript lists # Script accepts -f bigGenePred (gencode .bb) or -f genePredExt (ncbiRefSeq .txt.gz) # # 2026-04-20 lrnassar: Added Rule 4 (long-exon rule, Lindeboom 2016) - coding # exons >400 nt excluding the last coding exon. Rebuilt Gencode + RefSeq. # # 2026-04-21 lrnassar: Fixed Rule 2 to test rec["exonCount"]==1 instead of # len(cdsExons)==1. The old test misclassified multi-exon transcripts with a # single CDS exon (UTR introns) as "intronless", AND silently suppressed their # Rule 1/3/4 assignments via the if/else short-circuit. ~3,253 RefSeq curated # transcripts and ~2,000 Gencode transcripts reassigned. Rebuilt both tracks. +# +# 2026-04-21 lrnassar: Refined Rule 2 gate to reflect the real NMD biology: +# "single coding exon AND no 3'UTR intron" instead of "exonCount==1". +# 5'UTR introns do not deposit EJCs downstream of the stop codon (their EJCs +# are cleared by the scanning 40S or sit upstream of the stop codon and are +# never encountered by the terminating ribosome), so transcripts with a single +# coding exon and only 5'UTR introns are NMD-immune and belong in Rule 2. +# Transcripts with a single coding exon but a 3'UTR intron remain in Rules 1/3 +# because that intron deposits a downstream EJC. Reclassified 3,113 RefSeq +# curated transcripts (95.7% of the single-CDS-exon set with UTR introns) and +# 10,790 Gencode V49 transcripts into Rule 2. +# Post-fix rule counts (collapsed regions): +# RefSeq Curated: R1=54,015 R2=2,942 R3=49,443 R4=6,503 (total 112,903) +# Gencode V49: R1=134,464 R2=6,599 R3=85,319 R4=7,547 (total 233,929) cd /hive/data/genomes/hg38/bed/nmd/gencode/ # run the script on gencode bigGenePred - produces decorator + collapsed BED files ~/kent/src/hg/makeDb/scripts/nmd/genePredNmdEsc -f bigGenePred \ /hive/data/genomes/hg38/bed/gencodeV49/build/hg38.gencodeV49.bb \ knownGeneNmdDeco.bed nmdEscRegions.bed # build decorator bigBed bedSort knownGeneNmdDeco.bed knownGeneNmdDeco.bed bedToBigBed knownGeneNmdDeco.bed ../../../chrom.sizes knownGeneNmdDeco.bb \ -tab -type=bed12+5 -as=${HOME}/kent/src/hg/makeDb/scripts/nmd/nmdEscDecoration.as # build collapsed bigBed bedSort nmdEscRegions.bed nmdEscRegions.bed bedToBigBed nmdEscRegions.bed ../../../chrom.sizes nmdEscRegions.bb \ -tab -type=bed9+2 -as=${HOME}/kent/src/hg/makeDb/scripts/nmd/nmdEscCollapsed.as # symlinks to /gbdb ln -sf /hive/data/genomes/hg38/bed/nmd/gencode/nmdEscRegions.bb /gbdb/hg38/nmd/nmdEscRegions.bb ln -sf /hive/data/genomes/hg38/bed/nmd/gencode/knownGeneNmdDeco.bb /gbdb/hg38/nmd/knownGeneNmdDeco.bb ####################################################################### # NMD escape regions from NCBI RefSeq (2025-03-24 max) # # 2026-04-21 lrnassar: Switched from RefSeq all to RefSeq curated (NM_/NR_ only, # no XM_/XR_ predicted models) per Max's request on RM #33737. Prior RefSeq-all # outputs moved to refseqAll.bak/ within the same build dir. cd /hive/data/genomes/hg38/bed/nmd/ncbiRefSeq/ # run the script on ncbiRefSeq curated genePredExt # Note: the script writes nmdNcbiRefSeqDeco.bed (per-transcript decorator format) # alongside the collapsed output, but we intentionally do not convert it to bigBed # for RefSeq. The decorator workflow currently only ships for Gencode/knownGene # (via knownGeneNmdDeco.bb). ~/kent/src/hg/makeDb/scripts/nmd/genePredNmdEsc -f genePredExt \ /hive/data/genomes/hg38/bed/ncbiRefSeq.p14.2025-08-13/archive/hg38.ncbiRefSeqCurated.txt.gz \ nmdNcbiRefSeqDeco.bed nmdEscNcbiRefSeq.bed # build collapsed bigBed bedSort nmdEscNcbiRefSeq.bed nmdEscNcbiRefSeq.bed bedToBigBed nmdEscNcbiRefSeq.bed ../../../chrom.sizes nmdEscNcbiRefSeq.bb \ -tab -type=bed9+2 -as=${HOME}/kent/src/hg/makeDb/scripts/nmd/nmdEscCollapsed.as # symlink to gbdb ln -sf /hive/data/genomes/hg38/bed/nmd/ncbiRefSeq/nmdEscNcbiRefSeq.bb /gbdb/hg38/nmd/nmdEscNcbiRefSeq.bb ####################################################################### # Lindeboom et al. NMDetective scores (2025-03-23 max/Claude) # NMD efficiency predictions from Lindeboom et al. 2016, Nat Genet. # Four bedGraph custom track files downloaded to: # /hive/data/genomes/hg38/bed/nmd/lindeboom/ # Data downloaded from https://figshare.com/articles/dataset/NMDetective/7803398 # Custom track data in the session links from that page # - NMDetectiveA.ct - Random forest prediction of NMD efficiency # - NMDetectiveB.ct - Decision tree prediction of NMD efficiency # - nmdDectA-ptc.ct - Random forest, first out-of-frame PTC # - nmdDectB-ptc.ct - Decision tree, first out-of-frame PTC # Convert bedGraph custom tracks to bigWig and symlink from /gbdb: cd /hive/data/genomes/hg38/bed/nmd/lindeboom/ bash ~/kent/src/hg/makeDb/scripts/nmd/lindeboomToBigWig.sh