0151d00a4a1d73a78c35f6158c6c936ff338faeb max Fri Apr 24 10:37:34 2026 -0700 NMD Escape: MANE subtrack, Rule 1 bug fix, transcript filter. refs #33737 - Add nmdEscMane subtrack (MANE Select Plus Clinical 1.5), built from /gbdb/hg38/mane/mane.bb. Reuses nmdEscTranscripts.html. - Fix Rule 1: measure 50 bp upstream of the transcript's last splice junction (including 3'UTR introns) rather than stripping 3'UTR from the exon list first. The old logic painted the entire last CDS exon as NMD-escape whenever the transcript had only one CDS exon, even when a 3'UTR intron sat far past the stop codon (e.g. NBDY: 207 bp of CDS over-painted for a junction 2.6 kb past the stop). - Add --rule1-mode {cds,mrna} (default cds): cds counts only CDS bp on the walk-back (paints up to 50 bp of CDS matching the rule label literally); mrna counts mRNA bp and clips to CDS (tracks the 55 bp rule literature). Documented in makeDoc. - Rule 4: when a 3'UTR intron exists, the last CDS-containing exon has a downstream EJC and is now eligible for the long-exon rule. - Mouseover lists contributing transcript accessions when 1-3 items collapse into a region; falls back to a count above that. - Add filterText/filterType/filterLabel on all three escape subtracks so a user can narrow the display to one transcript. - genePredNmdEsc: --gene-sym-field (default 17 for Gencode; pass 18 for MANE, whose HGNC symbol lives in bigGenePred geneName2). - Add findShortTxLongUtrIntron.py helper for finding MANE transcripts with long UTR introns (used to pick NMD edge-case test cases). Post-fix collapsed-region counts (--rule1-mode=cds): MANE 1.5: 67,752 Gencode V49: 233,375 RefSeq Curated: 112,356 diff --git src/hg/makeDb/doc/hg38/nmd.txt src/hg/makeDb/doc/hg38/nmd.txt index b1a45d82daa..fafb069b827 100644 --- src/hg/makeDb/doc/hg38/nmd.txt +++ src/hg/makeDb/doc/hg38/nmd.txt @@ -63,30 +63,81 @@ # alongside the collapsed output, but we intentionally do not convert it to bigBed # for RefSeq. The decorator workflow currently only ships for Gencode/knownGene # (via knownGeneNmdDeco.bb). ~/kent/src/hg/makeDb/scripts/nmd/genePredNmdEsc -f genePredExt \ /hive/data/genomes/hg38/bed/ncbiRefSeq.p14.2025-08-13/archive/hg38.ncbiRefSeqCurated.txt.gz \ nmdNcbiRefSeqDeco.bed nmdEscNcbiRefSeq.bed # build collapsed bigBed bedSort nmdEscNcbiRefSeq.bed nmdEscNcbiRefSeq.bed bedToBigBed nmdEscNcbiRefSeq.bed ../../../chrom.sizes nmdEscNcbiRefSeq.bb \ -tab -type=bed9+2 -as=${HOME}/kent/src/hg/makeDb/scripts/nmd/nmdEscCollapsed.as # symlink to gbdb ln -sf /hive/data/genomes/hg38/bed/nmd/ncbiRefSeq/nmdEscNcbiRefSeq.bb /gbdb/hg38/nmd/nmdEscNcbiRefSeq.bb +####################################################################### +# NMD escape regions from MANE Select Plus Clinical (2026-04-24 max) +# Same script, run on the MANE bigGenePred. MANE puts the HGNC symbol in +# bigGenePred field 18 (Gencode uses field 17), so pass --gene-sym-field 18. + +mkdir -p /hive/data/genomes/hg38/bed/nmd/mane && cd /hive/data/genomes/hg38/bed/nmd/mane + +~/kent/src/hg/makeDb/scripts/nmd/genePredNmdEsc -f bigGenePred --gene-sym-field 18 \ + /gbdb/hg38/mane/mane.bb \ + nmdManeDeco.bed nmdEscMane.bed + +bedSort nmdEscMane.bed nmdEscMane.bed +bedToBigBed nmdEscMane.bed ../../../chrom.sizes nmdEscMane.bb \ + -tab -type=bed9+2 -as=${HOME}/kent/src/hg/makeDb/scripts/nmd/nmdEscCollapsed.as + +ln -sf /hive/data/genomes/hg38/bed/nmd/mane/nmdEscMane.bb /gbdb/hg38/nmd/nmdEscMane.bb + +# Collapsed-region counts (current script, no Rule 1/4 algorithmic fix): +# MANE 1.5: 68,345 +# Gencode V49: 233,929 +# RefSeq Curated: 112,903 + +# 2026-04-24 max: Fixed Rule 1 to measure 50 bp upstream of the last splice +# junction of the transcript (including 3'UTR introns), not the last CDS-CDS +# junction; output regions are clipped to CDS. The old logic stripped 3'UTR +# from the exon list before computing the "last coding junction", which +# over-painted the last CDS exon as NMD-escape whenever there was only one +# CDS exon, even when a 3'UTR intron sat far downstream (e.g. NBDY: the +# entire 207 bp CDS was painted Rule 1 despite the last junction being +# 2.6 kb past the stop). Rule 4 updated in the same pass: when a 3'UTR +# intron exists, the last CDS-containing exon has a downstream EJC and is +# now eligible for Rule 4. +# +# The script supports two ways of counting the 50 bp walk-back from the +# last junction (--rule1-mode): +# cds (default) - count only CDS nucleotides, skipping 3'UTR. A +# transcript like NBDY (last junction 2.6 kb past the +# stop, in 3'UTR) gets 50 bp of CDS painted, matching +# the literal "last 50 bp" reading of the rule label. +# mrna - count mRNA nucleotides including 3'UTR, then clip +# output to CDS. NBDY-like transcripts get nothing +# painted because the 50 mRNA-bp window stays inside +# 3'UTR. Tracks the 55 bp-rule literature, where the +# ribosome-EJC distance is measured in mRNA bp. +# We ship the 'cds' mode; the 'mrna' mode is retained for comparison. +# +# Post-fix collapsed-region counts (--rule1-mode=cds): +# MANE 1.5: 67,752 +# Gencode V49: 233,375 +# RefSeq Curated: 112,356 + ####################################################################### # Lindeboom et al. NMDetective scores (2025-03-23 max/Claude) # NMD efficiency predictions from Lindeboom et al. 2016, Nat Genet. # Four bedGraph custom track files downloaded to: # /hive/data/genomes/hg38/bed/nmd/lindeboom/ # Data downloaded from https://figshare.com/articles/dataset/NMDetective/7803398 # Custom track data in the session links from that page # - NMDetectiveA.ct - Random forest prediction of NMD efficiency # - NMDetectiveB.ct - Decision tree prediction of NMD efficiency # - nmdDectA-ptc.ct - Random forest, first out-of-frame PTC # - nmdDectB-ptc.ct - Decision tree, first out-of-frame PTC # Convert bedGraph custom tracks to bigWig and symlink from /gbdb: cd /hive/data/genomes/hg38/bed/nmd/lindeboom/ bash ~/kent/src/hg/makeDb/scripts/nmd/lindeboomToBigWig.sh