34d2eee845f5f45e571d1e153c632683b8a93f75 lrnassar Tue Apr 21 16:17:53 2026 -0700 Refine NMD Escape Rule 2 gate to "single coding exon and no 3'UTR intron". refs #33737 Previously Rule 2 required exonCount==1 (truly intronless). This overcorrected for single-CDS-exon transcripts whose only introns are in the 5'UTR: biologically these have no EJC downstream of the stop codon (5'UTR EJCs are cleared by the scanning 40S or sit upstream of the terminating ribosome) and are NMD-immune, but the code pushed them to Rules 1/3 under a less accurate "last coding exon" label. New gate: len(cdsExons) == 1 AND no exon-exon junction strictly downstream of the stop codon (strand-aware). Transcripts with a single coding exon but a 3'UTR intron correctly stay in Rules 1/3 because that intron deposits an EJC that can trigger NMD. 3,113 RefSeq Curated and 10,790 Gencode V49 transcripts move into Rule 2. 140 RefSeq and 1,135 Gencode single-CDS-exon transcripts with 3'UTR introns correctly remain in Rules 1/3. Description page and makedoc updated. diff --git src/hg/makeDb/doc/hg38/nmd.txt src/hg/makeDb/doc/hg38/nmd.txt index cf65f672985..b1a45d82daa 100644 --- src/hg/makeDb/doc/hg38/nmd.txt +++ src/hg/makeDb/doc/hg38/nmd.txt @@ -1,29 +1,43 @@ ####################################################################### # NMD escape regions from Gencode (2025-03-24 max/Claude) # Two outputs: decorator bigBed (per-transcript) and collapsed bigBed (merged by coordinates) # Collapsed version uses gene symbols from input, colors by rule, transcript lists # Script accepts -f bigGenePred (gencode .bb) or -f genePredExt (ncbiRefSeq .txt.gz) # # 2026-04-20 lrnassar: Added Rule 4 (long-exon rule, Lindeboom 2016) - coding # exons >400 nt excluding the last coding exon. Rebuilt Gencode + RefSeq. # # 2026-04-21 lrnassar: Fixed Rule 2 to test rec["exonCount"]==1 instead of # len(cdsExons)==1. The old test misclassified multi-exon transcripts with a # single CDS exon (UTR introns) as "intronless", AND silently suppressed their # Rule 1/3/4 assignments via the if/else short-circuit. ~3,253 RefSeq curated # transcripts and ~2,000 Gencode transcripts reassigned. Rebuilt both tracks. +# +# 2026-04-21 lrnassar: Refined Rule 2 gate to reflect the real NMD biology: +# "single coding exon AND no 3'UTR intron" instead of "exonCount==1". +# 5'UTR introns do not deposit EJCs downstream of the stop codon (their EJCs +# are cleared by the scanning 40S or sit upstream of the stop codon and are +# never encountered by the terminating ribosome), so transcripts with a single +# coding exon and only 5'UTR introns are NMD-immune and belong in Rule 2. +# Transcripts with a single coding exon but a 3'UTR intron remain in Rules 1/3 +# because that intron deposits a downstream EJC. Reclassified 3,113 RefSeq +# curated transcripts (95.7% of the single-CDS-exon set with UTR introns) and +# 10,790 Gencode V49 transcripts into Rule 2. +# Post-fix rule counts (collapsed regions): +# RefSeq Curated: R1=54,015 R2=2,942 R3=49,443 R4=6,503 (total 112,903) +# Gencode V49: R1=134,464 R2=6,599 R3=85,319 R4=7,547 (total 233,929) cd /hive/data/genomes/hg38/bed/nmd/gencode/ # run the script on gencode bigGenePred - produces decorator + collapsed BED files ~/kent/src/hg/makeDb/scripts/nmd/genePredNmdEsc -f bigGenePred \ /hive/data/genomes/hg38/bed/gencodeV49/build/hg38.gencodeV49.bb \ knownGeneNmdDeco.bed nmdEscRegions.bed # build decorator bigBed bedSort knownGeneNmdDeco.bed knownGeneNmdDeco.bed bedToBigBed knownGeneNmdDeco.bed ../../../chrom.sizes knownGeneNmdDeco.bb \ -tab -type=bed12+5 -as=${HOME}/kent/src/hg/makeDb/scripts/nmd/nmdEscDecoration.as # build collapsed bigBed bedSort nmdEscRegions.bed nmdEscRegions.bed