4bd316f5f1ca47328bd3f9a181214b788055f0bc lrnassar Tue Apr 21 13:29:26 2026 -0700 NMD Escape QA round 3: switch RefSeq to curated, fix Rule 2 misclassification. refs #33737 Switched the NMD Escape RefSeq subtrack input from hg38.ncbiRefSeq.txt.gz (all) to hg38.ncbiRefSeqCurated.txt.gz (NM_/NR_ only, no XM_/XR_ predicted models) per Max's feedback. longLabel updated to "NCBI RefSeq Curated transcripts". Fixed Rule 2 in genePredNmdEsc to test rec["exonCount"]==1 instead of len(cdsExons)==1. The old test misclassified multi-exon transcripts with a single CDS exon (UTR introns) as "intronless" and silently suppressed their Rule 1/3/4 assignments via the if/else short-circuit. 3,253 RefSeq curated and ~2,000 Gencode transcripts reassigned from Rule 2 to Rules 1/3. Rebuilt both tracks. Added Rule 1 caveat to nmdEscTranscripts.html for transcripts with a penultimate coding exon shorter than 50 bp. Added reciprocal relatedTracks.ra entries for nmd <-> mane and nmd <-> ncbiRefSeq. QA cleanups: non-ASCII prime char replaced with ′, mailing list links given target="_blank" across all three HTML pages, dead commented nmdGencode block removed from nmd.ra, AutoSQL field comments updated to cover Rule 4 color and the gene-symbol-to-transcript-ID fallback. Makedoc updated with the full Gencode + RefSeq pipeline and /gbdb symlinks. diff --git src/hg/makeDb/doc/hg38/nmd.txt src/hg/makeDb/doc/hg38/nmd.txt index 5a013417d73..cf65f672985 100644 --- src/hg/makeDb/doc/hg38/nmd.txt +++ src/hg/makeDb/doc/hg38/nmd.txt @@ -1,64 +1,78 @@ ####################################################################### # NMD escape regions from Gencode (2025-03-24 max/Claude) # Two outputs: decorator bigBed (per-transcript) and collapsed bigBed (merged by coordinates) # Collapsed version uses gene symbols from input, colors by rule, transcript lists # Script accepts -f bigGenePred (gencode .bb) or -f genePredExt (ncbiRefSeq .txt.gz) # # 2026-04-20 lrnassar: Added Rule 4 (long-exon rule, Lindeboom 2016) - coding # exons >400 nt excluding the last coding exon. Rebuilt Gencode + RefSeq. +# +# 2026-04-21 lrnassar: Fixed Rule 2 to test rec["exonCount"]==1 instead of +# len(cdsExons)==1. The old test misclassified multi-exon transcripts with a +# single CDS exon (UTR introns) as "intronless", AND silently suppressed their +# Rule 1/3/4 assignments via the if/else short-circuit. ~3,253 RefSeq curated +# transcripts and ~2,000 Gencode transcripts reassigned. Rebuilt both tracks. cd /hive/data/genomes/hg38/bed/nmd/gencode/ # run the script on gencode bigGenePred - produces decorator + collapsed BED files ~/kent/src/hg/makeDb/scripts/nmd/genePredNmdEsc -f bigGenePred \ /hive/data/genomes/hg38/bed/gencodeV49/build/hg38.gencodeV49.bb \ knownGeneNmdDeco.bed nmdEscRegions.bed # build decorator bigBed bedSort knownGeneNmdDeco.bed knownGeneNmdDeco.bed bedToBigBed knownGeneNmdDeco.bed ../../../chrom.sizes knownGeneNmdDeco.bb \ -tab -type=bed12+5 -as=${HOME}/kent/src/hg/makeDb/scripts/nmd/nmdEscDecoration.as # build collapsed bigBed bedSort nmdEscRegions.bed nmdEscRegions.bed bedToBigBed nmdEscRegions.bed ../../../chrom.sizes nmdEscRegions.bb \ -tab -type=bed9+2 -as=${HOME}/kent/src/hg/makeDb/scripts/nmd/nmdEscCollapsed.as +# symlinks to /gbdb +ln -sf /hive/data/genomes/hg38/bed/nmd/gencode/nmdEscRegions.bb /gbdb/hg38/nmd/nmdEscRegions.bb +ln -sf /hive/data/genomes/hg38/bed/nmd/gencode/knownGeneNmdDeco.bb /gbdb/hg38/nmd/knownGeneNmdDeco.bb + ####################################################################### # NMD escape regions from NCBI RefSeq (2025-03-24 max) +# +# 2026-04-21 lrnassar: Switched from RefSeq all to RefSeq curated (NM_/NR_ only, +# no XM_/XR_ predicted models) per Max's request on RM #33737. Prior RefSeq-all +# outputs moved to refseqAll.bak/ within the same build dir. cd /hive/data/genomes/hg38/bed/nmd/ncbiRefSeq/ -# run the script on ncbiRefSeq genePredExt -# Using all of RefSeq, not just refseq curated - good idea? -# This is the file for RefSeq curated: /hive/data/genomes/hg38/bed/ncbiRefSeq.p14.2025-08-13/archive/hg38.ncbiRefSeqCurated.txt.gz +# run the script on ncbiRefSeq curated genePredExt +# Note: the script writes nmdNcbiRefSeqDeco.bed (per-transcript decorator format) +# alongside the collapsed output, but we intentionally do not convert it to bigBed +# for RefSeq. The decorator workflow currently only ships for Gencode/knownGene +# (via knownGeneNmdDeco.bb). ~/kent/src/hg/makeDb/scripts/nmd/genePredNmdEsc -f genePredExt \ - /hive/data/genomes/hg38/bed/ncbiRefSeq.p14.2025-08-13/archive/hg38.ncbiRefSeq.txt.gz \ + /hive/data/genomes/hg38/bed/ncbiRefSeq.p14.2025-08-13/archive/hg38.ncbiRefSeqCurated.txt.gz \ nmdNcbiRefSeqDeco.bed nmdEscNcbiRefSeq.bed -# not building decorator file - needed? Useful? - # build collapsed bigBed bedSort nmdEscNcbiRefSeq.bed nmdEscNcbiRefSeq.bed bedToBigBed nmdEscNcbiRefSeq.bed ../../../chrom.sizes nmdEscNcbiRefSeq.bb \ -tab -type=bed9+2 -as=${HOME}/kent/src/hg/makeDb/scripts/nmd/nmdEscCollapsed.as # symlink to gbdb ln -sf /hive/data/genomes/hg38/bed/nmd/ncbiRefSeq/nmdEscNcbiRefSeq.bb /gbdb/hg38/nmd/nmdEscNcbiRefSeq.bb ####################################################################### # Lindeboom et al. NMDetective scores (2025-03-23 max/Claude) # NMD efficiency predictions from Lindeboom et al. 2016, Nat Genet. # Four bedGraph custom track files downloaded to: # /hive/data/genomes/hg38/bed/nmd/lindeboom/ # Data downloaded from https://figshare.com/articles/dataset/NMDetective/7803398 # Custom track data in the session links from that page # - NMDetectiveA.ct - Random forest prediction of NMD efficiency # - NMDetectiveB.ct - Decision tree prediction of NMD efficiency # - nmdDectA-ptc.ct - Random forest, first out-of-frame PTC # - nmdDectB-ptc.ct - Decision tree, first out-of-frame PTC # Convert bedGraph custom tracks to bigWig and symlink from /gbdb: cd /hive/data/genomes/hg38/bed/nmd/lindeboom/ bash ~/kent/src/hg/makeDb/scripts/nmd/lindeboomToBigWig.sh