3a62ea7e9a8cb3503586a0a78570331308c9bc58 max Mon Apr 27 02:23:00 2026 -0700 NMD Escape MANE: expose NM_ accession via labelFields. refs #33737 Per QA, the MANE subtrack now shows the NCBI RefSeq accession by default instead of the HGNC gene symbol, with the ENST and gene symbol still selectable via labelFields. - genePredNmdEsc: new --ncbi-id-field N option (default -1 = unused). When set, the named bigGenePred column is captured per-transcript and written into a new ncbiIds output column. For MANE pass 21. - genePredNmdEsc: new --no-collapse option. By default, regions with identical (chrom, start, end, rule) from multiple transcripts collapse into one row with comma-separated lists. With --no-collapse the script emits one row per (transcript, region). Used for MANE so each label-field column holds a single value: the 74 MANE Plus Clinical genes (e.g. LMNA) get two rows per region instead of one row with a two-element list. - nmdEscCollapsed.as: add lstring ncbiIds column. Schema is now bed9+3. - nmd.ra (nmdEscMane only): labelFields ncbiIds,name,transcripts; defaultLabelFields ncbiIds; labelSeparator " / ". Gencode and RefSeq subtracks unchanged - they default to the gene symbol (name column) and have an empty ncbiIds column. - doc/hg38/nmd.txt: bump all three bedToBigBed invocations to bed9+3 and document the --ncbi-id-field 21 + --no-collapse invocation for MANE. Counts: MANE 68,028 (--no-collapse); Gencode 233,375; RefSeq 112,356. diff --git src/hg/makeDb/doc/hg38/nmd.txt src/hg/makeDb/doc/hg38/nmd.txt index fafb069b827..691e42ca0f7 100644 --- src/hg/makeDb/doc/hg38/nmd.txt +++ src/hg/makeDb/doc/hg38/nmd.txt @@ -30,77 +30,85 @@ cd /hive/data/genomes/hg38/bed/nmd/gencode/ # run the script on gencode bigGenePred - produces decorator + collapsed BED files ~/kent/src/hg/makeDb/scripts/nmd/genePredNmdEsc -f bigGenePred \ /hive/data/genomes/hg38/bed/gencodeV49/build/hg38.gencodeV49.bb \ knownGeneNmdDeco.bed nmdEscRegions.bed # build decorator bigBed bedSort knownGeneNmdDeco.bed knownGeneNmdDeco.bed bedToBigBed knownGeneNmdDeco.bed ../../../chrom.sizes knownGeneNmdDeco.bb \ -tab -type=bed12+5 -as=${HOME}/kent/src/hg/makeDb/scripts/nmd/nmdEscDecoration.as # build collapsed bigBed bedSort nmdEscRegions.bed nmdEscRegions.bed bedToBigBed nmdEscRegions.bed ../../../chrom.sizes nmdEscRegions.bb \ - -tab -type=bed9+2 -as=${HOME}/kent/src/hg/makeDb/scripts/nmd/nmdEscCollapsed.as + -tab -type=bed9+3 -as=${HOME}/kent/src/hg/makeDb/scripts/nmd/nmdEscCollapsed.as # symlinks to /gbdb ln -sf /hive/data/genomes/hg38/bed/nmd/gencode/nmdEscRegions.bb /gbdb/hg38/nmd/nmdEscRegions.bb ln -sf /hive/data/genomes/hg38/bed/nmd/gencode/knownGeneNmdDeco.bb /gbdb/hg38/nmd/knownGeneNmdDeco.bb ####################################################################### # NMD escape regions from NCBI RefSeq (2025-03-24 max) # # 2026-04-21 lrnassar: Switched from RefSeq all to RefSeq curated (NM_/NR_ only, # no XM_/XR_ predicted models) per Max's request on RM #33737. Prior RefSeq-all # outputs moved to refseqAll.bak/ within the same build dir. cd /hive/data/genomes/hg38/bed/nmd/ncbiRefSeq/ # run the script on ncbiRefSeq curated genePredExt # Note: the script writes nmdNcbiRefSeqDeco.bed (per-transcript decorator format) # alongside the collapsed output, but we intentionally do not convert it to bigBed # for RefSeq. The decorator workflow currently only ships for Gencode/knownGene # (via knownGeneNmdDeco.bb). ~/kent/src/hg/makeDb/scripts/nmd/genePredNmdEsc -f genePredExt \ /hive/data/genomes/hg38/bed/ncbiRefSeq.p14.2025-08-13/archive/hg38.ncbiRefSeqCurated.txt.gz \ nmdNcbiRefSeqDeco.bed nmdEscNcbiRefSeq.bed # build collapsed bigBed bedSort nmdEscNcbiRefSeq.bed nmdEscNcbiRefSeq.bed bedToBigBed nmdEscNcbiRefSeq.bed ../../../chrom.sizes nmdEscNcbiRefSeq.bb \ - -tab -type=bed9+2 -as=${HOME}/kent/src/hg/makeDb/scripts/nmd/nmdEscCollapsed.as + -tab -type=bed9+3 -as=${HOME}/kent/src/hg/makeDb/scripts/nmd/nmdEscCollapsed.as # symlink to gbdb ln -sf /hive/data/genomes/hg38/bed/nmd/ncbiRefSeq/nmdEscNcbiRefSeq.bb /gbdb/hg38/nmd/nmdEscNcbiRefSeq.bb ####################################################################### # NMD escape regions from MANE Select Plus Clinical (2026-04-24 max) # Same script, run on the MANE bigGenePred. MANE puts the HGNC symbol in # bigGenePred field 18 (Gencode uses field 17), so pass --gene-sym-field 18. mkdir -p /hive/data/genomes/hg38/bed/nmd/mane && cd /hive/data/genomes/hg38/bed/nmd/mane +# --ncbi-id-field 21 puts the NCBI RefSeq accession (NM_/NR_) into the +# collapsed bigBed's ncbiIds column so the trackDb stanza can offer NM_ as +# the default label via labelFields/defaultLabelFields. +# --no-collapse emits one row per (transcript, region). MANE Select gives +# one transcript per gene; MANE Plus Clinical adds a second transcript for +# 74 genes (e.g. LMNA). Keeping rows per-transcript means each label-field +# column holds a single value, which renders cleaner than a comma list. ~/kent/src/hg/makeDb/scripts/nmd/genePredNmdEsc -f bigGenePred --gene-sym-field 18 \ + --ncbi-id-field 21 --no-collapse \ /gbdb/hg38/mane/mane.bb \ nmdManeDeco.bed nmdEscMane.bed bedSort nmdEscMane.bed nmdEscMane.bed bedToBigBed nmdEscMane.bed ../../../chrom.sizes nmdEscMane.bb \ - -tab -type=bed9+2 -as=${HOME}/kent/src/hg/makeDb/scripts/nmd/nmdEscCollapsed.as + -tab -type=bed9+3 -as=${HOME}/kent/src/hg/makeDb/scripts/nmd/nmdEscCollapsed.as ln -sf /hive/data/genomes/hg38/bed/nmd/mane/nmdEscMane.bb /gbdb/hg38/nmd/nmdEscMane.bb # Collapsed-region counts (current script, no Rule 1/4 algorithmic fix): # MANE 1.5: 68,345 # Gencode V49: 233,929 # RefSeq Curated: 112,903 # 2026-04-24 max: Fixed Rule 1 to measure 50 bp upstream of the last splice # junction of the transcript (including 3'UTR introns), not the last CDS-CDS # junction; output regions are clipped to CDS. The old logic stripped 3'UTR # from the exon list before computing the "last coding junction", which # over-painted the last CDS exon as NMD-escape whenever there was only one # CDS exon, even when a 3'UTR intron sat far downstream (e.g. NBDY: the # entire 207 bp CDS was painted Rule 1 despite the last junction being @@ -110,31 +118,31 @@ # # The script supports two ways of counting the 50 bp walk-back from the # last junction (--rule1-mode): # cds (default) - count only CDS nucleotides, skipping 3'UTR. A # transcript like NBDY (last junction 2.6 kb past the # stop, in 3'UTR) gets 50 bp of CDS painted, matching # the literal "last 50 bp" reading of the rule label. # mrna - count mRNA nucleotides including 3'UTR, then clip # output to CDS. NBDY-like transcripts get nothing # painted because the 50 mRNA-bp window stays inside # 3'UTR. Tracks the 55 bp-rule literature, where the # ribosome-EJC distance is measured in mRNA bp. # We ship the 'cds' mode; the 'mrna' mode is retained for comparison. # # Post-fix collapsed-region counts (--rule1-mode=cds): -# MANE 1.5: 67,752 +# MANE 1.5: 68,028 (--no-collapse: one row per transcript) # Gencode V49: 233,375 # RefSeq Curated: 112,356 ####################################################################### # Lindeboom et al. NMDetective scores (2025-03-23 max/Claude) # NMD efficiency predictions from Lindeboom et al. 2016, Nat Genet. # Four bedGraph custom track files downloaded to: # /hive/data/genomes/hg38/bed/nmd/lindeboom/ # Data downloaded from https://figshare.com/articles/dataset/NMDetective/7803398 # Custom track data in the session links from that page # - NMDetectiveA.ct - Random forest prediction of NMD efficiency # - NMDetectiveB.ct - Decision tree prediction of NMD efficiency # - nmdDectA-ptc.ct - Random forest, first out-of-frame PTC # - nmdDectB-ptc.ct - Decision tree, first out-of-frame PTC