f9a89b0e1ce3c937b4fbb879736c1619c35c271f lrnassar Tue Apr 21 12:11:02 2026 -0700 QA fixes for PromoterAI track. refs #37278 Description page: replaced the wrong reference (Gao et al. 2023, the PrimateAI-3D paper) with the actual PromoterAI citation (Jaganathan et al. Science 2025, PMID 40440429), corrected the score-direction wording (negative = under-expression, positive = over-expression, not "tolerated vs disruptive"), fixed the Data Access source link (Illumina BaseSpace, not the GitHub repo), and corrected the mouseover blurb to match mouseOverFunction noAverage behavior. Converter and AS: the overlap bigBed now carries the real per-transcript strand from the source TSV (was hardcoded '+'), with a new strands column in the AS, and the name field concatenates unique gene symbols so bidirectional-promoter items read as "HES4,ISG15" etc. BED score is now |PromoterAI|*1000 so scoreFilter is meaningful. Rewrote the converter to stream (sorted input), which drops peak memory from ~40 GB to a few MB. trackDb: added filterLabel/filterLimits on scoreDiff (the filter was unusable without labels), scoreFilter + scoreLabel, alwaysZero and autoScale off on the bigWig subtracks, color 200,0,0 / altColor 0,0,200 so signed bigWig bars draw red (over-expression) above zero and blue (under-expression) below, matching the overlap track itemRgb. Added maxWindowToDraw and maxItems on the overlap subtrack. Makedoc updated to describe the streaming pipeline, the new strands column, and the rebuild workflow. diff --git src/hg/makeDb/doc/hg38/promoterAi.txt src/hg/makeDb/doc/hg38/promoterAi.txt index 33de0152e1f..2f608686fb0 100644 --- src/hg/makeDb/doc/hg38/promoterAi.txt +++ src/hg/makeDb/doc/hg38/promoterAi.txt @@ -1,32 +1,47 @@ # PromoterAI, Claude max, Mar 20 2026 +# Updated Apr 21 2026 (RM #37278 QA): streaming converter, transcript strand +# carried through, per-transcript gene aggregation, AS gains a strands field, +# bigBed score field now stores |PromoterAI|*1000 (impact magnitude). # Source: promoterAI_tss500.tsv.gz from https://primateai3d.basespace.illumina.com/ +# (license-gated download; linked from https://github.com/Illumina/PromoterAI) # 262M rows, 118.6M unique variants, 39.5M unique positions, scores within 500bp of TSS +# Input fields (1-based): chrom, pos, ref, alt, gene, gene_id, transcript_id, +# strand (1 or -1), tss_pos, promoterAI cd /hive/data/genomes/hg38/bed/promoterai # download promoterAI_tss500.tsv.gz from Illumina BaseSpace (requires registration) -# convert to 4 bedGraph files (one per alt allele) + overlap BED -# picks max absolute score when transcripts overlap; overlap BED has all per-transcript scores +# convert to 4 bedGraph files (one per alt allele) + overlap BED. +# Streaming: reads input row-by-row assuming input is sorted by (chrom, pos), +# so memory use is proportional to the number of transcripts at a single +# position, not the whole file. Safe on a 4 GB node. +# Picks max absolute score when transcripts overlap; overlap BED has all +# per-transcript scores + strands, tagged with the consensus strand (or '.' +# when transcripts disagree on strand, i.e. bidirectional promoters). python3 ~/kent/src/hg/makeDb/scripts/promoterAiToBigWig.py # sort bedGraphs and convert to bigWig for alt in A C G T; do sort -k1,1 -k2,2n promoterAi_${alt}.bedGraph > promoterAi_${alt}.sorted.bedGraph bedGraphToBigWig promoterAi_${alt}.sorted.bedGraph /hive/data/genomes/hg38/chrom.sizes promoterAi_${alt}.bw rm promoterAi_${alt}.bedGraph promoterAi_${alt}.sorted.bedGraph done -# sort overlap BED and convert to bigBed -sort -k1,1 -k2,2n promoterAi_overlaps.bed > promoterAi_overlaps.sorted.bed +# sort overlap BED and convert to bigBed (bed9+6 -- see promoterAiOverlaps.as) +sort -S 2G -k1,1 -k2,2n promoterAi_overlaps.bed > promoterAi_overlaps.sorted.bed bedToBigBed -type=bed9+ -as=$HOME/kent/src/hg/makeDb/scripts/promoterAiOverlaps.as -tab \ promoterAi_overlaps.sorted.bed /hive/data/genomes/hg38/chrom.sizes promoterAi_overlaps.bb rm promoterAi_overlaps.bed promoterAi_overlaps.sorted.bed # symlinks mkdir -p /gbdb/hg38/promoterAi ln -s /hive/data/genomes/hg38/bed/promoterai/promoterAi_A.bw /gbdb/hg38/promoterAi/a.bw ln -s /hive/data/genomes/hg38/bed/promoterai/promoterAi_C.bw /gbdb/hg38/promoterAi/c.bw ln -s /hive/data/genomes/hg38/bed/promoterai/promoterAi_G.bw /gbdb/hg38/promoterAi/g.bw ln -s /hive/data/genomes/hg38/bed/promoterai/promoterAi_T.bw /gbdb/hg38/promoterAi/t.bw ln -s /hive/data/genomes/hg38/bed/promoterai/promoterAi_overlaps.bb /gbdb/hg38/promoterAi/overlaps.bb + +# Rebuild notes (Apr 21 2026): only the overlap bigBed needed regenerating +# because the bigWig best-score logic is unchanged. The existing bigWigs were +# left in place; only promoterAi_overlaps.bb was swapped (old kept as .bak).