af9a5b388259e680dd34bc47b2cad4ff6e3d162f lrnassar Sat Jun 13 03:00:51 2026 -0700 varFreqs: pre-release polish from comprehensive sanity check. * Sync the new combined-track shortLabels into the four description pages: "Affected/Case Individuals" -> "Disease cohorts" and "Population + Unaffected" -> "Population reference" (matches the trackdb shortLabels users now see). * Add a paragraph in the supertrack Methods section describing the pooled affectedAF / backgroundAF formulation (sum AC / sum AN) and the default_an configuration that handles AF-only cohorts. * Update the in-track Methods paragraphs on varFreqsAffected.html and varFreqsBackground.html: replace "summed/maximized" with "pooled". * Fix supertrack table downloadability column to match the underscore-prefix convention: allofus "Yes" -> "No" (description page already says license restricted); gregor "No" -> "Yes" (description page says VCF is on our download server, and the gbdb path is not underscore-prefixed). * Add a 2026-06-12 makedoc section documenting the pooled-AF rebuild, the default_an mechanism, the new affectedAN/backgroundAN columns, the before/after spot-check at APOE rs429358, and the build commands. refs #36642 diff --git src/hg/makeDb/doc/hg38/varFreqs.txt src/hg/makeDb/doc/hg38/varFreqs.txt index bd9095d7f48..52c925634a8 100644 --- src/hg/makeDb/doc/hg38/varFreqs.txt +++ src/hg/makeDb/doc/hg38/varFreqs.txt @@ -1030,15 +1030,60 @@ # varFreqsAffected and varFreqsBackground filter blocks in # trackDb/human/varFreqs.ra come from the paste-and-go auto fragment # varFreqs.trackDb.ra produced by this run (both stanzas use the same fragment; # only the mouseOver differs). # varFreqsArray (TPMI + MexBB + UKBB; ~14.7 M variants, 750 MB): bash $SCR/mergeAndAnnotate.sh --databases $SCR/databases_array.tsv --tag .array python3 $SCR/vcfToBigBed.py \ --databases-file $SCR/databases_array.tsv \ --annotated-vcf /hive/data/genomes/hg38/bed/varFreqs/all/merged.array.annotated.vcf.gz \ --output-prefix varFreqsArray \ --threads 6 \ --work-dir /hive/data/genomes/hg38/bed/varFreqs/array ln -sfn /hive/data/genomes/hg38/bed/varFreqs/array/varFreqsArray.bb \ /gbdb/hg38/varFreqs/_array/varFreqsArray.bb + +########## +# 2026-06-12 Lou (Claude) +# Switched the varFreqsAffected and varFreqsBackground combined tracks to a +# pooled-AF formulation: affectedAF = sum(AC)/sum(AN) across the contributing +# affected arms (and backgroundAF the same for the background arms), replacing +# the prior max-across-cohorts AF. The change fixes a known artifact where a +# small per-population subgroup with AC=AN=1 (e.g. AllOfUs Oceanian at the +# APOE ε4 site) drove backgroundAF to 1.0 even though the genuine pooled rate +# is ~0.12. +# +# vcfToBigBed.py now infers per-arm AN as round(AC/AF) when both are reported. +# databases.tsv grew an 8th column "default_an": for cohorts that publish only +# AF (ABraOM, ALFA), set this to the diploid cohort size so AC can be +# synthesized (AC = round(AF * default_an)) and default_an feeds the pool +# denominator. ABraOM=2342, ALFA=816000. Cohorts that publish only AC with no +# default_an (MGRB, GREGoR per-arm AC_AFFECTED/UNAFFECTED/UNKNOWN, AllOfUs +# per-population) appear in affectedCohorts/backgroundSources but contribute +# 0 to the pool, preserving the invariant that pooled AF cannot exceed 1. +# +# Two new bigBed columns: affectedAN and backgroundAN (pool denominators). +# mouseOver was updated to show "Affected AC/AN: 33238 / 213150" so the ratio +# is visible. The combined-track filter UI exposes affectedAN and backgroundAN +# as numeric range filters. AS schema field count: 161 -> 163. +# +# Rebuild used the existing merged.annotated.vcf.gz (no re-merge needed); +# Phase 1 re-extracted per-cohort AC/AF, Phase 2 wrote the new BEDs, then +# concat + bedToBigBed. Total wall-clock ~8 hours. +cd /hive/data/genomes/hg38/bed/varFreqs/all +# Back up first (the script overwrites): +cp varFreqsAffected.bb varFreqsAffected.bb.preAfPool.bak +cp varFreqsBackground.bb varFreqsBackground.bb.preAfPool.bak +# Build: +python3 ~/kent/src/hg/makeDb/scripts/varFreqs/vcfToBigBed.py \ + --annotated-vcf merged.annotated.vcf.gz \ + --output-prefix varFreqs \ + --split-affected \ + --threads 8 \ + --work-dir /hive/data/genomes/hg38/bed/varFreqs/all +# Spot-check (APOE rs429358, chr19:44908683-44908684 T>C): +# pre : affectedAC=33238 affectedAF=0.181730 (max, GA4K-dominated) +# backgroundAF=1.000000 (max, AllOfUs_OCE artifact) +# post : affectedAC=32782 affectedAN=213150 affectedAF=0.153798 +# backgroundAC=341065 backgroundAN=2751112 +# backgroundAF=0.123974