64a3f9e7813e823cf724ea188c3928a911578286 max Thu Jun 4 00:32:22 2026 -0700 varFreqs: replace All Databases Combined with two phenotype-split tracks Replace the single varFreqsAll combined track (and drop the varFreqsDisease track) with two matched tracks for visual case-vs-background comparison: varFreqsAffected - variants seen in the affected/case arms of disease cohorts (SFARI SPARK WES/WGS ASD probands, SCHEMA cases, GREGoR affected, GA4K); ~130,000 individuals varFreqsBackground - population reference cohorts + the unaffected/control arms of disease cohorts ("all other variants"); ~1.5 million individuals A variant seen in both groups appears in both tracks. Genotyping-array cohorts stay out of both (varFreqsArray unchanged). vcfToBigBed.py gains --split-affected to emit both tracks in one pass; it reads phenotype tags (affected/unaffected/unknown) from populations.tsv and is_disease/disease_role from databases.tsv, and derives the length-filter ranges from the observed data. TOPMed reclassified as a population cohort. SPARK WGS display name changed to SFARI SPARK WGS for consistency with the standalone subtracks. Fixed the trackDb mouseOver $-substitution prefix collision by wrapping fields in ${}. New description pages for both tracks. refs #36642 diff --git src/hg/makeDb/doc/hg38/varFreqs.txt src/hg/makeDb/doc/hg38/varFreqs.txt index 02f048e3d70..bd9095d7f48 100644 --- src/hg/makeDb/doc/hg38/varFreqs.txt +++ src/hg/makeDb/doc/hg38/varFreqs.txt @@ -958,25 +958,87 @@ ln -sfn /hive/data/genomes/hg38/bed/varFreqs/all/varFreqsAll.bb \ /gbdb/hg38/varFreqs/_all/varFreqsAll.bb # varFreqsDisease (6 cohorts; ~932 M variants, 23 GB): SCR=~/kent/src/hg/makeDb/scripts/varFreqs bash $SCR/mergeAndAnnotate.sh --databases $SCR/databases_disease.tsv --tag .disease python3 $SCR/vcfToBigBed.py \ --databases-file $SCR/databases_disease.tsv \ --annotated-vcf /hive/data/genomes/hg38/bed/varFreqs/all/merged.disease.annotated.vcf.gz \ --output-prefix varFreqsDisease \ --threads 6 \ --work-dir /hive/data/genomes/hg38/bed/varFreqs/disease ln -sfn /hive/data/genomes/hg38/bed/varFreqs/disease/varFreqsDisease.bb \ /gbdb/hg38/varFreqs/_disease/varFreqsDisease.bb +########## +# 2026-06-02 -> 2026-06-03 Claude max +# Replaced the single varFreqsAll combined track (and the older varFreqsDisease +# track) with TWO matched comparison tracks so the affected-vs-background +# difference can be seen visually (e.g. autism-gene pLoF in cases vs the general +# population). refs #36642 +# varFreqsAffected - variants seen in the affected/case arm of the disease +# cohorts +# varFreqsBackground - "all other variants": population reference cohorts + +# the unaffected/control/unknown arms of disease cohorts +# A variant present in both groups appears in both tracks (overlap is intended). +# Genotyping-array cohorts are in neither (they stay in varFreqsArray). +# +# Cohort classification (databases.tsv columns is_disease, disease_role): +# - TOPMed is a population cohort (is_disease=0): an NHLBI population/biobank +# reference, not an affected-disease case cohort, no affected/unaffected +# label. It feeds the background. +# - Disease-study cohorts (is_disease=1): SPARK, SFARI_WGS, SCHEMA, GREGoR +# (per-arm phenotype split) and GA4K (disease_role=affected: rare-disease +# cohort with no split, counted as affected; caveat - GA4K enrolls trios so +# a minority of carriers are unaffected parents). +# - populations.tsv has a 6th column "phenotype" (affected|unaffected|unknown) +# tagging the AUT/NON_AUT, CASE/CTRL, AFF/UNA/UNK arms. affected arms feed +# the affected track; unaffected/unknown arms feed the background track. +# +# Both tracks share one autoSql schema. Summary fields after dnaChange: +# affectedAF/AC (string) max AF / summed AC in affected/case arms +# affectedCohorts (string) disease cohorts contributing affected carriers +# backgroundAF/AC (string) max AF / summed AC in population + unaffected +# backgroundSources (string) cohorts contributing to the background +# inAffected (uint) 1 if seen in an affected/case arm, else 0 +# followed by the per-database and per-population AC/AF columns. Each track +# carries BOTH summaries (so the Affected track can be filtered to variants rare +# in the background). Row inclusion: affected track = affectedAF/AC present; +# background track = backgroundAF/AC present. score = that track's AF * 1000. +# vcfToBigBed.py derives the length-filter ranges (filter.refLen/altLen/varLen) +# from the observed data and emits them in the auto trackDb fragment. +# +# vcfToBigBed.py --split-affected emits the two tracks in one pass (shared +# Phase 1 extract). Without the flag it writes a single track (used by the +# array build). +# +# Naming: SPARK WGS display name -> "SFARI SPARK WGS" and SPARK WES -> "SFARI +# SPARK WES" to match the standalone sfariSparkWgs/sfariSparkExomes subtracks +# (internal key SFARI_WGS kept). mouseOver fields are wrapped in ${} to avoid +# the trackDb subMultiField prefix-substitution bug (hui.old.c). +# +# Rebuild uses existing merged.annotated.vcf.gz (no re-merge needed). +cd /hive/data/genomes/hg38/bed/varFreqs/all +python3 ~/kent/src/hg/makeDb/scripts/varFreqs/vcfToBigBed.py \ + --annotated-vcf merged.annotated.vcf.gz \ + --output-prefix varFreqs \ + --split-affected \ + --threads 8 \ + --work-dir /hive/data/genomes/hg38/bed/varFreqs/all +# Produces varFreqsAffected.bb and varFreqsBackground.bb. Symlinked under +# /gbdb/hg38/varFreqs/_affected/ and /gbdb/hg38/varFreqs/_background/. The +# varFreqsAffected and varFreqsBackground filter blocks in +# trackDb/human/varFreqs.ra come from the paste-and-go auto fragment +# varFreqs.trackDb.ra produced by this run (both stanzas use the same fragment; +# only the mouseOver differs). + # varFreqsArray (TPMI + MexBB + UKBB; ~14.7 M variants, 750 MB): bash $SCR/mergeAndAnnotate.sh --databases $SCR/databases_array.tsv --tag .array python3 $SCR/vcfToBigBed.py \ --databases-file $SCR/databases_array.tsv \ --annotated-vcf /hive/data/genomes/hg38/bed/varFreqs/all/merged.array.annotated.vcf.gz \ --output-prefix varFreqsArray \ --threads 6 \ --work-dir /hive/data/genomes/hg38/bed/varFreqs/array ln -sfn /hive/data/genomes/hg38/bed/varFreqs/array/varFreqsArray.bb \ /gbdb/hg38/varFreqs/_array/varFreqsArray.bb