64a3f9e7813e823cf724ea188c3928a911578286 max Thu Jun 4 00:32:22 2026 -0700 varFreqs: replace All Databases Combined with two phenotype-split tracks Replace the single varFreqsAll combined track (and drop the varFreqsDisease track) with two matched tracks for visual case-vs-background comparison: varFreqsAffected - variants seen in the affected/case arms of disease cohorts (SFARI SPARK WES/WGS ASD probands, SCHEMA cases, GREGoR affected, GA4K); ~130,000 individuals varFreqsBackground - population reference cohorts + the unaffected/control arms of disease cohorts ("all other variants"); ~1.5 million individuals A variant seen in both groups appears in both tracks. Genotyping-array cohorts stay out of both (varFreqsArray unchanged). vcfToBigBed.py gains --split-affected to emit both tracks in one pass; it reads phenotype tags (affected/unaffected/unknown) from populations.tsv and is_disease/disease_role from databases.tsv, and derives the length-filter ranges from the observed data. TOPMed reclassified as a population cohort. SPARK WGS display name changed to SFARI SPARK WGS for consistency with the standalone subtracks. Fixed the trackDb mouseOver $-substitution prefix collision by wrapping fields in ${}. New description pages for both tracks. refs #36642 diff --git src/hg/makeDb/trackDb/human/varFreqsBackground.html src/hg/makeDb/trackDb/human/varFreqsBackground.html new file mode 100644 index 00000000000..e835ddfe2a4 --- /dev/null +++ src/hg/makeDb/trackDb/human/varFreqsBackground.html @@ -0,0 +1,121 @@ +

Description

+This track shows small variants (SNVs and short indels) seen in population reference +cohorts and in unaffected or control individuals of disease-study cohorts, annotated +with their predicted protein consequence and colored by severity. It is the background half +of a matched pair: the companion +Affected/Case Individuals track shows the same +kind of variants seen in affected or case individuals. Displaying the two together lets you +see how common a variant is in the general/unaffected population compared with affected +individuals. For the full list of contributing projects, see the +SNV Frequencies collection page. +

+The background combines two kinds of data: the population/biobank reference cohorts (such as +gnomAD HGDP+1kG, TOPMed, ALFA, HRC and the many national WGS projects), and the +unaffected/control or unknown-phenotype arms of the disease-study cohorts (non-ASD family +members in SFARI SPARK WES/WGS, SCHEMA controls, and GREGoR unaffected/unknown +participants). Genotyping-array cohorts are not included. A variant that also appears in +affected individuals is shown in both this track and the +Affected/Case Individuals track. +

+ +

Display Conventions

Color by Consequence

Variants are colored by their most severe predicted consequence:

+ + + + + + + + + + + + + + +

Color	Consequence class	Examples
	Protein-truncating / loss-of-function	stop_gained, frameshift, splice_donor, splice_acceptor, stop_lost, start_lost
	Missense / in-frame	missense, inframe_insertion, inframe_deletion, protein_altering
	Synonymous	synonymous, stop_retained
	Non-coding / intergenic	intron, non_coding, intergenic, UTR

+The score (used for shading) is the background allele frequency (the maximum across the +population cohorts and unaffected/control arms) times 1000. +

+ +

Filters

Variant Type and Consequence: restrict to SNV/insertion/deletion/MNV + and to predicted consequence classes (the Consequence filter uses OR logic over the + comma-separated tokens on each variant).
Background AF and Background AC: the maximum allele frequency and summed + allele count across the population cohorts and unaffected/control arms.
Affected/case AF and Affected/case AC: the same variant's frequency in + affected individuals, for context.
Background source: restrict to variants seen in specific cohorts.
Per-database AF/AC and ancestry-specific allele frequencies (AllOfUs, GenomeAsia, + gnomAD HGDP+1kG, NPM Singapore, WBBC) let you filter to a single cohort or ancestry + group.
Reference/Alternate Length and Length Change: filter by allele length.

+ +

Methods

+Variant-frequency VCFs from the contributing cohorts were stripped of unneeded INFO fields, +normalized with bcftools norm (splitting multi-allelic sites), and merged with +bcftools merge. The merged callset was annotated with predicted protein +consequences using bcftools csq against the +Ensembl +GRCh38 release 115 gene models. +

+A custom Python script (vcfToBigBed.py) then read the per-cohort allele +frequencies and, for each variant, summed/maximized the counts across the population cohorts +and unaffected/control subgroups to produce this track, and across the affected arms to +produce the companion +Affected/Case Individuals track. A variant seen +in both groups appears in both tracks. The build is documented in the +makeDoc, and the scripts are on +GitHub. +

+ +

Data Access

+Because the merged callset combines cohorts whose redistribution licenses differ, this +track is not available for download and is not in the Table Browser. It can be +reconstructed from the individual source VCFs using the +conversion scripts and the +build documentation. The per-project subtracks on the +SNV Frequencies collection page document how to obtain +each source dataset. +

+ +

Credits

+This track is only possible thanks to the data from millions of volunteers around the world +who contributed to the population reference projects and to the unaffected/control arms of +the disease cohorts. Click the individual project subtracks on the +SNV Frequencies collection page for the specific credits +and citations of each cohort. Thanks to Alex Ioannidis, UCSC, for the inspiration for this +track and to Andreas Lahner, MGZ, for feedback. +

+ +

References

+For the primary citation of each source cohort, see the References section on the +SNV Frequencies collection page. The merged-track build +uses the following tools: +

+Danecek P, McCarthy SA. + +BCFtools/csq: haplotype-aware variant consequences. +Bioinformatics. 2017 Jul 1;33(13):2037-2039. +PMID: 28205675; +PMC: PMC5870570 +