6b285a53b036b309e3c7a9b61d3741731088a172 lrnassar Fri Jun 12 02:35:01 2026 -0700 varFreqs: switch affectedAF/backgroundAF from max-across-cohorts to pooled sum(AC)/sum(AN) so the rate matches the carrier count scale. Per-arm AN is derived as round(AC/AF) when both are reported. An optional "default_an" column was added to databases.tsv so AF-only cohorts (ABraOM, ALFA) can synthesize a denominator from their cohort size; without it those cohorts had been silently dropped from the pooled rate. New affectedAN and backgroundAN columns expose the pool denominator. The mouseOver now reads "Affected AC/AN: 33238 / 213153" so the ratio is visible. Per-arm cohorts that ship only AC and no default_an (MGRB, GREGoR AC_AFFECTED/UNAFFECTED/UNKNOWN, AllOfUs per-population) are still listed in affectedCohorts/backgroundSources but contribute 0 to the pool, preserving the invariant pool_AF <= 1. The build pipeline is unchanged: re-run vcfToBigBed.py --split-affected against the existing merged.annotated.vcf.gz. refs #36642 diff --git src/hg/makeDb/trackDb/human/varFreqsBackground.html src/hg/makeDb/trackDb/human/varFreqsBackground.html index e835ddfe2a4..e63ca161efb 100644 --- src/hg/makeDb/trackDb/human/varFreqsBackground.html +++ src/hg/makeDb/trackDb/human/varFreqsBackground.html @@ -1,121 +1,139 @@

Description

This track shows small variants (SNVs and short indels) seen in population reference cohorts and in unaffected or control individuals of disease-study cohorts, annotated with their predicted protein consequence and colored by severity. It is the background half of a matched pair: the companion Affected/Case Individuals track shows the same kind of variants seen in affected or case individuals. Displaying the two together lets you see how common a variant is in the general/unaffected population compared with affected individuals. For the full list of contributing projects, see the SNV Frequencies collection page.

The background combines two kinds of data: the population/biobank reference cohorts (such as gnomAD HGDP+1kG, TOPMed, ALFA, HRC and the many national WGS projects), and the unaffected/control or unknown-phenotype arms of the disease-study cohorts (non-ASD family members in SFARI SPARK WES/WGS, SCHEMA controls, and GREGoR unaffected/unknown participants). Genotyping-array cohorts are not included. A variant that also appears in affected individuals is shown in both this track and the Affected/Case Individuals track.

Display Conventions

Color by Consequence

Variants are colored by their most severe predicted consequence:

Color	Consequence class	Examples
	Protein-truncating / loss-of-function	stop_gained, frameshift, splice_donor, splice_acceptor, stop_lost, start_lost
	Missense / in-frame	missense, inframe_insertion, inframe_deletion, protein_altering
	Synonymous	synonymous, stop_retained
	Non-coding / intergenic	intron, non_coding, intergenic, UTR

-The score (used for shading) is the background allele frequency (the maximum across the -population cohorts and unaffected/control arms) times 1000. +The score (used for shading) is the pooled background allele frequency times 1000. +

+ +

Pooled allele frequency

+Background AF is the pooled rate across contributing population cohorts and +unaffected/control arms: backgroundAF = sum(AC) / sum(AN), where +backgroundAC sums the allele counts and backgroundAN sums the allele +numbers across each cohort/arm that ships both AC and AF (the per-arm AN is derived as +round(AC / AF)). Two cohorts that publish only AF (ABraOM, ALFA) contribute +via a configured default_an in the build configuration. Cohorts that publish +only AC with no default_an set (currently MGRB and the GREGoR unaffected and +unknown arms), and cohorts that contribute only through per-population AC/AF (currently +AllOfUs), are listed in backgroundSources but do not contribute to the pool +numerator or denominator; their data remain visible in the per-database and per-population +AC/AF columns. The pooled rate is preferred over a max-across-cohorts statistic so a small +cohort with a high local AF (for example AllOfUs Oceanian) cannot dominate the displayed +frequency.

Filters

Variant Type and Consequence: restrict to SNV/insertion/deletion/MNV and to predicted consequence classes (the Consequence filter uses OR logic over the comma-separated tokens on each variant).
Background AF and Background AC: the maximum allele frequency and summed - allele count across the population cohorts and unaffected/control arms.
Affected/case AF and Affected/case AC: the same variant's frequency in +
Background AF, AC, AN: the pooled allele frequency + (sum AC / sum AN), summed allele count, and summed allele number across the + contributing population cohorts and unaffected/control arms. See "Pooled + allele frequency" above.
Affected/case AF, AC, AN: the same triple computed across affected individuals, for context.
Background source: restrict to variants seen in specific cohorts.
Per-database AF/AC and ancestry-specific allele frequencies (AllOfUs, GenomeAsia, gnomAD HGDP+1kG, NPM Singapore, WBBC) let you filter to a single cohort or ancestry group.
Reference/Alternate Length and Length Change: filter by allele length.

Methods

Variant-frequency VCFs from the contributing cohorts were stripped of unneeded INFO fields, normalized with bcftools norm (splitting multi-allelic sites), and merged with bcftools merge. The merged callset was annotated with predicted protein consequences using bcftools csq against the Ensembl GRCh38 release 115 gene models.

A custom Python script (vcfToBigBed.py) then read the per-cohort allele frequencies and, for each variant, summed/maximized the counts across the population cohorts and unaffected/control subgroups to produce this track, and across the affected arms to produce the companion Affected/Case Individuals track. A variant seen in both groups appears in both tracks. The build is documented in the makeDoc, and the scripts are on GitHub.

Data Access

Because the merged callset combines cohorts whose redistribution licenses differ, this track is not available for download and is not in the Table Browser. It can be reconstructed from the individual source VCFs using the conversion scripts and the build documentation. The per-project subtracks on the SNV Frequencies collection page document how to obtain each source dataset.

Credits

This track is only possible thanks to the data from millions of volunteers around the world who contributed to the population reference projects and to the unaffected/control arms of the disease cohorts. Click the individual project subtracks on the SNV Frequencies collection page for the specific credits and citations of each cohort. Thanks to Alex Ioannidis, UCSC, for the inspiration for this track and to Andreas Lahner, MGZ, for feedback.

References

For the primary citation of each source cohort, see the References section on the SNV Frequencies collection page. The merged-track build uses the following tools:

Danecek P, McCarthy SA. BCFtools/csq: haplotype-aware variant consequences. Bioinformatics. 2017 Jul 1;33(13):2037-2039. PMID: 28205675; PMC: PMC5870570