1259dcfba3a263d92d2602665fd866dc44b47996 lrnassar Sun Jun 21 11:17:10 2026 -0700 Clarify varFreqs description page wording per code review feedback. refs #37733 Reword the default_an sentence in the Pooled allele frequency sections of varFreqsAffected.html and varFreqsBackground.html to explain that cohorts publishing only AF are pooled via an assigned default_an, with per-arm AC derived as round(AF * default_an). Change "tokens" to "terms" in the Consequence filter section of varFreqs.html. diff --git src/hg/makeDb/trackDb/human/varFreqsBackground.html src/hg/makeDb/trackDb/human/varFreqsBackground.html index 2b7d13f8ecb..5d2f82eb6ef 100644 --- src/hg/makeDb/trackDb/human/varFreqsBackground.html +++ src/hg/makeDb/trackDb/human/varFreqsBackground.html @@ -1,139 +1,141 @@

Description

This track shows small variants (SNVs and short indels) seen in population reference cohorts and in unaffected or control individuals of disease-study cohorts, annotated with their predicted protein consequence and colored by severity. It is the background half of a matched pair: the companion Disease cohorts track shows the same kind of variants seen in affected or case individuals. Displaying the two together lets you see how common a variant is in the general/unaffected population compared with affected individuals. For the full list of contributing projects, see the SNV Frequencies collection page.

The background combines two kinds of data: the population/biobank reference cohorts (such as gnomAD HGDP+1kG, TOPMed, ALFA, HRC and the many national WGS projects), and the unaffected/control or unknown-phenotype arms of the disease-study cohorts (non-ASD family members in SFARI SPARK WES/WGS, SCHEMA controls, and GREGoR unaffected/unknown participants). Genotyping-array cohorts are not included. A variant that also appears in affected individuals is shown in both this track and the Disease cohorts track.

Display Conventions

Color by Consequence

Variants are colored by their most severe predicted consequence:

Color	Consequence class	Examples
	Protein-truncating / loss-of-function	stop_gained, frameshift, splice_donor, splice_acceptor, stop_lost, start_lost
	Missense / in-frame	missense, inframe_insertion, inframe_deletion, protein_altering
	Synonymous	synonymous, stop_retained
	Non-coding / intergenic	intron, non_coding, intergenic, UTR

The score (used for shading) is the pooled background allele frequency times 1000.

Pooled allele frequency

Background AF is the pooled rate across contributing population cohorts and unaffected/control arms: backgroundAF = sum(AC) / sum(AN), where backgroundAC sums the allele counts and backgroundAN sums the allele numbers across each cohort/arm that provides both AC and AF (the per-arm AN is derived as -round(AC / AF)). Two cohorts that publish only AF (ABraOM, ALFA) contribute -via a configured default_an in the build configuration. Cohorts that publish +round(AC / AF)). Two cohorts that publish only AF (ABraOM, ALFA) are still +pooled by assigning them an assumed allele number, set as a default_an in the +build configuration; their per-arm AC is then derived as round(AF × default_an). +Cohorts that publish only AC with no default_an set (currently MGRB and the GREGoR unaffected and unknown arms), and cohorts that contribute only through per-population AC/AF (currently AllOfUs), are listed in backgroundSources but do not contribute to the pool numerator or denominator; their data remain visible in the per-database and per-population AC/AF columns. The pooled rate is preferred over a max-across-cohorts statistic so a small cohort with a high local AF (for example AllOfUs Oceanian) cannot dominate the displayed frequency.

Filters

Variant Type and Consequence: restrict to SNV/insertion/deletion/MNV and to predicted consequence classes (the Consequence filter uses OR logic over the comma-separated tokens on each variant).
Background AF, AC, AN: the pooled allele frequency (sum AC / sum AN), summed allele count, and summed allele number across the contributing population cohorts and unaffected/control arms. See "Pooled allele frequency" above.
Affected/case AF, AC, AN: the same triple computed across affected individuals, for context.
Background source: restrict to variants seen in specific cohorts.
Per-database AF/AC and ancestry-specific allele frequencies (AllOfUs, GenomeAsia, gnomAD HGDP+1kG, NPM Singapore, WBBC) let you filter to a single cohort or ancestry group.
Reference/Alternate Length and Length Change: filter by allele length.

Methods

Variant-frequency VCFs from the contributing cohorts were stripped of unneeded INFO fields, normalized with bcftools norm (splitting multi-allelic sites), and merged with bcftools merge. The merged callset was annotated with predicted protein consequences using bcftools csq against the Ensembl GRCh38 release 115 gene models.

A custom Python script (vcfToBigBed.py) then read the per-cohort allele counts and frequencies and, for each variant, pooled the allele counts and allele numbers across the population cohorts and unaffected/control subgroups to produce this track, and across the affected arms to produce the companion Disease cohorts track. A variant seen in both groups appears in both tracks. The build is documented in the makeDoc, and the scripts are on GitHub.

Data Access

Because the merged callset combines cohorts whose redistribution licenses differ, this track is not available for download and is not in the Table Browser. It can be reconstructed from the individual source VCFs using the conversion scripts and the build documentation. The per-project subtracks on the SNV Frequencies collection page document how to obtain each source dataset.

Credits

This track is only possible thanks to the data from millions of volunteers around the world who contributed to the population reference projects and to the unaffected/control arms of the disease cohorts. Click the individual project subtracks on the SNV Frequencies collection page for the specific credits and citations of each cohort. Thanks to Alex Ioannidis, UCSC, for the inspiration for this track and to Andreas Lahner, MGZ, for feedback.

References

For the primary citation of each source cohort, see the References section on the SNV Frequencies collection page. The merged-track build uses the following tools:

Danecek P, McCarthy SA. BCFtools/csq: haplotype-aware variant consequences. Bioinformatics. 2017 Jul 1;33(13):2037-2039. PMID: 28205675; PMC: PMC5870570