6b285a53b036b309e3c7a9b61d3741731088a172 lrnassar Fri Jun 12 02:35:01 2026 -0700 varFreqs: switch affectedAF/backgroundAF from max-across-cohorts to pooled sum(AC)/sum(AN) so the rate matches the carrier count scale. Per-arm AN is derived as round(AC/AF) when both are reported. An optional "default_an" column was added to databases.tsv so AF-only cohorts (ABraOM, ALFA) can synthesize a denominator from their cohort size; without it those cohorts had been silently dropped from the pooled rate. New affectedAN and backgroundAN columns expose the pool denominator. The mouseOver now reads "Affected AC/AN: 33238 / 213153" so the ratio is visible. Per-arm cohorts that ship only AC and no default_an (MGRB, GREGoR AC_AFFECTED/UNAFFECTED/UNKNOWN, AllOfUs per-population) are still listed in affectedCohorts/backgroundSources but contribute 0 to the pool, preserving the invariant pool_AF <= 1. The build pipeline is unchanged: re-run vcfToBigBed.py --split-affected against the existing merged.annotated.vcf.gz. refs #36642 diff --git src/hg/makeDb/trackDb/human/varFreqsBackground.html src/hg/makeDb/trackDb/human/varFreqsBackground.html index e835ddfe2a4..e63ca161efb 100644 --- src/hg/makeDb/trackDb/human/varFreqsBackground.html +++ src/hg/makeDb/trackDb/human/varFreqsBackground.html @@ -1,121 +1,139 @@
This track shows small variants (SNVs and short indels) seen in population reference cohorts and in unaffected or control individuals of disease-study cohorts, annotated with their predicted protein consequence and colored by severity. It is the background half of a matched pair: the companion Affected/Case Individuals track shows the same kind of variants seen in affected or case individuals. Displaying the two together lets you see how common a variant is in the general/unaffected population compared with affected individuals. For the full list of contributing projects, see the SNV Frequencies collection page.
The background combines two kinds of data: the population/biobank reference cohorts (such as gnomAD HGDP+1kG, TOPMed, ALFA, HRC and the many national WGS projects), and the unaffected/control or unknown-phenotype arms of the disease-study cohorts (non-ASD family members in SFARI SPARK WES/WGS, SCHEMA controls, and GREGoR unaffected/unknown participants). Genotyping-array cohorts are not included. A variant that also appears in affected individuals is shown in both this track and the Affected/Case Individuals track.
Variants are colored by their most severe predicted consequence:
| Color | Consequence class | Examples |
|---|---|---|
| Protein-truncating / loss-of-function | stop_gained, frameshift, splice_donor, splice_acceptor, stop_lost, start_lost | |
| Missense / in-frame | missense, inframe_insertion, inframe_deletion, protein_altering | |
| Synonymous | synonymous, stop_retained | |
| Non-coding / intergenic | intron, non_coding, intergenic, UTR |
-The score (used for shading) is the background allele frequency (the maximum across the -population cohorts and unaffected/control arms) times 1000. +The score (used for shading) is the pooled background allele frequency times 1000. +
+ +
+Background AF is the pooled rate across contributing population cohorts and
+unaffected/control arms: backgroundAF = sum(AC) / sum(AN), where
+backgroundAC sums the allele counts and backgroundAN sums the allele
+numbers across each cohort/arm that ships both AC and AF (the per-arm AN is derived as
+round(AC / AF)). Two cohorts that publish only AF (ABraOM, ALFA) contribute
+via a configured default_an in the build configuration. Cohorts that publish
+only AC with no default_an set (currently MGRB and the GREGoR unaffected and
+unknown arms), and cohorts that contribute only through per-population AC/AF (currently
+AllOfUs), are listed in backgroundSources but do not contribute to the pool
+numerator or denominator; their data remain visible in the per-database and per-population
+AC/AF columns. The pooled rate is preferred over a max-across-cohorts statistic so a small
+cohort with a high local AF (for example AllOfUs Oceanian) cannot dominate the displayed
+frequency.
Variant-frequency VCFs from the contributing cohorts were stripped of unneeded INFO fields,
normalized with bcftools norm (splitting multi-allelic sites), and merged with
bcftools merge. The merged callset was annotated with predicted protein
consequences using bcftools csq against the
Ensembl
GRCh38 release 115 gene models.
A custom Python script (vcfToBigBed.py) then read the per-cohort allele
frequencies and, for each variant, summed/maximized the counts across the population cohorts
and unaffected/control subgroups to produce this track, and across the affected arms to
produce the companion
Affected/Case Individuals track. A variant seen
in both groups appears in both tracks. The build is documented in the
makeDoc, and the scripts are on
GitHub.
Because the merged callset combines cohorts whose redistribution licenses differ, this track is not available for download and is not in the Table Browser. It can be reconstructed from the individual source VCFs using the conversion scripts and the build documentation. The per-project subtracks on the SNV Frequencies collection page document how to obtain each source dataset.
This track is only possible thanks to the data from millions of volunteers around the world who contributed to the population reference projects and to the unaffected/control arms of the disease cohorts. Click the individual project subtracks on the SNV Frequencies collection page for the specific credits and citations of each cohort. Thanks to Alex Ioannidis, UCSC, for the inspiration for this track and to Andreas Lahner, MGZ, for feedback.
For the primary citation of each source cohort, see the References section on the SNV Frequencies collection page. The merged-track build uses the following tools:
Danecek P, McCarthy SA. BCFtools/csq: haplotype-aware variant consequences. Bioinformatics. 2017 Jul 1;33(13):2037-2039. PMID: 28205675; PMC: PMC5870570