6b285a53b036b309e3c7a9b61d3741731088a172 lrnassar Fri Jun 12 02:35:01 2026 -0700 varFreqs: switch affectedAF/backgroundAF from max-across-cohorts to pooled sum(AC)/sum(AN) so the rate matches the carrier count scale. Per-arm AN is derived as round(AC/AF) when both are reported. An optional "default_an" column was added to databases.tsv so AF-only cohorts (ABraOM, ALFA) can synthesize a denominator from their cohort size; without it those cohorts had been silently dropped from the pooled rate. New affectedAN and backgroundAN columns expose the pool denominator. The mouseOver now reads "Affected AC/AN: 33238 / 213153" so the ratio is visible. Per-arm cohorts that ship only AC and no default_an (MGRB, GREGoR AC_AFFECTED/UNAFFECTED/UNKNOWN, AllOfUs per-population) are still listed in affectedCohorts/backgroundSources but contribute 0 to the pool, preserving the invariant pool_AF <= 1. The build pipeline is unchanged: re-run vcfToBigBed.py --split-affected against the existing merged.annotated.vcf.gz. refs #36642 diff --git src/hg/makeDb/trackDb/human/varFreqsAffected.html src/hg/makeDb/trackDb/human/varFreqsAffected.html index 0099d2fda9d..36af10b8e43 100644 --- src/hg/makeDb/trackDb/human/varFreqsAffected.html +++ src/hg/makeDb/trackDb/human/varFreqsAffected.html @@ -1,134 +1,147 @@

Description

This track shows small variants (SNVs and short indels) that were observed in affected or case individuals of disease-study cohorts, annotated with their predicted protein consequence and colored by severity. It is one half of a matched pair: the companion Population + Unaffected track shows the same kind of variants seen in population reference cohorts and in unaffected relatives or controls. Displaying the two together lets you compare, for example, how often a loss-of-function variant in a gene of interest is seen in affected individuals versus the general/unaffected background. For the full list of contributing projects, see the SNV Frequencies collection page.

The affected counts are drawn from the affected or case arm of five disease-study cohorts: SFARI SPARK WES and SFARI SPARK WGS (autism spectrum disorder probands), SCHEMA (schizophrenia cases), GREGoR (affected rare-disease participants), and GA4K (a pediatric rare-disease cohort). For SPARK, SFARI WGS, SCHEMA, and GREGoR the source data carries an explicit affected/unaffected (or case/control) label and only the affected arm feeds this track. GA4K reports a single cohort-wide frequency with no per-individual label; because it is a rare-disease cohort it is counted as affected here, with the caveat that it enrolls parent-child trios, so a minority of its carriers are unaffected parents. Genotyping-array cohorts are not included in either track.

Display Conventions

Color by Consequence

Variants are colored by their most severe predicted consequence:

Color	Consequence class	Examples
	Protein-truncating / loss-of-function	stop_gained, frameshift, splice_donor, splice_acceptor, stop_lost, start_lost
	Missense / in-frame	missense, inframe_insertion, inframe_deletion, protein_altering
	Synonymous	synonymous, stop_retained
	Non-coding / intergenic	intron, non_coding, intergenic, UTR

-The score (used for shading) is the affected/case allele frequency times 1000. Variants -contributed only by a cohort that reports allele counts but no allele frequency (GREGoR) -have a score of 0 but are still drawn in their consequence color. +The score (used for shading) is the pooled affected/case allele frequency times 1000. +

+ +

Pooled allele frequency

+Affected AF is the pooled rate across contributing affected arms: +affectedAF = sum(AC) / sum(AN), where affectedAC sums the allele counts +and affectedAN sums the allele numbers across each cohort/arm that ships both AC and +AF (the per-arm AN is derived as round(AC / AF)). Cohorts that publish only AF +contribute via a configured default_an in the build configuration. Cohorts +that publish only AC and have no default_an set (currently GREGoR's per-arm +AC_AFFECTED/UNAFFECTED/UNKNOWN) are listed in affectedCohorts but do not contribute +to the pool numerator or denominator; their carriers are visible in the per-database AC +column instead. The pooled rate is preferred over a max-across-cohorts statistic so a +small cohort with a high local AF cannot dominate the displayed frequency.

Finding case-enriched loss-of-function variants

To look for protein-truncating variants that are common in affected individuals but rare in the background, set the Consequence filter to Stop Gained, Frameshift, Splice Donor and Splice Acceptor (these appear red), then add an upper limit on the Background AF filter. Each variant here carries both its affected frequency and its background frequency, so this isolates variants seen in cases with little or no presence in the population/unaffected set. Comparing visually against the Population + Unaffected track shows the same contrast across a whole gene.

Filters

Variant Type and Consequence: restrict to SNV/insertion/deletion/MNV and to predicted consequence classes (the Consequence filter uses OR logic over the comma-separated tokens on each variant).
Affected/case AF and Affected/case AC: the maximum allele frequency and - summed allele count across the affected arms.
Background AF and Background AC: the same variant's frequency in the +
Affected/case AF, AC, AN: the pooled allele frequency + (sum AC / sum AN), summed allele count, and summed allele number across the + contributing affected arms. See "Pooled allele frequency" above.
Background AF, AC, AN: the same triple computed across the population + unaffected background, for filtering case-enriched variants.
Affected/case cohort: restrict to variants seen in specific disease cohorts (for example, only the two autism cohorts).
Reference/Alternate Length and Length Change: filter by allele length.

Methods

Variant-frequency VCFs from the contributing cohorts were stripped of unneeded INFO fields, normalized with bcftools norm (splitting multi-allelic sites), and merged with bcftools merge. The merged callset was annotated with predicted protein consequences using bcftools csq against the Ensembl GRCh38 release 115 gene models.

A custom Python script (vcfToBigBed.py) then read the per-cohort allele frequencies and, for each variant, summed/maximized the counts across the affected arms (case/proband subgroups, plus GA4K whole-cohort) to produce this track, and across the population cohorts and unaffected/control subgroups to produce the companion Population + Unaffected track. A variant seen in both groups appears in both tracks. The build is documented in the makeDoc, and the scripts are on GitHub.

Data Access

Because the merged callset combines cohorts whose redistribution licenses differ, this track is not available for download and is not in the Table Browser. It can be reconstructed from the individual source VCFs using the conversion scripts and the build documentation. The per-project subtracks on the SNV Frequencies collection page document how to obtain each source dataset.

Credits

This track is only possible thanks to the data from the participants and families of the SFARI SPARK, SCHEMA, GREGoR and GA4K studies. Click the individual project subtracks on the SNV Frequencies collection page for the specific credits and citations of each cohort. Thanks to Alex Ioannidis, UCSC, for the inspiration for this track and to Andreas Lahner, MGZ, for feedback.

References

For the primary citation of each source cohort, see the References section on the SNV Frequencies collection page. The merged-track build uses the following tools:

Danecek P, McCarthy SA. BCFtools/csq: haplotype-aware variant consequences. Bioinformatics. 2017 Jul 1;33(13):2037-2039. PMID: 28205675; PMC: PMC5870570