64a3f9e7813e823cf724ea188c3928a911578286 max Thu Jun 4 00:32:22 2026 -0700 varFreqs: replace All Databases Combined with two phenotype-split tracks Replace the single varFreqsAll combined track (and drop the varFreqsDisease track) with two matched tracks for visual case-vs-background comparison: varFreqsAffected - variants seen in the affected/case arms of disease cohorts (SFARI SPARK WES/WGS ASD probands, SCHEMA cases, GREGoR affected, GA4K); ~130,000 individuals varFreqsBackground - population reference cohorts + the unaffected/control arms of disease cohorts ("all other variants"); ~1.5 million individuals A variant seen in both groups appears in both tracks. Genotyping-array cohorts stay out of both (varFreqsArray unchanged). vcfToBigBed.py gains --split-affected to emit both tracks in one pass; it reads phenotype tags (affected/unaffected/unknown) from populations.tsv and is_disease/disease_role from databases.tsv, and derives the length-filter ranges from the observed data. TOPMed reclassified as a population cohort. SPARK WGS display name changed to SFARI SPARK WGS for consistency with the standalone subtracks. Fixed the trackDb mouseOver $-substitution prefix collision by wrapping fields in ${}. New description pages for both tracks. refs #36642 diff --git src/hg/makeDb/trackDb/human/varFreqsAffected.html src/hg/makeDb/trackDb/human/varFreqsAffected.html new file mode 100644 index 00000000000..0099d2fda9d --- /dev/null +++ src/hg/makeDb/trackDb/human/varFreqsAffected.html @@ -0,0 +1,134 @@ +

Description

+This track shows small variants (SNVs and short indels) that were observed in +affected or case individuals of disease-study cohorts, annotated with their +predicted protein consequence and colored by severity. It is one half of a matched pair: +the companion +Population + Unaffected track shows the same +kind of variants seen in population reference cohorts and in unaffected relatives or +controls. Displaying the two together lets you compare, for example, how often a +loss-of-function variant in a gene of interest is seen in affected individuals versus the +general/unaffected background. For the full list of contributing projects, see the +SNV Frequencies collection page. +

+The affected counts are drawn from the affected or case arm of five disease-study cohorts: +SFARI SPARK WES and SFARI SPARK WGS (autism spectrum disorder probands), SCHEMA +(schizophrenia cases), GREGoR (affected rare-disease participants), and GA4K (a pediatric +rare-disease cohort). For SPARK, SFARI WGS, SCHEMA, and GREGoR the source data carries an +explicit affected/unaffected (or case/control) label and only the affected arm feeds this +track. GA4K reports a single cohort-wide frequency with no per-individual label; because it +is a rare-disease cohort it is counted as affected here, with the caveat that it enrolls +parent-child trios, so a minority of its carriers are unaffected parents. Genotyping-array +cohorts are not included in either track. +

+ +

Display Conventions

Color by Consequence

Variants are colored by their most severe predicted consequence:

+ + + + + + + + + + + + + + +

Color	Consequence class	Examples
	Protein-truncating / loss-of-function	stop_gained, frameshift, splice_donor, splice_acceptor, stop_lost, start_lost
	Missense / in-frame	missense, inframe_insertion, inframe_deletion, protein_altering
	Synonymous	synonymous, stop_retained
	Non-coding / intergenic	intron, non_coding, intergenic, UTR

+The score (used for shading) is the affected/case allele frequency times 1000. Variants +contributed only by a cohort that reports allele counts but no allele frequency (GREGoR) +have a score of 0 but are still drawn in their consequence color. +

+ +

Finding case-enriched loss-of-function variants

+To look for protein-truncating variants that are common in affected individuals but rare +in the background, set the Consequence filter to Stop Gained, Frameshift, Splice Donor and +Splice Acceptor (these appear red), then add an upper limit on the +Background AF filter. Each variant here carries both its affected frequency and its +background frequency, so this isolates variants seen in cases with little or no presence in +the population/unaffected set. Comparing visually against the +Population + Unaffected track shows the same +contrast across a whole gene. +

+ +

Filters

Variant Type and Consequence: restrict to SNV/insertion/deletion/MNV + and to predicted consequence classes (the Consequence filter uses OR logic over the + comma-separated tokens on each variant).
Affected/case AF and Affected/case AC: the maximum allele frequency and + summed allele count across the affected arms.
Background AF and Background AC: the same variant's frequency in the + population + unaffected background, for filtering case-enriched variants.
Affected/case cohort: restrict to variants seen in specific disease cohorts + (for example, only the two autism cohorts).
Reference/Alternate Length and Length Change: filter by allele length.

+ +

Methods

+Variant-frequency VCFs from the contributing cohorts were stripped of unneeded INFO fields, +normalized with bcftools norm (splitting multi-allelic sites), and merged with +bcftools merge. The merged callset was annotated with predicted protein +consequences using bcftools csq against the +Ensembl +GRCh38 release 115 gene models. +

+A custom Python script (vcfToBigBed.py) then read the per-cohort allele +frequencies and, for each variant, summed/maximized the counts across the affected arms +(case/proband subgroups, plus GA4K whole-cohort) to produce this track, and across the +population cohorts and unaffected/control subgroups to produce the companion +Population + Unaffected track. A variant seen in +both groups appears in both tracks. The build is documented in the +makeDoc, and the scripts are on +GitHub. +

+ +

Data Access

+Because the merged callset combines cohorts whose redistribution licenses differ, this +track is not available for download and is not in the Table Browser. It can be +reconstructed from the individual source VCFs using the +conversion scripts and the +build documentation. The per-project subtracks on the +SNV Frequencies collection page document how to obtain +each source dataset. +

+ +

Credits

+This track is only possible thanks to the data from the participants and families of the +SFARI SPARK, SCHEMA, GREGoR and GA4K studies. Click the individual project subtracks on the +SNV Frequencies collection page for the specific credits +and citations of each cohort. Thanks to Alex Ioannidis, UCSC, for the inspiration for this +track and to Andreas Lahner, MGZ, for feedback. +

+ +

References

+For the primary citation of each source cohort, see the References section on the +SNV Frequencies collection page. The merged-track build +uses the following tools: +

+Danecek P, McCarthy SA. + +BCFtools/csq: haplotype-aware variant consequences. +Bioinformatics. 2017 Jul 1;33(13):2037-2039. +PMID: 28205675; +PMC: PMC5870570 +