64a3f9e7813e823cf724ea188c3928a911578286 max Thu Jun 4 00:32:22 2026 -0700 varFreqs: replace All Databases Combined with two phenotype-split tracks Replace the single varFreqsAll combined track (and drop the varFreqsDisease track) with two matched tracks for visual case-vs-background comparison: varFreqsAffected - variants seen in the affected/case arms of disease cohorts (SFARI SPARK WES/WGS ASD probands, SCHEMA cases, GREGoR affected, GA4K); ~130,000 individuals varFreqsBackground - population reference cohorts + the unaffected/control arms of disease cohorts ("all other variants"); ~1.5 million individuals A variant seen in both groups appears in both tracks. Genotyping-array cohorts stay out of both (varFreqsArray unchanged). vcfToBigBed.py gains --split-affected to emit both tracks in one pass; it reads phenotype tags (affected/unaffected/unknown) from populations.tsv and is_disease/disease_role from databases.tsv, and derives the length-filter ranges from the observed data. TOPMed reclassified as a population cohort. SPARK WGS display name changed to SFARI SPARK WGS for consistency with the standalone subtracks. Fixed the trackDb mouseOver $-substitution prefix collision by wrapping fields in ${}. New description pages for both tracks. refs #36642 diff --git src/hg/makeDb/trackDb/human/varFreqsBackground.html src/hg/makeDb/trackDb/human/varFreqsBackground.html new file mode 100644 index 00000000000..e835ddfe2a4 --- /dev/null +++ src/hg/makeDb/trackDb/human/varFreqsBackground.html @@ -0,0 +1,121 @@ +

Description

+

+This track shows small variants (SNVs and short indels) seen in population reference +cohorts and in unaffected or control individuals of disease-study cohorts, annotated +with their predicted protein consequence and colored by severity. It is the background half +of a matched pair: the companion +Affected/Case Individuals track shows the same +kind of variants seen in affected or case individuals. Displaying the two together lets you +see how common a variant is in the general/unaffected population compared with affected +individuals. For the full list of contributing projects, see the +SNV Frequencies collection page. +

+

+The background combines two kinds of data: the population/biobank reference cohorts (such as +gnomAD HGDP+1kG, TOPMed, ALFA, HRC and the many national WGS projects), and the +unaffected/control or unknown-phenotype arms of the disease-study cohorts (non-ASD family +members in SFARI SPARK WES/WGS, SCHEMA controls, and GREGoR unaffected/unknown +participants). Genotyping-array cohorts are not included. A variant that also appears in +affected individuals is shown in both this track and the +Affected/Case Individuals track. +

+ +

Display Conventions

+

Color by Consequence

+

Variants are colored by their most severe predicted consequence:

+ + + + + + + + + + + + + + +
ColorConsequence classExamples
 Protein-truncating / loss-of-functionstop_gained, frameshift, splice_donor, splice_acceptor, stop_lost, start_lost
 Missense / in-framemissense, inframe_insertion, inframe_deletion, protein_altering
 Synonymoussynonymous, stop_retained
 Non-coding / intergenicintron, non_coding, intergenic, UTR
+

+The score (used for shading) is the background allele frequency (the maximum across the +population cohorts and unaffected/control arms) times 1000. +

+ +

Filters

+ + +

Methods

+

+Variant-frequency VCFs from the contributing cohorts were stripped of unneeded INFO fields, +normalized with bcftools norm (splitting multi-allelic sites), and merged with +bcftools merge. The merged callset was annotated with predicted protein +consequences using bcftools csq against the +Ensembl +GRCh38 release 115 gene models. +

+

+A custom Python script (vcfToBigBed.py) then read the per-cohort allele +frequencies and, for each variant, summed/maximized the counts across the population cohorts +and unaffected/control subgroups to produce this track, and across the affected arms to +produce the companion +Affected/Case Individuals track. A variant seen +in both groups appears in both tracks. The build is documented in the +makeDoc, and the scripts are on +GitHub. +

+ +

Data Access

+

+Because the merged callset combines cohorts whose redistribution licenses differ, this +track is not available for download and is not in the Table Browser. It can be +reconstructed from the individual source VCFs using the +conversion scripts and the +build documentation. The per-project subtracks on the +SNV Frequencies collection page document how to obtain +each source dataset. +

+ +

Credits

+

+This track is only possible thanks to the data from millions of volunteers around the world +who contributed to the population reference projects and to the unaffected/control arms of +the disease cohorts. Click the individual project subtracks on the +SNV Frequencies collection page for the specific credits +and citations of each cohort. Thanks to Alex Ioannidis, UCSC, for the inspiration for this +track and to Andreas Lahner, MGZ, for feedback. +

+ +

References

+

+For the primary citation of each source cohort, see the References section on the +SNV Frequencies collection page. The merged-track build +uses the following tools: +

+

+Danecek P, McCarthy SA. + +BCFtools/csq: haplotype-aware variant consequences. +Bioinformatics. 2017 Jul 1;33(13):2037-2039. +PMID: 28205675; +PMC: PMC5870570 +