64a3f9e7813e823cf724ea188c3928a911578286 max Thu Jun 4 00:32:22 2026 -0700 varFreqs: replace All Databases Combined with two phenotype-split tracks Replace the single varFreqsAll combined track (and drop the varFreqsDisease track) with two matched tracks for visual case-vs-background comparison: varFreqsAffected - variants seen in the affected/case arms of disease cohorts (SFARI SPARK WES/WGS ASD probands, SCHEMA cases, GREGoR affected, GA4K); ~130,000 individuals varFreqsBackground - population reference cohorts + the unaffected/control arms of disease cohorts ("all other variants"); ~1.5 million individuals A variant seen in both groups appears in both tracks. Genotyping-array cohorts stay out of both (varFreqsArray unchanged). vcfToBigBed.py gains --split-affected to emit both tracks in one pass; it reads phenotype tags (affected/unaffected/unknown) from populations.tsv and is_disease/disease_role from databases.tsv, and derives the length-filter ranges from the observed data. TOPMed reclassified as a population cohort. SPARK WGS display name changed to SFARI SPARK WGS for consistency with the standalone subtracks. Fixed the trackDb mouseOver $-substitution prefix collision by wrapping fields in ${}. New description pages for both tracks. refs #36642 diff --git src/hg/makeDb/trackDb/human/varFreqsAffected.html src/hg/makeDb/trackDb/human/varFreqsAffected.html new file mode 100644 index 00000000000..0099d2fda9d --- /dev/null +++ src/hg/makeDb/trackDb/human/varFreqsAffected.html @@ -0,0 +1,134 @@ +
+This track shows small variants (SNVs and short indels) that were observed in +affected or case individuals of disease-study cohorts, annotated with their +predicted protein consequence and colored by severity. It is one half of a matched pair: +the companion +Population + Unaffected track shows the same +kind of variants seen in population reference cohorts and in unaffected relatives or +controls. Displaying the two together lets you compare, for example, how often a +loss-of-function variant in a gene of interest is seen in affected individuals versus the +general/unaffected background. For the full list of contributing projects, see the +SNV Frequencies collection page. +
++The affected counts are drawn from the affected or case arm of five disease-study cohorts: +SFARI SPARK WES and SFARI SPARK WGS (autism spectrum disorder probands), SCHEMA +(schizophrenia cases), GREGoR (affected rare-disease participants), and GA4K (a pediatric +rare-disease cohort). For SPARK, SFARI WGS, SCHEMA, and GREGoR the source data carries an +explicit affected/unaffected (or case/control) label and only the affected arm feeds this +track. GA4K reports a single cohort-wide frequency with no per-individual label; because it +is a rare-disease cohort it is counted as affected here, with the caveat that it enrolls +parent-child trios, so a minority of its carriers are unaffected parents. Genotyping-array +cohorts are not included in either track. +
+ +Variants are colored by their most severe predicted consequence:
+| Color | Consequence class | Examples |
|---|---|---|
| + | Protein-truncating / loss-of-function | +stop_gained, frameshift, splice_donor, splice_acceptor, stop_lost, start_lost |
| + | Missense / in-frame | +missense, inframe_insertion, inframe_deletion, protein_altering |
| + | Synonymous | +synonymous, stop_retained |
| + | Non-coding / intergenic | +intron, non_coding, intergenic, UTR |
+The score (used for shading) is the affected/case allele frequency times 1000. Variants +contributed only by a cohort that reports allele counts but no allele frequency (GREGoR) +have a score of 0 but are still drawn in their consequence color. +
+ ++To look for protein-truncating variants that are common in affected individuals but rare +in the background, set the Consequence filter to Stop Gained, Frameshift, Splice Donor and +Splice Acceptor (these appear red), then add an upper limit on the +Background AF filter. Each variant here carries both its affected frequency and its +background frequency, so this isolates variants seen in cases with little or no presence in +the population/unaffected set. Comparing visually against the +Population + Unaffected track shows the same +contrast across a whole gene. +
+ +
+Variant-frequency VCFs from the contributing cohorts were stripped of unneeded INFO fields,
+normalized with bcftools norm (splitting multi-allelic sites), and merged with
+bcftools merge. The merged callset was annotated with predicted protein
+consequences using bcftools csq against the
+Ensembl
+GRCh38 release 115 gene models.
+
+A custom Python script (vcfToBigBed.py) then read the per-cohort allele
+frequencies and, for each variant, summed/maximized the counts across the affected arms
+(case/proband subgroups, plus GA4K whole-cohort) to produce this track, and across the
+population cohorts and unaffected/control subgroups to produce the companion
+Population + Unaffected track. A variant seen in
+both groups appears in both tracks. The build is documented in the
+makeDoc, and the scripts are on
+GitHub.
+
+Because the merged callset combines cohorts whose redistribution licenses differ, this +track is not available for download and is not in the Table Browser. It can be +reconstructed from the individual source VCFs using the +conversion scripts and the +build documentation. The per-project subtracks on the +SNV Frequencies collection page document how to obtain +each source dataset. +
+ ++This track is only possible thanks to the data from the participants and families of the +SFARI SPARK, SCHEMA, GREGoR and GA4K studies. Click the individual project subtracks on the +SNV Frequencies collection page for the specific credits +and citations of each cohort. Thanks to Alex Ioannidis, UCSC, for the inspiration for this +track and to Andreas Lahner, MGZ, for feedback. +
+ ++For the primary citation of each source cohort, see the References section on the +SNV Frequencies collection page. The merged-track build +uses the following tools: +
++Danecek P, McCarthy SA. + +BCFtools/csq: haplotype-aware variant consequences. +Bioinformatics. 2017 Jul 1;33(13):2037-2039. +PMID: 28205675; +PMC: PMC5870570 +