64a3f9e7813e823cf724ea188c3928a911578286 max Thu Jun 4 00:32:22 2026 -0700 varFreqs: replace All Databases Combined with two phenotype-split tracks Replace the single varFreqsAll combined track (and drop the varFreqsDisease track) with two matched tracks for visual case-vs-background comparison: varFreqsAffected - variants seen in the affected/case arms of disease cohorts (SFARI SPARK WES/WGS ASD probands, SCHEMA cases, GREGoR affected, GA4K); ~130,000 individuals varFreqsBackground - population reference cohorts + the unaffected/control arms of disease cohorts ("all other variants"); ~1.5 million individuals A variant seen in both groups appears in both tracks. Genotyping-array cohorts stay out of both (varFreqsArray unchanged). vcfToBigBed.py gains --split-affected to emit both tracks in one pass; it reads phenotype tags (affected/unaffected/unknown) from populations.tsv and is_disease/disease_role from databases.tsv, and derives the length-filter ranges from the observed data. TOPMed reclassified as a population cohort. SPARK WGS display name changed to SFARI SPARK WGS for consistency with the standalone subtracks. Fixed the trackDb mouseOver $-substitution prefix collision by wrapping fields in ${}. New description pages for both tracks. refs #36642 diff --git src/hg/makeDb/trackDb/human/varFreqsDisease.html src/hg/makeDb/trackDb/human/varFreqsDisease.html deleted file mode 100644 index 010d65019b7..00000000000 --- src/hg/makeDb/trackDb/human/varFreqsDisease.html +++ /dev/null @@ -1,198 +0,0 @@ -

Description

-This track merges variants from six disease-focused or clinically-recruited cohorts into a -single bigBed file with predicted protein consequences and cross-database filtering. It -contains 932 million variants from SPARK WES (140k autism families), SFARI WGS (12.5k -autism families), TOPMed (NHLBI heart, lung and blood disease cohorts), SCHEMA -(schizophrenia case/control), GREGoR (rare-disease families), and GA4K (PacBio long-read -pediatric rare disease). Where the source dataset provides per-phenotype counts, those are -exposed as separate AC/AF columns and as filter widgets. -

- -

-For a summary of all available variant frequency databases, including the population-scale -control track and the genotyping-array track, see the -SNV Frequencies supertrack page. -

- -

-Each variant is annotated with its predicted consequence on protein-coding genes -(using bcftools csq with -Ensembl -gene models), and colored by severity. Allele counts and frequencies are shown for each -source database and, where available, broken down by phenotype. -

- -

Display Conventions

- -

Color by Consequence

Variants are colored by their most severe predicted consequence:

- - - - - - - - - - - - - - - - - - - - - - -

Color	Consequence class	Examples
Red	Protein-truncating / Loss-of-function	stop_gained, frameshift, splice_donor, splice_acceptor, stop_lost, start_lost
Blue	Missense / In-frame	missense, inframe_insertion, inframe_deletion, protein_altering
Green	Synonymous	synonymous, stop_retained
Grey	Non-coding / Intergenic	intron, non_coding, intergenic, UTR

- -

Amino Acid Change Notation

-The "AA change" field uses bcftools csq notation: 23I>23V means position -23 changed from Isoleucine (I) to Valine (V) (missense). 23I alone (no arrow) -means position 23 is Isoleucine and unchanged (synonymous). A "*" indicates a -stop codon (e.g. 45R>45* is a stop_gained). -

- -

Filters

-This track supports filtering via the track settings page. Click the track title or use the -"Configure" button to access filters. -

- -

Variant Type and Consequence

Variant Type: SNV, Insertion, Deletion, or MNV.
Consequence: Missense, Synonymous, Stop Gained, Frameshift, Splice Donor, - Splice Acceptor, Intron, 3' UTR, 5' UTR, Non-coding, Intergenic, or Other. The filter - uses OR logic across the comma-separated consequence tokens on each variant. See the - All Databases Combined description page for a - complete description of the "Other" bucket.

- -

Frequency and Count Filters

Max Allele Frequency: Filter by the maximum allele frequency observed across - the six disease cohorts. Useful for finding rare variants enriched in cases.
Total Allele Count: Filter by the sum of allele counts across all six - databases.
Per-database AF and AC: Filter by allele frequency or count in any specific - source. For example, restrict to variants with SCHEMA case AF > 0.001.

- -

Phenotype-stratified Filters

-Four of the six sources publish counts split by phenotype, which lets you compare allele -frequencies between affected and unaffected groups within the same cohort: -

SPARK WES and SFARI WGS: ASD proband counts versus non-ASD family - members (mostly parents and unaffected siblings). The split is from the SPARK - individuals_registration asd column.
SCHEMA: Schizophrenia case counts versus controls, summed across the 39 - analysis cohorts in the original meta-analysis.
GREGoR: Affected, Unaffected, and Unknown disease-status counts.

- -

Source Database

-The Source Database filter restricts the display to variants present in specific -databases. It uses OR logic: selecting multiple databases shows variants found in any of -the selected sources. -

- -

Length Filters

Reference/Alternate Length: Filter by the length of the reference or alternate allele.
Length Change: Filter by the size difference between alternate and reference - (positive = insertion, negative = deletion, zero = SNV or MNV).

- -

Methods

-The same merge-and-annotate pipeline used for the -All Databases Combined track was run on the -disease-cohort subset of source VCFs. Each VCF was stripped of its INFO fields, -normalized with bcftools norm (splitting multi-allelic sites), and merged with -bcftools merge. The merged VCF was then annotated with predicted protein -consequences using bcftools csq with the -Ensembl -GRCh38 release 115 gene annotation (GFF3). -

- -

-The SPARK WES and WGS sites VCFs were rebuilt for this track so each variant carries -phenotype-stratified counts in addition to overall AC/AN/AF. The split uses the -asd column of the SPARK individuals_registration TSV via -bcftools +fill-tags -S, producing AC_AUT / AN_AUT / AF_AUT and -AC_NON_AUT / AN_NON_AUT / AF_NON_AUT. SCHEMA was processed the same way, summing -AC_CASE/AN_CASE/AF_CASE and AC_CTRL/AN_CTRL/AF_CTRL across its 39 analysis cohorts. -GREGoR ships AC/AN/AF triples for affected, unaffected and unknown disease status -directly in its release. -

- -

-The track's -makeDoc file documents how each source VCF was converted. Scripts are -available from -Github. -

- -

Data Access

-The data can be explored interactively with the -Table Browser or the -Data Integrator. For programmatic access, our -REST API can be used; the track -name is varFreqsDisease. -

-Because the merged callset includes data from multiple sources whose redistribution -licenses differ, the combined bigBed is not available for download from our download -server. The combined track can be reconstructed from the individual source VCFs using the -conversion scripts on GitHub together with the -build documentation. -

- -

Credits

-This track is only possible thanks to the data from millions of volunteers around the -world, who donated blood, signed consent forms and provided health information about -themselves and sometimes their families. Click on any of the individual tracks in the -SNV Frequencies supertrack to see the specific credits -for each project. Thanks to Alex Ioannidis, UCSC, for the motivation for this track and -to Andreas Lahner, MGZ, for feedback. -

- -

References

-For primary citations of each source dataset, see the References section on the -SNV Frequencies supertrack page. The merged-track -build itself uses the following tools: -

-Danecek P, McCarthy SA. - -BCFtools/csq: haplotype-aware variant consequences. -Bioinformatics. 2017 Jul 1;33(13):2037-2039. -PMID: 28205675; PMC: PMC5870570 -

-McLaren W, Gil L, Hunt SE, Riat HS, Ritchie GR, Thormann A, Flicek P, Cunningham F. - -The Ensembl Variant Effect Predictor. -Genome Biol. 2016 Jun 6;17(1):122. -PMID: 27268795; PMC: PMC4893825 -