753e4fdfc8b960c8a8775e2282b0f87c73a95449 lrnassar Tue Jun 2 07:49:03 2026 -0700 varFreqsDisease.html: list six disease cohorts separately to match the "six cohorts" count in the opening sentence and the six per-source AC/AF columns in the bigBed schema. SPARK WES and SFARI WGS are two distinct sample sets, not one combined cohort. Per QA feedback. refs #36642 diff --git src/hg/makeDb/trackDb/human/varFreqsDisease.html src/hg/makeDb/trackDb/human/varFreqsDisease.html index 613013eac96..010d65019b7 100644 --- src/hg/makeDb/trackDb/human/varFreqsDisease.html +++ src/hg/makeDb/trackDb/human/varFreqsDisease.html @@ -1,198 +1,198 @@

Description

This track merges variants from six disease-focused or clinically-recruited cohorts into a single bigBed file with predicted protein consequences and cross-database filtering. It -contains 932 million variants from SFARI SPARK (WES + WGS, autism families), TOPMed -(NHLBI heart, lung and blood disease cohorts), SCHEMA (schizophrenia case/control), -GREGoR (rare-disease families), and GA4K (PacBio long-read pediatric rare disease). Where -the source dataset provides per-phenotype counts, those are exposed as separate AC/AF -columns and as filter widgets. +contains 932 million variants from SPARK WES (140k autism families), SFARI WGS (12.5k +autism families), TOPMed (NHLBI heart, lung and blood disease cohorts), SCHEMA +(schizophrenia case/control), GREGoR (rare-disease families), and GA4K (PacBio long-read +pediatric rare disease). Where the source dataset provides per-phenotype counts, those are +exposed as separate AC/AF columns and as filter widgets.

For a summary of all available variant frequency databases, including the population-scale control track and the genotyping-array track, see the SNV Frequencies supertrack page.

Each variant is annotated with its predicted consequence on protein-coding genes (using bcftools csq with Ensembl gene models), and colored by severity. Allele counts and frequencies are shown for each source database and, where available, broken down by phenotype.

Display Conventions

Color by Consequence

Variants are colored by their most severe predicted consequence:

Color	Consequence class	Examples
Red	Protein-truncating / Loss-of-function	stop_gained, frameshift, splice_donor, splice_acceptor, stop_lost, start_lost
Blue	Missense / In-frame	missense, inframe_insertion, inframe_deletion, protein_altering
Green	Synonymous	synonymous, stop_retained
Grey	Non-coding / Intergenic	intron, non_coding, intergenic, UTR

Amino Acid Change Notation

The "AA change" field uses bcftools csq notation: 23I>23V means position 23 changed from Isoleucine (I) to Valine (V) (missense). 23I alone (no arrow) means position 23 is Isoleucine and unchanged (synonymous). A "*" indicates a stop codon (e.g. 45R>45* is a stop_gained).

Filters

This track supports filtering via the track settings page. Click the track title or use the "Configure" button to access filters.

Variant Type and Consequence

Variant Type: SNV, Insertion, Deletion, or MNV.
Consequence: Missense, Synonymous, Stop Gained, Frameshift, Splice Donor, Splice Acceptor, Intron, 3' UTR, 5' UTR, Non-coding, Intergenic, or Other. The filter uses OR logic across the comma-separated consequence tokens on each variant. See the All Databases Combined description page for a complete description of the "Other" bucket.

Frequency and Count Filters

Max Allele Frequency: Filter by the maximum allele frequency observed across the six disease cohorts. Useful for finding rare variants enriched in cases.
Total Allele Count: Filter by the sum of allele counts across all six databases.
Per-database AF and AC: Filter by allele frequency or count in any specific source. For example, restrict to variants with SCHEMA case AF > 0.001.

Phenotype-stratified Filters

Four of the six sources publish counts split by phenotype, which lets you compare allele frequencies between affected and unaffected groups within the same cohort:

SPARK WES and SFARI WGS: ASD proband counts versus non-ASD family members (mostly parents and unaffected siblings). The split is from the SPARK individuals_registration asd column.
SCHEMA: Schizophrenia case counts versus controls, summed across the 39 analysis cohorts in the original meta-analysis.
GREGoR: Affected, Unaffected, and Unknown disease-status counts.

Source Database

The Source Database filter restricts the display to variants present in specific databases. It uses OR logic: selecting multiple databases shows variants found in any of the selected sources.

Length Filters

Reference/Alternate Length: Filter by the length of the reference or alternate allele.
Length Change: Filter by the size difference between alternate and reference (positive = insertion, negative = deletion, zero = SNV or MNV).

Methods

The same merge-and-annotate pipeline used for the All Databases Combined track was run on the disease-cohort subset of source VCFs. Each VCF was stripped of its INFO fields, normalized with bcftools norm (splitting multi-allelic sites), and merged with bcftools merge. The merged VCF was then annotated with predicted protein consequences using bcftools csq with the Ensembl GRCh38 release 115 gene annotation (GFF3).

The SPARK WES and WGS sites VCFs were rebuilt for this track so each variant carries phenotype-stratified counts in addition to overall AC/AN/AF. The split uses the asd column of the SPARK individuals_registration TSV via bcftools +fill-tags -S, producing AC_AUT / AN_AUT / AF_AUT and AC_NON_AUT / AN_NON_AUT / AF_NON_AUT. SCHEMA was processed the same way, summing AC_CASE/AN_CASE/AF_CASE and AC_CTRL/AN_CTRL/AF_CTRL across its 39 analysis cohorts. GREGoR ships AC/AN/AF triples for affected, unaffected and unknown disease status directly in its release.

The track's makeDoc file documents how each source VCF was converted. Scripts are available from Github.

Data Access

The data can be explored interactively with the Table Browser or the Data Integrator. For programmatic access, our REST API can be used; the track name is varFreqsDisease.

Because the merged callset includes data from multiple sources whose redistribution licenses differ, the combined bigBed is not available for download from our download server. The combined track can be reconstructed from the individual source VCFs using the conversion scripts on GitHub together with the build documentation.

Credits

This track is only possible thanks to the data from millions of volunteers around the world, who donated blood, signed consent forms and provided health information about themselves and sometimes their families. Click on any of the individual tracks in the SNV Frequencies supertrack to see the specific credits for each project. Thanks to Alex Ioannidis, UCSC, for the motivation for this track and to Andreas Lahner, MGZ, for feedback.

References

For primary citations of each source dataset, see the References section on the SNV Frequencies supertrack page. The merged-track build itself uses the following tools:

Danecek P, McCarthy SA. BCFtools/csq: haplotype-aware variant consequences. Bioinformatics. 2017 Jul 1;33(13):2037-2039. PMID: 28205675; PMC: PMC5870570

McLaren W, Gil L, Hunt SE, Riat HS, Ritchie GR, Thormann A, Flicek P, Cunningham F. The Ensembl Variant Effect Predictor. Genome Biol. 2016 Jun 6;17(1):122. PMID: 27268795; PMC: PMC4893825