753e4fdfc8b960c8a8775e2282b0f87c73a95449 lrnassar Tue Jun 2 07:49:03 2026 -0700 varFreqsDisease.html: list six disease cohorts separately to match the "six cohorts" count in the opening sentence and the six per-source AC/AF columns in the bigBed schema. SPARK WES and SFARI WGS are two distinct sample sets, not one combined cohort. Per QA feedback. refs #36642 diff --git src/hg/makeDb/trackDb/human/varFreqsDisease.html src/hg/makeDb/trackDb/human/varFreqsDisease.html index 613013eac96..010d65019b7 100644 --- src/hg/makeDb/trackDb/human/varFreqsDisease.html +++ src/hg/makeDb/trackDb/human/varFreqsDisease.html @@ -1,198 +1,198 @@
This track merges variants from six disease-focused or clinically-recruited cohorts into a single bigBed file with predicted protein consequences and cross-database filtering. It -contains 932 million variants from SFARI SPARK (WES + WGS, autism families), TOPMed -(NHLBI heart, lung and blood disease cohorts), SCHEMA (schizophrenia case/control), -GREGoR (rare-disease families), and GA4K (PacBio long-read pediatric rare disease). Where -the source dataset provides per-phenotype counts, those are exposed as separate AC/AF -columns and as filter widgets. +contains 932 million variants from SPARK WES (140k autism families), SFARI WGS (12.5k +autism families), TOPMed (NHLBI heart, lung and blood disease cohorts), SCHEMA +(schizophrenia case/control), GREGoR (rare-disease families), and GA4K (PacBio long-read +pediatric rare disease). Where the source dataset provides per-phenotype counts, those are +exposed as separate AC/AF columns and as filter widgets.
For a summary of all available variant frequency databases, including the population-scale control track and the genotyping-array track, see the SNV Frequencies supertrack page.
Each variant is annotated with its predicted consequence on protein-coding genes (using bcftools csq with Ensembl gene models), and colored by severity. Allele counts and frequencies are shown for each source database and, where available, broken down by phenotype.
Variants are colored by their most severe predicted consequence:
| Color | Consequence class | Examples |
|---|---|---|
| Red | Protein-truncating / Loss-of-function | stop_gained, frameshift, splice_donor, splice_acceptor, stop_lost, start_lost |
| Blue | Missense / In-frame | missense, inframe_insertion, inframe_deletion, protein_altering |
| Green | Synonymous | synonymous, stop_retained |
| Grey | Non-coding / Intergenic | intron, non_coding, intergenic, UTR |
The "AA change" field uses bcftools csq notation: 23I>23V means position 23 changed from Isoleucine (I) to Valine (V) (missense). 23I alone (no arrow) means position 23 is Isoleucine and unchanged (synonymous). A "*" indicates a stop codon (e.g. 45R>45* is a stop_gained).
This track supports filtering via the track settings page. Click the track title or use the "Configure" button to access filters.
Four of the six sources publish counts split by phenotype, which lets you compare allele frequencies between affected and unaffected groups within the same cohort:
asd column.The Source Database filter restricts the display to variants present in specific databases. It uses OR logic: selecting multiple databases shows variants found in any of the selected sources.
The same merge-and-annotate pipeline used for the
All Databases Combined track was run on the
disease-cohort subset of source VCFs. Each VCF was stripped of its INFO fields,
normalized with bcftools norm (splitting multi-allelic sites), and merged with
bcftools merge. The merged VCF was then annotated with predicted protein
consequences using bcftools csq with the
Ensembl
GRCh38 release 115 gene annotation (GFF3).
The SPARK WES and WGS sites VCFs were rebuilt for this track so each variant carries
phenotype-stratified counts in addition to overall AC/AN/AF. The split uses the
asd column of the SPARK individuals_registration TSV via
bcftools +fill-tags -S, producing AC_AUT / AN_AUT / AF_AUT and
AC_NON_AUT / AN_NON_AUT / AF_NON_AUT. SCHEMA was processed the same way, summing
AC_CASE/AN_CASE/AF_CASE and AC_CTRL/AN_CTRL/AF_CTRL across its 39 analysis cohorts.
GREGoR ships AC/AN/AF triples for affected, unaffected and unknown disease status
directly in its release.
The track's makeDoc file documents how each source VCF was converted. Scripts are available from Github.
The data can be explored interactively with the Table Browser or the Data Integrator. For programmatic access, our REST API can be used; the track name is varFreqsDisease.
Because the merged callset includes data from multiple sources whose redistribution licenses differ, the combined bigBed is not available for download from our download server. The combined track can be reconstructed from the individual source VCFs using the conversion scripts on GitHub together with the build documentation.
This track is only possible thanks to the data from millions of volunteers around the world, who donated blood, signed consent forms and provided health information about themselves and sometimes their families. Click on any of the individual tracks in the SNV Frequencies supertrack to see the specific credits for each project. Thanks to Alex Ioannidis, UCSC, for the motivation for this track and to Andreas Lahner, MGZ, for feedback.
For primary citations of each source dataset, see the References section on the SNV Frequencies supertrack page. The merged-track build itself uses the following tools:
Danecek P, McCarthy SA. BCFtools/csq: haplotype-aware variant consequences. Bioinformatics. 2017 Jul 1;33(13):2037-2039. PMID: 28205675; PMC: PMC5870570
McLaren W, Gil L, Hunt SE, Riat HS, Ritchie GR, Thormann A, Flicek P, Cunningham F. The Ensembl Variant Effect Predictor. Genome Biol. 2016 Jun 6;17(1):122. PMID: 27268795; PMC: PMC4893825