64a3f9e7813e823cf724ea188c3928a911578286 max Thu Jun 4 00:32:22 2026 -0700 varFreqs: replace All Databases Combined with two phenotype-split tracks Replace the single varFreqsAll combined track (and drop the varFreqsDisease track) with two matched tracks for visual case-vs-background comparison: varFreqsAffected - variants seen in the affected/case arms of disease cohorts (SFARI SPARK WES/WGS ASD probands, SCHEMA cases, GREGoR affected, GA4K); ~130,000 individuals varFreqsBackground - population reference cohorts + the unaffected/control arms of disease cohorts ("all other variants"); ~1.5 million individuals A variant seen in both groups appears in both tracks. Genotyping-array cohorts stay out of both (varFreqsArray unchanged). vcfToBigBed.py gains --split-affected to emit both tracks in one pass; it reads phenotype tags (affected/unaffected/unknown) from populations.tsv and is_disease/disease_role from databases.tsv, and derives the length-filter ranges from the observed data. TOPMed reclassified as a population cohort. SPARK WGS display name changed to SFARI SPARK WGS for consistency with the standalone subtracks. Fixed the trackDb mouseOver $-substitution prefix collision by wrapping fields in ${}. New description pages for both tracks. refs #36642 diff --git src/hg/makeDb/trackDb/human/varFreqsAll.html src/hg/makeDb/trackDb/human/varFreqsAll.html deleted file mode 100644 index d6ee5fdd42d..00000000000 --- src/hg/makeDb/trackDb/human/varFreqsAll.html +++ /dev/null @@ -1,244 +0,0 @@ -

Description

-

-This track merges variants from 28 sequencing-based variant frequency databases into a -single bigBed file with predicted protein consequences and cross-database filtering. It -contains 1.34 billion variants from WGS, WES, and long-read sequencing cohorts worldwide. -For a summary of all available databases, see the -SNV Frequencies supertrack page. -

- -

-Two companion combined tracks split out the cohorts that don't belong in a general -sequencing-based summary: -

- - -

-Each variant is annotated with its predicted consequence on protein-coding genes -(using bcftools csq with -Ensembl -gene models), and colored by severity. Allele counts and frequencies are shown for each -source database and, where available, broken down by ancestry, population, or phenotype. -

- -

Display Conventions

- -

Color by Consequence

-

Variants are colored by their most severe predicted consequence:

- - - - - - - - - - - - - - - - - - - - - - -
ColorConsequence classExamples
RedProtein-truncating / Loss-of-functionstop_gained, frameshift, splice_donor, splice_acceptor, stop_lost, start_lost
BlueMissense / In-framemissense, inframe_insertion, inframe_deletion, protein_altering
GreenSynonymoussynonymous, stop_retained
GreyNon-coding / Intergenicintron, non_coding, intergenic, UTR
- -

Amino Acid Change Notation

-

-The "AA change" field uses bcftools csq notation: 23I>23V means position -23 changed from Isoleucine (I) to Valine (V) (missense). 23I alone (no arrow) -means position 23 is Isoleucine and unchanged (synonymous). A "*" indicates a -stop codon (e.g. 45R>45* is a stop_gained). -

- -

Filters

-

-This track supports extensive filtering via the track settings page. Click on the track -title or use the "Configure" button to access filters: -

- -

Variant Type and Consequence

- - -

How to find protein-truncating variants: Set the Consequence filter to include -only "Stop Gained", "Frameshift", "Splice Donor", and -"Splice Acceptor". These will appear as red items in the track display.

- -

Frequency and Count Filters

- - -

Source Database

-

-The Source Database filter lets you restrict to variants present in specific databases. -For example, select only "GREGoR" to see variants found in the rare disease cohort. -This filter uses OR logic: selecting multiple databases shows variants found in -any of the selected databases. -

- -

Population- and Phenotype-Specific Filters

-

-Several databases provide ancestry-specific allele frequencies: -

- -

-Three sources also expose phenotype-stratified counts: -

- -

-The disease-related Disease-related Databases Combined -track exposes additional phenotype splits for SCHEMA (Schizophrenia case vs control). -

- -

Length Filters

- - -

Methods

-

-Variant frequency VCF files from 28 sequencing-based databases were stripped of their INFO -fields (to reduce size), normalized with bcftools norm (splitting multi-allelic -sites), and merged with bcftools merge. The merged VCF was then annotated with -predicted protein consequences using bcftools csq with the -Ensembl -GRCh38 release 115 gene annotation (GFF3). The same pipeline, run on different subsets of -source VCFs, produces the -Disease-related Databases Combined and -Genotyping Array Databases Combined tracks. -

- -

-The annotated VCF was converted to bigBed format using a custom Python script -(vcfToBigBed.py) that reads frequency data from each source VCF in parallel, -matches variants by position/ref/alt, and writes a BED file with consequence coloring, -per-database allele counts and frequencies, and population breakdowns. -The database configuration (which VCFs to include, field mappings, and population definitions) -is stored in two TSV files -(databases.tsv and -populations.tsv) -so that future updates only require editing these files. -

- -

-The track's -makeDoc file documents how each source VCF was converted. -Scripts are available from -Github. -

- -

Data Access

-

-The data can be explored interactively with the -Table Browser or the -Data Integrator. -For programmatic access, our REST API -can be used; the track name is varFreqsAll. -

-

-Because the merged callset includes data from multiple sources whose redistribution -licenses differ, the combined bigBed is not available for download from our -download server. The combined track can be reconstructed from the individual source VCFs -using the -conversion scripts on GitHub together with the -build documentation. Where individual source data is downloadable from UCSC, -the per-subtrack description page indicates the path on our download server. -

- -

Credits

-

-This track is only possible thanks to the data from millions of volunteers around the world, -who donated blood, signed consent forms and provided health information about themselves and -sometimes their families. Click on any of the individual tracks in the -SNV Frequencies supertrack to see the specific -credits for each project. Thanks to Alex Ioannidis, UCSC, for the motivation for this track -and to Andreas Lahner, MGZ, for feedback. -

- -

References

-

-For primary citations of each source dataset, see the References section on the -SNV Frequencies supertrack page. The merged-track -build itself uses the following tools: -

-

-Danecek P, McCarthy SA. - -BCFtools/csq: haplotype-aware variant consequences. -Bioinformatics. 2017 Jul 1;33(13):2037-2039. -PMID: 28205675; PMC: PMC5870570 -

-

-McLaren W, Gil L, Hunt SE, Riat HS, Ritchie GR, Thormann A, Flicek P, Cunningham F. - -The Ensembl Variant Effect Predictor. -Genome Biol. 2016 Jun 6;17(1):122. -PMID: 27268795; PMC: PMC4893825 -