64a3f9e7813e823cf724ea188c3928a911578286 max Thu Jun 4 00:32:22 2026 -0700 varFreqs: replace All Databases Combined with two phenotype-split tracks Replace the single varFreqsAll combined track (and drop the varFreqsDisease track) with two matched tracks for visual case-vs-background comparison: varFreqsAffected - variants seen in the affected/case arms of disease cohorts (SFARI SPARK WES/WGS ASD probands, SCHEMA cases, GREGoR affected, GA4K); ~130,000 individuals varFreqsBackground - population reference cohorts + the unaffected/control arms of disease cohorts ("all other variants"); ~1.5 million individuals A variant seen in both groups appears in both tracks. Genotyping-array cohorts stay out of both (varFreqsArray unchanged). vcfToBigBed.py gains --split-affected to emit both tracks in one pass; it reads phenotype tags (affected/unaffected/unknown) from populations.tsv and is_disease/disease_role from databases.tsv, and derives the length-filter ranges from the observed data. TOPMed reclassified as a population cohort. SPARK WGS display name changed to SFARI SPARK WGS for consistency with the standalone subtracks. Fixed the trackDb mouseOver $-substitution prefix collision by wrapping fields in ${}. New description pages for both tracks. refs #36642 diff --git src/hg/makeDb/trackDb/human/varFreqsAll.html src/hg/makeDb/trackDb/human/varFreqsAll.html deleted file mode 100644 index d6ee5fdd42d..00000000000 --- src/hg/makeDb/trackDb/human/varFreqsAll.html +++ /dev/null @@ -1,244 +0,0 @@ -

Description

-This track merges variants from 28 sequencing-based variant frequency databases into a -single bigBed file with predicted protein consequences and cross-database filtering. It -contains 1.34 billion variants from WGS, WES, and long-read sequencing cohorts worldwide. -For a summary of all available databases, see the -SNV Frequencies supertrack page. -

- -

-Two companion combined tracks split out the cohorts that don't belong in a general -sequencing-based summary: -

Disease-related Databases Combined — - 932 M variants from six disease-focused cohorts (SPARK, SFARI WGS, TOPMed, SCHEMA, - GREGoR, GA4K), with phenotype-stratified AC/AF where the source provides it.
Genotyping Array Databases Combined — - 14.7 M variants from three array cohorts (TPMI Taiwan, Mexico Biobank, UK Biobank - imputed). Kept separate because chip data has different per-variant confidence - than sequencing.

- -

-Each variant is annotated with its predicted consequence on protein-coding genes -(using bcftools csq with -Ensembl -gene models), and colored by severity. Allele counts and frequencies are shown for each -source database and, where available, broken down by ancestry, population, or phenotype. -

- -

Display Conventions

- -

Color by Consequence

Variants are colored by their most severe predicted consequence:

- - - - - - - - - - - - - - - - - - - - - - -

Color	Consequence class	Examples
Red	Protein-truncating / Loss-of-function	stop_gained, frameshift, splice_donor, splice_acceptor, stop_lost, start_lost
Blue	Missense / In-frame	missense, inframe_insertion, inframe_deletion, protein_altering
Green	Synonymous	synonymous, stop_retained
Grey	Non-coding / Intergenic	intron, non_coding, intergenic, UTR

- -

Amino Acid Change Notation

-The "AA change" field uses bcftools csq notation: 23I>23V means position -23 changed from Isoleucine (I) to Valine (V) (missense). 23I alone (no arrow) -means position 23 is Isoleucine and unchanged (synonymous). A "*" indicates a -stop codon (e.g. 45R>45* is a stop_gained). -

- -

Filters

-This track supports extensive filtering via the track settings page. Click on the track -title or use the "Configure" button to access filters: -

- -

Variant Type and Consequence

Variant Type: Filter by SNV, Insertion, Deletion, or MNV.
Consequence: Filter by predicted consequence (Missense, Synonymous, Stop Gained, - Frameshift, Splice Donor, Splice Acceptor, Intron, 3' UTR, 5' UTR, Non-coding, - Intergenic, Other). The filter uses OR logic across the comma-separated consequence - tokens on each variant: a variant tagged - stop_gained,frameshift is selected by either the "Stop Gained" - or the "Frameshift" filter. The "Other" bucket catches the less - common Sequence Ontology - consequence terms emitted by bcftools csq that don't fit the named - buckets above. Examples include - splice_region (variant near a splice site but outside the canonical - donor/acceptor), - start_lost / stop_lost (variant disrupts the start codon - or replaces the stop codon with a coding amino acid), - stop_retained (variant changes the stop codon but keeps it a stop), - inframe_insertion / inframe_deletion (in-frame indel - that adds or removes whole codons), and - coding_sequence (CDS variant where the precise impact is undetermined). - If you include "Other" in the filter selection, no records will be - hidden by the consequence filter.

- -

How to find protein-truncating variants: Set the Consequence filter to include -only "Stop Gained", "Frameshift", "Splice Donor", and -"Splice Acceptor". These will appear as red items in the track display.

- -

Frequency and Count Filters

Max Allele Frequency: Filter by the maximum allele frequency observed across - all databases. Useful for finding rare variants (e.g., set max to 0.01 for variants - with AF < 1% in all databases).
Total Allele Count: Filter by the sum of allele counts across all databases. - Useful for excluding singletons (e.g., set minimum to 2 to remove AC=1 variants - that may be sequencing errors).
Per-database AF and AC: Filter by allele frequency or allele count in any - specific database. For example, filter to variants with TOPMed AF > 0.001.

- -

Source Database

-The Source Database filter lets you restrict to variants present in specific databases. -For example, select only "GREGoR" to see variants found in the rare disease cohort. -This filter uses OR logic: selecting multiple databases shows variants found in -any of the selected databases. -

- -

Population- and Phenotype-Specific Filters

-Several databases provide ancestry-specific allele frequencies: -

AllOfUs: African, Indigenous American, East Asian, European, Oceanian, South - Asian (from local ancestry inference)
GenomeAsia: Northeast Asian, Southeast Asian, South Asian, Oceanian, American, - African, Western European Reference
gnomAD HGDP+1kG: African, Amish, Latino, Ashkenazi Jewish, East Asian, Finnish, - Middle Eastern, Non-Finnish European, Other, South Asian
NPM Singapore: Chinese, Malay, Indian
WBBC: North Han, Central Han, South Han, Lingnan Han

-Three sources also expose phenotype-stratified counts: -

SPARK WES and SFARI WGS: ASD proband AC/AF versus non-ASD family - member AC/AF.
GREGoR: Affected, Unaffected, and Unknown disease-status AC/AF.

-The disease-related Disease-related Databases Combined -track exposes additional phenotype splits for SCHEMA (Schizophrenia case vs control). -

- -

Length Filters

Reference/Alternate Length: Filter by the length of the reference or alternate allele.
Length Change: Filter by the size difference between alternate and reference - (positive = insertion, negative = deletion, zero = SNV or MNV).

- -

Methods

-Variant frequency VCF files from 28 sequencing-based databases were stripped of their INFO -fields (to reduce size), normalized with bcftools norm (splitting multi-allelic -sites), and merged with bcftools merge. The merged VCF was then annotated with -predicted protein consequences using bcftools csq with the -Ensembl -GRCh38 release 115 gene annotation (GFF3). The same pipeline, run on different subsets of -source VCFs, produces the -Disease-related Databases Combined and -Genotyping Array Databases Combined tracks. -

- -

-The annotated VCF was converted to bigBed format using a custom Python script -(vcfToBigBed.py) that reads frequency data from each source VCF in parallel, -matches variants by position/ref/alt, and writes a BED file with consequence coloring, -per-database allele counts and frequencies, and population breakdowns. -The database configuration (which VCFs to include, field mappings, and population definitions) -is stored in two TSV files -(databases.tsv and -populations.tsv) -so that future updates only require editing these files. -

- -

-The track's -makeDoc file documents how each source VCF was converted. -Scripts are available from -Github. -

- -

Data Access

-The data can be explored interactively with the -Table Browser or the -Data Integrator. -For programmatic access, our REST API -can be used; the track name is varFreqsAll. -

-Because the merged callset includes data from multiple sources whose redistribution -licenses differ, the combined bigBed is not available for download from our -download server. The combined track can be reconstructed from the individual source VCFs -using the -conversion scripts on GitHub together with the -build documentation. Where individual source data is downloadable from UCSC, -the per-subtrack description page indicates the path on our download server. -

- -

Credits

-This track is only possible thanks to the data from millions of volunteers around the world, -who donated blood, signed consent forms and provided health information about themselves and -sometimes their families. Click on any of the individual tracks in the -SNV Frequencies supertrack to see the specific -credits for each project. Thanks to Alex Ioannidis, UCSC, for the motivation for this track -and to Andreas Lahner, MGZ, for feedback. -

- -

References

-For primary citations of each source dataset, see the References section on the -SNV Frequencies supertrack page. The merged-track -build itself uses the following tools: -

-Danecek P, McCarthy SA. - -BCFtools/csq: haplotype-aware variant consequences. -Bioinformatics. 2017 Jul 1;33(13):2037-2039. -PMID: 28205675; PMC: PMC5870570 -

-McLaren W, Gil L, Hunt SE, Riat HS, Ritchie GR, Thormann A, Flicek P, Cunningham F. - -The Ensembl Variant Effect Predictor. -Genome Biol. 2016 Jun 6;17(1):122. -PMID: 27268795; PMC: PMC4893825 -