64a3f9e7813e823cf724ea188c3928a911578286 max Thu Jun 4 00:32:22 2026 -0700 varFreqs: replace All Databases Combined with two phenotype-split tracks Replace the single varFreqsAll combined track (and drop the varFreqsDisease track) with two matched tracks for visual case-vs-background comparison: varFreqsAffected - variants seen in the affected/case arms of disease cohorts (SFARI SPARK WES/WGS ASD probands, SCHEMA cases, GREGoR affected, GA4K); ~130,000 individuals varFreqsBackground - population reference cohorts + the unaffected/control arms of disease cohorts ("all other variants"); ~1.5 million individuals A variant seen in both groups appears in both tracks. Genotyping-array cohorts stay out of both (varFreqsArray unchanged). vcfToBigBed.py gains --split-affected to emit both tracks in one pass; it reads phenotype tags (affected/unaffected/unknown) from populations.tsv and is_disease/disease_role from databases.tsv, and derives the length-filter ranges from the observed data. TOPMed reclassified as a population cohort. SPARK WGS display name changed to SFARI SPARK WGS for consistency with the standalone subtracks. Fixed the trackDb mouseOver $-substitution prefix collision by wrapping fields in ${}. New description pages for both tracks. refs #36642 diff --git src/hg/makeDb/trackDb/human/varFreqsAll.html src/hg/makeDb/trackDb/human/varFreqsAll.html deleted file mode 100644 index d6ee5fdd42d..00000000000 --- src/hg/makeDb/trackDb/human/varFreqsAll.html +++ /dev/null @@ -1,244 +0,0 @@ -
-This track merges variants from 28 sequencing-based variant frequency databases into a -single bigBed file with predicted protein consequences and cross-database filtering. It -contains 1.34 billion variants from WGS, WES, and long-read sequencing cohorts worldwide. -For a summary of all available databases, see the -SNV Frequencies supertrack page. -
- --Two companion combined tracks split out the cohorts that don't belong in a general -sequencing-based summary: -
--Each variant is annotated with its predicted consequence on protein-coding genes -(using bcftools csq with -Ensembl -gene models), and colored by severity. Allele counts and frequencies are shown for each -source database and, where available, broken down by ancestry, population, or phenotype. -
- -Variants are colored by their most severe predicted consequence:
-| Color | Consequence class | Examples |
|---|---|---|
| Red | -Protein-truncating / Loss-of-function | -stop_gained, frameshift, splice_donor, splice_acceptor, stop_lost, start_lost | -
| Blue | -Missense / In-frame | -missense, inframe_insertion, inframe_deletion, protein_altering | -
| Green | -Synonymous | -synonymous, stop_retained | -
| Grey | -Non-coding / Intergenic | -intron, non_coding, intergenic, UTR | -
-The "AA change" field uses bcftools csq notation: 23I>23V means position -23 changed from Isoleucine (I) to Valine (V) (missense). 23I alone (no arrow) -means position 23 is Isoleucine and unchanged (synonymous). A "*" indicates a -stop codon (e.g. 45R>45* is a stop_gained). -
- --This track supports extensive filtering via the track settings page. Click on the track -title or use the "Configure" button to access filters: -
- -stop_gained,frameshift is selected by either the "Stop Gained"
- or the "Frameshift" filter. The "Other" bucket catches the less
- common Sequence Ontology
- consequence terms emitted by bcftools csq that don't fit the named
- buckets above. Examples include
- splice_region (variant near a splice site but outside the canonical
- donor/acceptor),
- start_lost / stop_lost (variant disrupts the start codon
- or replaces the stop codon with a coding amino acid),
- stop_retained (variant changes the stop codon but keeps it a stop),
- inframe_insertion / inframe_deletion (in-frame indel
- that adds or removes whole codons), and
- coding_sequence (CDS variant where the precise impact is undetermined).
- If you include "Other" in the filter selection, no records will be
- hidden by the consequence filter.How to find protein-truncating variants: Set the Consequence filter to include -only "Stop Gained", "Frameshift", "Splice Donor", and -"Splice Acceptor". These will appear as red items in the track display.
- --The Source Database filter lets you restrict to variants present in specific databases. -For example, select only "GREGoR" to see variants found in the rare disease cohort. -This filter uses OR logic: selecting multiple databases shows variants found in -any of the selected databases. -
- --Several databases provide ancestry-specific allele frequencies: -
--Three sources also expose phenotype-stratified counts: -
--The disease-related Disease-related Databases Combined -track exposes additional phenotype splits for SCHEMA (Schizophrenia case vs control). -
- -
-Variant frequency VCF files from 28 sequencing-based databases were stripped of their INFO
-fields (to reduce size), normalized with bcftools norm (splitting multi-allelic
-sites), and merged with bcftools merge. The merged VCF was then annotated with
-predicted protein consequences using bcftools csq with the
-Ensembl
-GRCh38 release 115 gene annotation (GFF3). The same pipeline, run on different subsets of
-source VCFs, produces the
-Disease-related Databases Combined and
-Genotyping Array Databases Combined tracks.
-
-The annotated VCF was converted to bigBed format using a custom Python script
-(vcfToBigBed.py) that reads frequency data from each source VCF in parallel,
-matches variants by position/ref/alt, and writes a BED file with consequence coloring,
-per-database allele counts and frequencies, and population breakdowns.
-The database configuration (which VCFs to include, field mappings, and population definitions)
-is stored in two TSV files
-(databases.tsv and
-populations.tsv)
-so that future updates only require editing these files.
-
-The track's -makeDoc file documents how each source VCF was converted. -Scripts are available from -Github. -
- --The data can be explored interactively with the -Table Browser or the -Data Integrator. -For programmatic access, our REST API -can be used; the track name is varFreqsAll. -
--Because the merged callset includes data from multiple sources whose redistribution -licenses differ, the combined bigBed is not available for download from our -download server. The combined track can be reconstructed from the individual source VCFs -using the -conversion scripts on GitHub together with the -build documentation. Where individual source data is downloadable from UCSC, -the per-subtrack description page indicates the path on our download server. -
- --This track is only possible thanks to the data from millions of volunteers around the world, -who donated blood, signed consent forms and provided health information about themselves and -sometimes their families. Click on any of the individual tracks in the -SNV Frequencies supertrack to see the specific -credits for each project. Thanks to Alex Ioannidis, UCSC, for the motivation for this track -and to Andreas Lahner, MGZ, for feedback. -
- --For primary citations of each source dataset, see the References section on the -SNV Frequencies supertrack page. The merged-track -build itself uses the following tools: -
--Danecek P, McCarthy SA. - -BCFtools/csq: haplotype-aware variant consequences. -Bioinformatics. 2017 Jul 1;33(13):2037-2039. -PMID: 28205675; PMC: PMC5870570 -
--McLaren W, Gil L, Hunt SE, Riat HS, Ritchie GR, Thormann A, Flicek P, Cunningham F. - -The Ensembl Variant Effect Predictor. -Genome Biol. 2016 Jun 6;17(1):122. -PMID: 27268795; PMC: PMC4893825 -