64a3f9e7813e823cf724ea188c3928a911578286 max Thu Jun 4 00:32:22 2026 -0700 varFreqs: replace All Databases Combined with two phenotype-split tracks Replace the single varFreqsAll combined track (and drop the varFreqsDisease track) with two matched tracks for visual case-vs-background comparison: varFreqsAffected - variants seen in the affected/case arms of disease cohorts (SFARI SPARK WES/WGS ASD probands, SCHEMA cases, GREGoR affected, GA4K); ~130,000 individuals varFreqsBackground - population reference cohorts + the unaffected/control arms of disease cohorts ("all other variants"); ~1.5 million individuals A variant seen in both groups appears in both tracks. Genotyping-array cohorts stay out of both (varFreqsArray unchanged). vcfToBigBed.py gains --split-affected to emit both tracks in one pass; it reads phenotype tags (affected/unaffected/unknown) from populations.tsv and is_disease/disease_role from databases.tsv, and derives the length-filter ranges from the observed data. TOPMed reclassified as a population cohort. SPARK WGS display name changed to SFARI SPARK WGS for consistency with the standalone subtracks. Fixed the trackDb mouseOver $-substitution prefix collision by wrapping fields in ${}. New description pages for both tracks. refs #36642 diff --git src/hg/makeDb/trackDb/human/varFreqs.html src/hg/makeDb/trackDb/human/varFreqs.html index bb8288f2744..8a6261da7ed 100644 --- src/hg/makeDb/trackDb/human/varFreqs.html +++ src/hg/makeDb/trackDb/human/varFreqs.html @@ -1,90 +1,94 @@
-This supertrack collects variant allele frequencies from population-scale sequencing and -genotyping projects worldwide, from a total of ~1.7 million genomes/exomes/arrays. +This track collection gathers variant allele frequencies from population-scale sequencing +and genotyping projects worldwide, from a total of ~1.7 million genomes/exomes/arrays. The data was not reprocessed in a harmonized way; the variant VCFs were collected from the projects as-is. The goal is a single place to compare how common a variant is across different populations, ancestries, and cohorts, for projects that cannot be recomputed by -gnomAD soon. Three combined tracks aggregate the source data along different lines, and +gnomAD soon. Two combined tracks aggregate the source data along different lines, and there is also one subtrack per project with the original VCF data and all the annotations that the project provides. The different projects use different pipelines and sequencing technologies. Click any of the projects above or below for a summary of their sample selection, sequencing assay and software pipeline. Many projects do not allow us to distribute the data, but we document how to request it and provide all converters.
Data from projects that provide haplotype-phased genotypes can also be found elsewhere: 1000 Genomes is also a separate track, and the phased genotypes HGDP, SGDP, HGDP+1000 Genomes and Mexico Biobank can also be found in the "Phased Variants" track. Their VCF versions below show only the isolate frequency per variant.
Please contact us (genome@soe.ucsc.edu) if you know of a project that we should add. So far, Regeneron's Million Exomes and Mexico City Studies (request rejected) and Taiwan Biobank (pending).
Three combined tracks merge variants from the individual subtracks into single bigBed files with predicted protein consequences and cross-database filtering. All three use the same filter conventions (variant type, consequence, source database, allele frequency, allele count, and per-database AF/AC).
| Database | Region | N | Data Type | Cohort | Sub-populations | Downloadable from UCSC | |||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| All Databases Combined | -Sequencing-based, all below | -~1.7mil | +Affected/Case Individuals | +Sequencing-based disease cohorts | +— | WGS/WES/long-read | -1.34B variants | -Phenotype splits for SPARK, SFARI WGS, GREGoR | +Affected/case arms of SFARI SPARK WES/WGS, SCHEMA, GREGoR, GA4K | +Affected/case AF and AC; background AF for contrast | No |
| Disease-related Databases Combined | -SPARK, SFARI WGS, TOPMed, SCHEMA, GREGoR, GA4K | -~300k | +Population + Unaffected | +Sequencing-based, population + unaffected | +~1.7mil | WGS/WES/long-read | -932M variants | -SPARK ASD/Non-ASD, SFARI WGS ASD/Non-ASD, SCHEMA case/control, GREGoR aff/unaff/unknown | +Population cohorts + unaffected/control arms | +Background AF and AC; per-cohort and ancestry breakdowns | No |
| Genotyping Array Databases Combined | TPMI, MexBB, UKBB | ~530k | Array / imputed | 14.7M variants | — | No | |||||
| AllOfUs v7 | USA | 245k | @@ -435,31 +439,31 @@ multi-sample callset, consequence annotations are recomputed against Ensembl with