64a3f9e7813e823cf724ea188c3928a911578286 max Thu Jun 4 00:32:22 2026 -0700 varFreqs: replace All Databases Combined with two phenotype-split tracks Replace the single varFreqsAll combined track (and drop the varFreqsDisease track) with two matched tracks for visual case-vs-background comparison: varFreqsAffected - variants seen in the affected/case arms of disease cohorts (SFARI SPARK WES/WGS ASD probands, SCHEMA cases, GREGoR affected, GA4K); ~130,000 individuals varFreqsBackground - population reference cohorts + the unaffected/control arms of disease cohorts ("all other variants"); ~1.5 million individuals A variant seen in both groups appears in both tracks. Genotyping-array cohorts stay out of both (varFreqsArray unchanged). vcfToBigBed.py gains --split-affected to emit both tracks in one pass; it reads phenotype tags (affected/unaffected/unknown) from populations.tsv and is_disease/disease_role from databases.tsv, and derives the length-filter ranges from the observed data. TOPMed reclassified as a population cohort. SPARK WGS display name changed to SFARI SPARK WGS for consistency with the standalone subtracks. Fixed the trackDb mouseOver $-substitution prefix collision by wrapping fields in ${}. New description pages for both tracks. refs #36642 diff --git src/hg/makeDb/trackDb/human/varFreqs.html src/hg/makeDb/trackDb/human/varFreqs.html index bb8288f2744..8a6261da7ed 100644 --- src/hg/makeDb/trackDb/human/varFreqs.html +++ src/hg/makeDb/trackDb/human/varFreqs.html @@ -1,90 +1,94 @@

Description

-This supertrack collects variant allele frequencies from population-scale sequencing and -genotyping projects worldwide, from a total of ~1.7 million genomes/exomes/arrays. +This track collection gathers variant allele frequencies from population-scale sequencing +and genotyping projects worldwide, from a total of ~1.7 million genomes/exomes/arrays. The data was not reprocessed in a harmonized way; the variant VCFs were collected from the projects as-is. The goal is a single place to compare how common a variant is across different populations, ancestries, and cohorts, for projects that cannot be recomputed by -gnomAD soon. Three combined tracks aggregate the source data along different lines, and +gnomAD soon. Two combined tracks aggregate the source data along different lines, and there is also one subtrack per project with the original VCF data and all the annotations that the project provides. The different projects use different pipelines and sequencing technologies. Click any of the projects above or below for a summary of their sample selection, sequencing assay and software pipeline. Many projects do not allow us to distribute the data, but we document how to request it and provide all converters.

Data from projects that provide haplotype-phased genotypes can also be found elsewhere: 1000 Genomes is also a separate track, and the phased genotypes HGDP, SGDP, HGDP+1000 Genomes and Mexico Biobank can also be found in the "Phased Variants" track. Their VCF versions below show only the isolate frequency per variant.

Please contact us (genome@soe.ucsc.edu) if you know of a project that we should add. So far, Regeneron's Million Exomes and Mexico City Studies (request rejected) and Taiwan Biobank (pending).

Combined Tracks

Three combined tracks merge variants from the individual subtracks into single bigBed files with predicted protein consequences and cross-database filtering. All three use the same filter conventions (variant type, consequence, source database, allele frequency, allele count, and per-database AF/AC).

All Databases Combined — 1.34 - billion variants from 28 sequencing-based cohorts (WGS, WES, long-read). The default - summary view of the supertrack. Excludes the genotyping-array cohorts.
Disease-related Databases Combined - — 932 million variants from six disease-focused cohorts (SPARK, SFARI WGS, - TOPMed, SCHEMA, GREGoR, GA4K), with phenotype-stratified AC/AF where the source - provides it.
Affected/Case Individuals — + variants seen in the affected or case arm of five disease-study cohorts (SFARI SPARK + WES and WGS autism probands, SCHEMA schizophrenia cases, GREGoR affected, GA4K + rare-disease). Each variant also carries its frequency in the background, so + case-enriched variants can be isolated.
Population + Unaffected — + the matched background: variants seen in the population reference cohorts (gnomAD + HGDP+1kG, TOPMed, ALFA, HRC and the national WGS projects) and in the + unaffected/control arms of the disease cohorts. Showing this together with the + Affected track lets you compare case versus background frequency across a gene. Both + tracks exclude the genotyping-array cohorts.
Genotyping Array Databases Combined — 14.7 million variants from three array cohorts (TPMI Taiwan, Mexico Biobank, UK Biobank imputed). Kept separate because chip data has different per-variant confidence than sequencing.

Available Datasets

- - - + + + - - + + - - - + + + - - + + @@ -435,31 +439,31 @@ multi-sample callset, consequence annotations are recomputed against Ensembl with bcftools csq, and the result is converted to bigBed via vcfToBigBed.py + bedToBigBed. The mapping from upstream INFO fields to bigBed columns is driven by two configuration files in the scripts directory: databases.tsv (one row per source dataset) and populations.tsv (per-population AC/AF columns within each source). Editing those two files and rerunning mergeAndAnnotate.sh followed by vcfToBigBed.py rebuilds the combined track.

Data Access

All the data is publicly available. The table above indicates if we are allowed to distribute it in VCF format. Most of the databases do not allow us to redistribute the data files directly from our website, but it can always be downloaded from the original websites in some form. Click the database link in the table above and see the "Data Access" section of the respective track for a description of where to download the data. When the data is freely available from our website, the Data Access section will also indicate the VCF file location on our download server. Because it contains some licensed data, the combined track is not available for download, but can be recreated using the conversion scripts in our GitHub repository and the accompanying documentation file.

Credits

This track is only possible thanks to the data from millions of volunteers around the world, who donated blood, signed consent forms and provided health information about themselves and sometimes their families. Click any of the tracks in the list above to see the specific credits for each project. Thanks to Alex Ioannidis, UCSC, for the motivation for this track and to Andreas Lahner, MGZ, for feedback.

This track is only possible thanks to the data from millions of volunteers around the world, who donated blood, signed consent forms and provided health information about themselves and sometimes their families. Click any of the tracks in the list above to see the specific credits for each project. Thanks to Alex Ioannidis, UCSC, for the inspiration for this track and to Andreas Lahner, MGZ, for feedback.

References

All of Us Research Program Genomics Investigators. Genomic data in the All of Us Research Program. Nature. 2024 Mar;627(8003):340-346. PMID: 38374255; PMC: PMC10937371

Bhattacharyya C, Subramanian K, Uppili B, Biswas NK, Ramdas S, Tallapaka KB, Arvind P, Rupanagudi KV, Maitra A, Nagabandi T et al.

Database	Region	N	Data Type	Cohort	Sub-populations	Downloadable from UCSC
All Databases Combined	Sequencing-based, all below	~1.7mil	Affected/Case Individuals	Sequencing-based disease cohorts	—	WGS/WES/long-read	1.34B variants	Phenotype splits for SPARK, SFARI WGS, GREGoR	Affected/case arms of SFARI SPARK WES/WGS, SCHEMA, GREGoR, GA4K	Affected/case AF and AC; background AF for contrast	No
Disease-related Databases Combined	SPARK, SFARI WGS, TOPMed, SCHEMA, GREGoR, GA4K	~300k	Population + Unaffected	Sequencing-based, population + unaffected	~1.7mil	WGS/WES/long-read	932M variants	SPARK ASD/Non-ASD, SFARI WGS ASD/Non-ASD, SCHEMA case/control, GREGoR aff/unaff/unknown	Population cohorts + unaffected/control arms	Background AF and AC; per-cohort and ancestry breakdowns	No
Genotyping Array Databases Combined	TPMI, MexBB, UKBB	~530k	Array / imputed	14.7M variants	—	No
AllOfUs v7	USA	245k