64a3f9e7813e823cf724ea188c3928a911578286 max Thu Jun 4 00:32:22 2026 -0700 varFreqs: replace All Databases Combined with two phenotype-split tracks Replace the single varFreqsAll combined track (and drop the varFreqsDisease track) with two matched tracks for visual case-vs-background comparison: varFreqsAffected - variants seen in the affected/case arms of disease cohorts (SFARI SPARK WES/WGS ASD probands, SCHEMA cases, GREGoR affected, GA4K); ~130,000 individuals varFreqsBackground - population reference cohorts + the unaffected/control arms of disease cohorts ("all other variants"); ~1.5 million individuals A variant seen in both groups appears in both tracks. Genotyping-array cohorts stay out of both (varFreqsArray unchanged). vcfToBigBed.py gains --split-affected to emit both tracks in one pass; it reads phenotype tags (affected/unaffected/unknown) from populations.tsv and is_disease/disease_role from databases.tsv, and derives the length-filter ranges from the observed data. TOPMed reclassified as a population cohort. SPARK WGS display name changed to SFARI SPARK WGS for consistency with the standalone subtracks. Fixed the trackDb mouseOver $-substitution prefix collision by wrapping fields in ${}. New description pages for both tracks. refs #36642 diff --git src/hg/makeDb/trackDb/human/varFreqs.html src/hg/makeDb/trackDb/human/varFreqs.html index bb8288f2744..8a6261da7ed 100644 --- src/hg/makeDb/trackDb/human/varFreqs.html +++ src/hg/makeDb/trackDb/human/varFreqs.html @@ -1,90 +1,94 @@

Description

-This supertrack collects variant allele frequencies from population-scale sequencing and -genotyping projects worldwide, from a total of ~1.7 million genomes/exomes/arrays. +This track collection gathers variant allele frequencies from population-scale sequencing +and genotyping projects worldwide, from a total of ~1.7 million genomes/exomes/arrays. The data was not reprocessed in a harmonized way; the variant VCFs were collected from the projects as-is. The goal is a single place to compare how common a variant is across different populations, ancestries, and cohorts, for projects that cannot be recomputed by -gnomAD soon. Three combined tracks aggregate the source data along different lines, and +gnomAD soon. Two combined tracks aggregate the source data along different lines, and there is also one subtrack per project with the original VCF data and all the annotations that the project provides. The different projects use different pipelines and sequencing technologies. Click any of the projects above or below for a summary of their sample selection, sequencing assay and software pipeline. Many projects do not allow us to distribute the data, but we document how to request it and provide all converters.

Data from projects that provide haplotype-phased genotypes can also be found elsewhere: 1000 Genomes is also a separate track, and the phased genotypes HGDP, SGDP, HGDP+1000 Genomes and Mexico Biobank can also be found in the "Phased Variants" track. Their VCF versions below show only the isolate frequency per variant.

Please contact us (genome@soe.ucsc.edu) if you know of a project that we should add. So far, Regeneron's Million Exomes and Mexico City Studies (request rejected) and Taiwan Biobank (pending).

Combined Tracks

Three combined tracks merge variants from the individual subtracks into single bigBed files with predicted protein consequences and cross-database filtering. All three use the same filter conventions (variant type, consequence, source database, allele frequency, allele count, and per-database AF/AC).

Available Datasets

- - - + + + - - + + - - - + + + - - + + @@ -435,31 +439,31 @@ multi-sample callset, consequence annotations are recomputed against Ensembl with bcftools csq, and the result is converted to bigBed via vcfToBigBed.py + bedToBigBed. The mapping from upstream INFO fields to bigBed columns is driven by two configuration files in the scripts directory: databases.tsv (one row per source dataset) and populations.tsv (per-population AC/AF columns within each source). Editing those two files and rerunning mergeAndAnnotate.sh followed by vcfToBigBed.py rebuilds the combined track.

Data Access

All the data is publicly available. The table above indicates if we are allowed to distribute it in VCF format. Most of the databases do not allow us to redistribute the data files directly from our website, but it can always be downloaded from the original websites in some form. Click the database link in the table above and see the "Data Access" section of the respective track for a description of where to download the data. When the data is freely available from our website, the Data Access section will also indicate the VCF file location on our download server. Because it contains some licensed data, the combined track is not available for download, but can be recreated using the conversion scripts in our GitHub repository and the accompanying documentation file.

Credits

-

This track is only possible thanks to the data from millions of volunteers around the world, who donated blood, signed consent forms and provided health information about themselves and sometimes their families. Click any of the tracks in the list above to see the specific credits for each project. Thanks to Alex Ioannidis, UCSC, for the motivation for this track and to Andreas Lahner, MGZ, for feedback.

+

This track is only possible thanks to the data from millions of volunteers around the world, who donated blood, signed consent forms and provided health information about themselves and sometimes their families. Click any of the tracks in the list above to see the specific credits for each project. Thanks to Alex Ioannidis, UCSC, for the inspiration for this track and to Andreas Lahner, MGZ, for feedback.

References

All of Us Research Program Genomics Investigators. Genomic data in the All of Us Research Program. Nature. 2024 Mar;627(8003):340-346. PMID: 38374255; PMC: PMC10937371

Bhattacharyya C, Subramanian K, Uppili B, Biswas NK, Ramdas S, Tallapaka KB, Arvind P, Rupanagudi KV, Maitra A, Nagabandi T et al.

Database Region N Data Type Cohort Sub-populations Downloadable from UCSC
All Databases CombinedSequencing-based, all below~1.7milAffected/Case IndividualsSequencing-based disease cohorts WGS/WES/long-read1.34B variantsPhenotype splits for SPARK, SFARI WGS, GREGoRAffected/case arms of SFARI SPARK WES/WGS, SCHEMA, GREGoR, GA4KAffected/case AF and AC; background AF for contrast No
Disease-related Databases CombinedSPARK, SFARI WGS, TOPMed, SCHEMA, GREGoR, GA4K~300kPopulation + UnaffectedSequencing-based, population + unaffected~1.7mil WGS/WES/long-read932M variantsSPARK ASD/Non-ASD, SFARI WGS ASD/Non-ASD, SCHEMA case/control, GREGoR aff/unaff/unknownPopulation cohorts + unaffected/control armsBackground AF and AC; per-cohort and ancestry breakdowns No
Genotyping Array Databases Combined TPMI, MexBB, UKBB ~530k Array / imputed 14.7M variants No
AllOfUs v7 USA 245k