165a15d6a94d53f8162a01e69f3912a7a23a3b50 max Mon Mar 23 06:47:55 2026 -0700 mostly done with the variant frequencies track, refs#36642 diff --git src/hg/makeDb/trackDb/human/varFreqs.html src/hg/makeDb/trackDb/human/varFreqs.html index e49971dccb3..c3268848514 100644 --- src/hg/makeDb/trackDb/human/varFreqs.html +++ src/hg/makeDb/trackDb/human/varFreqs.html @@ -1,249 +1,269 @@

Description

This supertrack collects variant allele frequencies from population-scale sequencing and -genotyping projects worldwide. The goal is to provide a single place to compare how common -a variant is across different populations, ancestries, and cohorts. Each subtrack contains -normalized VCF data from one project; an additional -combined track merges all databases for cross-project filtering. -

+genotyping projects worldwide, from a total of ~1.7 million genomes/exomes/arrays. +The data was not reprocessed in a harmonized way but the variant VCFs were collected from the projects. +The goal is to provide a single place to compare how common +a variant is across different populations, ancestries, and cohorts, for +projects that cannot be recomputed by gnomAD soon. The main +combined track merges all databases into one single summary track, +with filters, summed population frequencies and recalculated protein-effect annotations. +In addition, there is one subtrack per project with the original VCF data and all the annotations that the project provides. +The different projects use different pipelines and sequencing technologies, click any of the projects +above or below for a summary of their sample selection, sequencing assay and software pipeline. +Many projects do not allow us to distribute the data but we document how the +data can be requested and provide all converters.

-More detailed data for projects that provide haplotype-phased genotypes can also be found -in other tracks: 1000 Genomes is a separate track, and the projects HGDP, SGDP, -HGDP+1000 Genomes and Mexico Biobank can be found in the "Phased Variants" track. +Data from projects that provide haplotype-phased genotypes can also be found +elsewhere: 1000 Genomes is also a separate track, and the phased genotypes HGDP, SGDP, +HGDP+1000 Genomes and Mexico Biobank can also be found in the "Phased Variants" track. +Their VCF versions below show only the isolate frequency per variant. +

+ +

Please contact us (genome@soe.ucsc.edu), if you know a project that we should add. So far, +we already requested these: UK Biobank (pending for a year), +Regeneron's Million Exomes and Mexico City Studies (request rejected), Taiwan Biobank (pending).

If you want us to add other projects, please contact us. We were -unable to obtain variant frequencies from the following projects: UK Biobank (request pending), -Regeneron's Million Exomes and Mexico City Studies (request rejected). +

Combined Track (All Databases)

+The "All Databases Combined" track merges variants from all individual databases into a single +bigBed file with consequence annotations, a total of more than 1.2 billion variants from 1.7 mil individuals. +The track supports filtering by variant type +(SNV, insertion, deletion, MNV), predicted consequence (missense, synonymous, stop gained, +frameshift, splice, intron, intergenic), source database, allele frequency (overall maximum +and per-database), and allele count (total or per-database). This track is either useful in dense mode +for getting a quick overview of variant density across all projects, or with filters to find +variants present in specific databases or within certain frequency ranges. Not that with the "clone track" +feature you can clone this track and have multiple versions, each with different filters activated. +You can also use out "Density mode" checkbox on the track configuration page to show a plot with the +density of variants passing a filter, one per track clone.

Available Datasets

+ + + + + + + + +

Database	Region	N	Data Type	Cohort	Sub-populations	Access
All Databases combined	All below	1.7mil	WGS/WES/imputed			Not downloadable from UCSC
AllOfUs v7	USA	245k	WGS	General population, diverse	European, East Asian, African, Indigenous American, Oceanian, South Asian	Downloadable
TOPMED Freeze 10	USA	151k	WGS	Heart, lung, blood, sleep disorder cohorts	—	Requires login
SFARI SPARK WES	USA	140k	WES	Autism families (parents + affected children)	—	Access request
SFARI SPARK WGS	USA	12.5k	WGS	Autism families (parents + affected children)	—	Access request
NCBI ALFA R4	USA	408k	WGS/WES/array mix	Aggregated dbGaP studies, mixed phenotypes	—	Available
FinnGen R12	Finland	500k	Imputed (8.5k WGS ref panel)	National biobank, ~10% of population	—	Downloadable
SweGen	Sweden	1k	WGS	Cross-section of Swedish population	—	Access request
SCHEMA	Multi-national	121k	WES	Schizophrenia: 24k cases, 97k controls	—	Available
Japan ToMMO 61k	Japan	61k	WGS	General population	—	Downloadable
Australia MGRB	Australia	4k	WGS	Healthy elderly (age ≥70)	—	Access request
GenomeAsia Pilot	Asia (219 groups)	1.7k	WGS	Diverse populations across Asia	Northeast Asian, Southeast Asian, South Asian	Downloadable
ABraOM Brazil	Brazil	1.2k	WGS	Elderly admixed individuals (São Paulo)	—	Downloadable
IndiGenomes	India	1k	WGS	Healthy individuals	—	Downloadable
KOVA Korea	Korea	5.3k	1.9k WGS + 3.4k WES	Normal tissue from cancer patients, healthy parents, volunteers	—	Access request
NPM Singapore	Singapore	9.8k	WGS	Chinese, Indian, Malay ancestry	—	Access request
Saudi Genome	Saudi Arabia	302	WGS (30x)	Saudi population	—	Downloadable
HRC	Multi-national	~30k	Low-coverage WGS (7x)	Imputation reference panel (excl. 1000 Genomes)	—	Downloadable
MXB Mexico Biobank	Mexico	6k	Genotyping array	Diverse Mexican ancestries, 898 recruitment sites	By state, by ancestry	Access request
SGDP	Global	279	WGS	142 diverse populations worldwide	By population	Downloadable
GREGoR R4	USA	3.6k	WGS	Rare disease families (10.7k participants, 4.4k families)	—	Controlled (dbGaP/AnVIL)
gnomAD HGDP+1kG	Global	4k	WGS	80 populations (HGDP + 1000 Genomes reprocessed)	80 populations, continental groups	Downloadable

Display Conventions

Most tracks only show the variant and allele frequencies on mouseover or clicks. When zoomed in, tracks display alleles with base-specific coloring. Homozygote data are shown as one letter, while heterozygotes will be displayed with both letters. All VCF files are normalized, with one single allele per annotation (no multi-allele lines).

- -

Combined Track (All Databases)

-The "All Databases Combined" track merges variants from all individual databases into a single -bigBed file with consequence annotations (via VEP). It supports filtering by variant type -(SNV, insertion, deletion, MNV), predicted consequence (missense, synonymous, stop gained, -frameshift, splice, intron, intergenic), source database, allele frequency (overall maximum -and per-database), and allele count (per-database). This track is most useful in dense mode -for getting a quick overview of variant density across all projects, or with filters to find -variants present in specific databases or within certain frequency ranges. -

Data Access

All the data is publicly available. The table above indicates if we are allowed to distribute it in VCF format. Most of the databases do not allow us to redistribute the data files directly from our website, but it can always be downloaded from the original websites in some form. Click the database link in the table above and see the "Data Access" section of the respective track for description where to download the data. When the data is freely available from our website, the Data access section will also indicate the VCF file location on our download server. Because it contains some licensed data, the combined track is not available, but can be recreated using the conversion scripts in our Github repository and the accompanying documentation file.

Credits

This track is only possible thanks to the data from millions of volunteers around the world, who donated blood, signed consent forms and provided health information about themselves and sometimes their families. Click on any of the tracks in the list above to see the specific credits for each project. Thanks to Alex Ioannidis, UCSC, for the motivation for this track and to Andreas Lahner, MGZ, for feedback.