68c5b3b5dfc4053ff78a6b1d236bd1ac90251cfa lrnassar Mon Jun 1 14:40:45 2026 -0700 varFreqs: description pages for the three combined tracks and "SNV" rename sweep. Add varFreqsDisease.html and varFreqsArray.html so the two new combined tracks have full Description/Display/Methods/Data Access/References. Add a Caveats section on varFreqsArray about chip-data quality vs sequencing. Update varFreqsAll.html and the supertrack varFreqs.html to reflect the three-combined-track family (cross-links between siblings, new "Combined Tracks" section, new table rows, and updated source/variant counts). Add a GoNL row to the supertrack table. Sweep 37 subtrack longLabels and four cross-referencing description pages (colorsDbSnv.html, mei.html, meiSwegen.html, phasedVars.html) from "Variant Frequencies:" to "SNV Frequencies:" to match the supertrack shortLabel. refs #36642 diff --git src/hg/makeDb/trackDb/human/varFreqs.html src/hg/makeDb/trackDb/human/varFreqs.html index fa9d6dbb231..bb8288f2744 100644 --- src/hg/makeDb/trackDb/human/varFreqs.html +++ src/hg/makeDb/trackDb/human/varFreqs.html @@ -1,648 +1,680 @@

Description

This supertrack collects variant allele frequencies from population-scale sequencing and genotyping projects worldwide, from a total of ~1.7 million genomes/exomes/arrays. -The data was not reprocessed in a harmonized way; the variant VCFs were collected from the projects as-is. -The goal is a single place to compare how common -a variant is across different populations, ancestries, and cohorts, for -projects that cannot be recomputed by gnomAD soon. The main -combined track merges all databases into one summary track, -with filters, summed population frequencies and recalculated protein-effect annotations. -There is also one subtrack per project with the original VCF data and all the annotations that the project provides. -The different projects use different pipelines and sequencing technologies. Click any of the projects -above or below for a summary of their sample selection, sequencing assay and software pipeline. -Many projects do not allow us to distribute the data, but we document how to request it -and provide all converters.

+The data was not reprocessed in a harmonized way; the variant VCFs were collected from the +projects as-is. The goal is a single place to compare how common a variant is across +different populations, ancestries, and cohorts, for projects that cannot be recomputed by +gnomAD soon. Three combined tracks aggregate the source data along different lines, and +there is also one subtrack per project with the original VCF data and all the annotations +that the project provides. The different projects use different pipelines and sequencing +technologies. Click any of the projects above or below for a summary of their sample +selection, sequencing assay and software pipeline. Many projects do not allow us to +distribute the data, but we document how to request it and provide all converters. +

Data from projects that provide haplotype-phased genotypes can also be found elsewhere: 1000 Genomes is also a separate track, and the phased genotypes HGDP, SGDP, HGDP+1000 Genomes and Mexico Biobank can also be found in the "Phased Variants" track. Their VCF versions below show only the isolate frequency per variant.

Please contact us (genome@soe.ucsc.edu) if you know of a project that we should add. So far, Regeneron's Million Exomes and Mexico City Studies (request rejected) and Taiwan Biobank (pending).

Combined Track (All Databases)

Combined Tracks

-The "All Databases Combined" track merges variants from all individual databases into a single -bigBed file with consequence annotations, totaling 1.17 billion variants from ~1.7 million individuals. -The track supports filtering by variant type -(SNV, insertion, deletion, MNV), predicted consequence (missense, synonymous, stop gained, -frameshift, splice, intron, intergenic), source database, allele frequency (overall maximum -and per-database), and allele count (total or per-database). The track is useful in dense mode -to get a quick overview of variant density across all projects, or with filters to find -variants present in specific databases or within certain frequency ranges. With the "clone track" -feature you can clone this track and keep multiple versions, each with different filters activated. -The "Density mode" checkbox on the track configuration page shows a plot of the -density of variants passing a filter, one per track clone. +Three combined tracks merge variants from the individual subtracks into single bigBed files +with predicted protein consequences and cross-database filtering. All three use the same +filter conventions (variant type, consequence, source database, allele frequency, allele +count, and per-database AF/AC).

All Databases Combined — 1.34 + billion variants from 28 sequencing-based cohorts (WGS, WES, long-read). The default + summary view of the supertrack. Excludes the genotyping-array cohorts.
Disease-related Databases Combined + — 932 million variants from six disease-focused cohorts (SPARK, SFARI WGS, + TOPMed, SCHEMA, GREGoR, GA4K), with phenotype-stratified AC/AF where the source + provides it.
Genotyping Array Databases Combined + — 14.7 million variants from three array cohorts (TPMI Taiwan, Mexico Biobank, + UK Biobank imputed). Kept separate because chip data has different per-variant + confidence than sequencing.

Available Datasets

- - - - - - + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +

Database	Region	N	Data Type	Cohort	Sub-populations	Downloadable from UCSC
All Databases combined	All below	1.7mil	WGS/WES/imputed			All Databases Combined	Sequencing-based, all below	~1.7mil	WGS/WES/long-read	1.34B variants	Phenotype splits for SPARK, SFARI WGS, GREGoR	No
Disease-related Databases Combined	SPARK, SFARI WGS, TOPMed, SCHEMA, GREGoR, GA4K	~300k	WGS/WES/long-read	932M variants	SPARK ASD/Non-ASD, SFARI WGS ASD/Non-ASD, SCHEMA case/control, GREGoR aff/unaff/unknown	No
Genotyping Array Databases Combined	TPMI, MexBB, UKBB	~530k	Array / imputed	14.7M variants	—	No
AllOfUs v7	USA	245k	WGS	General population, diverse	African, Indigenous American, East Asian, European, Oceanian, South Asian (local ancestry; see Notes below)	Yes
TOPMED Freeze 10	USA	151k	WGS	Heart, lung, blood, sleep disorder cohorts	—	Yes
SFARI SPARK WES	USA	140k	WES	Autism families (parents + affected children)	—	No
SFARI SPARK WGS	USA	12.5k	WGS	Autism families (parents + affected children)	—	No
NCBI ALFA R4	USA	408k	WGS/WES/array mix	Aggregated dbGaP studies, mixed phenotypes	—	Yes
FinnGen R12	Finland	500k	Imputed (8.5k WGS ref panel)	National biobank, ~10% of population	—	Yes
UK Biobank (Neale Lab v3)	UK	361k	Imputed array (HRC+UK10K+1KGp3 ref panel)	White British subset of UK Biobank, Neale Lab Round 2 GWAS	—	Yes
SweGen	Sweden	1k	WGS	Cross-section of Swedish population	—	No
GoNL	Netherlands	498	WGS (~13x)	250 unrelated Dutch trios (parents only)	—	Yes
SCHEMA	Multi-national	121k	WES	Schizophrenia: 24k cases, 97k controls	—	Yes
Japan ToMMO 61k	Japan	61k	WGS	General population	—	Yes
WBBC China	China	4.5k	WGS	Westlake BioBank for Chinese pilot (now part of China Precision BioBank), autosomes only	North Han, Central Han, South Han, Lingnan Han (by recruitment region)	Yes
ChinaMAP phase 1	China	10.5k	WGS	China Metabolic Analytics Project, ~40x depth, 27 provinces and 8 ethnic groups, autosomes only	—	No
Taiwan TPMI	Taiwan	165k	Axiom SNP array (TPM1)	Taiwan Precision Medicine Initiative, Han Chinese	—	No
Australia MGRB	Australia	4k	WGS	Healthy elderly (age ≥70)	—	No
GenomeAsia Pilot	Asia (219 groups)	1.7k	WGS	Diverse populations across Asia	Northeast Asian, Southeast Asian, South Asian, Oceanian, American, African, Western European Reference	Yes
ABraOM Brazil	Brazil	1.2k	WGS	Elderly admixed individuals (São Paulo)	—	Yes
IndiGenomes	India	1k	WGS	Healthy individuals	—	Yes
GenomeIndia 9.7k	India	9.8k	WGS (≥23x)	83 anthropologically defined endogamous populations across India	—	Yes
KOVA Korea	Korea	5.3k	1.9k WGS + 3.4k WES	Normal tissue from cancer patients, healthy parents, volunteers	—	No
NPM Singapore	Singapore	9.8k	WGS	Chinese, Indian, Malay ancestry	—	No
Saudi Genome	Saudi Arabia	302	WGS (30x)	Saudi population	—	Yes
HRC	Multi-national	~30k	Low-coverage WGS (7x)	Imputation reference panel (excl. 1000 Genomes)	—	Yes
MXB Mexico Biobank	Mexico	6k	Genotyping array	Diverse Mexican ancestries, 898 recruitment sites	By state, by ancestry	No
SGDP	Global	279	WGS	142 diverse populations worldwide	By population	Yes
GREGoR R4	USA	3.6k	WGS	Rare disease families (10.7k participants, 4.4k families)	—	No
gnomAD HGDP+1kG	Global	4k	WGS	80 populations (HGDP + 1000 Genomes reprocessed)	4k-cohort total AF only; per-population AF columns are full gnomAD v3.1.2 release values (~76k genomes), see Notes below	Yes
GA4K	USA	552	PacBio HiFi long-read WGS	Genomic Answers for Kids: pediatric rare-disease probands and families (Children's Mercy)	—	Yes
CoLoRSdb v1.2.0	Multi-national	1,027	PacBio HiFi long-read WGS	Consortium of Long Read Sequencing: aggregated population-consented samples across multiple research cohorts	—	Yes
SVatalog 101	Canada (SickKids)	101	10X Genomics linked short-read WGS	GWAS SVatalog cohort: 101 samples with matched long-read SVs (see chirmade101Sv)	—	Yes
Indigenous Africans 180	Africa (Ethiopia, Tanzania, Cameroon, Botswana)	180	WGS (>30x)	12 indigenous populations across all four African language phyla (Khoesan, Niger-Congo, Nilo-Saharan, Afroasiatic)	—	No

Notes on Specific Sub-tracks

AllOfUs — local-ancestry-stratified frequencies

The AllOfUs subtrack provides local-ancestry-stratified allele frequencies, not the global ancestry categories used in the All of Us Research Program 2024 Nature paper (see References). Each variant's per-ancestry AF/AC counts only the haplotypes whose inferred local ancestry at that exact genomic position belongs to the named group (strict-both-haps mode). The six ancestry classes (African, Indigenous American, East Asian, European, Oceanian, South Asian) match HGDP-derived local-ancestry reference panels and so include Oceanian, which is not one of the paper's six global Rye categories (those are AFR, AMR, EAS, EUR, Middle Eastern, SAS). For an admixed individual, the local-ancestry AF at a position can therefore differ substantially from the AF among self-reported members of the same ancestry group. The Ioannidis lab (Phoenix, UCSC) developed the pipeline that produced this VCF and applied it to the AllOfUs v7 release; only variants with cohort allele count ≥ 20 were retained.

gnomAD HGDP+1kG — cohort vs full-release frequencies

This subtrack derives from the gnomAD v3.1.2 release, which embeds the 4,094-genome jointly-called HGDP+1kG cohort (Koenig et al. 2024) inside the larger gnomAD aggregation. To save space, we kept only INFO fields useful for clinical and population-genetic interpretation. Two allele-frequency sets are exposed:

The cohort-level AC/AF/AN fields (no prefix) are computed across the ~3,400 unrelated HGDP+1kG individuals (allele number ≈ 6,800).
The per-population filter fields (gnomAD v3.1.2 African AF, gnomAD v3.1.2 Latino AF, etc.) are values from the full gnomAD v3.1.2 release (~76,000 genomes), not just the 4,094-genome HGDP+1kG cohort. The corresponding allele numbers are typically tens of thousands per population.

The filter labels on the track configuration page, and the field descriptions in the combined-track bigBed, reflect this distinction. Per-population HGDP+1kG-cohort frequencies are not exposed because the cohort is too small for stable per-population estimates in many populations.

Display Conventions

Most tracks only show the variant and allele frequencies on mouseover or clicks. When zoomed in, tracks display alleles with base-specific coloring. Homozygote data are shown as one letter; heterozygotes are shown with both letters. All VCF files are normalized, with one allele per annotation (no multi-allele lines).

Methods

Each subtrack includes the upstream project's VCF largely as-released; per-subtrack pipelines (coordinate liftover, format conversion, header normalization) are documented on each subtrack's own description page and recorded in the build documentation. The conversion scripts (e.g. finngen_to_vcf.py, kovaToVcf.py, schema_addAcAnAf.py, svatalogFreqToVcf.py) live alongside the makedoc in the scripts directory.

The combined "All Databases" subtrack is built by a separate pipeline: each per-subtrack VCF is normalized (bcftools norm), all sites are merged into a single multi-sample callset, consequence annotations are recomputed against Ensembl with bcftools csq, and the result is converted to bigBed via vcfToBigBed.py + bedToBigBed. The mapping from upstream INFO fields to bigBed columns is driven by two configuration files in the scripts directory: databases.tsv (one row per source dataset) and populations.tsv (per-population AC/AF columns within each source). Editing those two files and rerunning mergeAndAnnotate.sh followed by vcfToBigBed.py rebuilds the combined track.

Data Access

All the data is publicly available. The table above indicates if we are allowed to distribute it in VCF format. Most of the databases do not allow us to redistribute the data files directly from our website, but it can always be downloaded from the original websites in some form. Click the database link in the table above and see the "Data Access" section of the respective track for a description of where to download the data. When the data is freely available from our website, the Data Access section will also indicate the VCF file location on our download server. Because it contains some licensed data, the combined track is not available for download, but can be recreated using the conversion scripts in our GitHub repository and the accompanying documentation file.

Credits

This track is only possible thanks to the data from millions of volunteers around the world, who donated blood, signed consent forms and provided health information about themselves and sometimes their families. Click any of the tracks in the list above to see the specific credits for each project. Thanks to Alex Ioannidis, UCSC, for the motivation for this track and to Andreas Lahner, MGZ, for feedback.

References

All of Us Research Program Genomics Investigators. Genomic data in the All of Us Research Program. Nature. 2024 Mar;627(8003):340-346. PMID: 38374255; PMC: PMC10937371

Bhattacharyya C, Subramanian K, Uppili B, Biswas NK, Ramdas S, Tallapaka KB, Arvind P, Rupanagudi KV, Maitra A, Nagabandi T et al. Mapping genetic diversity with the GenomeIndia project. Nat Genet. 2025 Apr;57(4):767-773. PMID: 40200122

Ameur A, Dahlberg J, Olason P, Vezzi F, Karlsson R, Martin M, Viklund J, Kahari AK, Lundin P, Che H et al. SweGen: a whole-genome data resource of genetic variability in a cross-section of the Swedish population. Eur J Hum Genet. 2017 Nov;25(11):1253-1260. PMID: 28832569; PMC: PMC5765326

Chirmade S, Wang Z, Mastromatteo S, Sanders E, Thiruvahindrapuram B, Nalpathamkalam T, Pellecchia G, Lin F, Keenan K, Patel RV et al. GWAS SVatalog: a visualization tool to aid fine-mapping of GWAS loci with structural variations. Heredity (Edinb). 2025 Sep;135(3):199-210. PMID: 41203876; PMC: PMC13031531

Cohen ASA, Farrow EG, Abdelmoity AT, Alaimo JT, Amudhavalli SM, Anderson JT, Bansal L, Bartik L, Baybayan P, Belden B et al. Genomic answers for children: Dynamic analyses of >1000 pediatric rare disease genomes. Genet Med. 2022 Jun;24(6):1336-1348. PMID: 35305867

Fan S, Spence JP, Feng Y, Hansen MEB, Terhorst J, Beltrame MH, Ranciaro A, Hirbo J, Beggs W, Thomas N et al. Whole-genome sequencing reveals a complex African population demographic history and signatures of local adaptation. Cell. 2023 Mar 2;186(5):923-939.e14. PMID: 36868214; PMC: PMC10568978

Feliciano P, Daniels AM, Snyder LG, Beaumont A, Camba A, Esler A, Gulsrud AG, Mason A, Nicholson A, Paolicelli AM et al; The SPARK Consortium. SPARK: A US Cohort of 50,000 Families to Accelerate Autism Research. Neuron. 2018 Feb 7;97(3):488-493. PMID: 29420931; PMC: PMC7444276

GenomeAsia100K Consortium. The GenomeAsia 100K Project enables genetic discoveries across Asia. Nature. 2019 Dec;576(7785):106-111. PMID: 31802016; PMC: PMC7054211

Jain A, Bhoyar RC, Pandhare K, Mishra A, Sharma D, Imran M, Senthivel V, Divakar MK, Rophina M, Jolly B et al. IndiGenomes: a comprehensive resource of genetic variants from over 1000 Indian genomes. Nucleic Acids Res. 2021 Jan 8;49(D1):D1225-D1232. PMID: 33095885; PMC: PMC7778947

Karczewski KJ, Francioli LC, Tiao G, Cummings BB, Alfoldi J, Wang Q, Collins RL, Laricchia KM, Ganna A, Birnbaum DP et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature. 2020 May;581(7809):434-443. PMID: 32461654; PMC: PMC7334197

Koenig Z, Yohannes MT, Nkambule LL, Zhao X, Goodrich JK, Kim HA, Wilson MW, Tiao G, Hao SP, Sahakian N et al. A harmonized public resource of deeply sequenced diverse human genomes. Genome Res. 2024 Jun 25;34(5):796-809. PMID: 38749656; PMC: PMC11216312

Kurki MI, Karjalainen J, Palta P, Sipila TP, Kristiansson K, Donner KM, Reeve MP, Laivuori H, Aavikko M, Kaunisto MA et al. FinnGen provides genetic insights from a well-phenotyped isolated population. Nature. 2023 Jan;613(7944):508-518. PMID: 36653562; PMC: PMC9849126

Lacaze P, Pinese M, Kaplan W, Stone A, Brion MJ, Woods RL, McNamara M, McNeil JJ, Dinger ME, Thomas DM. The Medical Genome Reference Bank: a whole-genome data resource of 4000 healthy elderly individuals. Rationale and cohort design. Eur J Hum Genet. 2019 Feb;27(2):308-316. PMID: 30353151; PMC: PMC6336775

Lee S, Seo J, Park J, Nam JY, Choi A, Ignatius JS, Bjornson RD, Chae JH, Jang IJ, Lee S et al. Korean Variant Archive (KOVA): a reference database of genetic variations in the Korean population. Sci Rep. 2017 Jun 27;7(1):4287. PMID: 28655895; PMC: PMC5487339

Mallick S, Li H, Lipson M, Mathieson I, Gymrek M, Racimo F, Zhao M, Chennagiri N, Nordenfelt S, Tandon A et al. The Simons Genome Diversity Project: 300 genomes from 142 diverse populations. Nature. 2016 Oct 13;538(7624):201-206. PMID: 27654912; PMC: PMC5161557

McCarthy S, Das S, Kretzschmar W, Delaneau O, Wood AR, Teumer A, Kang HM, Fuchsberger C, Danecek P, Sharp K et al. A reference panel of 64,976 haplotypes for genotype imputation. Nat Genet. 2016 Oct;48(10):1279-83. PMID: 27548312; PMC: PMC5388176

Naslavsky MS, Scliar MO, Yamamoto GL, Wang JYT, Zverinova S, Karp T, Nunes K, Ceroni JRM, de Carvalho DL, da Silva Simões CE et al. Whole-genome sequencing of 1,171 elderly admixed individuals from São Paulo, Brazil. Nat Commun. 2022 Mar 4;13(1):1004. PMID: 35246524; PMC: PMC8897431

Sohail M, Palma-Martínez MJ, Chong AY, Quinto-Cortés CD, Barberena-Jonas C, Medina-Muñoz SG, Ragsdale A, Delgado-Sánchez G, Cruz-Hervert LP, Ferreyra-Reyes L et al. Mexican Biobank advances population and medical genomics of diverse ancestries. Nature. 2023 Oct;622(7984):775-783. PMID: 37821706; PMC: PMC10600006

Singh T, Poterba T, Curtis D, Akil H, Al Eissa M, Barchas JD, Bass N, Bigdeli TB, Breen G, Bromet EJ et al. Rare coding variants in ten genes confer substantial risk for schizophrenia. Nature. 2022 Apr;604(7906):509-516. PMID: 35396579; PMC: PMC9805802

Tadaka S, Kawashima J, Hishinuma E, Saito S, Okamura Y, Otsuki A, Kojima K, Komaki S, Aoki Y, Kanno T et al. jMorp: Japanese Multi-Omics Reference Panel update report 2023. Nucleic Acids Res. 2024 Jan 5;52(D1):D622-D632. PMID: 37930845; PMC: PMC10767895

Taliun D, Harris DN, Kessler MD, Carlson J, Szpiech ZA, Torres R, Taliun SAG, Corvelo A, Gogarten SM, Kang HM et al. Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program. Nature. 2021 Feb;590(7845):290-299. PMID: 33568819; PMC: PMC7875770

Wong E, Bertin N, Hebrard M, Tirado-Magallanes R, Bellis C, Lim WK, Chua CY, Tong PML, Chua R, Mak K et al. The Singapore National Precision Medicine Strategy. Nat Genet. 2023 Feb;55(2):178-186. PMID: 36658435

Wu D, Dou J, Chai X, Bellis C, Wilm A, Shih CC, Soon WWJ, Bertin N, Lin CB, Khor CC et al. Large-scale whole-genome sequencing of three diverse Asian populations in Singapore. Cell. 2019 Oct 17;179(3):736-749.e15. PMID: 31626772