aa61ebc800429515f9ced7e28f669c6042219f43 max Wed Mar 18 09:09:13 2026 -0700 varFreqs supertrack: add GREGoR track, update all HTML docs, move scripts to varFreqs/, refs #36642 Add GREGoR R04 WGS track to varFreqs superTrack. Update Data Access and Methods sections for all 20+ subtrack HTML files with consistent formatting, sequencing methods from source papers, and links to makeDoc and Github scripts. Move all varFreqs conversion scripts into scripts/varFreqs/ subdirectory and update makeDoc paths accordingly. Co-Authored-By: Claude Opus 4.6 diff --git src/hg/makeDb/trackDb/human/varFreqs.html src/hg/makeDb/trackDb/human/varFreqs.html index 7ade4a73a09..e49971dccb3 100644 --- src/hg/makeDb/trackDb/human/varFreqs.html +++ src/hg/makeDb/trackDb/human/varFreqs.html @@ -1,720 +1,249 @@

Description

-This container shows results from projects where the variant frequencies, aka allele frequencies, -are publicly available. The tracks were collected from the -projects listed below. More detailed data for projects that provide haplotype-phased -genotypes/variants can also be found +This supertrack collects variant allele frequencies from population-scale sequencing and +genotyping projects worldwide. The goal is to provide a single place to compare how common +a variant is across different populations, ancestries, and cohorts. Each subtrack contains +normalized VCF data from one project; an additional +combined track merges all databases for cross-project filtering. +

+ +

+More detailed data for projects that provide haplotype-phased genotypes can also be found in other tracks: 1000 Genomes is a separate track, and the projects HGDP, SGDP, -HGDP+1000 Genomes and Mexico Biobank can be found in the "Phased Variants" track, showing -the linkage between variants. +HGDP+1000 Genomes and Mexico Biobank can be found in the "Phased Variants" track.

If you want us to add other projects, please contact us. We were unable to obtain variant frequencies from the following projects: UK Biobank (request pending), -Regeneron's Million Exomes and Mexico City Studies (request rejected). -

- -

-The following projects were added: -

- Mexico City Prospective Study (MCPS): - 9,950 whole genome sequenced individuals and 141,046 exome sequenced and genotyped - individuals from the Mexico City Prospective Study (MCPS), a collaboration between the - Regeneron Genetics Center, University of Oxford, Universidad Nacional Autónoma de - México (UNAM), National Institute of Genomic Medicine in Mexico, Abbvie Inc. and - AstraZeneca UK. For details, see (Ziyatdinov A, Nature 2023) in the reference section. -
- Regeneron Million Exomes Project (ME): - Whole-exomes of 983,578 individuals sequenced by the Regeneron Genetics Center (RGC). - These data span dozens of collaborations including large biobanks and - health systems. All data were generated by the RGC on a single, harmonized - sequencing and informatics protocol. The dataset includes individuals across - diverse ancestral populations, encompassing outbred and founder populations and - cohorts with high rates of consanguinity. See (Sun et al, Nature 2024) for details. -
- NHLBI TOPMED Freeze 10: - NHLBI TOPMed (Trans-Omics for Precision - Medicine) program, launched by the U.S. National Heart, Lung, and Blood - Institute, integrates whole-genome sequencing with molecular, clinical, - and environmental data from large, well-phenotyped cohorts. Its goal is to - uncover the biological mechanisms underlying heart, lung, blood, and sleep - disorders to advance precision medicine and improve population health. Freeze - 10 contains 868,581,653 variants from 150,899 whole genomes. VCFs were - downloaded from BRAVO. -
- SFARI SPARK: - The Simons Foundation Autism Research Initiative (SFARI) recruited - a large cohort of families with autistic children who provided DNA - samples and phenotypes. 54,558 families, parents and their children - were sequenced, a total of 142,357 individuals with whole-exome (WES) - and 12,519 with whole-genome sequencing (WGS). The data contains - 32,559 trios and 8,895 quads (one sibling without autism), and 824 - twins. The same frequencies shown here - are also available publicly on the - SFARI Genome Browser. - See (SPARK et al, Neuron 2018) for details or the methods below on this page. -
- GenomeAsia Pilot (GAsP): - Whole-genome sequencing data of 1,739 individuals from 219 population groups across Asia. - See (GenomeAsia Consortium, Nature 2019) for details. -
- Australia MGRB: - The Australian Medical Genome Reference Bank collected - whole-genome sequencing data of 4,011 healthy elderly individuals who - lived >=70 years, to make sure that the dataset is depleted of damaging - genetic variants. Age and sex summary graphs are available from - the MGRB website. - See (Lacaze Eur J Hum Genet 2019) for details. -
- ALFA: - The NCBI ALlele Frequency Aggregator pipeline computes allele frequencies from - approved, unrestricted dbGaP studies and makes them publicly available through - dbSNP. Its goal is to release frequency data from over one million dbGaP - subjects to aid discoveries involving common and rare variants with biological - or disease relevance. The R4 release includes 408,709 subjects and allele - frequencies for 15.5 million rs sites, including nearly one million ClinVar - variants. We converted the NCBI track hub to VCF format, the data is freely available. - Genotype and associated individual-level data are accessible through the dbGaP - authorized access request system. -
- FinnGen: - Imputed variants from 500,348 Biobank samples obtained using genotyping arrays - in Finland, 10% of the population. The imputation used phased variants obtained from 8,554 - high-quality whole genome sequences, also from Finland. For details, see (Kurki et al, - Nature 2023). - Phenotype links can be shown at FinnGen PheWeb. -
- SweGen: - Whole-genome sequencing variant frequencies for 1000 Swedish individuals generated within - the SweGen project. - The 1000 individuals included in the SweGen project represent a - cross-section of the Swedish population and that no disease information - has been used for the selection. The frequency data may therefore - include genetic variants that are associated with, or causative of, - disease. SweGen also provides SV calls, TEs, MELT results for TEs, HLAs and new sequence. - For details, see (Ameur et al, Eur J Hum Genet 2017). - The dataset can be browsed at the - SweGen Browser. -
- JPN To61k Japan - Tohoku University Tohoku Medical Megabank Organization 61k Allele frequency panel - (JPN 61k): - An allele frequency panel based on short-read WGS analysis of 61,000 Japanese individuals. - The project includes other datatypes, such as STRs, long-read SVs and short-read CNVs. - Data can be downloaded from the jMorp Website, specifically the - Downloads - section. For details, see (Tadaka et al, NAR 2023). -
- Brazil Arquivo Brasileiro Online de Mutaçõ (ABraOM): - Genomic variants obtained with whole-genome sequencing from SABE, a - census-based sample of elderly individuals from São Paulo, Brazil's - largest city. The Brazilian population is constituted by ~500 years of - admixture between Africans, Europeans, and Native Americans. - Additionally, the cohort presents ~3% of individuals with non-admixed - Japanese ancestry (early 20th century migration). Coverage 38.6x. Data - can be downloaded from the AbraOM Website. TEs, HLAs and new sequence are also available. - For details, see (Naslavsky et al, Nat Comm 2022). -
- IndiGenomes: - Whole genome sequencing of 1,029 healthy Indian individuals under the pilot phase of the - "IndiGen" program. - Data can be downloaded from the IndiGen Website. For details, see (Jain et al, NAR 2020). Only - the allele frequency is available from this project. The website also provides SV call - and Alu insertion VCFs. -
- Korean Variant Archive (KOVA): - 1,896 whole genome sequencing and 3,409 whole exome sequencing data from healthy individuals - of Korean ethnicity. - Most of the samples originated from normal tissue of cancer - patients (40.16 %), healthy parents of rare disease patients (28.4 %), - or healthy volunteers (31.44 %). Japanese ancestry is broken down - in the INFO field. Coverage 100x for WES, 30x for WGS. SVs called with Manta - are also available. For details, see (Lee et al, Exp Mol Med 2022).
- NPM Singapore: - 9,770 whole genomes, mostly of Chinese, Indian and Malay ancestry. - A minimum allele count cutoff of > 5 was applied. - Data is available for download from the CHORUS browser, see "Data access" below. - For details, see (Wong et al, Nat Genetics 2023). CNV data is also available there. -
- Saudi Genome Program: - Variant frequencies from 302 whole genomes at 30x coverage, on Saudi Genome Program Samples. - The genotyping data and imputations from 3,352 individuals do not seem to be available - publicly. For details, see (Malomane et al 2025). -

+Regeneron's Million Exomes and Mexico City Studies (request rejected). +

+ +

Available Datasets

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +

Database	Region	N	Data Type	Cohort	Sub-populations	Access
AllOfUs v7	USA	245k	WGS	General population, diverse	European, East Asian, African, Indigenous American, Oceanian, South Asian	Downloadable
TOPMED Freeze 10	USA	151k	WGS	Heart, lung, blood, sleep disorder cohorts	—	Requires login
SFARI SPARK WES	USA	140k	WES	Autism families (parents + affected children)	—	Access request
SFARI SPARK WGS	USA	12.5k	WGS	Autism families (parents + affected children)	—	Access request
NCBI ALFA R4	USA	408k	WGS/WES/array mix	Aggregated dbGaP studies, mixed phenotypes	—	Available
FinnGen R12	Finland	500k	Imputed (8.5k WGS ref panel)	National biobank, ~10% of population	—	Downloadable
SweGen	Sweden	1k	WGS	Cross-section of Swedish population	—	Access request
SCHEMA	Multi-national	121k	WES	Schizophrenia: 24k cases, 97k controls	—	Available
Japan ToMMO 61k	Japan	61k	WGS	General population	—	Downloadable
Australia MGRB	Australia	4k	WGS	Healthy elderly (age ≥70)	—	Access request
GenomeAsia Pilot	Asia (219 groups)	1.7k	WGS	Diverse populations across Asia	Northeast Asian, Southeast Asian, South Asian	Downloadable
ABraOM Brazil	Brazil	1.2k	WGS	Elderly admixed individuals (São Paulo)	—	Downloadable
IndiGenomes	India	1k	WGS	Healthy individuals	—	Downloadable
KOVA Korea	Korea	5.3k	1.9k WGS + 3.4k WES	Normal tissue from cancer patients, healthy parents, volunteers	—	Access request
NPM Singapore	Singapore	9.8k	WGS	Chinese, Indian, Malay ancestry	—	Access request
Saudi Genome	Saudi Arabia	302	WGS (30x)	Saudi population	—	Downloadable
HRC	Multi-national	~30k	Low-coverage WGS (7x)	Imputation reference panel (excl. 1000 Genomes)	—	Downloadable
MXB Mexico Biobank	Mexico	6k	Genotyping array	Diverse Mexican ancestries, 898 recruitment sites	By state, by ancestry	Access request
SGDP	Global	279	WGS	142 diverse populations worldwide	By population	Downloadable
GREGoR R4	USA	3.6k	WGS	Rare disease families (10.7k participants, 4.4k families)	—	Controlled (dbGaP/AnVIL)
gnomAD HGDP+1kG	Global	4k	WGS	80 populations (HGDP + 1000 Genomes reprocessed)	80 populations, continental groups	Downloadable

Display Conventions

Most tracks only show the variant and allele frequencies on mouseover or clicks. When zoomed in, tracks display alleles with base-specific coloring. Homozygote data are shown as one letter, while heterozygotes will be displayed with both letters. All VCF files are normalized, with one single allele per annotation (no multi-allele lines).

- -

Data Access

-Most of the data in these tracks are not available for download from UCSC and the data can only be -browsed on our website. But all variant data can be downloaded for free from the original project -websites. Accessing it usually requires a click-through license or filling out an access request -form on the respective websites, by following these instructions: -

- -

-MXB: Allele frequencies by geographical state and ancestry are available via -the MexVar platform. -Raw genotype data are available under controlled access at the -EGA (Study: EGAS00001005797; Dataset: EGAD00010002361). For the VCFs, email -andres.moreno@cinvestav.mx. -

- - - -

-TOPMED: VCFs with summarized allele frequencies are available from -the TOPMED BRAVO website. They require a -login. -

-SFARI SPARK: Allele frequencies can be displayed on the - SFARI Genome Browser. - Full CRAMs and VCFs with genotypes are available from SFARI Base. -They require a data access request, which is usually reviewed quickly. More information is available -in the SPARK Welcome Packet. -

- -

-Australia MGRB: VCF access can be requested via a form from -Sydney Genomics. -

- -

-GenomeAsia Pilot: VCFs are available from UCSC and also from -the GenomeAsia 100K website. -No license nor login. -

- -

KOVA: -TSV data can be requested on the KOVA Downloads website. Our Github repo contains a script that -converts this format to VCF. -

- -

Finngen: TSV data can be requested via the form at -Finngen, -which triggers an automated email containing the download -link. A script in our Github repo converts this file to VCF (see methods below).

- -

SweGen: VCF files can be requested at -SweGen via a form, the request -needs manual approval, which usually is quick. If there is no reply, email SweGen directly. -

- -

NPM: -VCF download can be requested on the Chorus Browser website, which requires an account and data access request. -

- -

Methods

The following are quotes from the respective papers and/or websites of the datasets:

- -

-MXB: Genotyping was performed with the Illumina Multi-Ethnic Global Array -(MEGA, ~1.8M SNPs), optimized for admixed populations and enriched for -ancestry-informative and medically relevant variants. Only autosomal, biallelic -SNPs passing quality control are included. Samples were selected from 898 -recruitment sites, with prioritization of indigenous language speakers. Data -processing included GenomeStudio → PLINK conversion, strand alignment, removal -of duplicates, update of map positions using dbSNP Build 151 and low-quality -variants/individuals, and relatedness filtering. -

-SGDP: The version used was -https://sharehost.hms.harvard.edu/genetics/reich_lab/sgdp/vcf_variants/, -merged with bcftools and lifted to hg38 with CrossMap. -

-KOVA: Raw reads were aligned to the GRCh38+decoy reference using BWA-MEM v0.7.17 with default -parameters, followed by duplicate marking and coordinate sorting with MarkDuplicatesSpark, and base -quality score recalibration using BQSRPipelineSpark in GATK v4.1.3.0; mapping quality control -metrics were generated with Qualimap v2.2.1. Single-nucleotide variants and small -insertions/deletions were called per sample using GATK HaplotypeCaller in GVCF mode (-ERC GVCF), and -joint genotyping was performed by creating a GenomicsDB with GenomicsDBImport and following GATK -Best Practices, including variant quality score recalibration (VQSR) retaining 99.7% of true SNVs -and 99.0% of true indels based on training sets (workflow detailed in Supplementary Fig. 1). -Downstream analyses followed a modified version of the gnomAD quality-control framework and were -primarily conducted using Hail, an open-source Python library for large-scale genome analysis; after -merging WES and WGS data in Hail, multiallelic variants and variants with genotype quality <20, -read depth <10, allelic balance <0.2, or overlapping low-complexity regions were excluded -(Supplementary Fig. 2). -
-At UCSC, V7 of the TSV.gz was obtained from the KOVA staff by email and converted to VCF. It is not -available for download from our site but can be requested from the KOVA website. -

- + +

Combined Track (All Databases)

-ABraOM: For Academic use only. Licensing for commercial use might be available under request and agreement. -By using this resource you agree to cite the flagship paper (Naslavsky et al. Nat Comm 2022). -Whole-genome sequencing was performed at Human Longevity Inc. using TruSeq Nano DNA HT libraries -sequenced on Illumina HiSeqX instruments with 150 bp paired-end reads targeting 30x coverage, and -reads were mapped to GRCh38 using ISIS software. Sample sex was validated by comparing CPMs of X -chromosome and male-specific Y (MSY) reads relative to autosomes, yielding the expected female -(~55,000 X CPM, <200 MSY CPM) and male (~27,500 X CPM, >550 MSY CPM) patterns. Germline SNVs -and indels were called following GATK Best Practices (GATK v3.7) via per-sample GVCFs -(HaplotypeCaller), joint genotyping (CombineGVCFs, GenotypeGVCFs), and Variant Quality Score -Recalibration (VQSR-AS); multiallelic variants were split with an in-house script, left-aligned with -BCFtools, and annotated using Annovar and custom scripts against dbSNP, 1000 Genomes, and gnomAD, -with putative loss-of-function variants identified using LOFTEE v0.3-beta irrespective of confidence -labels. Variant and genotype quality was further assessed using the in-house CEGH-Filter two-step -algorithm based on depth and allele balance, and analyses retained only GATK VQSR-AS PASS variants -and higher-confidence CEGH-Filter calls. Relatedness was assessed using KING and PC-Relate -(GENESIS), retaining a single proband per related pair and excluding one contaminated sample -(>3% by verifyBAMID), resulting in a final dataset of 1,171 unrelated individuals. Final samples -achieved mean coverages ranging from 31.3x to 64.8x, with an average of 38.65x and a median of -36.6x. -

- -

SFARI SPARK: The genome browser track project was approved by the Simons -Foundation under request number 14584.1. WES and WGS Data were downloaded from -SFARI Base. -pVCFs were downloaded, anonymized with a script using bcftools and its "fill-tags" plugin and -normalized. There was no minimum allele frequency cutoff.
-The methods are documented as follows by SFARI:

- WES: - This release consists of sequence and variant call data for 12,519 - unique individuals, of which 12,517 (99.98%) have available genome-wide - SNP genotype data. Sequencing and genotyping of all samples in this - release was performed at New York Genome Center (NYGC). DNA from saliva - samples were extracted and prepared with PCR-free methods and sequenced - with paired-end sequencing of 150 bases on the Illumina NovaSeq 6000 - system. Alignment of reads to the human reference genome version - GRCh38, duplicate read marking, and Base Quality Score Recalibration - (BQSR) were performed by New York Genome Center (NYCG). Whole-genome - sequencing data were processed using a standardized, functionally - equivalent CCDG pipeline with alignment to the GRCh38DH (1000 Genomes) - reference using BWA-MEM v0.7.15 (deterministic settings, no -M, use of - .alt contigs), Picard-equivalent duplicate marking (Picard ≥2.4.1 or - equivalent), no indel realignment, and base quality score recalibration - with GATK (dbSNP138, Mills and 1000G gold-standard indels, known - indels). Final outputs were stored as lossless CRAM files with - complete SAM-compliant read-group annotations and mandatory 4-bin - base-quality compression (Q2—6, 10, 20, 30), and all implementations - were validated for functional equivalence across centers before use. - Variant Calling was performed using DeepVariant. See - CCDG pipeline details.
-
- WGS: This release contains - sequence data for 142,357 individuals and genotyping data for - 141,368 individuals. DNA was sequenced from saliva for all - samples and all participants consented to having their genetic - data shared by Regeneron. Exomes for all samples were sequenced with - short-read, paired-end sequencing of 150 bases on Illumina - NovaSeq 6000 machines using S2/S4 flow cells. Sequencing and - genotyping was performed across nine batches (WES1 through - WES9) at the Regeneron Genetics Center (RGC) and integrated - together for this data release. All sequencing batches were - processed using the same DNA extraction methods and sequencing - machines, however two different exome capture panels were used, - as described below. Genotyping was performed using a SNP - genotyping array for WES1 through WES4 and using - "genotyping-by-sequencing" (GxS) for WES5 through WES9. The - first four sequencing batches were sequenced at Regeneron using - custom NEB/Kapa reagents with the IDT (Integrated DNA - Technologies) xGen capture platform, including custom exome - capture regions. Samples starting with batch WES5 were - sequenced using the Twist Bioscience Human - Comprehensive Exome panel, combined with spike-ins for - sequencing genotyping sites (see Genotyping Methods), the full - mitochondrial genome, and coverage boosted at selected sites - for assaying clonal hematopoiesis of indeterminate potential - (CHIP). SFARI performed NV/indel calling via DeepVariant and - GATK to generate gVCFs, pairwise relatedness inferred using - PLINK v1.9 IBD estimates from common SNPs (AF ≥ 0.01, dbSNP - v151) with ≥15% relatedness flagged, and comprehensive - individual- and family-level quality control executed using the - internal GenomeCheckMate pipeline to exclude samples based on - contamination (≥5%), insufficient coverage (<20x in <80% of - targets), sex discordance, pedigree/IBD inconsistencies, - unregistered relationships, unexpected duplicates, or excess - relatedness, after which QC-passing individuals (selecting the - most recent passing sample per person) were retained for - variant calling and joint genotyping. -

- -

Finngen: R12 annotated variants were downloaded from the Google Cloud -bucket link received though an email and -converted to VCF -with a custom Python script.

- -

SweGen: Fragment size 350bp on a Covaris E220. Paired-end sequencing with 150bp read -length was performed on Illumina HiSeq X (HiSeq Control Software 3.3.39/RTA 2.7.1) with v2.5 -sequencing chemistry. Raw whole-genome reads were aligned to the GRCh37 reference using BWA-MEM -v0.7.12, then sorted and indexed with samtools v0.1.19 and assessed with qualimap v2.2.20; -per-sample alignments from multiple lanes and flow cells were merged using Picard MergeSamFiles -v1.120. Processing followed GATK best practices with GATK v3.3, including indel realignment -(RealignerTargetCreator, IndelRealigner), duplicate marking (Picard MarkDuplicates v1.120), and -base quality score recalibration (BaseRecalibrator), producing one finalized BAM per sample. -Per-sample gVCFs were generated with GATK HaplotypeCaller v3.3 using reference files from the GATK -v2.8 resource bundle, with all steps coordinated via Piper v1.4.0. Joint genotyping of 1,000 samples -was performed by merging gVCFs in five batches of 200 using GATK CombineGVCFs, followed by cohort -genotyping with GATK GenotypeGVCFs and variant quality score recalibration for SNVs and indels using -VariantRecalibrator and ApplyRecalibration. -
At UCSC, the hg38 VCF was downloaded -from SweFreq. -

- -

Australia MGRB: The 4,011 MGRB samples underwent whole-genome sequencing on -Illumina HiSeq X instruments at KCCG under ISO 15189 accreditation, using -paired-end TruSeq DNA Nano libraries sequenced one lane per sample. Alignment -of sequence reads to the hg38 reference genome assembly was with bwa -0.7.15-r1140. Variants were called following the Genome Analysis Toolkit -(GATK) best practices procedure using GATK 4.1.4.0. A sites-only VCF with only -passing variants (FILTER=PASS) was made with bcftools 1.20.

- -

NPM Singapore: Whole Genome Sequencing (WGS) data processing followed -GATK4 best practices. GATK4 germline variant analysis workflow written in WDL -was adapted to use Nextflow and deployed at the National Supercomputing Centre, -Singapore (NSCC). In short, WGS reads were aligned against GRCh38 using the -BWA-MEM algorithm and used as input to GATK HaplotypeCaller to produce single -sample gVCFs. The gVCF files were joint-called then loaded in Hail, an -open-source python-based data analysis library suited to work with -population-scale with genomic data collections. Low-quality WGS libraries and -low-quality variants were removed. QC-ed variants were functionally annotated -using Ensembl Variant Effect Predictor (VEP) (version 95). Functional -annotations for variant impacting protein-coding were also complemented with -information on the potential alteration to their cognate protein's 3D structure -and drug binding ability. -

- -

Saudi Genome Program: Data were downloaded -from Figshare, -and converted to VCF. +The "All Databases Combined" track merges variants from all individual databases into a single +bigBed file with consequence annotations (via VEP). It supports filtering by variant type +(SNV, insertion, deletion, MNV), predicted consequence (missense, synonymous, stop gained, +frameshift, splice, intron, intergenic), source database, allele frequency (overall maximum +and per-database), and allele count (per-database). This track is most useful in dense mode +for getting a quick overview of variant density across all projects, or with filters to find +variants present in specific databases or within certain frequency ranges.

Data Access

All the data is publicly available. The table above indicates if we are allowed to distribute it in VCF format. Most of the databases do not allow us to redistribute the data files directly from our website, but it can always be downloaded from the original websites in some form. Click the database link in the table above and see the "Data Access" section of the respective track for description where to download the data. When the data is freely available from our website, the Data access section will also indicate the VCF file location on our download server. Because it contains some licensed data, the combined track is not available, but can be recreated using the conversion scripts in our Github repository and the accompanying documentation file.

Credits

-MXB: We thank the Center for Research and Advanced Studies (Cinvestav) of Mexico for -generating and providing the frequency data, the National Institute of Medical -Sciences and Nutrition (INCMNSZ) for DNA extraction, and the Ministry of Health -together with the National Institute of Public Health (INSP) for the design and -implementation of the National Health Survey 2000 (ENSA 2000). We also thank -the ENSA-Genomics Consortium for their contributions to sample collection and -data processing that made possible the construction of the MXB genomic -resource. -

-MCPS: Data produced by Regeneron RGC and collaborators, which are the -University of Oxford, Universidad Nacional Autónoma de México (UNAM) and -National Institute of Genomic Medicine in Mexico. -The Regeneron Genetics Center, University of Oxford, Universidad Nacional -Autónoma de México (UNAM), National Institute of Genomic Medicine in Mexico, -Abbvie Inc. and AstraZeneca UK Limited (collectively, the "Collaborators") bear -no responsibility for the analyses or interpretations of the data presented -here. Any opinions, insights, or conclusions presented herein are those of the -authors and not of the Collaborators.

-Regeneron Million Exomes: The Regeneron Genetics Center, and its collaborators -(collectively, the "Collaborators") bear no responsibility for the analyses or -interpretations of the data presented here. Any opinions, insights, or -conclusions presented herein are those of the authors and not of the -Collaborators. This research has been conducted using the UK Biobank Resource -under application number 26041. -

-SGDP: This project was funded by the Simons Foundation. Thanks to David Reich and Swapan -Mallick for help with importing the data. -

-KOVA: Thanks to Insu Jang and the KOVA director for providing variant frequencies in TSV -format. -

-Finngen: We want to acknowledge the participants and investigators of the FinnGen study. -

- -

-SweGen: The SweGen allele frequency data was generated by Science for -Life Laboratory. The data may be redistributed in original or modified form, -but must always be distributed together with the file "terms_of_use.txt" that -is stored together with the data on our download server, and any redistributed -data derived from the SweGen data set must follow those terms and conditions. -The data may not be used to attempt to identify any individual in this or other studies. -

- -

-NPM Singapore: Thanks to the NPM Data Access Committee and Eleanor for granting our data -request. -By browsing the data, you agree to use the data only for academic, non-commercial -research to improve human health (biology/disease). We request all data users -agree to protect the -confidentiality of the data subjects in any research papers or publications -that they may prepare, by taking all reasonable care to limit the possibility -of identification. In particular, the data users shall not to use, or attempt -to use, the data to deliberately compromise or otherwise infringe the -confidentiality of information on data subjects and their right to privacy. -If you use any of the data obtained from the CHORUS variant browser, we request -that you cite the NPM flagship paper (Wong et al, 2023). All data users of the -data must take note that the data provider and relevant SG10K_Health cohort -owners bear no responsibility for the further analysis or interpretation of the -data.

- -

Thanks to Alex Ioannidis, UCSC, for the idea and motivation for this track. -Thanks to Andreas Lahner, MGZ, for feedback and suggestions.

- -

References

-Barberena-Jonas, C. et al. (2025). MexVar database: Clinical genetic variation beyond the -Hispanic label in the Mexican Biobank. Nature Medicine (in press). -

- -

-Sohail M, Moreno-Estrada A. - -The Mexican Biobank Project promotes genetic discovery, inclusive science and local capacity -building. -Dis Model Mech. 2024 Jan 1;17(1). -PMID: 38299665; PMC: PMC10855211 -

- -

-Sohail M, Palma-Martínez MJ, Chong AY, Quinto-Corés CD, Barberena-Jonas C, Medina-Muñoz SG, -Ragsdale A, Delgado-Sánchez G, Cruz-Hervert LP, Ferreyra-Reyes L et al. - -Mexican Biobank advances population and medical genomics of diverse ancestries. -Nature. 2023 Oct;622(7984):775-783. -PMID: 37821706; PMC: PMC10600006 -

- -

-Ziyatdinov A, Torres J, Alegre-Díaz J, Backman J, Mbatchou J, Turner M, Gaynor SM, Joseph T, Zou Y, -Liu D et al. - -Genotyping, sequencing and analysis of 140,000 adults from Mexico City. -Nature. 2023 Oct;622(7984):784-793. -PMID: 37821707; PMC: PMC10600010 -

- -

-GenomeAsia100K Consortium. - -The GenomeAsia 100K Project enables genetic discoveries across Asia. -Nature. 2019 Dec;576(7785):106-111. -PMID: 31802016; PMC: PMC7054211 -

- -

-Sun KY, Bai X, Chen S, Bao S, Zhang C, Kapoor M, Backman J, Joseph T, Maxwell E, Mitra G et -al. - -A deep catalogue of protein-coding variation in 983,578 individuals. -Nature. 2024 Jul;631(8021):583-592. -PMID: 38768635; PMC: PMC11254753 -

- -

-Tadaka S, Kawashima J, Hishinuma E, Saito S, Okamura Y, Otsuki A, Kojima K, Komaki S, Aoki Y, Kanno -T et al. - -jMorp: Japanese Multi-Omics Reference Panel update report 2023. -Nucleic Acids Res. 2024 Jan 5;52(D1):D622-D632. -PMID: 37930845; PMC: PMC10767895 -

- - - -

-Naslavsky MS, Scliar MO, Yamamoto GL, Wang JYT, Zverinova S, Karp T, Nunes K, Ceroni JRM, de -Carvalho DL, da Silva Simões CE et al. - -Whole-genome sequencing of 1,171 elderly admixed individuals from São Paulo, Brazil. -Nat Commun. 2022 Mar 4;13(1):1004. -PMID: 35246524; PMC: PMC8897431 -

- - - -

-Jain A, Bhoyar RC, Pandhare K, Mishra A, Sharma D, Imran M, Senthivel V, Divakar MK, Rophina M, -Jolly B et al. - -IndiGenomes: a comprehensive resource of genetic variants from over 1000 Indian genomes. -Nucleic Acids Res. 2021 Jan 8;49(D1):D1225-D1232. -PMID: 33095885; PMC: PMC7778947 -

- - - -

-Bergström A, McCarthy SA, Hui R, Almarri MA, Ayub Q, Danecek P, Chen Y, Felkel S, Hallast P, Kamm J -et al. - -Insights into human genetic variation and population history from 929 diverse genomes. -Science. 2020 Mar 20;367(6484). -PMID: 32193295; PMC: PMC7115999 -

- -

-Koenig Z, Yohannes MT, Nkambule LL, Zhao X, Goodrich JK, Kim HA, Wilson MW, Tiao G, Hao SP, Sahakian -N et al. - -A harmonized public resource of deeply sequenced diverse human genomes. -Genome Res. 2024 Jun 25;34(5):796-809. -PMID: 38749656; PMC: PMC11216312 -

- -

-Mallick S, Li H, Lipson M, Mathieson I, Gymrek M, Racimo F, Zhao M, Chennagiri N, Nordenfelt S, -Tandon A et al. - -The Simons Genome Diversity Project: 300 genomes from 142 diverse populations. -Nature. 2016 Oct 13;538(7624):201-206. -PMID: 27654912; PMC: PMC5161557 -

- -

-Lee J, Lee J, Jeon S, Lee J, Jang I, Yang JO, Park S, Lee B, Choi J, Choi BO et al. - -A database of 5305 healthy Korean individuals reveals genetic and clinical implications for an East -Asian population. -Exp Mol Med. 2022 Nov;54(11):1862-1871. -PMID: 36323850; PMC: PMC9628380 -

- -

-Kurki MI, Karjalainen J, Palta P, Sipilä TP, Kristiansson K, Donner KM, Reeve MP, Laivuori H, -Aavikko M, Kaunisto MA et al. - -FinnGen provides genetic insights from a well-phenotyped isolated population. -Nature. 2023 Jan;613(7944):508-518. -PMID: 36653562; PMC: PMC9849126 -

- -

-Wong E, Bertin N, Hebrard M, Tirado-Magallanes R, Bellis C, Lim WK, Chua CY, Tong PML, Chua R, Mak K -et al. - -The Singapore National Precision Medicine Strategy. -Nat Genet. 2023 Feb;55(2):178-186. -PMID: 36658435 -

- -

-Malomane DK, Williams MP, Huber CD, Mangul S, Abedalthagafi M, Chiang CWK. - -Patterns of population structure and genetic variation within the Saudi Arabian population. -bioRxiv. 2025 Jan 13;. -PMID: 39868174; PMC: PMC11761371 -

- -

-Ameur A, Dahlberg J, Olason P, Vezzi F, Karlsson R, Martin M, Viklund J, Kähäri AK, -Lundin P, Che H -et al. - -SweGen: a whole-genome data resource of genetic variability in a cross-section of the Swedish -population. -Eur J Hum Genet. 2017 Nov;25(11):1253-1260. -PMID: 28832569; PMC: PMC5765326 -

- -

-SPARK Consortium. Electronic address: pfeliciano@simonsfoundation.org, SPARK Consortium. - -SPARK: A US Cohort of 50,000 Families to Accelerate Autism Research. -Neuron. 2018 Feb 7;97(3):488-493. -PMID: 29420931; PMC: PMC7444276 -

- - - -

-Lacaze P, Pinese M, Kaplan W, Stone A, Brion MJ, Woods RL, McNamara M, McNeil JJ, Dinger ME, Thomas -DM. - -The Medical Genome Reference Bank: a whole-genome data resource of 4000 healthy elderly individuals. -Rationale and cohort design. -Eur J Hum Genet. 2019 Feb;27(2):308-316. -PMID: 30353151; PMC: PMC6336775 -

This track is only possible thanks to the data from millions of volunteers around the world, who donated blood, signed consent forms and provided health information about themselves and sometimes their families. Click on any of the tracks in the list above to see the specific credits for each project. Thanks to Alex Ioannidis, UCSC, for the motivation for this track and to Andreas Lahner, MGZ, for feedback.