d4e7e1a69b17ccebfc8b558f220dd0e1cc3ba1d0 max Mon Feb 2 06:14:27 2026 -0800 removing regeneron data after rejection of request, code review feedback, refs #36978 diff --git src/hg/makeDb/trackDb/human/varFreqs.html src/hg/makeDb/trackDb/human/varFreqs.html index 1a73429546f..625e2c0bf0e 100644 --- src/hg/makeDb/trackDb/human/varFreqs.html +++ src/hg/makeDb/trackDb/human/varFreqs.html @@ -1,26 +1,27 @@

Description

This container shows results from projects where the variant frequencies, aka allele frequencies, are publicly available. The tracks were collected from the -projects listed below. Projects that provide haplotype-phased genotypes/variants can be found -elsewhere: 1000 Genomes is a separate track, and the projects HGDP, SGDP, -HGDP+1000 Genomes and Mexico Biobank can be found in the "Phased Variants" track. +projects listed below. More detailed data for projects that provide haplotype-phased genotypes/variants can also be found +in other tracks: 1000 Genomes is a separate track, and the projects HGDP, SGDP, +HGDP+1000 Genomes and Mexico Biobank can be found in the "Phased Variants" track, showing the linkage between variants.

If you want us to add other projects, please contact us. We asked and were -unable to obtain variant frequencies from the following projects: UK Biobank (request pending), All -of Us (granted, ongoing). + +

If you want us to add other projects, please contact us. We were +unable to obtain variant frequencies from the following projects: UK Biobank (request pending), +Regeneron's Million Exomes and Mexico City Studies (request rejected).

The following projects were added:

Mexico City Prospective Study (MCPS): 9,950 whole genome sequenced individuals and 141,046 exome sequenced and genotyped individuals from the Mexico City Prospective Study (MCPS), a collaboration between the Regeneron Genetics Center, University of Oxford, Universidad Nacional Autónoma de México (UNAM), National Institute of Genomic Medicine in Mexico, Abbvie Inc. and AstraZeneca UK. For details see (Ziyatdinov A, Nature 2023), the reference section.

Display Conventions

Most tracks only show the variant and allele frequencies on mouseover or clicks. When zoomed in, tracks display alleles with base-specific coloring. Homozygote data are shown as one letter, while heterozygotes will be displayed with both letters. All VCF files are normalized, with one single allele per annotation (no multi-allele lines).

Data Access

Most of the data in these tracks are not available for download from UCSC. -Data can be browsed on our website. -But the data can be downloaded for free from the original projects. Accessing the -data usually requires a click-through license or filling out an access request form on the respective websites, links are either provided above in the project description or with more details here: +

Most of the data in these tracks are not available for download from UCSC and the data can only be browsed on our website. +But all variant data can be downloaded for free from the original project websites. Accessing it usually requires a click-through license or filling out an access request form on the respective websites, by following these instructions:

MXB: Allele frequencies by geographical state and ancestry are available via the MexVar platform. Raw genotype data are available under controlled access at the EGA (Study: EGAS00001005797; Dataset: EGAD00010002361). For the VCFs, email andres.moreno@cinvestav.mx.

+ + +

TOPMED: VCFs with summarized allele frequencies are available from the TOPMED BRAVO website. They require a login.

SFARI SPARK: Allele frequencies can be displayed on the SFARI Genome Browser. Full CRAMs and VCFs with genotypes are available from SFARI Base. They require a data access request, which is usually reviewed quickly. More information is available in the SPARK Welcome Packet.

Australia MGRB: VCF access can be requested via a form from Sydney Genomics.

GenomeAsia Pilot: VCFs are available from UCSC and also from the GenomeAsia 100K website. No license nor login.

KOVA: TSV data can be requested on the KOVA Downloads website. +target="_blank">KOVA Downloads website. Our Github repo contains a script that +converts this format to VCF.

Finngen: TSV data can be requested via the form at -https://finngen.gitbook.io/documentation/data-download which triggers an email with the download -link.

+Finngen +which triggers an automated email containing the download +link. A script in our Github repo converts this file to VCF (see methods below).

SweGen: We are not allowed to redistribute the VCF, you can request it at -SweGen, alongside the VCF file. +

SweGen: VCF files can be requested at +SweGen via a form, the request needs manual approval, which usually is quick. If there is no reply, email SweGen directly.

NPM: - VCF access can be requested on the - Chorus Browser website, which - requires an account and data access - request. +VCF download can be requested on the Chorus Browser website, which requires an account and data access request.

Methods

The following are quotes from the respective papers and/or websites of the datasets:

MXB: Genotyping was performed with the Illumina Multi-Ethnic Global Array (MEGA, ~1.8M SNPs), optimized for admixed populations and enriched for ancestry-informative and medically relevant variants. Only autosomal, biallelic SNPs passing quality control are included. Samples were selected from 898 recruitment sites, with prioritization of indigenous language speakers. Data processing included GenomeStudio → PLINK conversion, strand alignment, removal of duplicates, update of map positions using dbSNP Build 151 and low-quality variants/individuals, and relatedness filtering.

SGDP: The version used was https://sharehost.hms.harvard.edu/genetics/reich_lab/sgdp/vcf_variants/, merged with bcftools and lifted to hg38 with CrossMap. @@ -305,34 +309,34 @@ and indels were called following GATK Best Practices (GATK v3.7) via per-sample GVCFs (HaplotypeCaller), joint genotyping (CombineGVCFs, GenotypeGVCFs), and Variant Quality Score Recalibration (VQSR-AS); multiallelic variants were split with an in-house script, left-aligned with BCFtools, and annotated using Annovar and custom scripts against dbSNP, 1000 Genomes, and gnomAD, with putative loss-of-function variants identified using LOFTEE v0.3-beta irrespective of confidence labels. Variant and genotype quality was further assessed using the in-house CEGH-Filter two-step algorithm based on depth and allele balance, and analyses retained only GATK VQSR-AS PASS variants and higher-confidence CEGH-Filter calls. Relatedness was assessed using KING and PC-Relate (GENESIS), retaining a single proband per related pair and excluding one contaminated sample (>3% by verifyBAMID), resulting in a final dataset of 1,171 unrelated individuals. Final samples achieved mean coverages ranging from 31.3x to 64.8x, with an average of 38.65x and a median of 36.6x.

SFARI SPARK: The genome browser track project was approved by the Simons -Foundation as 14584.1. WES and WGS Data were downloaded from +Foundation under request number 14584.1. WES and WGS Data were downloaded from SFARI Base. -pVCFs were downloaded, anonymized with a script using bcftools and the fill-tags plugin and -normalized, without a minimum allele frequency cutoff.
+pVCFs were downloaded, anonymized with a script using bcftools and its "fill-tags" plugin and +normalized. There was no minimum allele frequency cutoff.
The methods are documented as follows by SFARI:

WES: This release consists of sequence and variant call data for 12,519 unique individuals, of which 12,517 (99.98%) have available genome-wide SNP genotype data. Sequencing and genotyping of all samples in this release was performed at New York Genome Center (NYGC). DNA from saliva samples were extracted and prepared with PCR-free methods and sequenced with paired-end sequencing of 150 bases on the Illumina NovaSeq 6000 system. Alignment of reads to the human reference genome version GRCh38, duplicate read marking, and Base Quality Score Recalibration (BQSR) were performed by New York Genome Center (NYCG). Whole-genome sequencing data were processed using a standardized, functionally equivalent CCDG pipeline with alignment to the GRCh38DH (1000 Genomes) @@ -377,32 +381,32 @@ GATK to generate gVCFs, pairwise relatedness inferred using PLINK v1.9 IBD estimates from common SNPs (AF ≥ 0.01, dbSNP v151) with ≥15% relatedness flagged, and comprehensive individual- and family-level quality control executed using the internal GenomeCheckMate pipeline to exclude samples based on contamination (≥5%), insufficient coverage (<20x in <80% of targets), sex discordance, pedigree/IBD inconsistencies, unregistered relationships, unexpected duplicates, or excess relatedness, after which QC-passing individuals (selecting the most recent passing sample per person) were retained for variant calling and joint genotyping.

Finngen: R12 annotated variants were downloaded from the Google Cloud -bucket link received though an email after filling out the form linked from -https://finngen.gitbook.io/documentation/data-download and converted to VCF +bucket link received though an email and +converted to VCF with a custom Python script.

SweGen: Fragment size 350bp on a Covaris E220. Paired-end sequencing with 150bp read length was performed on Illumina HiSeq X (HiSeq Control Software 3.3.39/RTA 2.7.1) with v2.5 sequencing chemistry. Raw whole-genome reads were aligned to the GRCh37 reference using BWA-MEM v0.7.12, then sorted and indexed with samtools v0.1.19 and assessed with qualimap v2.2.20; per-sample alignments from multiple lanes and flow cells were merged using Picard MergeSamFiles v1.120. Processing followed GATK best practices with GATK v3.3, including indel realignment (RealignerTargetCreator, IndelRealigner), duplicate marking (Picard MarkDuplicates v1.120), and base quality score recalibration (BaseRecalibrator), producing one finalized BAM per sample. Per-sample gVCFs were generated with GATK HaplotypeCaller v3.3 using reference files from the GATK v2.8 resource bundle, with all steps coordinated via Piper v1.4.0. Joint genotyping of 1,000 samples was performed by merging gVCFs in five batches of 200 using GATK CombineGVCFs, followed by cohort @@ -498,31 +502,32 @@ request. By browsing the data, you agree to use the data only for academic, non-commercial research to improve human health (biology/disease). We request all data users agree to protect the confidentiality of the data subjects in any research papers or publications that they may prepare, by taking all reasonable care to limit the possibility of identification. In particular, the data users shall not to use, or attempt to use, the data to deliberately compromise or otherwise infringe the confidentiality of information on data subjects and their right to privacy. If you use any of the data obtained from the CHORUS variant browser, we request that you cite the NPM flagship paper (Wong et al, 2023). All data users of the data must take note that the data provider and relevant SG10K_Health cohort owners bear no responsibility for the further analysis or interpretation of the data.

Thanks to Alex Ioannidis, UCSC, and Andreas Lahner, MGZ, for feedback on this track.

Thanks to Alex Ioannidis, UCSC, for the idea and motivation for this track. +Thanks to Andreas Lahner, MGZ, for feedback and suggestions.

References

Barberena-Jonas, C. et al. (2025). MexVar database: Clinical genetic variation beyond the Hispanic label in the Mexican Biobank. Nature Medicine (in press).

Sohail M, Moreno-Estrada A. The Mexican Biobank Project promotes genetic discovery, inclusive science and local capacity building. Dis Model Mech. 2024 Jan 1;17(1). PMID: 38299665; PMC: PMC10855211