c7b6da0a433f53fdf3c4137a67598d099d9ce1a5 max Wed Jan 14 10:52:53 2026 -0800 adding SFARI SPARK to varFreqs supertrack, refs #36642 diff --git src/hg/makeDb/trackDb/human/varFreqs.html src/hg/makeDb/trackDb/human/varFreqs.html index 177933905e0..e2b62660951 100644 --- src/hg/makeDb/trackDb/human/varFreqs.html +++ src/hg/makeDb/trackDb/human/varFreqs.html @@ -1,26 +1,25 @@
This container shows results from projects where the variant frequencies, aka allele frequencies, are publicly available. The tracks were collected from the projects listed below. Projects that provide haplotype-phased genotypes/variants can be found elsewhere: 1000 Genomes is a separate track, and the projects HGDP, SGDP, HGDP+1000 Genomes and Mexico Biobank can be found in the "Phased Variants" track.
If you want us to add other projects, please contact us. We asked and were -unable to obtain variant frequencies from the following projects: UK Biobank (request pending), All of us (granted), -SFARI SPARK (in process). +unable to obtain variant frequencies from the following projects: UK Biobank (request pending), All of Us (granted, ongoing).
The following projects were added:
Most tracks only show the variant and allele frequencies on mouseover or clicks. When zoomed in, tracks display alleles with base-specific coloring. Homozygote data are shown as one letter, while heterozygotes will be displayed with both letters.
For NCBI ALFA: This track has no single VCF with INFO fields, but uses multiple subtracks instead, one per ancestry.
Most of the data in these tracks are not available for download from UCSC. Data can be browsed on our website. -But the data can be downloaded -for free from the original projects. Accessing the -data usually requires a click-through license on the respectice websites, links are either +But the data can be downloaded for free from the original projects. Accessing the +data usually requires a click-through license or access request on the respectice websites, links are either provided above in the project description or with more details here:
MXB: Allele frequencies by geographical state and ancestry are available via the MexVar platform. Raw genotype data are available under controlled access at the EGA (Study: EGAS00001005797; Dataset: EGAD00010002361). For the VCFs, email andres.moreno@cinvestav.mx.
MCPS: VCFs with summarized allele frequencies are available from the MCPS website.
Regeneron one million exomes: VCFs with summarized allele frequencies are available from the RGC ME website.
TOPMED: VCFs with summarized allele frequencies are available from the TOPMED BRAVO website. They require a login.
+SFARI SPARK: Allele frequencies can be displayed on the + SFARI Genome Browser. + Full CRAMs and VCFs with genotypes are available from SFARI Base. +They require a data access request, which is usually reviewed quickly. More information is available in the +SPARK Welcome Packet. +
+GenomeAsia Pilot: VCFs are available from UCSC and also from the GenomeAsia 100K website. No license nor login.
KOVA: TSV data can be requested on the KOVA Downloads website.
Finngen: TSV data can be requested via the form at https://finngen.gitbook.io/documentation/data-download which triggers an email with the download link.
-SweGen: We are allowed to redistribute the VCF, but under the condition that the file terms_of_use.txt is distributed with the file. You can find it on our download server, alongside the VCF file.
+SweGen: We are not allowed to redistribute the VCF, you can request it at SweGen, alongside the VCF file.
NPM: VCF access can be requested on the Chorus Browser website, which requires an account and data access request.
MXB: Genotyping was performed with the Illumina Multi-Ethnic Global Array (MEGA, ~1.8M SNPs), optimized for admixed populations and enriched for ancestry-informative and medically relevant variants. Only autosomal, biallelic SNPs passing quality control are included. Samples were selected from 898 recruitment sites, with prioritization of indigenous language speakers. Data processing included GenomeStudio → PLINK conversion, strand alignment, removal of duplicates, update of map positions using dbSNP Build 151 and low-quality variants/individuals, and relatedness filtering.
SGDP: The version used was https://sharehost.hms.harvard.edu/genetics/reich_lab/sgdp/vcf_variants/, merged with bcftools and lifted to hg38 with CrossMap.
-KOVA: V7 of the TSV.gz was obtained from the KOVA staff and converted to VCF. It is not
+KOVA:Raw reads were aligned to the GRCh38+decoy reference using BWA-MEM v0.7.17 with default parameters, followed by duplicate marking and coordinate sorting with MarkDuplicatesSpark, and base quality score recalibration using BQSRPipelineSpark in GATK v4.1.3.0; mapping quality control metrics were generated with Qualimap v2.2.1. Single-nucleotide variants and small insertions/deletions were called per sample using GATK HaplotypeCaller in GVCF mode (-ERC GVCF), and joint genotyping was performed by creating a GenomicsDB with GenomicsDBImport and following GATK Best Practices, including variant quality score recalibration (VQSR) retaining 99.7% of true SNVs and 99.0% of true indels based on training sets (workflow detailed in Supplementary Fig. 1). Downstream analyses followed a modified version of the gnomAD quality-control framework and were primarily conducted using Hail, an open-source Python library for large-scale genome analysis; after merging WES and WGS data in Hail, multiallelic variants and variants with genotype quality <20, read depth <10, allelic balance <0.2, or overlapping low-complexity regions were excluded (Supplementary Fig. 2).
+
+At UCSC, V7 of the TSV.gz was obtained from the KOVA staff by email and converted to VCF. It is not
available for download from our site but can be requested from the KOVA website.
+ABraOM: For Academic use only. Licensing for commercial use might be available under request and agreement. +By using this resource you agree to cite the flagship paper (Naslavsky et al. Nat Comm 2022). +Whole-genome sequencing was performed at Human Longevity Inc. using TruSeq Nano DNA HT libraries sequenced on Illumina HiSeqX instruments with 150 bp paired-end reads targeting 30× coverage, and reads were mapped to GRCh38 using ISIS software. Sample sex was validated by comparing CPMs of X chromosome and male-specific Y (MSY) reads relative to autosomes, yielding the expected female (~55,000 X CPM, <200 MSY CPM) and male (~27,500 X CPM, >550 MSY CPM) patterns. Germline SNVs and indels were called following GATK Best Practices (GATK v3.7) via per-sample GVCFs (HaplotypeCaller), joint genotyping (CombineGVCFs, GenotypeGVCFs), and Variant Quality Score Recalibration (VQSR-AS); multiallelic variants were split with an in-house script, left-aligned with BCFtools, and annotated using Annovar and custom scripts against dbSNP, 1000 Genomes, and gnomAD, with putative loss-of-function variants identified using LOFTEE v0.3-beta irrespective of confidence labels. Variant and genotype quality was further assessed using the in-house CEGH-Filter two-step algorithm based on depth and allele balance, and analyses retained only GATK VQSR-AS PASS variants and higher-confidence CEGH-Filter calls. Relatedness was assessed using KING and PC-Relate (GENESIS), retaining a single proband per related pair and excluding one contaminated sample (>3% by verifyBAMID), resulting in a final dataset of 1,171 unrelated individuals. Final samples achieved mean coverages ranging from 31.3× to 64.8×, with an average of 38.65× and a median of 36.6×. +
+ +SFARI SPARK: The project as approved by the Simons Foundation as 14584.1. WES and WGS Data were downloaded from
+ SFARI Base.
+ pVCFs were downloaded, anonymized with a script using bcftools and the fill-tags plugin and normalized,
+ without a minimum allele frequency cutoff.
+ The methods are documented as follows by SFARI:
+ WES:
+ This release consists of sequence and variant call data for 12,519
+ unique individuals, of which 12,517 (99.98%) have available genome-wide
+ SNP genotype data. Sequencing and genotyping of all samples in this
+ release was performed at New York Genome Center (NYGC). DNA from saliva
+ samples were extracted and prepared with PCR-free methods and sequenced
+ with paired-end sequencing of 150 bases on the Illumina NovaSeq 6000
+ system. Alignment of reads to the human reference genome version
+ GRCh38, duplicate read marking, and Base Quality Score Recalibration
+ (BQSR) were performed by New York Genome Center (NYCG). Whole-genome
+ sequencing data were processed using a standardized, functionally
+ equivalent CCDG pipeline with alignment to the GRCh38DH (1000 Genomes)
+ reference using BWA-MEM v0.7.15 (deterministic settings, no -M, use of
+ .alt contigs), Picard-equivalent duplicate marking (Picard ≥2.4.1 or
+ equivalent), no indel realignment, and base quality score recalibration
+ with GATK (dbSNP138, Mills and 1000G gold-standard indels, known
+ indels). Final outputs were stored as lossless CRAM files with
+ complete SAM-compliant read-group annotations and mandatory 4-bin
+ base-quality compression (Q2–6, 10, 20, 30), and all implementations
+ were validated for functional equivalence across centers before use.
+ Variant Calling was performed using DeepVariant. See CCDG pipeline details.
+
+ WGS: This release contains
+ sequence data for 142,357 individuals and genotyping data for
+ 141,368 individuals. DNA was sequenced from saliva for all
+ samples and all participants consented to having their genetic
+ data shared by Regeneron. Exomes for all samples were sequenced with
+ short-read, paired-end sequencing of 150 bases on Illumina
+ NovaSeq 6000 machines using S2/S4 flow cells. Sequencing and
+ genotyping was performed across nine batches (WES1 through
+ WES9) at the Regeneron Genetics Center (RGC) and integrated
+ together for this data release. All sequencing batches were
+ processed using the same DNA extraction methods and sequencing
+ machines, however two different exome capture panels were used,
+ as described below. Genotyping was performed using a SNP
+ genotyping array for WES1 through WES4 and using
+ “genotyping-by-sequencing” (GxS) for WES5 through WES9. The
+ first four sequencing batches were sequenced at Regeneron using
+ custom NEB/Kapa reagents with the IDT (Integrated DNA
+ Technologies) xGen capture platform, including custom exome
+ capture regions. Samples starting with batch WES5 were
+ sequenced using the Twist Bioscience Human
+ Comprehensive Exome panel, combined with spike-ins for
+ sequencing genotyping sites (see Genotyping Methods), the full
+ mitochondrial genome, and coverage boosted at selected sites
+ for assaying clonal hematopoiesis of indeterminate potential
+ (CHIP). SFARI performed NV/indel calling via DeepVariant and
+ GATK to generate gVCFs, pairwise relatedness inferred using
+ PLINK v1.9 IBD estimates from common SNPs (AF ≥ 0.01, dbSNP
+ v151) with ≥15% relatedness flagged, and comprehensive
+ individual- and family-level quality control executed using the
+ internal GenomeCheckMate pipeline to exclude samples based on
+ contamination (≥5%), insufficient coverage (<20× in <80% of
+ targets), sex discordance, pedigree/IBD inconsistencies,
+ unregistered relationships, unexpected duplicates, or excess
+ relatedness, after which QC-passing individuals (selecting the
+ most recent passing sample per person) were retained for
+ variant calling and joint genotyping.
+
Finngen: R12 annotated variants were downloaded from the Google Cloud bucket link received though an email after filling out the form linked from https://finngen.gitbook.io/documentation/data-download and converted to VCF with a custom Python script.
SweGen: Fragment size 350bp on a Covaris E220. Paired-end sequencing with 150 bp read length was performed on Illumina HiSeq X (HiSeq Control Software 3.3.39/RTA 2.7.1) with v2.5 sequencing chemistry. Raw whole-genome reads were aligned to the GRCh37 reference using BWA-MEM v0.7.12, then sorted and indexed with samtools v0.1.19 and assessed with qualimap v2.2.20; per-sample alignments from multiple lanes and flow cells were merged using Picard MergeSamFiles v1.120. Processing followed GATK best practices with GATK v3.3, including indel realignment (RealignerTargetCreator, IndelRealigner), duplicate marking (Picard MarkDuplicates v1.120), and base quality score recalibration (BaseRecalibrator), producing one finalized BAM per sample. Per-sample gVCFs were generated with GATK HaplotypeCaller v3.3 using reference files from the GATK v2.8 resource bundle, with all steps coordinated via Piper v1.4.0. Joint genotyping of 1,000 samples was performed by merging gVCFs in five batches of 200 using GATK CombineGVCFs, followed by cohort genotyping with GATK GenotypeGVCFs and variant quality score recalibration for SNVs and indels using VariantRecalibrator and ApplyRecalibration.
At UCSC, the hg38 VCF was downloaded
from SweFreq.
NPM Singapore: Whole Genome Sequencing (WGS) data processing followed GATK4 best practices. GATK4 germline variant analysis workflow written in WDL was adapted to use Nextflow and deployed at the National Supercomputing Centre, Singapore (NSCC). In short, WGS reads were aligned against GRCh38 using the BWA-MEM algorithm and used as input to GATK HaplotypeCaller to produce single sample gVCFs. The gVCF files were joint-called then loaded in Hail, an open-source python-based data analysis library suited to work with population-scale with genomic data collections. Low-quality WGS libraries and low-quality variants were removed. QC-ed variants were functionally annotated using Ensembl Variant Effect Predictor (VEP) (version 95). Functional annotations for variant impacting protein-coding were also complemented with information on the potential alteration to their cognate protein's 3D structure and drug binding ability.
-Saudi Genome Program: Data was downloaded +
Saudi Genome Program: Data were downloaded from Figshare, and converted to VCF.
MXB: We thank the Center for Research and Advanced Studies (Cinvestav) of Mexico for generating and providing the frequency data, the National Institute of Medical Sciences and Nutrition (INCMNSZ) for DNA extraction, and the Ministry of Health together with the National Institute of Public Health (INSP) for the design and implementation of the National Health Survey 2000 (ENSA 2000). We also thank the ENSA-Genomics Consortium for their contributions to sample collection and data processing that made possible the construction of the MXB genomic resource.
@@ -486,28 +579,35 @@ Nat Genet. 2023 Feb;55(2):178-186. PMID: 36658435Malomane DK, Williams MP, Huber CD, Mangul S, Abedalthagafi M, Chiang CWK. Patterns of population structure and genetic variation within the Saudi Arabian population. bioRxiv. 2025 Jan 13;. PMID: 39868174; PMC: PMC11761371
- -Ameur A, Dahlberg J, Olason P, Vezzi F, Karlsson R, Martin M, Viklund J, Kähäri AK, Lundin P, Che H et al. SweGen: a whole-genome data resource of genetic variability in a cross-section of the Swedish population. Eur J Hum Genet. 2017 Nov;25(11):1253-1260. PMID: 28832569; PMC: PMC5765326
++SPARK Consortium. Electronic address: pfeliciano@simonsfoundation.org, SPARK Consortium. + +SPARK: A US Cohort of 50,000 Families to Accelerate Autism Research. +Neuron. 2018 Feb 7;97(3):488-493. +PMID: 29420931; PMC: PMC7444276 +
+