85a3ec13e80a0e61f16e691afb878956e0483892 max Fri Nov 28 08:53:18 2025 -0800 adding Finnland to var freqs track, refs #36642 diff --git src/hg/makeDb/trackDb/human/varFreqs.html src/hg/makeDb/trackDb/human/varFreqs.html index 053f7ef112c..8dc145fd00a 100644 --- src/hg/makeDb/trackDb/human/varFreqs.html +++ src/hg/makeDb/trackDb/human/varFreqs.html @@ -57,33 +57,38 @@ <li> <b><a href="https://www.ncbi.nlm.nih.gov/snp/docs/gsr/alfa/" target="_blank">ALFA</a></b>: The NCBI ALlele Frequency Aggregator pipeline computes allele frequencies from approved, unrestricted dbGaP studies and makes them publicly available through dbSNP. Its goal is to release frequency data from over one million dbGaP subjects to aid discoveries involving common and rare variants with biological or disease relevance. The R4 release includes 408,709 subjects and allele frequencies for 15.5 million rs sites, including nearly one million ClinVar variants. Genotype and associated individual-level data are accessible through dbGaP <a href="https://dbgap.ncbi.nlm.nih.gov/aa/wga.cgi?page=login" target="_blank">authorized access</a>. </li> <li> - <b><a href="https://jmorp.megabank.tohoku.ac.jp/downloads" - target="_blank">JPN To61k Japan Tohoku University Tohoku Medical Megabank Organization - 61k Allele frequency panel (JPN 61k)</a></b>: + <b><a href="https://www.finngen.fi/en" target="_blank">FinnGen</a></b>: + Imputed variants from 500,348 Biobank samples obtained using genotyping arrays + in Finnland, 10% of the population. The imputation used phased variants obtained from 8,554 + high-quality whole genome sequences, also from Finnland. For details, see (Kurki et al, Nature 2023). + Phenotype links can be shown at <a href="https://r12.finngen.fi/">FinnGen PheWeb</a>. + </li> + <li> + <b><a href="https://jmorp.megabank.tohoku.ac.jp/downloads" target="_blank">JPN To61k Japan Tohoku University Tohoku Medical Megabank Organization 61k Allele frequency panel (JPN 61k)</a></b>: An allele frequency panel based on short-read WGS analysis of 61,000 Japanese individuals. The project includes other datatypes, such as STRs, long-read SVs and short-read CNVs. Data can be downloaded from the <a href="https://jmorp.megabank.tohoku.ac.jp" target="_blank">jMorp Website</a>, specifically the <a href="https://jmorp.megabank.tohoku.ac.jp/downloads" target="_blank">Downloads</a> section. For details, see (Tadaka et al, NAR 2023). </li> <li> <b><a href="https://abraom.ib.usp.br/" target="_blank">Brazil Arquivo Brasileiro Online de Mutaçõ (ABraOM)</a></b>: Genomic variants obtained with whole-genome sequencing from SABE, a census-based sample of elderly individuals from São Paulo, Brazil's largest city. Brazilian population is constituted by ~500 years of admixture between Africans, Europeans, and Native Americans. @@ -99,42 +104,39 @@ "IndiGen" program. Data can be downloaded from the <a href="https://clingen.igib.res.in/indigen/" target="_blank">IndiGen Website</a>. For details see (Jain et al, NAR 2020). Only the allele frequency is available from this project. The website also provides SV call and Alu insertion VCFs. </li> <li> <b><a href="https://www.kobic.re.kr/kova/" target="_blank">Korean Variant Archive (KOVA)</a></b>: 1,896 whole genome sequencing and 3,409 whole exome sequencing data from healthy individuals of Korean ethnicity. Most of the samples were originated from normal tissue of cancer patients (40.16 %), healthy parents of rare disease patients (28.4 %), or healthy volunteers (31.44 %). Japanese ancestry is broken down - in the INFO field. - TSV data can be requested on the <a href="https://www.kobic.re.kr/kova/downloads" - target="_blank">KOVA Downloads</a> website. Coverage 100x for WES, 30x for WGS. - For details see (Lee et al, Exp Mol Med 2022). - </li> + in the INFO field. Coverage 100x for WES, 30x for WGS. + For details see (Lee et al, Exp Mol Med 2022).</li> <li> - <b><a href="" + <b><a href="https://www.npm.sg/" target="_blank">NPM Singapore</a></b>: 9,770 whole genomes, mostly of Chinese, Indian and Malay ancestry. - VCF access can be requested on the <a href="https://chorus.grids-platform.io/" - target="_blank">Chorus Browser</a> website, which requires an account and access request. - For details see (Wong et al, Nat Genetics 2023). + A minimum allele count cutoff of > 5 was applied. + Data is available for download from the CHORUS browser, see "Data access" below. + For details see (Wong et al, Nat Genetics 2023). CNV data is also available there. </li> </ul> </p> <h2>Display Conventions</h2> <p>Most tracks only show the variant and allele frequencies on mouseover or clicks. When zoomed in, tracks display alleles with base-specific coloring. Homozygote data are shown as one letter, while heterozygotes will be displayed with both letters. </p> <p> For <b>NCBI ALFA:</b> This track has no single VCF with INFO fields, but uses multiple subtracks instead, one per ancestry. @@ -165,53 +167,71 @@ <b>Regeneron one million exomes:</b> VCFs with summarized allele frequencies are available from the <a target="_blank" href="https://rgc-research.regeneron.com/me/resources">RGC ME website</a>. </p> <p> <b>TOPMED:</b> VCFs with summarized allele frequencies are available from the <a target="_blank" href="https://bravo.sph.umich.edu/">TOPMED BRAVO website</a>. They require a login. </p> <p> <b>GenomeAsia Pilot:</b> VCFs are available from UCSC and also from the <a target="_blank" href="https://browser.genomeasia100k.org/#tid=download">GenomeAsia 100K website</a>. No license nor login. </p> +<p><b>KOVA:</b> + TSV data can be requested on the <a href="https://www.kobic.re.kr/kova/downloads" + target="_blank">KOVA Downloads</a> website. +</p> + +<p><b>Finngen:</b> TSV data can be requested via the form at https://finngen.gitbook.io/documentation/data-download which triggers an email with the download link.</p> + +<p><b>NPM:</b> + VCF access can be requested on the + <a href="https://chorus.grids-platform.io/" target="_blank">Chorus Browser</a> website, which requires an + <a href = "https://npm.a-star.edu.sg/" target=_blank>account and data access request</a>. +</p> + <h2>Methods</h2> <p> <b>MXB:</b> Genotyping was performed with the Illumina Multi-Ethnic Global Array (MEGA, ~1.8M SNPs), optimized for admixed populations and enriched for ancestry-informative and medically relevant variants. Only autosomal, biallelic SNPs passing quality control are included. Samples were selected from 898 recruitment sites, with prioritization of indigenous language speakers. Data processing included GenomeStudio → PLINK conversion, strand alignment, removal of duplicates, update of map positions using dbSNP Build 151 and low-quality variants/individuals, and relatedness filtering. </p> <p> <b>SGDP:</b> The version used was <a target="_blank" href="https://sharehost.hms.harvard.edu/genetics/reich_lab/sgdp/vcf_variants/" >https://sharehost.hms.harvard.edu/genetics/reich_lab/sgdp/vcf_variants/</a>, merged with bcftools and lifted to hg38 with CrossMap. </p> <p> <b>KOVA:</b> V7 of the TSV.gz was obtained from the KOVA staff and converted to VCF. It is not available for download from our site but can be requested from the KOVA website. </p> -<p><b>Finngen:</b> R12 was downloaded from https://finngen.gitbook.io/documentation/data-download and converted to VCF with a Python script. </p> +<p><b>Finngen:</b> R12 annotated variants were downloaded from the Google Cloud +bucket link received though an email after filling out the form linked from +https://finngen.gitbook.io/documentation/data-download and converted to VCF +with a <a +href="https://github.com/ucscGenomeBrowser/kent/tree/master/src/hg/makeDb/scripts/finngen_to_vcf.py" +target=_blank>custom Python script</a>. </p> <p><b>NPM Singapore:</b> Whole Genome Sequencing (WGS) data processing followed GATK4 best practices. GATK4 germline variant analysis workflow written in WDL was adapted to use Nextflow and deployed at the National Supercomputing Centre, Singapore (NSCC). In short, WGS reads were aligned against GRCh38 using the BWA-MEM algorithm and used as input to GATK HaplotypeCaller to produce single sample gVCFs. The gVCF files were joint-called then loaded in Hail, an open-source python-based data analysis library suited to work with population-scale with genomic data collections. Low-quality WGS libraries and low-quality variants were removed. QC-ed variants were functionally annotated using Ensembl Variant Effect Predictor (VEP) (version 95). Functional annotations for variant impacting protein-coding were also complemented with information on the potential alteration to their cognate protein's 3D structure and drug binding ability. </p>