85a3ec13e80a0e61f16e691afb878956e0483892
max
  Fri Nov 28 08:53:18 2025 -0800
adding Finnland to var freqs track, refs #36642

diff --git src/hg/makeDb/trackDb/human/varFreqs.html src/hg/makeDb/trackDb/human/varFreqs.html
index 053f7ef112c..8dc145fd00a 100644
--- src/hg/makeDb/trackDb/human/varFreqs.html
+++ src/hg/makeDb/trackDb/human/varFreqs.html
@@ -57,33 +57,38 @@
 
     <li>
         <b><a href="https://www.ncbi.nlm.nih.gov/snp/docs/gsr/alfa/" target="_blank">ALFA</a></b>:
         The NCBI ALlele Frequency Aggregator pipeline computes allele frequencies from
         approved, unrestricted dbGaP studies and makes them publicly available through
         dbSNP. Its goal is to release frequency data from over one million dbGaP
         subjects to aid discoveries involving common and rare variants with biological
         or disease relevance. The R4 release includes 408,709 subjects and allele
         frequencies for 15.5 million rs sites, including nearly one million ClinVar
         variants. Genotype and associated individual-level data are accessible through dbGaP
         <a href="https://dbgap.ncbi.nlm.nih.gov/aa/wga.cgi?page=login"
         target="_blank">authorized access</a>.
     </li>
 
     <li>
-        <b><a href="https://jmorp.megabank.tohoku.ac.jp/downloads"
-        target="_blank">JPN To61k Japan Tohoku University Tohoku Medical Megabank Organization
-        61k Allele frequency panel (JPN 61k)</a></b>:
+        <b><a href="https://www.finngen.fi/en" target="_blank">FinnGen</a></b>:
+        Imputed variants from 500,348 Biobank samples obtained using genotyping arrays
+        in Finnland, 10% of the population. The imputation used phased variants obtained from 8,554
+        high-quality whole genome sequences, also from Finnland. For details, see (Kurki et al, Nature 2023).
+        Phenotype links can be shown at <a href="https://r12.finngen.fi/">FinnGen PheWeb</a>.
+    </li>
+    <li>
+        <b><a href="https://jmorp.megabank.tohoku.ac.jp/downloads" target="_blank">JPN To61k Japan Tohoku University Tohoku Medical Megabank Organization 61k Allele frequency panel (JPN 61k)</a></b>:
         An allele frequency panel based on short-read WGS analysis of 61,000 Japanese individuals.
         The project includes other datatypes, such as STRs, long-read SVs and short-read CNVs.
         Data can be downloaded from the <a href="https://jmorp.megabank.tohoku.ac.jp"
         target="_blank">jMorp Website</a>, specifically the
         <a href="https://jmorp.megabank.tohoku.ac.jp/downloads" target="_blank">Downloads</a>
         section. For details, see (Tadaka et al, NAR 2023).
     </li>
 
     <li>
         <b><a href="https://abraom.ib.usp.br/"
         target="_blank">Brazil Arquivo Brasileiro Online de Muta&ccedil;&otilde; (ABraOM)</a></b>:
         Genomic variants obtained with whole-genome sequencing from SABE, a
         census-based sample of elderly individuals from S&atilde;o Paulo, Brazil's
         largest city. Brazilian population is constituted by ~500 years of
         admixture between Africans, Europeans, and Native Americans.
@@ -99,42 +104,39 @@
         &quot;IndiGen&quot; program.
         Data can be downloaded from the <a href="https://clingen.igib.res.in/indigen/"
         target="_blank">IndiGen Website</a>. For details see (Jain et al, NAR 2020). Only
         the allele frequency is available from this project. The website also provides SV call
         and Alu insertion VCFs.
     </li>
 
     <li>
         <b><a href="https://www.kobic.re.kr/kova/"
         target="_blank">Korean Variant Archive (KOVA)</a></b>:
         1,896 whole genome sequencing and 3,409 whole exome sequencing data from healthy individuals
         of Korean ethnicity.
         Most of the samples were originated from normal tissue of cancer
         patients (40.16 %), healthy parents of rare disease patients (28.4 %),
         or healthy volunteers (31.44 %). Japanese ancestry is broken down
-        in the INFO field.
-        TSV data can be requested on the <a href="https://www.kobic.re.kr/kova/downloads"
-        target="_blank">KOVA Downloads</a> website. Coverage 100x for WES, 30x for WGS.
-        For details see (Lee et al, Exp Mol Med 2022).
-    </li>
+        in the INFO field. Coverage 100x for WES, 30x for WGS.
+        For details see (Lee et al, Exp Mol Med 2022).</li>
     <li>
-        <b><a href=""
+        <b><a href="https://www.npm.sg/"
         target="_blank">NPM Singapore</a></b>:
         9,770 whole genomes, mostly of Chinese, Indian and Malay ancestry. 
-        VCF access can be requested on the <a href="https://chorus.grids-platform.io/"
-        target="_blank">Chorus Browser</a> website, which requires an account and access request. 
-        For details see (Wong et al, Nat Genetics 2023).
+        A minimum allele count cutoff of &gt; 5 was applied.
+        Data is available for download from the CHORUS browser, see "Data access" below.
+        For details see (Wong et al, Nat Genetics 2023). CNV data is also available there.
     </li>
 </ul>
 </p>
 
 <h2>Display Conventions</h2>
 
 <p>Most tracks only show the variant and allele frequencies on mouseover or clicks.
 When zoomed in, tracks display alleles with base-specific coloring. Homozygote
 data are shown as one letter, while heterozygotes will be displayed with both
 letters.
 </p>
 
 <p>
 For <b>NCBI ALFA:</b> This track has no single VCF with INFO fields, but uses multiple subtracks
 instead, one per ancestry.
@@ -165,53 +167,71 @@
 <b>Regeneron one million exomes:</b> VCFs with summarized allele frequencies are available from
 the <a target="_blank" href="https://rgc-research.regeneron.com/me/resources">RGC ME website</a>.
 </p>
 <p>
 <b>TOPMED:</b> VCFs with summarized allele frequencies are available from
 the <a target="_blank" href="https://bravo.sph.umich.edu/">TOPMED BRAVO website</a>. They require a
 login.
 </p>
 <p>
 <b>GenomeAsia Pilot:</b> VCFs are available from UCSC and also from
 the <a target="_blank"
 href="https://browser.genomeasia100k.org/#tid=download">GenomeAsia 100K website</a>.
 No license nor login.
 </p>
 
+<p><b>KOVA:</b> 
+        TSV data can be requested on the <a href="https://www.kobic.re.kr/kova/downloads"
+        target="_blank">KOVA Downloads</a> website. 
+</p>
+
+<p><b>Finngen:</b> TSV data can be requested via the form at https://finngen.gitbook.io/documentation/data-download which triggers an email with the download link.</p>
+
+<p><b>NPM:</b> 
+        VCF access can be requested on the 
+        <a href="https://chorus.grids-platform.io/" target="_blank">Chorus Browser</a> website, which requires an 
+        <a href = "https://npm.a-star.edu.sg/" target=_blank>account and data access request</a>. 
+</p>
+
 <h2>Methods</h2>
 <p>
 <b>MXB:</b> Genotyping was performed with the Illumina Multi-Ethnic Global Array
 (MEGA, ~1.8M SNPs), optimized for admixed populations and enriched for
 ancestry-informative and medically relevant variants. Only autosomal, biallelic
 SNPs passing quality control are included. Samples were selected from 898
 recruitment sites, with prioritization of indigenous language speakers. Data
 processing included GenomeStudio &rarr; PLINK conversion, strand alignment, removal
 of duplicates, update of map positions using dbSNP Build 151 and low-quality
 variants/individuals, and relatedness filtering.
 </p>
 <p>
 <b>SGDP:</b> The version used was
 <a target="_blank" href="https://sharehost.hms.harvard.edu/genetics/reich_lab/sgdp/vcf_variants/"
 >https://sharehost.hms.harvard.edu/genetics/reich_lab/sgdp/vcf_variants/</a>,
 merged with bcftools and lifted to hg38 with CrossMap. 
 </p>
 <p>
 <b>KOVA:</b> V7 of the TSV.gz was obtained from the KOVA staff and converted to VCF. It is not
 available for download from our site but can be requested from the KOVA website.
 </p>
 
-<p><b>Finngen:</b> R12 was downloaded from https://finngen.gitbook.io/documentation/data-download and converted to VCF with a Python script. </p>
+<p><b>Finngen:</b> R12 annotated variants were downloaded from the Google Cloud
+bucket link received though an email after filling out the form linked from
+https://finngen.gitbook.io/documentation/data-download and converted to VCF
+with a <a
+href="https://github.com/ucscGenomeBrowser/kent/tree/master/src/hg/makeDb/scripts/finngen_to_vcf.py"
+target=_blank>custom Python script</a>. </p>
 
 <p><b>NPM Singapore:</b> Whole Genome Sequencing (WGS) data processing followed
 GATK4 best practices. GATK4 germline variant analysis workflow written in WDL
 was adapted to use Nextflow and deployed at the National Supercomputing Centre,
 Singapore (NSCC). In short, WGS reads were aligned against GRCh38 using the
 BWA-MEM algorithm and used as input to GATK HaplotypeCaller to produce single
 sample gVCFs. The gVCF files were joint-called then loaded in Hail, an
 open-source python-based data analysis library suited to work with
 population-scale with genomic data collections. Low-quality WGS libraries and
 low-quality variants were removed.  QC-ed variants were functionally annotated
 using Ensembl Variant Effect Predictor (VEP) (version 95). Functional
 annotations for variant impacting protein-coding were also complemented with
 information on the potential alteration to their cognate protein's 3D structure
 and drug binding ability.
 </p>