150d6eac0d7fa08368b11b25ad8f6b4e84143243 max Tue Dec 16 15:13:55 2025 -0800 more docs for var freqs track diff --git src/hg/makeDb/trackDb/human/varFreqs.html src/hg/makeDb/trackDb/human/varFreqs.html index f47b5d89e6b..177933905e0 100644 --- src/hg/makeDb/trackDb/human/varFreqs.html +++ src/hg/makeDb/trackDb/human/varFreqs.html @@ -63,30 +63,43 @@ subjects to aid discoveries involving common and rare variants with biological or disease relevance. The R4 release includes 408,709 subjects and allele frequencies for 15.5 million rs sites, including nearly one million ClinVar variants. Genotype and associated individual-level data are accessible through dbGaP <a href="https://dbgap.ncbi.nlm.nih.gov/aa/wga.cgi?page=login" target="_blank">authorized access</a>. </li> <li> <b><a href="https://www.finngen.fi/en" target="_blank">FinnGen</a></b>: Imputed variants from 500,348 Biobank samples obtained using genotyping arrays in Finnland, 10% of the population. The imputation used phased variants obtained from 8,554 high-quality whole genome sequences, also from Finnland. For details, see (Kurki et al, Nature 2023). Phenotype links can be shown at <a href="https://r12.finngen.fi/">FinnGen PheWeb</a>. </li> + + <li> + <b><a href="https://swefreq.nbis.se/dataset/SweGen" target="_blank">SweGen</a></b>: + Whole-genome sequencing variant frequencies for 1000 Swedish individuals generated within the SweGen project. + The 1000 individuals included in the SweGen project represent a + cross-section of the Swedish population and that no disease information + has been used for the selection. The frequency data may therefore + include genetic variants that are associated with, or causative of, + disease. SweGen also provides SV calls, TEs, MELT results for TEs, HLAs and new sequence. + For details, see (Ameur et al, Eur J Hum Genet 2017). + Dataset can be browsed at the <a href="https://swefreq.nbis.se/dataset/SweGen/browser">SweGen Browser</a>. + </li> + <li> <b><a href="https://jmorp.megabank.tohoku.ac.jp/downloads" target="_blank">JPN To61k Japan Tohoku University Tohoku Medical Megabank Organization 61k Allele frequency panel (JPN 61k)</a></b>: An allele frequency panel based on short-read WGS analysis of 61,000 Japanese individuals. The project includes other datatypes, such as STRs, long-read SVs and short-read CNVs. Data can be downloaded from the <a href="https://jmorp.megabank.tohoku.ac.jp" target="_blank">jMorp Website</a>, specifically the <a href="https://jmorp.megabank.tohoku.ac.jp/downloads" target="_blank">Downloads</a> section. For details, see (Tadaka et al, NAR 2023). </li> <li> <b><a href="https://abraom.ib.usp.br/" target="_blank">Brazil Arquivo Brasileiro Online de Mutaçõ (ABraOM)</a></b>: Genomic variants obtained with whole-genome sequencing from SABE, a census-based sample of elderly individuals from São Paulo, Brazil's @@ -181,30 +194,32 @@ </p> <p> <b>GenomeAsia Pilot:</b> VCFs are available from UCSC and also from the <a target="_blank" href="https://browser.genomeasia100k.org/#tid=download">GenomeAsia 100K website</a>. No license nor login. </p> <p><b>KOVA:</b> TSV data can be requested on the <a href="https://www.kobic.re.kr/kova/downloads" target="_blank">KOVA Downloads</a> website. </p> <p><b>Finngen:</b> TSV data can be requested via the form at https://finngen.gitbook.io/documentation/data-download which triggers an email with the download link.</p> +<p><b>SweGen:</b> We are allowed to redistribute the VCF, but under the condition that the file terms_of_use.txt is distributed with the file. You can find it <a target=_blank href="https://hgdownload.soe.ucsc.edu/gbdb/hg38/varFreqs/swegen">on our download server</a>, alongside the VCF file. </p> + <p><b>NPM:</b> VCF access can be requested on the <a href="https://chorus.grids-platform.io/" target="_blank">Chorus Browser</a> website, which requires an <a href = "https://npm.a-star.edu.sg/" target=_blank>account and data access request</a>. </p> <h2>Methods</h2> <p> <b>MXB:</b> Genotyping was performed with the Illumina Multi-Ethnic Global Array (MEGA, ~1.8M SNPs), optimized for admixed populations and enriched for ancestry-informative and medically relevant variants. Only autosomal, biallelic SNPs passing quality control are included. Samples were selected from 898 recruitment sites, with prioritization of indigenous language speakers. Data processing included GenomeStudio → PLINK conversion, strand alignment, removal of duplicates, update of map positions using dbSNP Build 151 and low-quality @@ -216,30 +231,35 @@ >https://sharehost.hms.harvard.edu/genetics/reich_lab/sgdp/vcf_variants/</a>, merged with bcftools and lifted to hg38 with CrossMap. </p> <p> <b>KOVA:</b> V7 of the TSV.gz was obtained from the KOVA staff and converted to VCF. It is not available for download from our site but can be requested from the KOVA website. </p> <p><b>Finngen:</b> R12 annotated variants were downloaded from the Google Cloud bucket link received though an email after filling out the form linked from https://finngen.gitbook.io/documentation/data-download and converted to VCF with a <a href="https://github.com/ucscGenomeBrowser/kent/tree/master/src/hg/makeDb/scripts/finngen_to_vcf.py" target=_blank>custom Python script</a>. </p> +<p><b>SweGen:</b> Fragment size 350bp on a Covaris E220. Paired-end sequencing with 150 bp read length was performed on Illumina HiSeq X (HiSeq Control Software 3.3.39/RTA 2.7.1) with v2.5 sequencing chemistry. Raw whole-genome reads were aligned to the GRCh37 reference using BWA-MEM v0.7.12, then sorted and indexed with samtools v0.1.19 and assessed with qualimap v2.2.20; per-sample alignments from multiple lanes and flow cells were merged using Picard MergeSamFiles v1.120. Processing followed GATK best practices with GATK v3.3, including indel realignment (RealignerTargetCreator, IndelRealigner), duplicate marking (Picard MarkDuplicates v1.120), and base quality score recalibration (BaseRecalibrator), producing one finalized BAM per sample. Per-sample gVCFs were generated with GATK HaplotypeCaller v3.3 using reference files from the GATK v2.8 resource bundle, with all steps coordinated via Piper v1.4.0. Joint genotyping of 1,000 samples was performed by merging gVCFs in five batches of 200 using GATK CombineGVCFs, followed by cohort genotyping with GATK GenotypeGVCFs and variant quality score recalibration for SNVs and indels using VariantRecalibrator and ApplyRecalibration. +<BR>At UCSC, the hg38 VCF was downloaded +from <a target=_blank href="https://swefreq.nbis.se/dataset/SweGen/download">SweFreq</a>. +</p> + <p><b>NPM Singapore:</b> Whole Genome Sequencing (WGS) data processing followed GATK4 best practices. GATK4 germline variant analysis workflow written in WDL was adapted to use Nextflow and deployed at the National Supercomputing Centre, Singapore (NSCC). In short, WGS reads were aligned against GRCh38 using the BWA-MEM algorithm and used as input to GATK HaplotypeCaller to produce single sample gVCFs. The gVCF files were joint-called then loaded in Hail, an open-source python-based data analysis library suited to work with population-scale with genomic data collections. Low-quality WGS libraries and low-quality variants were removed. QC-ed variants were functionally annotated using Ensembl Variant Effect Predictor (VEP) (version 95). Functional annotations for variant impacting protein-coding were also complemented with information on the potential alteration to their cognate protein's 3D structure and drug binding ability. </p> @@ -278,30 +298,39 @@ Collaborators. This research has been conducted using the UK Biobank Resource under application number 26041. </p> <p> <b>SGDP:</b> This project was funded by the Simons Foundation. Thanks to David Reich and Swapan Mallick for help with importing the data. </p> <p> <b>KOVA:</b> Thanks to Insu Jang and the KOVA director for providing variant frequencies in TSV format. </p> <p> <b>Finngen:</b> We want to acknowledge the participants and investigators of the FinnGen study. </p> +<p> +<b>SweGen:</b> The SweGen allele frequency data was generated by Science for +Life Laboratory. The data may be redistributed in original or modified form, +but must always be distributed together with the file "terms_of_use.txt" that +is stored together with the data on our download server, and any redistributed +data derived from the SweGen data set must follow those terms and conditions. +The data may not be used to attempt to identify any individual in this or other studies. +</p> + <p> <b>NPM Singapore:</b> Thanks to the NPM Data Access Committee and Eleanor for granting our data request. By browsing the data, you agree to use the data only for academic, non-commercial research to improve human health (biology/disease). We request all data users agree to protect the confidentiality of the data subjects in any research papers or publications that they may prepare, by taking all reasonable care to limit the possibility of identification. In particular, the data users shall not to use, or attempt to use, the data to deliberately compromise or otherwise infringe the confidentiality of information on data subjects and their right to privacy. If you use any of the data obtained from the CHORUS variant browser, we request that you cite the NPM flagship paper (Wong et al, 2023). All data users of the data must take note that the data provider and relevant SG10K_Health cohort owners bear no responsibility for the further analysis or interpretation of the data. </p> @@ -457,15 +486,28 @@ <em>Nat Genet</em>. 2023 Feb;55(2):178-186. PMID: <a href="https://www.ncbi.nlm.nih.gov/pubmed/36658435" target="_blank">36658435</a> </p> <p> Malomane DK, Williams MP, Huber CD, Mangul S, Abedalthagafi M, Chiang CWK. <a href="https://doi.org/10.1101/2025.01.10.632500" target="_blank"> Patterns of population structure and genetic variation within the Saudi Arabian population</a>. <em>bioRxiv</em>. 2025 Jan 13;. PMID: <a href="https://www.ncbi.nlm.nih.gov/pubmed/39868174" target="_blank">39868174</a>; PMC: <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11761371/" target="_blank">PMC11761371</a> </p> + + +<p> +Ameur A, Dahlberg J, Olason P, Vezzi F, Karlsson R, Martin M, Viklund J, Kähäri AK, Lundin P, Che H +<em>et al</em>. +<a href="https://doi.org/10.1038/ejhg.2017.130" target="_blank"> +SweGen: a whole-genome data resource of genetic variability in a cross-section of the Swedish +population</a>. +<em>Eur J Hum Genet</em>. 2017 Nov;25(11):1253-1260. +PMID: <a href="https://www.ncbi.nlm.nih.gov/pubmed/28832569" target="_blank">28832569</a>; PMC: <a +href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5765326/" target="_blank">PMC5765326</a> +</p> +