d4e7e1a69b17ccebfc8b558f220dd0e1cc3ba1d0
max
  Mon Feb 2 06:14:27 2026 -0800
removing regeneron data after rejection of request, code review feedback, refs #36978

diff --git src/hg/makeDb/trackDb/human/varFreqs.html src/hg/makeDb/trackDb/human/varFreqs.html
index 1a73429546f..625e2c0bf0e 100644
--- src/hg/makeDb/trackDb/human/varFreqs.html
+++ src/hg/makeDb/trackDb/human/varFreqs.html
@@ -1,707 +1,712 @@
 <h2>Description</h2>
 <p>
 This container shows results from projects where the variant frequencies, aka allele frequencies,
 are publicly available. The tracks were collected from the 
-projects listed below. Projects that provide haplotype-phased genotypes/variants can be found
-elsewhere: 1000 Genomes is a separate track, and the projects HGDP, SGDP,
-HGDP+1000 Genomes and Mexico Biobank can be found in the &quot;Phased Variants&quot; track.
+projects listed below. More detailed data for projects that provide haplotype-phased genotypes/variants can also be found
+in other tracks: 1000 Genomes is a separate track, and the projects HGDP, SGDP,
+HGDP+1000 Genomes and Mexico Biobank can be found in the &quot;Phased Variants&quot; track, showing the linkage between variants.
 </p>
-<p>If you want us to add other projects, please contact us. We asked and were
-unable to obtain variant frequencies from the following projects: UK Biobank (request pending), All
-of Us (granted, ongoing).
+
+<p>If you want us to add other projects, please contact us. We were
+unable to obtain variant frequencies from the following projects: UK Biobank (request pending), 
+Regeneron's Million Exomes and Mexico City Studies (request rejected).
 </p>
 
 <p>
 The following projects were added:
 <ul>
     <li>
         <b><a href="https://rgc-mcps.regeneron.com/home"
         target="_blank">Mexico City Prospective Study (MCPS)</a></b>:
         9,950 whole genome sequenced individuals and 141,046 exome sequenced and genotyped
         individuals from the Mexico City Prospective Study (MCPS), a collaboration between the
         Regeneron Genetics Center, University of Oxford, Universidad Nacional Aut&oacute;noma de
         M&eacute;xico (UNAM), National Institute of Genomic Medicine in Mexico, Abbvie Inc. and
         AstraZeneca UK. For details see (Ziyatdinov A, Nature 2023), the reference section.
     </li>
 
     <li>
         <b><a href="https://rgc-research.regeneron.com/me/home"
         target="_blank">Regeneron Million Exomes Project (ME)</a></b>:
         Whole-exomes of 983,578 individuals sequenced by the Regeneron Genetics Center (RGC).
         These data span dozens of collaborations including large biobanks and
         health systems. All data were generated by the RGC on a single, harmonized
         sequencing and informatics protocol. The dataset includes individuals across
         diverse ancestral populations, encompassing outbred and founder populations and
         cohorts with high rates of consanguinity. See (Sun et al, Nature 2024) for details.
     </li>
 
     <li>
         <b><a href="https://topmed.nhlbi.nih.gov/" target="_blank">NHLBI TOPMED Freeze 10</a></b>:
         NHLBI TOPMed (Trans-Omics for Precision
         Medicine) program, launched by the U.S. National Heart, Lung, and Blood
         Institute, integrates whole-genome sequencing with molecular, clinical,
         and environmental data from large, well-phenotyped cohorts. Its goal is to
         uncover the biological mechanisms underlying heart, lung, blood, and sleep
         disorders to advance precision medicine and improve population health. Freeze
         10 contains 868,581,653 variants from 150,899 whole genomes. VCFs were
         downloaded from <a href="https://bravo.sph.umich.edu/terms.html"
         target="_blank">BRAVO</a>.
     </li>
 
     <li>
         <b><a href="https://sparkforautism.org/" target="_blank">SFARI SPARK</a></b>:
         The Simons Foundation Autism Research Initiative (SFARI) recruited
         a large cohort of families with autistic children who provided DNA
         samples and phenotypes.  54,558 families, parents and their children
         were sequenced, a total of 142,357 individuals with whole-exome (WES)
         and 12,519 with whole-genome sequencing (WGS).  The data contains
         32,559 trios and 8,895 quads (one sibling without autism), and 824
         twins. The same frequencies shown here
         are also available publicly on the <a href="https://genomes.sfari.org/" target=_blank>SFARI Genome Browser</a>. 
        See (SPARK et al, Neuron 2018) for details or the methods below on this page.
     </li>
 
     <li>
         <b><a href="https://www.genomeasia100k.org/"
         target="_blank">GenomeAsia Pilot (GAsP)</a></b>:
         Whole-genome sequencing data of 1,739 individuals from 219 population groups across Asia.
         See (GenomeAsia Consortium, Nature 2019) for details.
     </li>
 
     <li>
         <b><a href="https://www.genomeasia100k.org/"
         target="_blank">Australia MRGB</a></b>:
         The Australian Medical Genome Reference Bank collected
         whole-genome sequencing data of 4,011 healthy elderly individuals who
         lived >=70 years, to make sure that the dataset is depleted of damaging
         genetic variants. Age and sex summary graphs are available from 
         <a href="https://sgc.garvan.org.au/initiatives/mgrb/index.html">the MGRB website</a>.
         See (Lacaze Eur J Humn Genet 2019) for details.
     </li>
 
     <li>
         <b><a href="https://www.ncbi.nlm.nih.gov/snp/docs/gsr/alfa/" target="_blank">ALFA</a></b>:
         The NCBI ALlele Frequency Aggregator pipeline computes allele frequencies from
         approved, unrestricted dbGaP studies and makes them publicly available through
         dbSNP. Its goal is to release frequency data from over one million dbGaP
         subjects to aid discoveries involving common and rare variants with biological
         or disease relevance. The R4 release includes 408,709 subjects and allele
         frequencies for 15.5 million rs sites, including nearly one million ClinVar
         variants. We converted the NCBI track hub to VCF format, the data is freely available.
         Genotype and associated individual-level data are accessible through the dbGaP
         <a href="https://dbgap.ncbi.nlm.nih.gov/aa/wga.cgi?page=login"
         target="_blank">authorized access request</a> system.
     </li>
 
     <li>
         <b><a href="https://www.finngen.fi/en" target="_blank">FinnGen</a></b>:
         Imputed variants from 500,348 Biobank samples obtained using genotyping arrays
         in Finnland, 10% of the population. The imputation used phased variants obtained from 8,554
         high-quality whole genome sequences, also from Finnland. For details, see (Kurki et al,
         Nature 2023).
         Phenotype links can be shown at <a href="https://r12.finngen.fi/">FinnGen PheWeb</a>.
     </li>
 
     <li>
         <b><a href="https://swefreq.nbis.se/dataset/SweGen" target="_blank">SweGen</a></b>:
         Whole-genome sequencing variant frequencies for 1000 Swedish individuals generated within
         the SweGen project.
         The 1000 individuals included in the SweGen project represent a
         cross-section of the Swedish population and that no disease information
         has been used for the selection. The frequency data may therefore
         include genetic variants that are associated with, or causative of,
         disease. SweGen also provides SV calls, TEs, MELT results for TEs, HLAs and new sequence.
         For details, see (Ameur et al, Eur J Hum Genet 2017).
         Dataset can be browsed at the
         <a href="https://swefreq.nbis.se/dataset/SweGen/browser">SweGen Browser</a>.
     </li>
 
     <li>
         <b><a href="https://jmorp.megabank.tohoku.ac.jp/downloads" target="_blank">JPN To61k Japan
         Tohoku University Tohoku Medical Megabank Organization 61k Allele frequency panel
         (JPN 61k)</a></b>:
         An allele frequency panel based on short-read WGS analysis of 61,000 Japanese individuals.
         The project includes other datatypes, such as STRs, long-read SVs and short-read CNVs.
         Data can be downloaded from the <a href="https://jmorp.megabank.tohoku.ac.jp"
         target="_blank">jMorp Website</a>, specifically the
         <a href="https://jmorp.megabank.tohoku.ac.jp/downloads" target="_blank">Downloads</a>
         section. For details, see (Tadaka et al, NAR 2023).
     </li>
 
     <li>
         <b><a href="https://abraom.ib.usp.br/"
         target="_blank">Brazil Arquivo Brasileiro Online de Muta&ccedil;&otilde; (ABraOM)</a></b>:
         Genomic variants obtained with whole-genome sequencing from SABE, a
         census-based sample of elderly individuals from S&atilde;o Paulo, Brazil's
         largest city. The Brazilian population is constituted by ~500 years of
         admixture between Africans, Europeans, and Native Americans.
         Additionally, the cohort presents ~3% of individuals with non-admixed
         Japanese ancestry (early 20th century migration). Coverage 38.6x.  Data
         can be downloaded from the <a href="https://abraom.ib.usp.br/download/"
         target="_blank">AbraOM Website</a>. TEs, HLAs and new sequence are also available.
         For details see (Naslavsky et al, Nat Comm 2022).
     </li>
 
     <li>
         <b><a href="https://clingen.igib.res.in/indigen/" target="_blank">IndiGenomes</a></b>:
         Whole genome sequencing of 1,029 healthy Indian individuals under the pilot phase of the
         &quot;IndiGen&quot; program.
         Data can be downloaded from the <a href="https://clingen.igib.res.in/indigen/"
         target="_blank">IndiGen Website</a>. For details see (Jain et al, NAR 2020). Only
         the allele frequency is available from this project. The website also provides SV call
         and Alu insertion VCFs.
     </li>
 
     <li>
         <b><a href="https://www.kobic.re.kr/kova/"
         target="_blank">Korean Variant Archive (KOVA)</a></b>:
         1,896 whole genome sequencing and 3,409 whole exome sequencing data from healthy individuals
         of Korean ethnicity.
         Most of the samples originated from normal tissue of cancer
         patients (40.16 %), healthy parents of rare disease patients (28.4 %),
         or healthy volunteers (31.44 %). Japanese ancestry is broken down
         in the INFO field. Coverage 100x for WES, 30x for WGS. SVs called with Manta
         are also available. For details see (Lee et al, Exp Mol Med 2022).</li>
     <li>
         <b><a href="https://www.npm.sg/"
         target="_blank">NPM Singapore</a></b>:
         9,770 whole genomes, mostly of Chinese, Indian and Malay ancestry. 
         A minimum allele count cutoff of &gt; 5 was applied.
         Data is available for download from the CHORUS browser, see &quot;Data access&quot; below.
         For details see (Wong et al, Nat Genetics 2023). CNV data is also available there.
     </li>
     <li>
         <b><a href="https://www.vision2030.gov.sa/en/explore/projects/the-saudi-genome-program"
         target="_blank">Saudi Genome Program</a></b>:
         Variant frequencies from 302 whole genomes at 30x coverage, on Saudi Genome Program Samples.
         The genotyping data and imputations from 3,352 individuals do not seem to be available
         publicly. For details see (Malomane et al 2025). 
     </li>
 </ul>
 </p>
 
 <h2>Display Conventions</h2>
 
 <p>Most tracks only show the variant and allele frequencies on mouseover or clicks.
 When zoomed in, tracks display alleles with base-specific coloring. Homozygote
 data are shown as one letter, while heterozygotes will be displayed with both
 letters. All VCF files are normalized, with one single allele per annotation (no multi-allele lines).
 </p>
 
 
 <h2>Data Access</h2>
-<p>Most of the data in these tracks are not available for download from UCSC.
-Data can be browsed on our website.
-But the data can be downloaded for free from the original projects. Accessing the 
-data usually requires a click-through license or filling out an access request form on the respective websites, links are either provided above in the project description or with more details here:
+<p>Most of the data in these tracks are not available for download from UCSC and the data can only be browsed on our website.
+But all variant data can be downloaded for free from the original project websites. Accessing it usually requires a click-through license or filling out an access request form on the respective websites, by following these instructions:
 </p>
 
 <p>
 <b>MXB:</b> Allele frequencies by geographical state and ancestry are available via
 the <a target="_blank" href="https://morenolab.shinyapps.io/mexvar/">MexVar platform</a>.
 Raw genotype data are available under controlled access at the
 EGA (Study: EGAS00001005797; Dataset: EGAD00010002361). For the VCFs, email
 andres.moreno@cinvestav.mx.
 </p>
+
+<!--
 <p>
 <b>MCPS:</b> VCFs with summarized allele frequencies are available from
 the <a target="_blank" href="https://rgc-mcps.regeneron.com/">MCPS website</a>.
 </p>
 <p>
 <b>Regeneron one million exomes:</b> VCFs with summarized allele frequencies are available from
 the <a target="_blank" href="https://rgc-research.regeneron.com/me/resources">RGC ME website</a>.
 </p>
+-->
+
 <p>
 <b>TOPMED:</b> VCFs with summarized allele frequencies are available from
 the <a target="_blank" href="https://bravo.sph.umich.edu/">TOPMED BRAVO website</a>. They require a
 login.
 </p>
 <p>
 <b>SFARI SPARK:</b> Allele frequencies can be displayed on the
         <a href="https://genomes.sfari.org/" target=_blank>SFARI Genome Browser</a>.
         Full CRAMs and VCFs with genotypes are available from <a target="_blank"
         href="https://base.sfari.org/">SFARI Base</a>. 
 They require a data access request, which is usually reviewed quickly. More information is available
 in the <a href="https://cohorts-cdn.simonsfoundation.org/spark/researcher_packets/SPARK_SFARI_Researcher_Welcome_Packet.pdf"
 target=_blank>SPARK Welcome Packet</a>.
 </p>
 
 <p>
 <b>Australia MGRB:</b> VCF access can be requested via a form from 
 <a target="_blank" href="https://sgc.garvan.org.au/terms/mgrb/index.html">Sydney Genomics</a>.
 </p>
 
 <p>
 <b>GenomeAsia Pilot:</b> VCFs are available from UCSC and also from
 the <a target="_blank"
 href="https://browser.genomeasia100k.org/#tid=download">GenomeAsia 100K website</a>.
 No license nor login.
 </p>
 
 <p><b>KOVA:</b> 
 TSV data can be requested on the <a href="https://www.kobic.re.kr/kova/downloads"
-        target="_blank">KOVA Downloads</a> website. 
+target="_blank">KOVA Downloads</a> website. Our Github repo contains a script that 
+converts this format to VCF.
 </p>
 
 <p><b>Finngen:</b> TSV data can be requested via the form at
-https://finngen.gitbook.io/documentation/data-download which triggers an email with the download
-link.</p>
+<a href="https://finngen.gitbook.io/documentation/data-download" target=_blank>Finngen</a>
+which triggers an automated email containing the download
+link. A script in our Github repo converts this file to VCF (see methods below).</p>
 
-<p><b>SweGen:</b> We are not allowed to redistribute the VCF, you can request it at
-<a target="_blank" href="https://swefreq.nbis.se/dataset/SweGen">SweGen</a>, alongside the VCF file.
+<p><b>SweGen:</b> VCF files can be requested at
+<a target="_blank" href="https://swefreq.nbis.se/dataset/SweGen">SweGen</a> via a form, the request needs manual approval, which usually is quick. If there is no reply, email SweGen directly.
 </p>
 
 <p><b>NPM:</b> 
-   VCF access can be requested on the 
-   <a href="https://chorus.grids-platform.io/" target="_blank">Chorus Browser</a> website, which
-   requires an <a href = "https://npm.a-star.edu.sg/" target=_blank>account and data access
-   request</a>. 
+VCF download can be requested on the <a href="https://chorus.grids-platform.io/" target="_blank">Chorus Browser</a> website, which requires an <a href = "https://npm.a-star.edu.sg/" target=_blank>account and data access request</a>. 
 </p>
 
 <h2>Methods</h2>
+<p>The following are quotes from the respective papers and/or websites of the datasets:</p>
+
 <p>
 <b>MXB:</b> Genotyping was performed with the Illumina Multi-Ethnic Global Array
 (MEGA, ~1.8M SNPs), optimized for admixed populations and enriched for
 ancestry-informative and medically relevant variants. Only autosomal, biallelic
 SNPs passing quality control are included. Samples were selected from 898
 recruitment sites, with prioritization of indigenous language speakers. Data
 processing included GenomeStudio &rarr; PLINK conversion, strand alignment, removal
 of duplicates, update of map positions using dbSNP Build 151 and low-quality
 variants/individuals, and relatedness filtering.
 </p>
 <p>
 <b>SGDP:</b> The version used was
 <a target="_blank" href="https://sharehost.hms.harvard.edu/genetics/reich_lab/sgdp/vcf_variants/"
 >https://sharehost.hms.harvard.edu/genetics/reich_lab/sgdp/vcf_variants/</a>,
 merged with bcftools and lifted to hg38 with CrossMap. 
 </p>
 <p>
 <b>KOVA:</b> Raw reads were aligned to the GRCh38+decoy reference using BWA-MEM v0.7.17 with default
 parameters, followed by duplicate marking and coordinate sorting with MarkDuplicatesSpark, and base
 quality score recalibration using BQSRPipelineSpark in GATK v4.1.3.0; mapping quality control
 metrics were generated with Qualimap v2.2.1. Single-nucleotide variants and small
 insertions/deletions were called per sample using GATK HaplotypeCaller in GVCF mode (-ERC GVCF), and
 joint genotyping was performed by creating a GenomicsDB with GenomicsDBImport and following GATK
 Best Practices, including variant quality score recalibration (VQSR) retaining 99.7% of true SNVs
 and 99.0% of true indels based on training sets (workflow detailed in Supplementary Fig. 1).
 Downstream analyses followed a modified version of the gnomAD quality-control framework and were
 primarily conducted using Hail, an open-source Python library for large-scale genome analysis; after
 merging WES and WGS data in Hail, multiallelic variants and variants with genotype quality &lt;20,
 read depth &lt;10, allelic balance &lt;0.2, or overlapping low-complexity regions were excluded
 (Supplementary Fig. 2).
 <br>
 At UCSC, V7 of the TSV.gz was obtained from the KOVA staff by email and converted to VCF. It is not
 available for download from our site but can be requested from the KOVA website.
 </p>
 
 <p>
 <b>ABraOM:</b> For Academic use only. Licensing for commercial use might be available under request and agreement.
 By using this resource you agree to cite the flagship paper (Naslavsky et al. Nat Comm 2022).
 Whole-genome sequencing was performed at Human Longevity Inc. using TruSeq Nano DNA HT libraries
 sequenced on Illumina HiSeqX instruments with 150 bp paired-end reads targeting 30x coverage, and
 reads were mapped to GRCh38 using ISIS software. Sample sex was validated by comparing CPMs of X
 chromosome and male-specific Y (MSY) reads relative to autosomes, yielding the expected female
 (~55,000 X CPM, &lt;200 MSY CPM) and male (~27,500 X CPM, &gt;550 MSY CPM) patterns. Germline SNVs
 and indels were called following GATK Best Practices (GATK v3.7) via per-sample GVCFs
 (HaplotypeCaller), joint genotyping (CombineGVCFs, GenotypeGVCFs), and Variant Quality Score
 Recalibration (VQSR-AS); multiallelic variants were split with an in-house script, left-aligned with
 BCFtools, and annotated using Annovar and custom scripts against dbSNP, 1000 Genomes, and gnomAD,
 with putative loss-of-function variants identified using LOFTEE v0.3-beta irrespective of confidence
 labels. Variant and genotype quality was further assessed using the in-house CEGH-Filter two-step
 algorithm based on depth and allele balance, and analyses retained only GATK VQSR-AS PASS variants
 and higher-confidence CEGH-Filter calls. Relatedness was assessed using KING and PC-Relate
 (GENESIS), retaining a single proband per related pair and excluding one contaminated sample
 (&gt;3% by verifyBAMID), resulting in a final dataset of 1,171 unrelated individuals. Final samples
 achieved mean coverages ranging from 31.3x to 64.8x, with an average of 38.65x and a median of
 36.6x.
 </p>
 
 <p><b>SFARI SPARK:</b> The genome browser track project was approved by the Simons 
-Foundation as 14584.1. WES and WGS Data were downloaded from 
+Foundation under request number 14584.1. WES and WGS Data were downloaded from 
 <a href="https://base.sfari.org/" target="_blank">SFARI Base</a>.
-pVCFs were downloaded, anonymized with a script using bcftools and the fill-tags plugin and
-normalized, without a minimum allele frequency cutoff.<br>
+pVCFs were downloaded, anonymized with a script using bcftools and its "fill-tags" plugin and
+normalized. There was no minimum allele frequency cutoff.<br>
 The methods are documented as follows by SFARI:<br>
 <ul>
   <li>
     <b>WES</b>:
     This release consists of sequence and variant call data for 12,519
     unique individuals, of which 12,517 (99.98%) have available genome-wide
     SNP genotype data. Sequencing and genotyping of all samples in this
     release was performed at New York Genome Center (NYGC). DNA from saliva
     samples were extracted and prepared with PCR-free methods and sequenced
     with paired-end sequencing of 150 bases on the Illumina NovaSeq 6000
     system.  Alignment of reads to the human reference genome version
     GRCh38, duplicate read marking, and Base Quality Score Recalibration
     (BQSR) were performed by New York Genome Center (NYCG). Whole-genome
     sequencing data were processed using a standardized, functionally
     equivalent CCDG pipeline with alignment to the GRCh38DH (1000 Genomes)
     reference using BWA-MEM v0.7.15 (deterministic settings, no -M, use of
     .alt contigs), Picard-equivalent duplicate marking (Picard &ge;2.4.1 or
     equivalent), no indel realignment, and base quality score recalibration
     with GATK (dbSNP138, Mills and 1000G gold-standard indels, known
     indels).  Final outputs were stored as lossless CRAM files with
     complete SAM-compliant read-group annotations and mandatory 4-bin
     base-quality compression (Q2&mdash;6, 10, 20, 30), and all implementations
     were validated for functional equivalence across centers before use.
     Variant Calling was performed using DeepVariant. See
     <a href="https://github.com/CCDG/Pipeline-Standardization/blob/master/PipelineStandard.md"
     target="_blank">CCDG pipeline details</a>.<br>
   </li>
   <li>
     <b>WGS</b>: This release contains
     sequence data for 142,357 individuals and genotyping data for
     141,368 individuals. DNA was sequenced from saliva for all
     samples and all participants consented to having their genetic
     data shared by Regeneron. Exomes for all samples were sequenced with
     short-read, paired-end sequencing of 150 bases on Illumina
     NovaSeq 6000 machines using S2/S4 flow cells. Sequencing and
     genotyping was performed across nine batches (WES1 through
     WES9) at the Regeneron Genetics Center (RGC) and integrated
     together for this data release. All sequencing batches were
     processed using the same DNA extraction methods and sequencing
     machines, however two different exome capture panels were used,
     as described below. Genotyping was performed using a SNP
     genotyping array for WES1 through WES4 and using
     &quot;genotyping-by-sequencing&quot; (GxS) for WES5 through WES9.  The
     first four sequencing batches were sequenced at Regeneron using
     custom NEB/Kapa reagents with the IDT (Integrated DNA
     Technologies) xGen capture platform, including custom exome
     capture regions. Samples starting with batch WES5 were
     sequenced using the Twist Bioscience Human
     Comprehensive Exome panel, combined with spike-ins for
     sequencing genotyping sites (see Genotyping Methods), the full
     mitochondrial genome, and coverage boosted at selected sites
     for assaying clonal hematopoiesis of indeterminate potential
     (CHIP).  SFARI performed NV/indel calling via DeepVariant and
     GATK to generate gVCFs, pairwise relatedness inferred using
     PLINK v1.9 IBD estimates from common SNPs (AF &ge; 0.01, dbSNP
     v151) with &ge;15% relatedness flagged, and comprehensive
     individual- and family-level quality control executed using the
     internal GenomeCheckMate pipeline to exclude samples based on
     contamination (&ge;5%), insufficient coverage (&lt;20x in &lt;80% of
     targets), sex discordance, pedigree/IBD inconsistencies,
     unregistered relationships, unexpected duplicates, or excess
     relatedness, after which QC-passing individuals (selecting the
     most recent passing sample per person) were retained for
     variant calling and joint genotyping.
     </p></li>
 </ul>
 
 <p><b>Finngen:</b> R12 annotated variants were downloaded from the Google Cloud
-bucket link received though an email after filling out the form linked from
-https://finngen.gitbook.io/documentation/data-download and converted to VCF
+bucket link received though an email and
+converted to VCF
 with a <a
 href="https://github.com/ucscGenomeBrowser/kent/tree/master/src/hg/makeDb/scripts/finngen_to_vcf.py"
 target="_blank">custom Python script</a>. </p>
 
 <p><b>SweGen:</b> Fragment size 350bp on a Covaris E220. Paired-end sequencing with 150bp read
 length was performed on Illumina HiSeq X (HiSeq Control Software 3.3.39/RTA 2.7.1) with v2.5
 sequencing chemistry. Raw whole-genome reads were aligned to the GRCh37 reference using BWA-MEM
 v0.7.12, then sorted and indexed with samtools v0.1.19 and assessed with qualimap v2.2.20;
 per-sample alignments from multiple lanes and flow cells were merged using Picard MergeSamFiles
 v1.120. Processing followed GATK best practices with GATK v3.3, including indel realignment
 (RealignerTargetCreator, IndelRealigner), duplicate marking (Picard MarkDuplicates v1.120), and
 base quality score recalibration (BaseRecalibrator), producing one finalized BAM per sample.
 Per-sample gVCFs were generated with GATK HaplotypeCaller v3.3 using reference files from the GATK
 v2.8 resource bundle, with all steps coordinated via Piper v1.4.0. Joint genotyping of 1,000 samples
 was performed by merging gVCFs in five batches of 200 using GATK CombineGVCFs, followed by cohort
 genotyping with GATK GenotypeGVCFs and variant quality score recalibration for SNVs and indels using
 VariantRecalibrator and ApplyRecalibration.
 <BR>At UCSC, the hg38 VCF was downloaded 
 from <a target="_blank" href="https://swefreq.nbis.se/dataset/SweGen/download">SweFreq</a>.
 </p>
 
 <p><b>Australia MGRB:</b> MGRB samples underwent whole-genome sequencing on
 Illumina HiSeq X instruments at KCCG under ISO 15189 accreditation, using
 paired-end TruSeq DNA Nano libraries sequenced one lane per sample. Reads were
 aligned to human reference genome Build 37 (GRCh37) and processed following
 GATK best practices, including indel realignment and base quality score
 recalibration, with variant calling performed using GATK HaplotypeCaller to
 generate g.vcf files. Data processing utilized the Genome One Discovery
 pipeline and analysis was conducted using the Hail framework.
 </p>
 
 <p><b>NPM Singapore:</b> Whole Genome Sequencing (WGS) data processing followed
 GATK4 best practices. GATK4 germline variant analysis workflow written in WDL
 was adapted to use Nextflow and deployed at the National Supercomputing Centre,
 Singapore (NSCC). In short, WGS reads were aligned against GRCh38 using the
 BWA-MEM algorithm and used as input to GATK HaplotypeCaller to produce single
 sample gVCFs. The gVCF files were joint-called then loaded in Hail, an
 open-source python-based data analysis library suited to work with
 population-scale with genomic data collections. Low-quality WGS libraries and
 low-quality variants were removed.  QC-ed variants were functionally annotated
 using Ensembl Variant Effect Predictor (VEP) (version 95). Functional
 annotations for variant impacting protein-coding were also complemented with
 information on the potential alteration to their cognate protein's 3D structure
 and drug binding ability.
 </p>
 
 <p><b>Saudi Genome Program:</b> Data were downloaded 
 from <a href="https://figshare.com/articles/dataset/A_list_of_Saudi_Arabian_variants_and_their_allele_frequencies/28059686/1?file=51297884">Figshare</a>,
 and converted to VCF.
 </p>
 
 <h2>Credits</h2>
 <p>
 <b>MXB:</b> We thank the Center for Research and Advanced Studies (Cinvestav) of Mexico for
 generating and providing the frequency data, the National Institute of Medical
 Sciences and Nutrition (INCMNSZ) for DNA extraction, and the Ministry of Health
 together with the National Institute of Public Health (INSP) for the design and
 implementation of the National Health Survey 2000 (ENSA 2000). We also thank
 the ENSA-Genomics Consortium for their contributions to sample collection and
 data processing that made possible the construction of the MXB genomic
 resource.
 </p>
 <p>
 <b>MCPS:</b> Data produced by Regeneron RGC and collaborators, which are the
 University of Oxford, Universidad Nacional Aut&oacute;noma de M&eacute;xico (UNAM) and
 National Institute of Genomic Medicine in Mexico.
 The Regeneron Genetics Center, University of Oxford, Universidad Nacional
 Aut&oacute;noma de M&eacute;xico (UNAM), National Institute of Genomic Medicine in Mexico,
 Abbvie Inc. and AstraZeneca UK Limited (collectively, the &quot;Collaborators&quot;) bear
 no responsibility for the analyses or interpretations of the data presented
 here. Any opinions, insights, or conclusions presented herein are those of the
 authors and not of the Collaborators. </p>
 </p>
 <p>
 <b>Regeneron Million Exomes:</b> The Regeneron Genetics Center, and its collaborators
 (collectively, the &quot;Collaborators&quot;) bear no responsibility for the analyses or
 interpretations of the data presented here. Any opinions, insights, or
 conclusions presented herein are those of the authors and not of the
 Collaborators. This research has been conducted using the UK Biobank Resource
 under application number 26041.
 </p>
 <p>
 <b>SGDP:</b> This project was funded by the Simons Foundation. Thanks to David Reich and Swapan 
 Mallick for help with importing the data.
 </p>
 <p>
 <b>KOVA:</b> Thanks to Insu Jang and the KOVA director for providing variant frequencies in TSV
 format.
 </p>
 <p>
 <b>Finngen:</b> We want to acknowledge the participants and investigators of the FinnGen study.
 </p>
 
 <p>
 <b>SweGen:</b> The SweGen allele frequency data was generated by Science for
 Life Laboratory. The data may be redistributed in original or modified form,
 but must always be distributed together with the file &quot;terms_of_use.txt&quot; that
 is stored together with the data on our download server, and any redistributed
 data derived from the SweGen data set must follow those terms and conditions.
 The data may not be used to attempt to identify any individual in this or other studies.
 </p>
 
 <p>
 <b>NPM Singapore:</b> Thanks to the NPM Data Access Committee and Eleanor for granting our data
 request. 
 By browsing the data, you agree to use the data only for academic, non-commercial
 research to improve human health (biology/disease).  We request all data users
 agree to protect the
 confidentiality of the data subjects in any research papers or publications
 that they may prepare, by taking all reasonable care to limit the possibility
 of identification. In particular, the data users shall not to use, or attempt
 to use, the data to deliberately compromise or otherwise infringe the
 confidentiality of information on data subjects and their right to privacy.
 If you use any of the data obtained from the CHORUS variant browser, we request
 that you cite the NPM flagship paper (Wong et al, 2023). All data users of the
 data must take note that the data provider and relevant SG10K_Health cohort
 owners bear no responsibility for the further analysis or interpretation of the
 data.  </p>
 
-<p>Thanks to Alex Ioannidis, UCSC, and Andreas Lahner, MGZ, for feedback on this track.</p>
+<p>Thanks to Alex Ioannidis, UCSC, for the idea and motivation for this track. 
+Thanks to Andreas Lahner, MGZ, for feedback and suggestions.</p>
 
 <h2>References</h2>
 <p>
 Barberena-Jonas, C. et al. (2025). MexVar database: Clinical genetic variation beyond the
 Hispanic label in the Mexican Biobank. <em>Nature Medicine (in press)</em>.
 </p>
 
 <p>
 Sohail M, Moreno-Estrada A.
 <a href="https://journals.biologists.com/dmm/article-lookup/doi/10.1242/dmm.050522" target="_blank">
 The Mexican Biobank Project promotes genetic discovery, inclusive science and local capacity
 building</a>.
 <em>Dis Model Mech</em>. 2024 Jan 1;17(1).
 PMID: <a href="https://www.ncbi.nlm.nih.gov/pubmed/38299665" target="_blank">38299665</a>; PMC: <a
 href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10855211/" target="_blank">PMC10855211</a>
 </p>
 
 <p>
 Sohail M, Palma-Mart&iacute;nez MJ, Chong AY, Quinto-Cor&eacute;s CD, Barberena-Jonas C, Medina-Mu&ntilde;oz SG,
 Ragsdale A, Delgado-S&aacute;nchez G, Cruz-Hervert LP, Ferreyra-Reyes L <em>et al</em>.
 <a href="https://doi.org/10.1038/s41586-023-06560-0" target="_blank">
 Mexican Biobank advances population and medical genomics of diverse ancestries</a>.
 <em>Nature</em>. 2023 Oct;622(7984):775-783.
 PMID: <a href="https://www.ncbi.nlm.nih.gov/pubmed/37821706" target="_blank">37821706</a>; PMC: <a
 href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10600006/" target="_blank">PMC10600006</a>
 </p>
 
 <p>
 Ziyatdinov A, Torres J, Alegre-D&iacute;az J, Backman J, Mbatchou J, Turner M, Gaynor SM, Joseph T, Zou Y,
 Liu D <em>et al</em>.
 <a href="https://doi.org/10.1038/s41586-023-06595-3" target="_blank">
 Genotyping, sequencing and analysis of 140,000 adults from Mexico City</a>.
 <em>Nature</em>. 2023 Oct;622(7984):784-793.
 PMID: <a href="https://www.ncbi.nlm.nih.gov/pubmed/37821707" target="_blank">37821707</a>; PMC: <a
 href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10600010/" target="_blank">PMC10600010</a>
 </p>
 
 <p>
 GenomeAsia100K Consortium.
 <a href="https://doi.org/10.1038/s41586-019-1793-z" target="_blank">
 The GenomeAsia 100K Project enables genetic discoveries across Asia</a>.
 <em>Nature</em>. 2019 Dec;576(7785):106-111.
 PMID: <a href="https://www.ncbi.nlm.nih.gov/pubmed/31802016" target="_blank">31802016</a>; PMC: <a
 href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7054211/" target="_blank">PMC7054211</a>
 </p>
 
 <p>
 Sun KY, Bai X, Chen S, Bao S, Zhang C, Kapoor M, Backman J, Joseph T, Maxwell E, Mitra G <em>et
 al</em>.
 <a href="https://doi.org/10.1038/s41586-024-07556-0" target="_blank">
 A deep catalogue of protein-coding variation in 983,578 individuals</a>.
 <em>Nature</em>. 2024 Jul;631(8021):583-592.
 PMID: <a href="https://www.ncbi.nlm.nih.gov/pubmed/38768635" target="_blank">38768635</a>; PMC: <a
 href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11254753/" target="_blank">PMC11254753</a>
 </p>
 
 <p>
 Tadaka S, Kawashima J, Hishinuma E, Saito S, Okamura Y, Otsuki A, Kojima K, Komaki S, Aoki Y, Kanno
 T <em>et al</em>.
 <a href="https://academic.oup.com/nar/article-lookup/doi/10.1093/nar/gkad978" target="_blank">
 jMorp: Japanese Multi-Omics Reference Panel update report 2023</a>.
 <em>Nucleic Acids Res</em>. 2024 Jan 5;52(D1):D622-D632.
 PMID: <a href="https://www.ncbi.nlm.nih.gov/pubmed/37930845" target="_blank">37930845</a>; PMC: <a
 href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10767895/" target="_blank">PMC10767895</a>
 </p>
 
 
 
 <p>
 Naslavsky MS, Scliar MO, Yamamoto GL, Wang JYT, Zverinova S, Karp T, Nunes K, Ceroni JRM, de
 Carvalho DL, da Silva Sim&otilde;es CE <em>et al</em>.
 <a href="https://doi.org/10.1038/s41467-022-28648-3" target="_blank">
 Whole-genome sequencing of 1,171 elderly admixed individuals from S&atilde;o Paulo, Brazil</a>.
 <em>Nat Commun</em>. 2022 Mar 4;13(1):1004.
 PMID: <a href="https://www.ncbi.nlm.nih.gov/pubmed/35246524" target="_blank">35246524</a>; PMC: <a
 href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8897431/" target="_blank">PMC8897431</a>
 </p>
 
 
 
 <p>
 Jain A, Bhoyar RC, Pandhare K, Mishra A, Sharma D, Imran M, Senthivel V, Divakar MK, Rophina M,
 Jolly B <em>et al</em>.
 <a href="https://academic.oup.com/nar/article-lookup/doi/10.1093/nar/gkaa923" target="_blank">
 IndiGenomes: a comprehensive resource of genetic variants from over 1000 Indian genomes</a>.
 <em>Nucleic Acids Res</em>. 2021 Jan 8;49(D1):D1225-D1232.
 PMID: <a href="https://www.ncbi.nlm.nih.gov/pubmed/33095885" target="_blank">33095885</a>; PMC: <a
 href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7778947/" target="_blank">PMC7778947</a>
 </p>
 
 
 
 <p>
 Bergstr&ouml;m A, McCarthy SA, Hui R, Almarri MA, Ayub Q, Danecek P, Chen Y, Felkel S, Hallast P, Kamm J
 <em>et al</em>.
 <a href="https:///www.science.org/doi/10.1126/science.aay5012" target="_blank">
 Insights into human genetic variation and population history from 929 diverse genomes</a>.
 <em>Science</em>. 2020 Mar 20;367(6484).
 PMID: <a href="https://www.ncbi.nlm.nih.gov/pubmed/32193295" target="_blank">32193295</a>; PMC: <a
 href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7115999/" target="_blank">PMC7115999</a>
 </p>
 
 <p>
 Koenig Z, Yohannes MT, Nkambule LL, Zhao X, Goodrich JK, Kim HA, Wilson MW, Tiao G, Hao SP, Sahakian
 N <em>et al</em>.
 <a href="https://pmc.ncbi.nlm.nih.gov/articles/pmid/38749656/" target="_blank">
 A harmonized public resource of deeply sequenced diverse human genomes</a>.
 <em>Genome Res</em>. 2024 Jun 25;34(5):796-809.
 PMID: <a href="https://www.ncbi.nlm.nih.gov/pubmed/38749656" target="_blank">38749656</a>; PMC: <a
 href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11216312/" target="_blank">PMC11216312</a>
 </p>
 
 <p>
 Mallick S, Li H, Lipson M, Mathieson I, Gymrek M, Racimo F, Zhao M, Chennagiri N, Nordenfelt S,
 Tandon A <em>et al</em>.
 <a href="https://doi.org/10.1038/nature18964" target="_blank">
 The Simons Genome Diversity Project: 300 genomes from 142 diverse populations</a>.
 <em>Nature</em>. 2016 Oct 13;538(7624):201-206.
 PMID: <a href="https://www.ncbi.nlm.nih.gov/pubmed/27654912" target="_blank">27654912</a>; PMC: <a
 href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5161557/" target="_blank">PMC5161557</a>
 </p>
 
 <p>
 Lee J, Lee J, Jeon S, Lee J, Jang I, Yang JO, Park S, Lee B, Choi J, Choi BO <em>et al</em>.
 <a href="https://doi.org/10.1038/s12276-022-00871-4" target="_blank">
 A database of 5305 healthy Korean individuals reveals genetic and clinical implications for an East
 Asian population</a>.
 <em>Exp Mol Med</em>. 2022 Nov;54(11):1862-1871.
 PMID: <a href="https://www.ncbi.nlm.nih.gov/pubmed/36323850" target="_blank">36323850</a>; PMC: <a
 href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9628380/" target="_blank">PMC9628380</a>
 </p>
 
 <p>
 Kurki MI, Karjalainen J, Palta P, Sipil&auml; TP, Kristiansson K, Donner KM, Reeve MP, Laivuori H,
 Aavikko M, Kaunisto MA <em>et al</em>.
 <a href="https://doi.org/10.1038/s41586-022-05473-8" target="_blank">
 FinnGen provides genetic insights from a well-phenotyped isolated population</a>.
 <em>Nature</em>. 2023 Jan;613(7944):508-518.
 PMID: <a href="https://www.ncbi.nlm.nih.gov/pubmed/36653562" target="_blank">36653562</a>; PMC: <a
 href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9849126/" target="_blank">PMC9849126</a>
 </p>
 
 <p>
 Wong E, Bertin N, Hebrard M, Tirado-Magallanes R, Bellis C, Lim WK, Chua CY, Tong PML, Chua R, Mak K
 <em>et al</em>.
 <a href="https://doi.org/10.1038/s41588-022-01274-x" target="_blank">
 The Singapore National Precision Medicine Strategy</a>.
 <em>Nat Genet</em>. 2023 Feb;55(2):178-186.
 PMID: <a href="https://www.ncbi.nlm.nih.gov/pubmed/36658435" target="_blank">36658435</a>
 </p>
 
 <p>
 Malomane DK, Williams MP, Huber CD, Mangul S, Abedalthagafi M, Chiang CWK.
 <a href="https://doi.org/10.1101/2025.01.10.632500" target="_blank">
 Patterns of population structure and genetic variation within the Saudi Arabian population</a>.
 <em>bioRxiv</em>. 2025 Jan 13;.
 PMID: <a href="https://www.ncbi.nlm.nih.gov/pubmed/39868174" target="_blank">39868174</a>; PMC: <a
 href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11761371/" target="_blank">PMC11761371</a>
 </p>
 
 <p>
 Ameur A, Dahlberg J, Olason P, Vezzi F, Karlsson R, Martin M, Viklund J, K&auml;h&auml;ri AK,
 Lundin P, Che H
 <em>et al</em>.
 <a href="https://doi.org/10.1038/ejhg.2017.130" target="_blank">
 SweGen: a whole-genome data resource of genetic variability in a cross-section of the Swedish
 population</a>.
 <em>Eur J Hum Genet</em>. 2017 Nov;25(11):1253-1260.
 PMID: <a href="https://www.ncbi.nlm.nih.gov/pubmed/28832569" target="_blank">28832569</a>; PMC: <a
 href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5765326/" target="_blank">PMC5765326</a>
 </p>
 
 <p>
 SPARK Consortium. Electronic address: pfeliciano@simonsfoundation.org, SPARK Consortium.
 <a href="https://linkinghub.elsevier.com/retrieve/pii/S0896-6273(18)30018-7" target="_blank">
 SPARK: A US Cohort of 50,000 Families to Accelerate Autism Research</a>.
 <em>Neuron</em>. 2018 Feb 7;97(3):488-493.
 PMID: <a href="https://www.ncbi.nlm.nih.gov/pubmed/29420931" target="_blank">29420931</a>; PMC: <a
 href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7444276/" target="_blank">PMC7444276</a>
 </p>
 
 
 
 <p>
 Lacaze P, Pinese M, Kaplan W, Stone A, Brion MJ, Woods RL, McNamara M, McNeil JJ, Dinger ME, Thomas
 DM.
 <a href="https://doi.org/10.1038/s41431-018-0279-z" target="_blank">
 The Medical Genome Reference Bank: a whole-genome data resource of 4000 healthy elderly individuals.
 Rationale and cohort design</a>.
 <em>Eur J Hum Genet</em>. 2019 Feb;27(2):308-316.
 PMID: <a href="https://www.ncbi.nlm.nih.gov/pubmed/30353151" target="_blank">30353151</a>; PMC: <a
 href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6336775/" target="_blank">PMC6336775</a>
 </p>