src/hg/makeDb/trackDb/human/varFreqs.html d4e7e1a69b17ccebfc8b558f220dd0e1cc3ba1d0

d4e7e1a69b17ccebfc8b558f220dd0e1cc3ba1d0
max
  Mon Feb 2 06:14:27 2026 -0800
removing regeneron data after rejection of request, code review feedback, refs #36978

diff --git src/hg/makeDb/trackDb/human/varFreqs.html src/hg/makeDb/trackDb/human/varFreqs.html
index 1a73429546f..625e2c0bf0e 100644
--- src/hg/makeDb/trackDb/human/varFreqs.html
+++ src/hg/makeDb/trackDb/human/varFreqs.html
@@ -1,26 +1,27 @@
 <h2>Description</h2>
 <p>
 This container shows results from projects where the variant frequencies, aka allele frequencies,
 are publicly available. The tracks were collected from the 
-projects listed below. Projects that provide haplotype-phased genotypes/variants can be found
-elsewhere: 1000 Genomes is a separate track, and the projects HGDP, SGDP,
-HGDP+1000 Genomes and Mexico Biobank can be found in the &quot;Phased Variants&quot; track.
+projects listed below. More detailed data for projects that provide haplotype-phased genotypes/variants can also be found
+in other tracks: 1000 Genomes is a separate track, and the projects HGDP, SGDP,
+HGDP+1000 Genomes and Mexico Biobank can be found in the &quot;Phased Variants&quot; track, showing the linkage between variants.
 </p>
-<p>If you want us to add other projects, please contact us. We asked and were
-unable to obtain variant frequencies from the following projects: UK Biobank (request pending), All
-of Us (granted, ongoing).
+
+<p>If you want us to add other projects, please contact us. We were
+unable to obtain variant frequencies from the following projects: UK Biobank (request pending), 
+Regeneron's Million Exomes and Mexico City Studies (request rejected).
 </p>
 
 <p>
 The following projects were added:
 <ul>
     <li>
         <b><a href="https://rgc-mcps.regeneron.com/home"
         target="_blank">Mexico City Prospective Study (MCPS)</a></b>:
         9,950 whole genome sequenced individuals and 141,046 exome sequenced and genotyped
         individuals from the Mexico City Prospective Study (MCPS), a collaboration between the
         Regeneron Genetics Center, University of Oxford, Universidad Nacional Aut&oacute;noma de
         M&eacute;xico (UNAM), National Institute of Genomic Medicine in Mexico, Abbvie Inc. and
         AstraZeneca UK. For details see (Ziyatdinov A, Nature 2023), the reference section.
     </li>
 
@@ -178,99 +179,102 @@
         publicly. For details see (Malomane et al 2025). 
     </li>
 </ul>
 </p>
 
 <h2>Display Conventions</h2>
 
 <p>Most tracks only show the variant and allele frequencies on mouseover or clicks.
 When zoomed in, tracks display alleles with base-specific coloring. Homozygote
 data are shown as one letter, while heterozygotes will be displayed with both
 letters. All VCF files are normalized, with one single allele per annotation (no multi-allele lines).
 </p>
 
 
 <h2>Data Access</h2>
-<p>Most of the data in these tracks are not available for download from UCSC.
-Data can be browsed on our website.
-But the data can be downloaded for free from the original projects. Accessing the 
-data usually requires a click-through license or filling out an access request form on the respective websites, links are either provided above in the project description or with more details here:
+<p>Most of the data in these tracks are not available for download from UCSC and the data can only be browsed on our website.
+But all variant data can be downloaded for free from the original project websites. Accessing it usually requires a click-through license or filling out an access request form on the respective websites, by following these instructions:
 </p>
 
 <p>
 <b>MXB:</b> Allele frequencies by geographical state and ancestry are available via
 the <a target="_blank" href="https://morenolab.shinyapps.io/mexvar/">MexVar platform</a>.
 Raw genotype data are available under controlled access at the
 EGA (Study: EGAS00001005797; Dataset: EGAD00010002361). For the VCFs, email
 andres.moreno@cinvestav.mx.
 </p>
+
+<!--
 <p>
 <b>MCPS:</b> VCFs with summarized allele frequencies are available from
 the <a target="_blank" href="https://rgc-mcps.regeneron.com/">MCPS website</a>.
 </p>
 <p>
 <b>Regeneron one million exomes:</b> VCFs with summarized allele frequencies are available from
 the <a target="_blank" href="https://rgc-research.regeneron.com/me/resources">RGC ME website</a>.
 </p>
+-->
+
 <p>
 <b>TOPMED:</b> VCFs with summarized allele frequencies are available from
 the <a target="_blank" href="https://bravo.sph.umich.edu/">TOPMED BRAVO website</a>. They require a
 login.
 </p>
 <p>
 <b>SFARI SPARK:</b> Allele frequencies can be displayed on the
         <a href="https://genomes.sfari.org/" target=_blank>SFARI Genome Browser</a>.
         Full CRAMs and VCFs with genotypes are available from <a target="_blank"
         href="https://base.sfari.org/">SFARI Base</a>. 
 They require a data access request, which is usually reviewed quickly. More information is available
 in the <a href="https://cohorts-cdn.simonsfoundation.org/spark/researcher_packets/SPARK_SFARI_Researcher_Welcome_Packet.pdf"
 target=_blank>SPARK Welcome Packet</a>.
 </p>
 
 <p>
 <b>Australia MGRB:</b> VCF access can be requested via a form from 
 <a target="_blank" href="https://sgc.garvan.org.au/terms/mgrb/index.html">Sydney Genomics</a>.
 </p>
 
 <p>
 <b>GenomeAsia Pilot:</b> VCFs are available from UCSC and also from
 the <a target="_blank"
 href="https://browser.genomeasia100k.org/#tid=download">GenomeAsia 100K website</a>.
 No license nor login.
 </p>
 
 <p><b>KOVA:</b> 
 TSV data can be requested on the <a href="https://www.kobic.re.kr/kova/downloads"
-        target="_blank">KOVA Downloads</a> website. 
+target="_blank">KOVA Downloads</a> website. Our Github repo contains a script that 
+converts this format to VCF.
 </p>
 
 <p><b>Finngen:</b> TSV data can be requested via the form at
-https://finngen.gitbook.io/documentation/data-download which triggers an email with the download
-link.</p>
+<a href="https://finngen.gitbook.io/documentation/data-download" target=_blank>Finngen</a>
+which triggers an automated email containing the download
+link. A script in our Github repo converts this file to VCF (see methods below).</p>
 
-<p><b>SweGen:</b> We are not allowed to redistribute the VCF, you can request it at
-<a target="_blank" href="https://swefreq.nbis.se/dataset/SweGen">SweGen</a>, alongside the VCF file.
+<p><b>SweGen:</b> VCF files can be requested at
+<a target="_blank" href="https://swefreq.nbis.se/dataset/SweGen">SweGen</a> via a form, the request needs manual approval, which usually is quick. If there is no reply, email SweGen directly.
 </p>
 
 <p><b>NPM:</b> 
-   VCF access can be requested on the 
-   <a href="https://chorus.grids-platform.io/" target="_blank">Chorus Browser</a> website, which
-   requires an <a href = "https://npm.a-star.edu.sg/" target=_blank>account and data access
-   request</a>. 
+VCF download can be requested on the <a href="https://chorus.grids-platform.io/" target="_blank">Chorus Browser</a> website, which requires an <a href = "https://npm.a-star.edu.sg/" target=_blank>account and data access request</a>. 
 </p>
 
 <h2>Methods</h2>
+<p>The following are quotes from the respective papers and/or websites of the datasets:</p>
+
 <p>
 <b>MXB:</b> Genotyping was performed with the Illumina Multi-Ethnic Global Array
 (MEGA, ~1.8M SNPs), optimized for admixed populations and enriched for
 ancestry-informative and medically relevant variants. Only autosomal, biallelic
 SNPs passing quality control are included. Samples were selected from 898
 recruitment sites, with prioritization of indigenous language speakers. Data
 processing included GenomeStudio &rarr; PLINK conversion, strand alignment, removal
 of duplicates, update of map positions using dbSNP Build 151 and low-quality
 variants/individuals, and relatedness filtering.
 </p>
 <p>
 <b>SGDP:</b> The version used was
 <a target="_blank" href="https://sharehost.hms.harvard.edu/genetics/reich_lab/sgdp/vcf_variants/"
 >https://sharehost.hms.harvard.edu/genetics/reich_lab/sgdp/vcf_variants/</a>,
 merged with bcftools and lifted to hg38 with CrossMap. 
@@ -305,34 +309,34 @@
 and indels were called following GATK Best Practices (GATK v3.7) via per-sample GVCFs
 (HaplotypeCaller), joint genotyping (CombineGVCFs, GenotypeGVCFs), and Variant Quality Score
 Recalibration (VQSR-AS); multiallelic variants were split with an in-house script, left-aligned with
 BCFtools, and annotated using Annovar and custom scripts against dbSNP, 1000 Genomes, and gnomAD,
 with putative loss-of-function variants identified using LOFTEE v0.3-beta irrespective of confidence
 labels. Variant and genotype quality was further assessed using the in-house CEGH-Filter two-step
 algorithm based on depth and allele balance, and analyses retained only GATK VQSR-AS PASS variants
 and higher-confidence CEGH-Filter calls. Relatedness was assessed using KING and PC-Relate
 (GENESIS), retaining a single proband per related pair and excluding one contaminated sample
 (&gt;3% by verifyBAMID), resulting in a final dataset of 1,171 unrelated individuals. Final samples
 achieved mean coverages ranging from 31.3x to 64.8x, with an average of 38.65x and a median of
 36.6x.
 </p>
 
 <p><b>SFARI SPARK:</b> The genome browser track project was approved by the Simons 
-Foundation as 14584.1. WES and WGS Data were downloaded from 
+Foundation under request number 14584.1. WES and WGS Data were downloaded from 
 <a href="https://base.sfari.org/" target="_blank">SFARI Base</a>.
-pVCFs were downloaded, anonymized with a script using bcftools and the fill-tags plugin and
-normalized, without a minimum allele frequency cutoff.<br>
+pVCFs were downloaded, anonymized with a script using bcftools and its "fill-tags" plugin and
+normalized. There was no minimum allele frequency cutoff.<br>
 The methods are documented as follows by SFARI:<br>
 <ul>
   <li>
     <b>WES</b>:
     This release consists of sequence and variant call data for 12,519
     unique individuals, of which 12,517 (99.98%) have available genome-wide
     SNP genotype data. Sequencing and genotyping of all samples in this
     release was performed at New York Genome Center (NYGC). DNA from saliva
     samples were extracted and prepared with PCR-free methods and sequenced
     with paired-end sequencing of 150 bases on the Illumina NovaSeq 6000
     system.  Alignment of reads to the human reference genome version
     GRCh38, duplicate read marking, and Base Quality Score Recalibration
     (BQSR) were performed by New York Genome Center (NYCG). Whole-genome
     sequencing data were processed using a standardized, functionally
     equivalent CCDG pipeline with alignment to the GRCh38DH (1000 Genomes)
@@ -377,32 +381,32 @@
     GATK to generate gVCFs, pairwise relatedness inferred using
     PLINK v1.9 IBD estimates from common SNPs (AF &ge; 0.01, dbSNP
     v151) with &ge;15% relatedness flagged, and comprehensive
     individual- and family-level quality control executed using the
     internal GenomeCheckMate pipeline to exclude samples based on
     contamination (&ge;5%), insufficient coverage (&lt;20x in &lt;80% of
     targets), sex discordance, pedigree/IBD inconsistencies,
     unregistered relationships, unexpected duplicates, or excess
     relatedness, after which QC-passing individuals (selecting the
     most recent passing sample per person) were retained for
     variant calling and joint genotyping.
     </p></li>
 </ul>
 
 <p><b>Finngen:</b> R12 annotated variants were downloaded from the Google Cloud
-bucket link received though an email after filling out the form linked from
-https://finngen.gitbook.io/documentation/data-download and converted to VCF
+bucket link received though an email and
+converted to VCF
 with a <a
 href="https://github.com/ucscGenomeBrowser/kent/tree/master/src/hg/makeDb/scripts/finngen_to_vcf.py"
 target="_blank">custom Python script</a>. </p>
 
 <p><b>SweGen:</b> Fragment size 350bp on a Covaris E220. Paired-end sequencing with 150bp read
 length was performed on Illumina HiSeq X (HiSeq Control Software 3.3.39/RTA 2.7.1) with v2.5
 sequencing chemistry. Raw whole-genome reads were aligned to the GRCh37 reference using BWA-MEM
 v0.7.12, then sorted and indexed with samtools v0.1.19 and assessed with qualimap v2.2.20;
 per-sample alignments from multiple lanes and flow cells were merged using Picard MergeSamFiles
 v1.120. Processing followed GATK best practices with GATK v3.3, including indel realignment
 (RealignerTargetCreator, IndelRealigner), duplicate marking (Picard MarkDuplicates v1.120), and
 base quality score recalibration (BaseRecalibrator), producing one finalized BAM per sample.
 Per-sample gVCFs were generated with GATK HaplotypeCaller v3.3 using reference files from the GATK
 v2.8 resource bundle, with all steps coordinated via Piper v1.4.0. Joint genotyping of 1,000 samples
 was performed by merging gVCFs in five batches of 200 using GATK CombineGVCFs, followed by cohort
@@ -498,31 +502,32 @@
 request. 
 By browsing the data, you agree to use the data only for academic, non-commercial
 research to improve human health (biology/disease).  We request all data users
 agree to protect the
 confidentiality of the data subjects in any research papers or publications
 that they may prepare, by taking all reasonable care to limit the possibility
 of identification. In particular, the data users shall not to use, or attempt
 to use, the data to deliberately compromise or otherwise infringe the
 confidentiality of information on data subjects and their right to privacy.
 If you use any of the data obtained from the CHORUS variant browser, we request
 that you cite the NPM flagship paper (Wong et al, 2023). All data users of the
 data must take note that the data provider and relevant SG10K_Health cohort
 owners bear no responsibility for the further analysis or interpretation of the
 data.  </p>
 
-<p>Thanks to Alex Ioannidis, UCSC, and Andreas Lahner, MGZ, for feedback on this track.</p>
+<p>Thanks to Alex Ioannidis, UCSC, for the idea and motivation for this track. 
+Thanks to Andreas Lahner, MGZ, for feedback and suggestions.</p>
 
 <h2>References</h2>
 <p>
 Barberena-Jonas, C. et al. (2025). MexVar database: Clinical genetic variation beyond the
 Hispanic label in the Mexican Biobank. <em>Nature Medicine (in press)</em>.
 </p>
 
 <p>
 Sohail M, Moreno-Estrada A.
 <a href="https://journals.biologists.com/dmm/article-lookup/doi/10.1242/dmm.050522" target="_blank">
 The Mexican Biobank Project promotes genetic discovery, inclusive science and local capacity
 building</a>.
 <em>Dis Model Mech</em>. 2024 Jan 1;17(1).
 PMID: <a href="https://www.ncbi.nlm.nih.gov/pubmed/38299665" target="_blank">38299665</a>; PMC: <a
 href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10855211/" target="_blank">PMC10855211</a>