src/hg/makeDb/trackDb/human/varFreqs.html 9cfd62d70caa8a7ce7610fcd55c1cf2cc6ef942b

9cfd62d70caa8a7ce7610fcd55c1cf2cc6ef942b
jnavarr5
  Wed Feb 4 10:52:00 2026 -0800
Making lines less than 100 characters. Fixing typos from code review, refs #37057

diff --git src/hg/makeDb/trackDb/human/varFreqs.html src/hg/makeDb/trackDb/human/varFreqs.html
index 6d672bc37cb..c86b4361e97 100644
--- src/hg/makeDb/trackDb/human/varFreqs.html
+++ src/hg/makeDb/trackDb/human/varFreqs.html
@@ -1,40 +1,42 @@
 <h2>Description</h2>
 <p>
 This container shows results from projects where the variant frequencies, aka allele frequencies,
 are publicly available. The tracks were collected from the 
-projects listed below. More detailed data for projects that provide haplotype-phased genotypes/variants can also be found
+projects listed below. More detailed data for projects that provide haplotype-phased
+genotypes/variants can also be found
 in other tracks: 1000 Genomes is a separate track, and the projects HGDP, SGDP,
-HGDP+1000 Genomes and Mexico Biobank can be found in the &quot;Phased Variants&quot; track, showing the linkage between variants.
+HGDP+1000 Genomes and Mexico Biobank can be found in the &quot;Phased Variants&quot; track, showing
+the linkage between variants.
 </p>
 
 <p>If you want us to add other projects, please contact us. We were
 unable to obtain variant frequencies from the following projects: UK Biobank (request pending), 
 Regeneron's Million Exomes and Mexico City Studies (request rejected).
 </p>
 
 <p>
 The following projects were added:
 <ul>
     <li>
         <b><a href="https://rgc-mcps.regeneron.com/home"
         target="_blank">Mexico City Prospective Study (MCPS)</a></b>:
         9,950 whole genome sequenced individuals and 141,046 exome sequenced and genotyped
         individuals from the Mexico City Prospective Study (MCPS), a collaboration between the
         Regeneron Genetics Center, University of Oxford, Universidad Nacional Aut&oacute;noma de
         M&eacute;xico (UNAM), National Institute of Genomic Medicine in Mexico, Abbvie Inc. and
-        AstraZeneca UK. For details see (Ziyatdinov A, Nature 2023), the reference section.
+        AstraZeneca UK. For details, see (Ziyatdinov A, Nature 2023) in the reference section.
     </li>
 
     <li>
         <b><a href="https://rgc-research.regeneron.com/me/home"
         target="_blank">Regeneron Million Exomes Project (ME)</a></b>:
         Whole-exomes of 983,578 individuals sequenced by the Regeneron Genetics Center (RGC).
         These data span dozens of collaborations including large biobanks and
         health systems. All data were generated by the RGC on a single, harmonized
         sequencing and informatics protocol. The dataset includes individuals across
         diverse ancestral populations, encompassing outbred and founder populations and
         cohorts with high rates of consanguinity. See (Sun et al, Nature 2024) for details.
     </li>
 
     <li>
         <b><a href="https://topmed.nhlbi.nih.gov/" target="_blank">NHLBI TOPMED Freeze 10</a></b>:
@@ -46,31 +48,32 @@
         disorders to advance precision medicine and improve population health. Freeze
         10 contains 868,581,653 variants from 150,899 whole genomes. VCFs were
         downloaded from <a href="https://bravo.sph.umich.edu/terms.html"
         target="_blank">BRAVO</a>.
     </li>
 
     <li>
         <b><a href="https://sparkforautism.org/" target="_blank">SFARI SPARK</a></b>:
         The Simons Foundation Autism Research Initiative (SFARI) recruited
         a large cohort of families with autistic children who provided DNA
         samples and phenotypes.  54,558 families, parents and their children
         were sequenced, a total of 142,357 individuals with whole-exome (WES)
         and 12,519 with whole-genome sequencing (WGS).  The data contains
         32,559 trios and 8,895 quads (one sibling without autism), and 824
         twins. The same frequencies shown here
-        are also available publicly on the <a href="https://genomes.sfari.org/" target=_blank>SFARI Genome Browser</a>. 
+        are also available publicly on the
+        <a href="https://genomes.sfari.org/" target="_blank">SFARI Genome Browser</a>. 
        See (SPARK et al, Neuron 2018) for details or the methods below on this page.
     </li>
 
     <li>
         <b><a href="https://www.genomeasia100k.org/"
         target="_blank">GenomeAsia Pilot (GAsP)</a></b>:
         Whole-genome sequencing data of 1,739 individuals from 219 population groups across Asia.
         See (GenomeAsia Consortium, Nature 2019) for details.
     </li>
 
     <li>
         <b><a href="https://sgc.garvan.org.au/initiatives/mgrb/index.html"
         target="_blank">Australia MGRB</a></b>:
         The Australian Medical Genome Reference Bank collected
         whole-genome sequencing data of 4,011 healthy elderly individuals who
@@ -85,126 +88,130 @@
         The NCBI ALlele Frequency Aggregator pipeline computes allele frequencies from
         approved, unrestricted dbGaP studies and makes them publicly available through
         dbSNP. Its goal is to release frequency data from over one million dbGaP
         subjects to aid discoveries involving common and rare variants with biological
         or disease relevance. The R4 release includes 408,709 subjects and allele
         frequencies for 15.5 million rs sites, including nearly one million ClinVar
         variants. We converted the NCBI track hub to VCF format, the data is freely available.
         Genotype and associated individual-level data are accessible through the dbGaP
         <a href="https://dbgap.ncbi.nlm.nih.gov/aa/wga.cgi?page=login"
         target="_blank">authorized access request</a> system.
     </li>
 
     <li>
         <b><a href="https://www.finngen.fi/en" target="_blank">FinnGen</a></b>:
         Imputed variants from 500,348 Biobank samples obtained using genotyping arrays
-        in Finnland, 10% of the population. The imputation used phased variants obtained from 8,554
-        high-quality whole genome sequences, also from Finnland. For details, see (Kurki et al,
+        in Finland, 10% of the population. The imputation used phased variants obtained from 8,554
+        high-quality whole genome sequences, also from Finland. For details, see (Kurki et al,
         Nature 2023).
         Phenotype links can be shown at <a href="https://r12.finngen.fi/">FinnGen PheWeb</a>.
     </li>
 
     <li>
         <b><a href="https://swefreq.nbis.se/dataset/SweGen" target="_blank">SweGen</a></b>:
         Whole-genome sequencing variant frequencies for 1000 Swedish individuals generated within
         the SweGen project.
         The 1000 individuals included in the SweGen project represent a
         cross-section of the Swedish population and that no disease information
         has been used for the selection. The frequency data may therefore
         include genetic variants that are associated with, or causative of,
         disease. SweGen also provides SV calls, TEs, MELT results for TEs, HLAs and new sequence.
         For details, see (Ameur et al, Eur J Hum Genet 2017).
-        Dataset can be browsed at the
+        The dataset can be browsed at the
         <a href="https://swefreq.nbis.se/dataset/SweGen/browser">SweGen Browser</a>.
     </li>
 
     <li>
         <b><a href="https://jmorp.megabank.tohoku.ac.jp/downloads" target="_blank">JPN To61k Japan
         Tohoku University Tohoku Medical Megabank Organization 61k Allele frequency panel
         (JPN 61k)</a></b>:
         An allele frequency panel based on short-read WGS analysis of 61,000 Japanese individuals.
         The project includes other datatypes, such as STRs, long-read SVs and short-read CNVs.
         Data can be downloaded from the <a href="https://jmorp.megabank.tohoku.ac.jp"
         target="_blank">jMorp Website</a>, specifically the
         <a href="https://jmorp.megabank.tohoku.ac.jp/downloads" target="_blank">Downloads</a>
         section. For details, see (Tadaka et al, NAR 2023).
     </li>
 
     <li>
         <b><a href="https://abraom.ib.usp.br/"
         target="_blank">Brazil Arquivo Brasileiro Online de Muta&ccedil;&otilde; (ABraOM)</a></b>:
         Genomic variants obtained with whole-genome sequencing from SABE, a
         census-based sample of elderly individuals from S&atilde;o Paulo, Brazil's
         largest city. The Brazilian population is constituted by ~500 years of
         admixture between Africans, Europeans, and Native Americans.
         Additionally, the cohort presents ~3% of individuals with non-admixed
         Japanese ancestry (early 20th century migration). Coverage 38.6x.  Data
         can be downloaded from the <a href="https://abraom.ib.usp.br/download/"
         target="_blank">AbraOM Website</a>. TEs, HLAs and new sequence are also available.
-        For details see (Naslavsky et al, Nat Comm 2022).
+        For details, see (Naslavsky et al, Nat Comm 2022).
     </li>
 
     <li>
         <b><a href="https://clingen.igib.res.in/indigen/" target="_blank">IndiGenomes</a></b>:
         Whole genome sequencing of 1,029 healthy Indian individuals under the pilot phase of the
         &quot;IndiGen&quot; program.
         Data can be downloaded from the <a href="https://clingen.igib.res.in/indigen/"
-        target="_blank">IndiGen Website</a>. For details see (Jain et al, NAR 2020). Only
+        target="_blank">IndiGen Website</a>. For details, see (Jain et al, NAR 2020). Only
         the allele frequency is available from this project. The website also provides SV call
         and Alu insertion VCFs.
     </li>
 
     <li>
         <b><a href="https://www.kobic.re.kr/kova/"
         target="_blank">Korean Variant Archive (KOVA)</a></b>:
         1,896 whole genome sequencing and 3,409 whole exome sequencing data from healthy individuals
         of Korean ethnicity.
         Most of the samples originated from normal tissue of cancer
         patients (40.16 %), healthy parents of rare disease patients (28.4 %),
         or healthy volunteers (31.44 %). Japanese ancestry is broken down
         in the INFO field. Coverage 100x for WES, 30x for WGS. SVs called with Manta
-        are also available. For details see (Lee et al, Exp Mol Med 2022).</li>
+        are also available. For details, see (Lee et al, Exp Mol Med 2022).</li>
     <li>
         <b><a href="https://www.npm.sg/"
         target="_blank">NPM Singapore</a></b>:
         9,770 whole genomes, mostly of Chinese, Indian and Malay ancestry. 
         A minimum allele count cutoff of &gt; 5 was applied.
         Data is available for download from the CHORUS browser, see &quot;Data access&quot; below.
-        For details see (Wong et al, Nat Genetics 2023). CNV data is also available there.
+        For details, see (Wong et al, Nat Genetics 2023). CNV data is also available there.
     </li>
     <li>
         <b><a href="https://www.vision2030.gov.sa/en/explore/projects/the-saudi-genome-program"
         target="_blank">Saudi Genome Program</a></b>:
         Variant frequencies from 302 whole genomes at 30x coverage, on Saudi Genome Program Samples.
         The genotyping data and imputations from 3,352 individuals do not seem to be available
-        publicly. For details see (Malomane et al 2025). 
+        publicly. For details, see (Malomane et al 2025). 
     </li>
 </ul>
 </p>
 
 <h2>Display Conventions</h2>
 
 <p>Most tracks only show the variant and allele frequencies on mouseover or clicks.
 When zoomed in, tracks display alleles with base-specific coloring. Homozygote
 data are shown as one letter, while heterozygotes will be displayed with both
-letters. All VCF files are normalized, with one single allele per annotation (no multi-allele lines).
+letters. All VCF files are normalized, with one single allele per annotation (no multi-allele
+lines).
 </p>
 
 
 <h2>Data Access</h2>
-<p>Most of the data in these tracks are not available for download from UCSC and the data can only be browsed on our website.
-But all variant data can be downloaded for free from the original project websites. Accessing it usually requires a click-through license or filling out an access request form on the respective websites, by following these instructions:
+<p>
+Most of the data in these tracks are not available for download from UCSC and the data can only be
+browsed on our website. But all variant data can be downloaded for free from the original project
+websites. Accessing it usually requires a click-through license or filling out an access request
+form on the respective websites, by following these instructions:
 </p>
 
 <p>
 <b>MXB:</b> Allele frequencies by geographical state and ancestry are available via
 the <a target="_blank" href="https://morenolab.shinyapps.io/mexvar/">MexVar platform</a>.
 Raw genotype data are available under controlled access at the
 EGA (Study: EGAS00001005797; Dataset: EGAD00010002361). For the VCFs, email
 andres.moreno@cinvestav.mx.
 </p>
 
 <!--
 <p>
 <b>MCPS:</b> VCFs with summarized allele frequencies are available from
 the <a target="_blank" href="https://rgc-mcps.regeneron.com/">MCPS website</a>.
 </p>
@@ -236,40 +243,43 @@
 
 <p>
 <b>GenomeAsia Pilot:</b> VCFs are available from UCSC and also from
 the <a target="_blank"
 href="https://browser.genomeasia100k.org/#tid=download">GenomeAsia 100K website</a>.
 No license nor login.
 </p>
 
 <p><b>KOVA:</b> 
 TSV data can be requested on the <a href="https://www.kobic.re.kr/kova/downloads"
 target="_blank">KOVA Downloads</a> website. Our Github repo contains a script that 
 converts this format to VCF.
 </p>
 
 <p><b>Finngen:</b> TSV data can be requested via the form at
-<a href="https://finngen.gitbook.io/documentation/data-download" target=_blank>Finngen</a>
+<a href="https://finngen.gitbook.io/documentation/data-download" target=_blank>Finngen</a>,
 which triggers an automated email containing the download
 link. A script in our Github repo converts this file to VCF (see methods below).</p>
 
 <p><b>SweGen:</b> VCF files can be requested at
-<a target="_blank" href="https://swefreq.nbis.se/dataset/SweGen">SweGen</a> via a form, the request needs manual approval, which usually is quick. If there is no reply, email SweGen directly.
+<a target="_blank" href="https://swefreq.nbis.se/dataset/SweGen">SweGen</a> via a form, the request
+needs manual approval, which usually is quick. If there is no reply, email SweGen directly.
 </p>
 
 <p><b>NPM:</b> 
-VCF download can be requested on the <a href="https://chorus.grids-platform.io/" target="_blank">Chorus Browser</a> website, which requires an <a href = "https://npm.a-star.edu.sg/" target=_blank>account and data access request</a>. 
+VCF download can be requested on the <a href="https://chorus.grids-platform.io/"
+target="_blank">Chorus Browser</a> website, which requires an <a href="https://npm.a-star.edu.sg/"
+target=_blank>account and data access request</a>. 
 </p>
 
 <h2>Methods</h2>
 <p>The following are quotes from the respective papers and/or websites of the datasets:</p>
 
 <p>
 <b>MXB:</b> Genotyping was performed with the Illumina Multi-Ethnic Global Array
 (MEGA, ~1.8M SNPs), optimized for admixed populations and enriched for
 ancestry-informative and medically relevant variants. Only autosomal, biallelic
 SNPs passing quality control are included. Samples were selected from 898
 recruitment sites, with prioritization of indigenous language speakers. Data
 processing included GenomeStudio &rarr; PLINK conversion, strand alignment, removal
 of duplicates, update of map positions using dbSNP Build 151 and low-quality
 variants/individuals, and relatedness filtering.
 </p>
@@ -313,31 +323,31 @@
 with putative loss-of-function variants identified using LOFTEE v0.3-beta irrespective of confidence
 labels. Variant and genotype quality was further assessed using the in-house CEGH-Filter two-step
 algorithm based on depth and allele balance, and analyses retained only GATK VQSR-AS PASS variants
 and higher-confidence CEGH-Filter calls. Relatedness was assessed using KING and PC-Relate
 (GENESIS), retaining a single proband per related pair and excluding one contaminated sample
 (&gt;3% by verifyBAMID), resulting in a final dataset of 1,171 unrelated individuals. Final samples
 achieved mean coverages ranging from 31.3x to 64.8x, with an average of 38.65x and a median of
 36.6x.
 </p>
 
 <p><b>SFARI SPARK:</b> The genome browser track project was approved by the Simons 
 Foundation under request number 14584.1. WES and WGS Data were downloaded from 
 <a href="https://base.sfari.org/" target="_blank">SFARI Base</a>.
 pVCFs were downloaded, anonymized with a script using bcftools and its "fill-tags" plugin and
 normalized. There was no minimum allele frequency cutoff.<br>
-The methods are documented as follows by SFARI:<br>
+The methods are documented as follows by SFARI:</p>
 <ul>
   <li>
     <b>WES</b>:
     This release consists of sequence and variant call data for 12,519
     unique individuals, of which 12,517 (99.98%) have available genome-wide
     SNP genotype data. Sequencing and genotyping of all samples in this
     release was performed at New York Genome Center (NYGC). DNA from saliva
     samples were extracted and prepared with PCR-free methods and sequenced
     with paired-end sequencing of 150 bases on the Illumina NovaSeq 6000
     system.  Alignment of reads to the human reference genome version
     GRCh38, duplicate read marking, and Base Quality Score Recalibration
     (BQSR) were performed by New York Genome Center (NYCG). Whole-genome
     sequencing data were processed using a standardized, functionally
     equivalent CCDG pipeline with alignment to the GRCh38DH (1000 Genomes)
     reference using BWA-MEM v0.7.15 (deterministic settings, no -M, use of
@@ -377,31 +387,31 @@
     sequencing genotyping sites (see Genotyping Methods), the full
     mitochondrial genome, and coverage boosted at selected sites
     for assaying clonal hematopoiesis of indeterminate potential
     (CHIP).  SFARI performed NV/indel calling via DeepVariant and
     GATK to generate gVCFs, pairwise relatedness inferred using
     PLINK v1.9 IBD estimates from common SNPs (AF &ge; 0.01, dbSNP
     v151) with &ge;15% relatedness flagged, and comprehensive
     individual- and family-level quality control executed using the
     internal GenomeCheckMate pipeline to exclude samples based on
     contamination (&ge;5%), insufficient coverage (&lt;20x in &lt;80% of
     targets), sex discordance, pedigree/IBD inconsistencies,
     unregistered relationships, unexpected duplicates, or excess
     relatedness, after which QC-passing individuals (selecting the
     most recent passing sample per person) were retained for
     variant calling and joint genotyping.
-    </p></li>
+    </li>
 </ul>
 
 <p><b>Finngen:</b> R12 annotated variants were downloaded from the Google Cloud
 bucket link received though an email and
 converted to VCF
 with a <a
 href="https://github.com/ucscGenomeBrowser/kent/tree/master/src/hg/makeDb/scripts/finngen_to_vcf.py"
 target="_blank">custom Python script</a>. </p>
 
 <p><b>SweGen:</b> Fragment size 350bp on a Covaris E220. Paired-end sequencing with 150bp read
 length was performed on Illumina HiSeq X (HiSeq Control Software 3.3.39/RTA 2.7.1) with v2.5
 sequencing chemistry. Raw whole-genome reads were aligned to the GRCh37 reference using BWA-MEM
 v0.7.12, then sorted and indexed with samtools v0.1.19 and assessed with qualimap v2.2.20;
 per-sample alignments from multiple lanes and flow cells were merged using Picard MergeSamFiles
 v1.120. Processing followed GATK best practices with GATK v3.3, including indel realignment