src/hg/makeDb/trackDb/human/varFreqs.html 6ba06b49cf98953915f487b3f9080e4dedd2df60

6ba06b49cf98953915f487b3f9080e4dedd2df60
jnavarr5
  Tue Jan 27 16:31:41 2026 -0800
Making each line less than 100 characters. Replacing double quotes, <, and > with the HTML entities. Using HTML characters for special characters. Making a section of the Methods for Sfari Spark to use bullet points, refs #36642

diff --git src/hg/makeDb/trackDb/human/varFreqs.html src/hg/makeDb/trackDb/human/varFreqs.html
index 39d2482b47a..29c83610b23 100644
--- src/hg/makeDb/trackDb/human/varFreqs.html
+++ src/hg/makeDb/trackDb/human/varFreqs.html
@@ -1,25 +1,26 @@
 <h2>Description</h2>
 <p>
-This container shows results from projects
-where the variant frequencies, aka allele frequencies, are publicly available. The tracks were collected from the 
+This container shows results from projects where the variant frequencies, aka allele frequencies,
+are publicly available. The tracks were collected from the 
 projects listed below. Projects that provide haplotype-phased genotypes/variants can be found
 elsewhere: 1000 Genomes is a separate track, and the projects HGDP, SGDP,
-HGDP+1000 Genomes and Mexico Biobank can be found in the "Phased Variants" track.
+HGDP+1000 Genomes and Mexico Biobank can be found in the &quot;Phased Variants&quot; track.
 </p>
 <p>If you want us to add other projects, please contact us. We asked and were
-unable to obtain variant frequencies from the following projects: UK Biobank (request pending), All of Us (granted, ongoing).
+unable to obtain variant frequencies from the following projects: UK Biobank (request pending), All
+of Us (granted, ongoing).
 </p>
 
 <p>
 The following projects were added:
 <ul>
     <li>
         <b><a href="https://rgc-mcps.regeneron.com/home"
         target="_blank">Mexico City Prospective Study (MCPS)</a></b>:
         9,950 whole genome sequenced individuals and 141,046 exome sequenced and genotyped
         individuals from the Mexico City Prospective Study (MCPS), a collaboration between the
         Regeneron Genetics Center, University of Oxford, Universidad Nacional Aut&oacute;noma de
         M&eacute;xico (UNAM), National Institute of Genomic Medicine in Mexico, Abbvie Inc. and
         AstraZeneca UK. For details see (Ziyatdinov A, Nature 2023), the reference section.
     </li>
 
@@ -72,48 +73,53 @@
         The NCBI ALlele Frequency Aggregator pipeline computes allele frequencies from
         approved, unrestricted dbGaP studies and makes them publicly available through
         dbSNP. Its goal is to release frequency data from over one million dbGaP
         subjects to aid discoveries involving common and rare variants with biological
         or disease relevance. The R4 release includes 408,709 subjects and allele
         frequencies for 15.5 million rs sites, including nearly one million ClinVar
         variants. Genotype and associated individual-level data are accessible through dbGaP
         <a href="https://dbgap.ncbi.nlm.nih.gov/aa/wga.cgi?page=login"
         target="_blank">authorized access</a>.
     </li>
 
     <li>
         <b><a href="https://www.finngen.fi/en" target="_blank">FinnGen</a></b>:
         Imputed variants from 500,348 Biobank samples obtained using genotyping arrays
         in Finnland, 10% of the population. The imputation used phased variants obtained from 8,554
-        high-quality whole genome sequences, also from Finnland. For details, see (Kurki et al, Nature 2023).
+        high-quality whole genome sequences, also from Finnland. For details, see (Kurki et al,
+        Nature 2023).
         Phenotype links can be shown at <a href="https://r12.finngen.fi/">FinnGen PheWeb</a>.
     </li>
 
     <li>
         <b><a href="https://swefreq.nbis.se/dataset/SweGen" target="_blank">SweGen</a></b>:
-        Whole-genome sequencing variant frequencies for 1000 Swedish individuals generated within the SweGen project.
+        Whole-genome sequencing variant frequencies for 1000 Swedish individuals generated within
+        the SweGen project.
         The 1000 individuals included in the SweGen project represent a
         cross-section of the Swedish population and that no disease information
         has been used for the selection. The frequency data may therefore
         include genetic variants that are associated with, or causative of,
         disease. SweGen also provides SV calls, TEs, MELT results for TEs, HLAs and new sequence.
         For details, see (Ameur et al, Eur J Hum Genet 2017).
-        Dataset can be browsed at the <a href="https://swefreq.nbis.se/dataset/SweGen/browser">SweGen Browser</a>.
+        Dataset can be browsed at the
+        <a href="https://swefreq.nbis.se/dataset/SweGen/browser">SweGen Browser</a>.
     </li>
 
     <li>
-        <b><a href="https://jmorp.megabank.tohoku.ac.jp/downloads" target="_blank">JPN To61k Japan Tohoku University Tohoku Medical Megabank Organization 61k Allele frequency panel (JPN 61k)</a></b>:
+        <b><a href="https://jmorp.megabank.tohoku.ac.jp/downloads" target="_blank">JPN To61k Japan
+        Tohoku University Tohoku Medical Megabank Organization 61k Allele frequency panel
+        (JPN 61k)</a></b>:
         An allele frequency panel based on short-read WGS analysis of 61,000 Japanese individuals.
         The project includes other datatypes, such as STRs, long-read SVs and short-read CNVs.
         Data can be downloaded from the <a href="https://jmorp.megabank.tohoku.ac.jp"
         target="_blank">jMorp Website</a>, specifically the
         <a href="https://jmorp.megabank.tohoku.ac.jp/downloads" target="_blank">Downloads</a>
         section. For details, see (Tadaka et al, NAR 2023).
     </li>
 
     <li>
         <b><a href="https://abraom.ib.usp.br/"
         target="_blank">Brazil Arquivo Brasileiro Online de Muta&ccedil;&otilde; (ABraOM)</a></b>:
         Genomic variants obtained with whole-genome sequencing from SABE, a
         census-based sample of elderly individuals from S&atilde;o Paulo, Brazil's
         largest city. The Brazilian population is constituted by ~500 years of
         admixture between Africans, Europeans, and Native Americans.
@@ -137,39 +143,39 @@
     <li>
         <b><a href="https://www.kobic.re.kr/kova/"
         target="_blank">Korean Variant Archive (KOVA)</a></b>:
         1,896 whole genome sequencing and 3,409 whole exome sequencing data from healthy individuals
         of Korean ethnicity.
         Most of the samples originated from normal tissue of cancer
         patients (40.16 %), healthy parents of rare disease patients (28.4 %),
         or healthy volunteers (31.44 %). Japanese ancestry is broken down
         in the INFO field. Coverage 100x for WES, 30x for WGS. SVs called with Manta
         are also available. For details see (Lee et al, Exp Mol Med 2022).</li>
     <li>
         <b><a href="https://www.npm.sg/"
         target="_blank">NPM Singapore</a></b>:
         9,770 whole genomes, mostly of Chinese, Indian and Malay ancestry. 
         A minimum allele count cutoff of &gt; 5 was applied.
-        Data is available for download from the CHORUS browser, see "Data access" below.
+        Data is available for download from the CHORUS browser, see &quot;Data access&quot; below.
         For details see (Wong et al, Nat Genetics 2023). CNV data is also available there.
     </li>
     <li>
         <b><a href="https://www.vision2030.gov.sa/en/explore/projects/the-saudi-genome-program"
         target="_blank">Saudi Genome Program</a></b>:
         Variant frequencies from 302 whole genomes at 30x coverage, on Saudi Genome Program Samples.
-        The genotyping data and imputations from 3,352 individuals do not seem to be available publicly.
-        For details see (Malomane et al 2025). 
+        The genotyping data and imputations from 3,352 individuals do not seem to be available
+        publicly. For details see (Malomane et al 2025). 
     </li>
 </ul>
 </p>
 
 <h2>Display Conventions</h2>
 
 <p>Most tracks only show the variant and allele frequencies on mouseover or clicks.
 When zoomed in, tracks display alleles with base-specific coloring. Homozygote
 data are shown as one letter, while heterozygotes will be displayed with both
 letters.
 </p>
 
 <p>
 For <b>NCBI ALFA:</b> This track has no single VCF with INFO fields, but uses multiple subtracks
 instead, one per ancestry.
@@ -195,163 +201,215 @@
 <b>MCPS:</b> VCFs with summarized allele frequencies are available from
 the <a target="_blank" href="https://rgc-mcps.regeneron.com/">MCPS website</a>.
 </p>
 <p>
 <b>Regeneron one million exomes:</b> VCFs with summarized allele frequencies are available from
 the <a target="_blank" href="https://rgc-research.regeneron.com/me/resources">RGC ME website</a>.
 </p>
 <p>
 <b>TOPMED:</b> VCFs with summarized allele frequencies are available from
 the <a target="_blank" href="https://bravo.sph.umich.edu/">TOPMED BRAVO website</a>. They require a
 login.
 </p>
 <p>
 <b>SFARI SPARK:</b> Allele frequencies can be displayed on the
         <a href="https://genomes.sfari.org/" target=_blank>SFARI Genome Browser</a>.
-        Full CRAMs and VCFs with genotypes are available from <a target="_blank" href="https://base.sfari.org/">SFARI Base</a>. 
-They require a data access request, which is usually reviewed quickly. More information is available in the 
-<a href="https://cohorts-cdn.simonsfoundation.org/spark/researcher_packets/SPARK_SFARI_Researcher_Welcome_Packet.pdf" target=_blank>SPARK Welcome Packet</a>.
+        Full CRAMs and VCFs with genotypes are available from <a target="_blank"
+        href="https://base.sfari.org/">SFARI Base</a>. 
+They require a data access request, which is usually reviewed quickly. More information is available
+in the <a href="https://cohorts-cdn.simonsfoundation.org/spark/researcher_packets/SPARK_SFARI_Researcher_Welcome_Packet.pdf"
+target=_blank>SPARK Welcome Packet</a>.
 </p>
 <p>
 <b>GenomeAsia Pilot:</b> VCFs are available from UCSC and also from
 the <a target="_blank"
 href="https://browser.genomeasia100k.org/#tid=download">GenomeAsia 100K website</a>.
 No license nor login.
 </p>
 
 <p><b>KOVA:</b> 
         TSV data can be requested on the <a href="https://www.kobic.re.kr/kova/downloads"
         target="_blank">KOVA Downloads</a> website. 
 </p>
 
-<p><b>Finngen:</b> TSV data can be requested via the form at https://finngen.gitbook.io/documentation/data-download which triggers an email with the download link.</p>
+<p><b>Finngen:</b> TSV data can be requested via the form at
+https://finngen.gitbook.io/documentation/data-download which triggers an email with the download
+link.</p>
 
-<p><b>SweGen:</b> We are not allowed to redistribute the VCF, you can request it at <a target=_blank href="https://swefreq.nbis.se/dataset/SweGen">SweGen</a>, alongside the VCF file. </p>
+<p><b>SweGen:</b> We are not allowed to redistribute the VCF, you can request it at
+<a target="_blank" href="https://swefreq.nbis.se/dataset/SweGen">SweGen</a>, alongside the VCF file.
+</p>
 
 <p><b>NPM:</b> 
    VCF access can be requested on the 
-        <a href="https://chorus.grids-platform.io/" target="_blank">Chorus Browser</a> website, which requires an 
-        <a href = "https://npm.a-star.edu.sg/" target=_blank>account and data access request</a>. 
+   <a href="https://chorus.grids-platform.io/" target="_blank">Chorus Browser</a> website, which
+   requires an <a href = "https://npm.a-star.edu.sg/" target=_blank>account and data access
+   request</a>. 
 </p>
 
 <h2>Methods</h2>
 <p>
 <b>MXB:</b> Genotyping was performed with the Illumina Multi-Ethnic Global Array
 (MEGA, ~1.8M SNPs), optimized for admixed populations and enriched for
 ancestry-informative and medically relevant variants. Only autosomal, biallelic
 SNPs passing quality control are included. Samples were selected from 898
 recruitment sites, with prioritization of indigenous language speakers. Data
 processing included GenomeStudio &rarr; PLINK conversion, strand alignment, removal
 of duplicates, update of map positions using dbSNP Build 151 and low-quality
 variants/individuals, and relatedness filtering.
 </p>
 <p>
 <b>SGDP:</b> The version used was
 <a target="_blank" href="https://sharehost.hms.harvard.edu/genetics/reich_lab/sgdp/vcf_variants/"
 >https://sharehost.hms.harvard.edu/genetics/reich_lab/sgdp/vcf_variants/</a>,
 merged with bcftools and lifted to hg38 with CrossMap. 
 </p>
 <p>
-<b>KOVA:</b>Raw reads were aligned to the GRCh38+decoy reference using BWA-MEM v0.7.17 with default parameters, followed by duplicate marking and coordinate sorting with MarkDuplicatesSpark, and base quality score recalibration using BQSRPipelineSpark in GATK v4.1.3.0; mapping quality control metrics were generated with Qualimap v2.2.1. Single-nucleotide variants and small insertions/deletions were called per sample using GATK HaplotypeCaller in GVCF mode (-ERC GVCF), and joint genotyping was performed by creating a GenomicsDB with GenomicsDBImport and following GATK Best Practices, including variant quality score recalibration (VQSR) retaining 99.7% of true SNVs and 99.0% of true indels based on training sets (workflow detailed in Supplementary Fig. 1). Downstream analyses followed a modified version of the gnomAD quality-control framework and were primarily conducted using Hail, an open-source Python library for large-scale genome analysis; after merging WES and WGS data in Hail, multiallelic variants and variants with genotype quality <20, read depth <10, allelic balance <0.2, or overlapping low-complexity regions were excluded (Supplementary Fig. 2).
+<b>KOVA:</b> Raw reads were aligned to the GRCh38+decoy reference using BWA-MEM v0.7.17 with default
+parameters, followed by duplicate marking and coordinate sorting with MarkDuplicatesSpark, and base
+quality score recalibration using BQSRPipelineSpark in GATK v4.1.3.0; mapping quality control
+metrics were generated with Qualimap v2.2.1. Single-nucleotide variants and small
+insertions/deletions were called per sample using GATK HaplotypeCaller in GVCF mode (-ERC GVCF), and
+joint genotyping was performed by creating a GenomicsDB with GenomicsDBImport and following GATK
+Best Practices, including variant quality score recalibration (VQSR) retaining 99.7% of true SNVs
+and 99.0% of true indels based on training sets (workflow detailed in Supplementary Fig. 1).
+Downstream analyses followed a modified version of the gnomAD quality-control framework and were
+primarily conducted using Hail, an open-source Python library for large-scale genome analysis; after
+merging WES and WGS data in Hail, multiallelic variants and variants with genotype quality &lt;20,
+read depth &lt;10, allelic balance &lt;0.2, or overlapping low-complexity regions were excluded
+(Supplementary Fig. 2).
 <br>
 At UCSC, V7 of the TSV.gz was obtained from the KOVA staff by email and converted to VCF. It is not
 available for download from our site but can be requested from the KOVA website.
 </p>
 
 <p>
 <b>ABraOM:</b> For Academic use only. Licensing for commercial use might be available under request and agreement.
 By using this resource you agree to cite the flagship paper (Naslavsky et al. Nat Comm 2022).
-Whole-genome sequencing was performed at Human Longevity Inc. using TruSeq Nano DNA HT libraries sequenced on Illumina HiSeqX instruments with 150 bp paired-end reads targeting 30× coverage, and reads were mapped to GRCh38 using ISIS software. Sample sex was validated by comparing CPMs of X chromosome and male-specific Y (MSY) reads relative to autosomes, yielding the expected female (~55,000 X CPM, <200 MSY CPM) and male (~27,500 X CPM, >550 MSY CPM) patterns. Germline SNVs and indels were called following GATK Best Practices (GATK v3.7) via per-sample GVCFs (HaplotypeCaller), joint genotyping (CombineGVCFs, GenotypeGVCFs), and Variant Quality Score Recalibration (VQSR-AS); multiallelic variants were split with an in-house script, left-aligned with BCFtools, and annotated using Annovar and custom scripts against dbSNP, 1000 Genomes, and gnomAD, with putative loss-of-function variants identified using LOFTEE v0.3-beta irrespective of confidence labels. Variant and genotype quality was further assessed using the in-house CEGH-Filter two-step algorithm based on depth and allele balance, and analyses retained only GATK VQSR-AS PASS variants and higher-confidence CEGH-Filter calls. Relatedness was assessed using KING and PC-Relate (GENESIS), retaining a single proband per related pair and excluding one contaminated sample (>3% by verifyBAMID), resulting in a final dataset of 1,171 unrelated individuals. Final samples achieved mean coverages ranging from 31.3× to 64.8×, with an average of 38.65× and a median of 36.6×.
+Whole-genome sequencing was performed at Human Longevity Inc. using TruSeq Nano DNA HT libraries
+sequenced on Illumina HiSeqX instruments with 150 bp paired-end reads targeting 30x coverage, and
+reads were mapped to GRCh38 using ISIS software. Sample sex was validated by comparing CPMs of X
+chromosome and male-specific Y (MSY) reads relative to autosomes, yielding the expected female
+(~55,000 X CPM, &lt;200 MSY CPM) and male (~27,500 X CPM, &gt;550 MSY CPM) patterns. Germline SNVs
+and indels were called following GATK Best Practices (GATK v3.7) via per-sample GVCFs
+(HaplotypeCaller), joint genotyping (CombineGVCFs, GenotypeGVCFs), and Variant Quality Score
+Recalibration (VQSR-AS); multiallelic variants were split with an in-house script, left-aligned with
+BCFtools, and annotated using Annovar and custom scripts against dbSNP, 1000 Genomes, and gnomAD,
+with putative loss-of-function variants identified using LOFTEE v0.3-beta irrespective of confidence
+labels. Variant and genotype quality was further assessed using the in-house CEGH-Filter two-step
+algorithm based on depth and allele balance, and analyses retained only GATK VQSR-AS PASS variants
+and higher-confidence CEGH-Filter calls. Relatedness was assessed using KING and PC-Relate
+(GENESIS), retaining a single proband per related pair and excluding one contaminated sample
+(&gt;3% by verifyBAMID), resulting in a final dataset of 1,171 unrelated individuals. Final samples
+achieved mean coverages ranging from 31.3x to 64.8x, with an average of 38.65x and a median of
+36.6x.
 </p>
 
 <p><b>SFARI SPARK:</b> The genome browser track project was approved by the Simons 
 Foundation as 14584.1. WES and WGS Data were downloaded from 
 <a href="https://base.sfari.org/" target="_blank">SFARI Base</a>.
-pVCFs were downloaded, anonymized with a script using bcftools and the fill-tags plugin and normalized,
-without a minimum allele frequency cutoff.<br>
+pVCFs were downloaded, anonymized with a script using bcftools and the fill-tags plugin and
+normalized, without a minimum allele frequency cutoff.<br>
 The methods are documented as follows by SFARI:<br>
+<ul>
+  <li>
     <b>WES</b>:
     This release consists of sequence and variant call data for 12,519
     unique individuals, of which 12,517 (99.98%) have available genome-wide
     SNP genotype data. Sequencing and genotyping of all samples in this
     release was performed at New York Genome Center (NYGC). DNA from saliva
     samples were extracted and prepared with PCR-free methods and sequenced
     with paired-end sequencing of 150 bases on the Illumina NovaSeq 6000
     system.  Alignment of reads to the human reference genome version
     GRCh38, duplicate read marking, and Base Quality Score Recalibration
     (BQSR) were performed by New York Genome Center (NYCG). Whole-genome
     sequencing data were processed using a standardized, functionally
     equivalent CCDG pipeline with alignment to the GRCh38DH (1000 Genomes)
     reference using BWA-MEM v0.7.15 (deterministic settings, no -M, use of
-.alt contigs), Picard-equivalent duplicate marking (Picard ≥2.4.1 or
+    .alt contigs), Picard-equivalent duplicate marking (Picard &ge;2.4.1 or
     equivalent), no indel realignment, and base quality score recalibration
     with GATK (dbSNP138, Mills and 1000G gold-standard indels, known
     indels).  Final outputs were stored as lossless CRAM files with
     complete SAM-compliant read-group annotations and mandatory 4-bin
-base-quality compression (Q2–6, 10, 20, 30), and all implementations
+    base-quality compression (Q2&mdash;6, 10, 20, 30), and all implementations
     were validated for functional equivalence across centers before use.
-Variant Calling was performed using DeepVariant. See <a href="https://github.com/CCDG/Pipeline-Standardization/blob/master/PipelineStandard.md" target=_blank>CCDG pipeline details</a>.<br>
-
+    Variant Calling was performed using DeepVariant. See
+    <a href="https://github.com/CCDG/Pipeline-Standardization/blob/master/PipelineStandard.md"
+    target="_blank">CCDG pipeline details</a>.<br>
+  </li>
+  <li>
     <b>WGS</b>: This release contains
     sequence data for 142,357 individuals and genotyping data for
     141,368 individuals. DNA was sequenced from saliva for all
     samples and all participants consented to having their genetic
     data shared by Regeneron. Exomes for all samples were sequenced with
     short-read, paired-end sequencing of 150 bases on Illumina
     NovaSeq 6000 machines using S2/S4 flow cells. Sequencing and
     genotyping was performed across nine batches (WES1 through
     WES9) at the Regeneron Genetics Center (RGC) and integrated
     together for this data release. All sequencing batches were
     processed using the same DNA extraction methods and sequencing
     machines, however two different exome capture panels were used,
     as described below. Genotyping was performed using a SNP
     genotyping array for WES1 through WES4 and using
-“genotyping-by-sequencing” (GxS) for WES5 through WES9.  The
+    &quot;genotyping-by-sequencing&quot; (GxS) for WES5 through WES9.  The
     first four sequencing batches were sequenced at Regeneron using
     custom NEB/Kapa reagents with the IDT (Integrated DNA
     Technologies) xGen capture platform, including custom exome
     capture regions. Samples starting with batch WES5 were
     sequenced using the Twist Bioscience Human
     Comprehensive Exome panel, combined with spike-ins for
     sequencing genotyping sites (see Genotyping Methods), the full
     mitochondrial genome, and coverage boosted at selected sites
     for assaying clonal hematopoiesis of indeterminate potential
     (CHIP).  SFARI performed NV/indel calling via DeepVariant and
     GATK to generate gVCFs, pairwise relatedness inferred using
-PLINK v1.9 IBD estimates from common SNPs (AF ≥ 0.01, dbSNP
-v151) with ≥15% relatedness flagged, and comprehensive
+    PLINK v1.9 IBD estimates from common SNPs (AF &ge; 0.01, dbSNP
+    v151) with &ge;15% relatedness flagged, and comprehensive
     individual- and family-level quality control executed using the
     internal GenomeCheckMate pipeline to exclude samples based on
-contamination (≥5%), insufficient coverage (<20× in <80% of
+    contamination (&ge;5%), insufficient coverage (&lt;20x in &lt;80% of
     targets), sex discordance, pedigree/IBD inconsistencies,
     unregistered relationships, unexpected duplicates, or excess
     relatedness, after which QC-passing individuals (selecting the
     most recent passing sample per person) were retained for
     variant calling and joint genotyping.
-</p>
-
+    </p></li>
+</ul>
 
 <p><b>Finngen:</b> R12 annotated variants were downloaded from the Google Cloud
 bucket link received though an email after filling out the form linked from
 https://finngen.gitbook.io/documentation/data-download and converted to VCF
 with a <a
 href="https://github.com/ucscGenomeBrowser/kent/tree/master/src/hg/makeDb/scripts/finngen_to_vcf.py"
-target=_blank>custom Python script</a>. </p>
-
-<p><b>SweGen:</b> Fragment size 350bp on a Covaris E220. Paired-end sequencing with 150 bp read length was performed on Illumina HiSeq X (HiSeq Control Software 3.3.39/RTA 2.7.1) with v2.5 sequencing chemistry. Raw whole-genome reads were aligned to the GRCh37 reference using BWA-MEM v0.7.12, then sorted and indexed with samtools v0.1.19 and assessed with qualimap v2.2.20; per-sample alignments from multiple lanes and flow cells were merged using Picard MergeSamFiles v1.120. Processing followed GATK best practices with GATK v3.3, including indel realignment (RealignerTargetCreator, IndelRealigner), duplicate marking (Picard MarkDuplicates v1.120), and base quality score recalibration (BaseRecalibrator), producing one finalized BAM per sample. Per-sample gVCFs were generated with GATK HaplotypeCaller v3.3 using reference files from the GATK v2.8 resource bundle, with all steps coordinated via Piper v1.4.0. Joint genotyping of 1,000 samples was performed by merging gVCFs in five batches of 200 using GATK CombineGVCFs, followed by cohort genotyping with GATK GenotypeGVCFs and variant quality score recalibration for SNVs and indels using VariantRecalibrator and ApplyRecalibration.
+target="_blank">custom Python script</a>. </p>
+
+<p><b>SweGen:</b> Fragment size 350bp on a Covaris E220. Paired-end sequencing with 150bp read
+length was performed on Illumina HiSeq X (HiSeq Control Software 3.3.39/RTA 2.7.1) with v2.5
+sequencing chemistry. Raw whole-genome reads were aligned to the GRCh37 reference using BWA-MEM
+v0.7.12, then sorted and indexed with samtools v0.1.19 and assessed with qualimap v2.2.20;
+per-sample alignments from multiple lanes and flow cells were merged using Picard MergeSamFiles
+v1.120. Processing followed GATK best practices with GATK v3.3, including indel realignment
+(RealignerTargetCreator, IndelRealigner), duplicate marking (Picard MarkDuplicates v1.120), and
+base quality score recalibration (BaseRecalibrator), producing one finalized BAM per sample.
+Per-sample gVCFs were generated with GATK HaplotypeCaller v3.3 using reference files from the GATK
+v2.8 resource bundle, with all steps coordinated via Piper v1.4.0. Joint genotyping of 1,000 samples
+was performed by merging gVCFs in five batches of 200 using GATK CombineGVCFs, followed by cohort
+genotyping with GATK GenotypeGVCFs and variant quality score recalibration for SNVs and indels using
+VariantRecalibrator and ApplyRecalibration.
 <BR>At UCSC, the hg38 VCF was downloaded 
-from <a target=_blank href="https://swefreq.nbis.se/dataset/SweGen/download">SweFreq</a>.
+from <a target="_blank" href="https://swefreq.nbis.se/dataset/SweGen/download">SweFreq</a>.
 </p>
 
 <p><b>NPM Singapore:</b> Whole Genome Sequencing (WGS) data processing followed
 GATK4 best practices. GATK4 germline variant analysis workflow written in WDL
 was adapted to use Nextflow and deployed at the National Supercomputing Centre,
 Singapore (NSCC). In short, WGS reads were aligned against GRCh38 using the
 BWA-MEM algorithm and used as input to GATK HaplotypeCaller to produce single
 sample gVCFs. The gVCF files were joint-called then loaded in Hail, an
 open-source python-based data analysis library suited to work with
 population-scale with genomic data collections. Low-quality WGS libraries and
 low-quality variants were removed.  QC-ed variants were functionally annotated
 using Ensembl Variant Effect Predictor (VEP) (version 95). Functional
 annotations for variant impacting protein-coding were also complemented with
 information on the potential alteration to their cognate protein's 3D structure
 and drug binding ability.
@@ -395,38 +453,39 @@
 <p>
 <b>SGDP:</b> This project was funded by the Simons Foundation. Thanks to David Reich and Swapan 
 Mallick for help with importing the data.
 </p>
 <p>
 <b>KOVA:</b> Thanks to Insu Jang and the KOVA director for providing variant frequencies in TSV
 format.
 </p>
 <p>
 <b>Finngen:</b> We want to acknowledge the participants and investigators of the FinnGen study.
 </p>
 
 <p>
 <b>SweGen:</b> The SweGen allele frequency data was generated by Science for
 Life Laboratory. The data may be redistributed in original or modified form,
-but must always be distributed together with the file "terms_of_use.txt" that
+but must always be distributed together with the file &quot;terms_of_use.txt&quot; that
 is stored together with the data on our download server, and any redistributed
 data derived from the SweGen data set must follow those terms and conditions.
 The data may not be used to attempt to identify any individual in this or other studies.
 </p>
 
 <p>
-<b>NPM Singapore:</b> Thanks to the NPM Data Access Committee and Eleanor for granting our data request. 
+<b>NPM Singapore:</b> Thanks to the NPM Data Access Committee and Eleanor for granting our data
+request. 
 By browsing the data, you agree to use the data only for academic, non-commercial
 research to improve human health (biology/disease).  We request all data users
 agree to protect the
 confidentiality of the data subjects in any research papers or publications
 that they may prepare, by taking all reasonable care to limit the possibility
 of identification. In particular, the data users shall not to use, or attempt
 to use, the data to deliberately compromise or otherwise infringe the
 confidentiality of information on data subjects and their right to privacy.
 If you use any of the data obtained from the CHORUS variant browser, we request
 that you cite the NPM flagship paper (Wong et al, 2023). All data users of the
 data must take note that the data provider and relevant SG10K_Health cohort
 owners bear no responsibility for the further analysis or interpretation of the
 data.  </p>
 
 <p>Thanks to Alex Ioannidis, UCSC, and Andreas Lahner, MGZ, for feedback on this track.</p>
@@ -551,61 +610,60 @@
 PMID: <a href="https://www.ncbi.nlm.nih.gov/pubmed/27654912" target="_blank">27654912</a>; PMC: <a
 href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5161557/" target="_blank">PMC5161557</a>
 </p>
 
 <p>
 Lee J, Lee J, Jeon S, Lee J, Jang I, Yang JO, Park S, Lee B, Choi J, Choi BO <em>et al</em>.
 <a href="https://doi.org/10.1038/s12276-022-00871-4" target="_blank">
 A database of 5305 healthy Korean individuals reveals genetic and clinical implications for an East
 Asian population</a>.
 <em>Exp Mol Med</em>. 2022 Nov;54(11):1862-1871.
 PMID: <a href="https://www.ncbi.nlm.nih.gov/pubmed/36323850" target="_blank">36323850</a>; PMC: <a
 href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9628380/" target="_blank">PMC9628380</a>
 </p>
 
 <p>
-Kurki MI, Karjalainen J, Palta P, Sipilä TP, Kristiansson K, Donner KM, Reeve MP, Laivuori H,
+Kurki MI, Karjalainen J, Palta P, Sipil&auml; TP, Kristiansson K, Donner KM, Reeve MP, Laivuori H,
 Aavikko M, Kaunisto MA <em>et al</em>.
 <a href="https://doi.org/10.1038/s41586-022-05473-8" target="_blank">
 FinnGen provides genetic insights from a well-phenotyped isolated population</a>.
 <em>Nature</em>. 2023 Jan;613(7944):508-518.
 PMID: <a href="https://www.ncbi.nlm.nih.gov/pubmed/36653562" target="_blank">36653562</a>; PMC: <a
 href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9849126/" target="_blank">PMC9849126</a>
 </p>
 
 <p>
 Wong E, Bertin N, Hebrard M, Tirado-Magallanes R, Bellis C, Lim WK, Chua CY, Tong PML, Chua R, Mak K
 <em>et al</em>.
 <a href="https://doi.org/10.1038/s41588-022-01274-x" target="_blank">
 The Singapore National Precision Medicine Strategy</a>.
 <em>Nat Genet</em>. 2023 Feb;55(2):178-186.
 PMID: <a href="https://www.ncbi.nlm.nih.gov/pubmed/36658435" target="_blank">36658435</a>
 </p>
 
-
-
 <p>
 Malomane DK, Williams MP, Huber CD, Mangul S, Abedalthagafi M, Chiang CWK.
 <a href="https://doi.org/10.1101/2025.01.10.632500" target="_blank">
 Patterns of population structure and genetic variation within the Saudi Arabian population</a>.
 <em>bioRxiv</em>. 2025 Jan 13;.
 PMID: <a href="https://www.ncbi.nlm.nih.gov/pubmed/39868174" target="_blank">39868174</a>; PMC: <a
 href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11761371/" target="_blank">PMC11761371</a>
 </p>
 
 <p>
-Ameur A, Dahlberg J, Olason P, Vezzi F, Karlsson R, Martin M, Viklund J, Kähäri AK, Lundin P, Che H
+Ameur A, Dahlberg J, Olason P, Vezzi F, Karlsson R, Martin M, Viklund J, K&auml;h&auml;ri AK,
+Lundin P, Che H
 <em>et al</em>.
 <a href="https://doi.org/10.1038/ejhg.2017.130" target="_blank">
 SweGen: a whole-genome data resource of genetic variability in a cross-section of the Swedish
 population</a>.
 <em>Eur J Hum Genet</em>. 2017 Nov;25(11):1253-1260.
 PMID: <a href="https://www.ncbi.nlm.nih.gov/pubmed/28832569" target="_blank">28832569</a>; PMC: <a
 href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5765326/" target="_blank">PMC5765326</a>
 </p>
 
 <p>
 SPARK Consortium. Electronic address: pfeliciano@simonsfoundation.org, SPARK Consortium.
 <a href="https://linkinghub.elsevier.com/retrieve/pii/S0896-6273(18)30018-7" target="_blank">
 SPARK: A US Cohort of 50,000 Families to Accelerate Autism Research</a>.
 <em>Neuron</em>. 2018 Feb 7;97(3):488-493.
 PMID: <a href="https://www.ncbi.nlm.nih.gov/pubmed/29420931" target="_blank">29420931</a>; PMC: <a