src/hg/makeDb/trackDb/human/varFreqsAll.html 68c5b3b5dfc4053ff78a6b1d236bd1ac90251cfa

68c5b3b5dfc4053ff78a6b1d236bd1ac90251cfa
lrnassar
  Mon Jun 1 14:40:45 2026 -0700
varFreqs: description pages for the three combined tracks and "SNV" rename
sweep.

Add varFreqsDisease.html and varFreqsArray.html so the two new combined
tracks have full Description/Display/Methods/Data Access/References. Add a
Caveats section on varFreqsArray about chip-data quality vs sequencing.

Update varFreqsAll.html and the supertrack varFreqs.html to reflect the
three-combined-track family (cross-links between siblings, new "Combined
Tracks" section, new table rows, and updated source/variant counts). Add a
GoNL row to the supertrack table.

Sweep 37 subtrack longLabels and four cross-referencing description pages
(colorsDbSnv.html, mei.html, meiSwegen.html, phasedVars.html) from
"Variant Frequencies:" to "SNV Frequencies:" to match the supertrack
shortLabel. refs #36642

diff --git src/hg/makeDb/trackDb/human/varFreqsAll.html src/hg/makeDb/trackDb/human/varFreqsAll.html
index 9a07af3d6b9..d6ee5fdd42d 100644
--- src/hg/makeDb/trackDb/human/varFreqsAll.html
+++ src/hg/makeDb/trackDb/human/varFreqsAll.html
@@ -1,214 +1,244 @@
 <h2>Description</h2>
 <p>
-This track merges variants from all individual variant frequency databases into a single
-bigBed file with predicted protein consequences and cross-database filtering. It contains
-over 1.1 billion variants from 26 source databases worldwide. For a summary of
-all available databases, see the
-<a href="hgTrackUi?g=varFreqs">Variant Frequencies</a> supertrack page.
+This track merges variants from 28 sequencing-based variant frequency databases into a
+single bigBed file with predicted protein consequences and cross-database filtering. It
+contains 1.34 billion variants from WGS, WES, and long-read sequencing cohorts worldwide.
+For a summary of all available databases, see the
+<a href="hgTrackUi?g=varFreqs">SNV Frequencies</a> supertrack page.
 </p>
 
+<p>
+Two companion combined tracks split out the cohorts that don't belong in a general
+sequencing-based summary:
+</p>
+<ul>
+  <li><a href="hgTrackUi?g=varFreqsDisease">Disease-related Databases Combined</a> &mdash;
+      932 M variants from six disease-focused cohorts (SPARK, SFARI WGS, TOPMed, SCHEMA,
+      GREGoR, GA4K), with phenotype-stratified AC/AF where the source provides it.</li>
+  <li><a href="hgTrackUi?g=varFreqsArray">Genotyping Array Databases Combined</a> &mdash;
+      14.7 M variants from three array cohorts (TPMI Taiwan, Mexico Biobank, UK Biobank
+      imputed). Kept separate because chip data has different per-variant confidence
+      than sequencing.</li>
+</ul>
+
 <p>
 Each variant is annotated with its predicted consequence on protein-coding genes
 (using <a href="https://samtools.github.io/bcftools/howtos/csq-calling.html"
 target="_blank">bcftools csq</a> with
 <a href="https://www.ensembl.org/info/data/ftp/index.html" target="_blank">Ensembl</a>
-gene models), and colored by severity.
-Allele counts and frequencies are shown for each source database and, where available,
-broken down by ancestry or population group.
+gene models), and colored by severity. Allele counts and frequencies are shown for each
+source database and, where available, broken down by ancestry, population, or phenotype.
 </p>
 
 <h2>Display Conventions</h2>
 
 <h3>Color by Consequence</h3>
 <p>Variants are colored by their most severe predicted consequence:</p>
 <table class="stdTbl">
 <tr><th>Color</th><th>Consequence class</th><th>Examples</th></tr>
 <tr>
   <td style="background-color: rgb(255,0,0); color: white; text-align: center;"><b>Red</b></td>
   <td>Protein-truncating / Loss-of-function</td>
   <td>stop_gained, frameshift, splice_donor, splice_acceptor, stop_lost, start_lost</td>
 </tr>
 <tr>
   <td style="background-color: rgb(31,119,180); color: white; text-align: center;"><b>Blue</b></td>
   <td>Missense / In-frame</td>
   <td>missense, inframe_insertion, inframe_deletion, protein_altering</td>
 </tr>
 <tr>
   <td style="background-color: rgb(0,128,0); color: white; text-align: center;"><b>Green</b></td>
   <td>Synonymous</td>
   <td>synonymous, stop_retained</td>
 </tr>
 <tr>
   <td style="background-color: rgb(128,128,128); color: white; text-align: center;"><b>Grey</b></td>
   <td>Non-coding / Intergenic</td>
   <td>intron, non_coding, intergenic, UTR</td>
 </tr>
 </table>
 
 <h3>Amino Acid Change Notation</h3>
 <p>
 The &quot;AA change&quot; field uses bcftools csq notation: <b>23I&gt;23V</b> means position
 23 changed from Isoleucine (I) to Valine (V) (missense). <b>23I</b> alone (no arrow)
 means position 23 is Isoleucine and unchanged (synonymous). A &quot;*&quot; indicates a
 stop codon (e.g. 45R&gt;45* is a stop_gained).
 </p>
 
 <h2>Filters</h2>
 <p>
 This track supports extensive filtering via the track settings page. Click on the track
 title or use the &quot;Configure&quot; button to access filters:
 </p>
 
 <h3>Variant Type and Consequence</h3>
 <ul>
   <li><b>Variant Type</b>: Filter by SNV, Insertion, Deletion, or MNV.</li>
   <li><b>Consequence</b>: Filter by predicted consequence (Missense, Synonymous, Stop Gained,
       Frameshift, Splice Donor, Splice Acceptor, Intron, 3' UTR, 5' UTR, Non-coding,
       Intergenic, Other). The filter uses OR logic across the comma-separated consequence
       tokens on each variant: a variant tagged
       <code>stop_gained,frameshift</code> is selected by either the &quot;Stop Gained&quot;
       or the &quot;Frameshift&quot; filter. The &quot;Other&quot; bucket catches the less
       common <a href="http://www.sequenceontology.org/" target="_blank">Sequence Ontology</a>
       consequence terms emitted by <code>bcftools csq</code> that don't fit the named
       buckets above. Examples include
       <code>splice_region</code> (variant near a splice site but outside the canonical
       donor/acceptor),
       <code>start_lost</code> / <code>stop_lost</code> (variant disrupts the start codon
       or replaces the stop codon with a coding amino acid),
       <code>stop_retained</code> (variant changes the stop codon but keeps it a stop),
       <code>inframe_insertion</code> / <code>inframe_deletion</code> (in-frame indel
       that adds or removes whole codons), and
       <code>coding_sequence</code> (CDS variant where the precise impact is undetermined).
       If you include &quot;Other&quot; in the filter selection, no records will be
       hidden by the consequence filter.</li>
 </ul>
 
 <p><b>How to find protein-truncating variants:</b> Set the Consequence filter to include
 only &quot;Stop Gained&quot;, &quot;Frameshift&quot;, &quot;Splice Donor&quot;, and
 &quot;Splice Acceptor&quot;. These will appear as red items in the track display.</p>
 
 <h3>Frequency and Count Filters</h3>
 <ul>
   <li><b>Max Allele Frequency</b>: Filter by the maximum allele frequency observed across
       all databases. Useful for finding rare variants (e.g., set max to 0.01 for variants
       with AF &lt; 1% in all databases).</li>
   <li><b>Total Allele Count</b>: Filter by the sum of allele counts across all databases.
       Useful for excluding singletons (e.g., set minimum to 2 to remove AC=1 variants
       that may be sequencing errors).</li>
   <li><b>Per-database AF and AC</b>: Filter by allele frequency or allele count in any
       specific database. For example, filter to variants with TOPMed AF &gt; 0.001.</li>
 </ul>
 
 <h3>Source Database</h3>
 <p>
 The <b>Source Database</b> filter lets you restrict to variants present in specific databases.
 For example, select only &quot;GREGoR&quot; to see variants found in the rare disease cohort.
 This filter uses OR logic: selecting multiple databases shows variants found in
 <em>any</em> of the selected databases.
 </p>
 
-<h3>Population-Specific Filters</h3>
+<h3>Population- and Phenotype-Specific Filters</h3>
 <p>
 Several databases provide ancestry-specific allele frequencies:
 </p>
 <ul>
-  <li><b>AllOfUs</b>: African, Indigenous American, East Asian, European, Oceanian, South Asian
-      (from local ancestry inference)</li>
-  <li><b>GenomeAsia</b>: Northeast Asian, Southeast Asian, South Asian</li>
+  <li><b>AllOfUs</b>: African, Indigenous American, East Asian, European, Oceanian, South
+      Asian (from local ancestry inference)</li>
+  <li><b>GenomeAsia</b>: Northeast Asian, Southeast Asian, South Asian, Oceanian, American,
+      African, Western European Reference</li>
   <li><b>gnomAD HGDP+1kG</b>: African, Amish, Latino, Ashkenazi Jewish, East Asian, Finnish,
       Middle Eastern, Non-Finnish European, Other, South Asian</li>
-  <li><b>GREGoR</b>: Affected, Unaffected, Unknown (disease status, not ancestry)</li>
+  <li><b>NPM Singapore</b>: Chinese, Malay, Indian</li>
+  <li><b>WBBC</b>: North Han, Central Han, South Han, Lingnan Han</li>
 </ul>
+<p>
+Three sources also expose phenotype-stratified counts:
+</p>
+<ul>
+  <li><b>SPARK WES</b> and <b>SFARI WGS</b>: ASD proband AC/AF versus non-ASD family
+      member AC/AF.</li>
+  <li><b>GREGoR</b>: Affected, Unaffected, and Unknown disease-status AC/AF.</li>
+</ul>
+<p>
+The disease-related <a href="hgTrackUi?g=varFreqsDisease">Disease-related Databases Combined</a>
+track exposes additional phenotype splits for SCHEMA (Schizophrenia case vs control).
+</p>
 
 <h3>Length Filters</h3>
 <ul>
   <li><b>Reference/Alternate Length</b>: Filter by the length of the reference or alternate allele.</li>
   <li><b>Length Change</b>: Filter by the size difference between alternate and reference
       (positive = insertion, negative = deletion, zero = SNV or MNV).</li>
 </ul>
 
 <h2>Methods</h2>
 <p>
-Variant frequency VCF files from 26 databases were stripped of their INFO fields
-(to reduce size), normalized with <code>bcftools norm</code> (splitting multi-allelic sites),
-and merged with <code>bcftools merge</code>. The merged VCF was then annotated with predicted
-protein consequences using <code>bcftools csq</code> with the
+Variant frequency VCF files from 28 sequencing-based databases were stripped of their INFO
+fields (to reduce size), normalized with <code>bcftools norm</code> (splitting multi-allelic
+sites), and merged with <code>bcftools merge</code>. The merged VCF was then annotated with
+predicted protein consequences using <code>bcftools csq</code> with the
 <a href="https://www.ensembl.org/info/data/ftp/index.html" target="_blank">Ensembl</a>
-GRCh38 release 115 gene annotation (GFF3).
+GRCh38 release 115 gene annotation (GFF3). The same pipeline, run on different subsets of
+source VCFs, produces the
+<a href="hgTrackUi?g=varFreqsDisease">Disease-related Databases Combined</a> and
+<a href="hgTrackUi?g=varFreqsArray">Genotyping Array Databases Combined</a> tracks.
 </p>
 
 <p>
 The annotated VCF was converted to bigBed format using a custom Python script
 (<code>vcfToBigBed.py</code>) that reads frequency data from each source VCF in parallel,
 matches variants by position/ref/alt, and writes a BED file with consequence coloring,
 per-database allele counts and frequencies, and population breakdowns.
 The database configuration (which VCFs to include, field mappings, and population definitions)
 is stored in two TSV files
 (<a href="https://github.com/ucscGenomeBrowser/kent/tree/master/src/hg/makeDb/scripts/varFreqs/databases.tsv"
 target="_blank">databases.tsv</a> and
 <a href="https://github.com/ucscGenomeBrowser/kent/tree/master/src/hg/makeDb/scripts/varFreqs/populations.tsv"
 target="_blank">populations.tsv</a>)
 so that future updates only require editing these files.
 </p>
 
 <p>
 The track's
 <a href="https://github.com/ucscGenomeBrowser/kent/blob/master/src/hg/makeDb/doc/hg38/varFreqs.txt"
 target="_blank">makeDoc file</a> documents how each source VCF was converted.
 Scripts are available from
 <a href="https://github.com/ucscGenomeBrowser/kent/tree/master/src/hg/makeDb/scripts/varFreqs"
 target="_blank">Github</a>.
 </p>
 
 <h2>Data Access</h2>
 <p>
 The data can be explored interactively with the
 <a href="../cgi-bin/hgTables">Table Browser</a> or the
 <a href="../cgi-bin/hgIntegrator">Data Integrator</a>.
 For programmatic access, our <a href="https://api.genome.ucsc.edu" target="_blank">REST API</a>
 can be used; the track name is <em>varFreqsAll</em>.
 </p>
 <p>
 Because the merged callset includes data from multiple sources whose redistribution
 licenses differ, the combined bigBed is <b>not available for download</b> from our
 download server. The combined track can be reconstructed from the individual source VCFs
 using the
 <a href="https://github.com/ucscGenomeBrowser/kent/tree/master/src/hg/makeDb/scripts/varFreqs"
 target="_blank">conversion scripts on GitHub</a> together with the
 <a href="https://github.com/ucscGenomeBrowser/kent/blob/master/src/hg/makeDb/doc/hg38/varFreqs.txt"
 target="_blank">build documentation</a>. Where individual source data is downloadable from UCSC,
 the per-subtrack description page indicates the path on our download server.
 </p>
 
 <h2>Credits</h2>
 <p>
 This track is only possible thanks to the data from millions of volunteers around the world,
 who donated blood, signed consent forms and provided health information about themselves and
 sometimes their families. Click on any of the individual tracks in the
-<a href="hgTrackUi?g=varFreqs">Variant Frequencies</a> supertrack to see the specific
+<a href="hgTrackUi?g=varFreqs">SNV Frequencies</a> supertrack to see the specific
 credits for each project. Thanks to Alex Ioannidis, UCSC, for the motivation for this track
 and to Andreas Lahner, MGZ, for feedback.
 </p>
 
 <h2>References</h2>
 <p>
 For primary citations of each source dataset, see the References section on the
-<a href="hgTrackUi?g=varFreqs">Variant Frequencies</a> supertrack page. The merged-track
+<a href="hgTrackUi?g=varFreqs">SNV Frequencies</a> supertrack page. The merged-track
 build itself uses the following tools:
 </p>
 <p>
 Danecek P, McCarthy SA.
 <a href="https://doi.org/10.1093/bioinformatics/btx100" target="_blank">
 BCFtools/csq: haplotype-aware variant consequences</a>.
 <em>Bioinformatics</em>. 2017 Jul 1;33(13):2037-2039.
 PMID: <a href="https://www.ncbi.nlm.nih.gov/pubmed/28205675" target="_blank">28205675</a>; PMC: <a
 href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5870570/" target="_blank">PMC5870570</a>
 </p>
 <p>
 McLaren W, Gil L, Hunt SE, Riat HS, Ritchie GR, Thormann A, Flicek P, Cunningham F.
 <a href="https://doi.org/10.1186/s13059-016-0974-4" target="_blank">
 The Ensembl Variant Effect Predictor</a>.
 <em>Genome Biol</em>. 2016 Jun 6;17(1):122.
 PMID: <a href="https://www.ncbi.nlm.nih.gov/pubmed/27268795" target="_blank">27268795</a>; PMC: <a
 href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4893825/" target="_blank">PMC4893825</a>
 </p>