src/hg/makeDb/trackDb/human/varFreqsAll.html 64a3f9e7813e823cf724ea188c3928a911578286

64a3f9e7813e823cf724ea188c3928a911578286
max
  Thu Jun 4 00:32:22 2026 -0700
varFreqs: replace All Databases Combined with two phenotype-split tracks

Replace the single varFreqsAll combined track (and drop the varFreqsDisease
track) with two matched tracks for visual case-vs-background comparison:
varFreqsAffected   - variants seen in the affected/case arms of disease
cohorts (SFARI SPARK WES/WGS ASD probands, SCHEMA cases,
GREGoR affected, GA4K); ~130,000 individuals
varFreqsBackground - population reference cohorts + the unaffected/control
arms of disease cohorts ("all other variants");
~1.5 million individuals
A variant seen in both groups appears in both tracks. Genotyping-array cohorts
stay out of both (varFreqsArray unchanged).

vcfToBigBed.py gains --split-affected to emit both tracks in one pass; it reads
phenotype tags (affected/unaffected/unknown) from populations.tsv and
is_disease/disease_role from databases.tsv, and derives the length-filter
ranges from the observed data. TOPMed reclassified as a population cohort.
SPARK WGS display name changed to SFARI SPARK WGS for consistency with the
standalone subtracks. Fixed the trackDb mouseOver $-substitution prefix
collision by wrapping fields in ${}. New description pages for both tracks.

refs #36642

diff --git src/hg/makeDb/trackDb/human/varFreqsAll.html src/hg/makeDb/trackDb/human/varFreqsAll.html
deleted file mode 100644
index d6ee5fdd42d..00000000000
--- src/hg/makeDb/trackDb/human/varFreqsAll.html
+++ /dev/null
@@ -1,244 +0,0 @@
-<h2>Description</h2>
-<p>
-This track merges variants from 28 sequencing-based variant frequency databases into a
-single bigBed file with predicted protein consequences and cross-database filtering. It
-contains 1.34 billion variants from WGS, WES, and long-read sequencing cohorts worldwide.
-For a summary of all available databases, see the
-<a href="hgTrackUi?g=varFreqs">SNV Frequencies</a> supertrack page.
-</p>
-
-<p>
-Two companion combined tracks split out the cohorts that don't belong in a general
-sequencing-based summary:
-</p>
-<ul>
-  <li><a href="hgTrackUi?g=varFreqsDisease">Disease-related Databases Combined</a> &mdash;
-      932 M variants from six disease-focused cohorts (SPARK, SFARI WGS, TOPMed, SCHEMA,
-      GREGoR, GA4K), with phenotype-stratified AC/AF where the source provides it.</li>
-  <li><a href="hgTrackUi?g=varFreqsArray">Genotyping Array Databases Combined</a> &mdash;
-      14.7 M variants from three array cohorts (TPMI Taiwan, Mexico Biobank, UK Biobank
-      imputed). Kept separate because chip data has different per-variant confidence
-      than sequencing.</li>
-</ul>
-
-<p>
-Each variant is annotated with its predicted consequence on protein-coding genes
-(using <a href="https://samtools.github.io/bcftools/howtos/csq-calling.html"
-target="_blank">bcftools csq</a> with
-<a href="https://www.ensembl.org/info/data/ftp/index.html" target="_blank">Ensembl</a>
-gene models), and colored by severity. Allele counts and frequencies are shown for each
-source database and, where available, broken down by ancestry, population, or phenotype.
-</p>
-
-<h2>Display Conventions</h2>
-
-<h3>Color by Consequence</h3>
-<p>Variants are colored by their most severe predicted consequence:</p>
-<table class="stdTbl">
-<tr><th>Color</th><th>Consequence class</th><th>Examples</th></tr>
-<tr>
-  <td style="background-color: rgb(255,0,0); color: white; text-align: center;"><b>Red</b></td>
-  <td>Protein-truncating / Loss-of-function</td>
-  <td>stop_gained, frameshift, splice_donor, splice_acceptor, stop_lost, start_lost</td>
-</tr>
-<tr>
-  <td style="background-color: rgb(31,119,180); color: white; text-align: center;"><b>Blue</b></td>
-  <td>Missense / In-frame</td>
-  <td>missense, inframe_insertion, inframe_deletion, protein_altering</td>
-</tr>
-<tr>
-  <td style="background-color: rgb(0,128,0); color: white; text-align: center;"><b>Green</b></td>
-  <td>Synonymous</td>
-  <td>synonymous, stop_retained</td>
-</tr>
-<tr>
-  <td style="background-color: rgb(128,128,128); color: white; text-align: center;"><b>Grey</b></td>
-  <td>Non-coding / Intergenic</td>
-  <td>intron, non_coding, intergenic, UTR</td>
-</tr>
-</table>
-
-<h3>Amino Acid Change Notation</h3>
-<p>
-The &quot;AA change&quot; field uses bcftools csq notation: <b>23I&gt;23V</b> means position
-23 changed from Isoleucine (I) to Valine (V) (missense). <b>23I</b> alone (no arrow)
-means position 23 is Isoleucine and unchanged (synonymous). A &quot;*&quot; indicates a
-stop codon (e.g. 45R&gt;45* is a stop_gained).
-</p>
-
-<h2>Filters</h2>
-<p>
-This track supports extensive filtering via the track settings page. Click on the track
-title or use the &quot;Configure&quot; button to access filters:
-</p>
-
-<h3>Variant Type and Consequence</h3>
-<ul>
-  <li><b>Variant Type</b>: Filter by SNV, Insertion, Deletion, or MNV.</li>
-  <li><b>Consequence</b>: Filter by predicted consequence (Missense, Synonymous, Stop Gained,
-      Frameshift, Splice Donor, Splice Acceptor, Intron, 3' UTR, 5' UTR, Non-coding,
-      Intergenic, Other). The filter uses OR logic across the comma-separated consequence
-      tokens on each variant: a variant tagged
-      <code>stop_gained,frameshift</code> is selected by either the &quot;Stop Gained&quot;
-      or the &quot;Frameshift&quot; filter. The &quot;Other&quot; bucket catches the less
-      common <a href="http://www.sequenceontology.org/" target="_blank">Sequence Ontology</a>
-      consequence terms emitted by <code>bcftools csq</code> that don't fit the named
-      buckets above. Examples include
-      <code>splice_region</code> (variant near a splice site but outside the canonical
-      donor/acceptor),
-      <code>start_lost</code> / <code>stop_lost</code> (variant disrupts the start codon
-      or replaces the stop codon with a coding amino acid),
-      <code>stop_retained</code> (variant changes the stop codon but keeps it a stop),
-      <code>inframe_insertion</code> / <code>inframe_deletion</code> (in-frame indel
-      that adds or removes whole codons), and
-      <code>coding_sequence</code> (CDS variant where the precise impact is undetermined).
-      If you include &quot;Other&quot; in the filter selection, no records will be
-      hidden by the consequence filter.</li>
-</ul>
-
-<p><b>How to find protein-truncating variants:</b> Set the Consequence filter to include
-only &quot;Stop Gained&quot;, &quot;Frameshift&quot;, &quot;Splice Donor&quot;, and
-&quot;Splice Acceptor&quot;. These will appear as red items in the track display.</p>
-
-<h3>Frequency and Count Filters</h3>
-<ul>
-  <li><b>Max Allele Frequency</b>: Filter by the maximum allele frequency observed across
-      all databases. Useful for finding rare variants (e.g., set max to 0.01 for variants
-      with AF &lt; 1% in all databases).</li>
-  <li><b>Total Allele Count</b>: Filter by the sum of allele counts across all databases.
-      Useful for excluding singletons (e.g., set minimum to 2 to remove AC=1 variants
-      that may be sequencing errors).</li>
-  <li><b>Per-database AF and AC</b>: Filter by allele frequency or allele count in any
-      specific database. For example, filter to variants with TOPMed AF &gt; 0.001.</li>
-</ul>
-
-<h3>Source Database</h3>
-<p>
-The <b>Source Database</b> filter lets you restrict to variants present in specific databases.
-For example, select only &quot;GREGoR&quot; to see variants found in the rare disease cohort.
-This filter uses OR logic: selecting multiple databases shows variants found in
-<em>any</em> of the selected databases.
-</p>
-
-<h3>Population- and Phenotype-Specific Filters</h3>
-<p>
-Several databases provide ancestry-specific allele frequencies:
-</p>
-<ul>
-  <li><b>AllOfUs</b>: African, Indigenous American, East Asian, European, Oceanian, South
-      Asian (from local ancestry inference)</li>
-  <li><b>GenomeAsia</b>: Northeast Asian, Southeast Asian, South Asian, Oceanian, American,
-      African, Western European Reference</li>
-  <li><b>gnomAD HGDP+1kG</b>: African, Amish, Latino, Ashkenazi Jewish, East Asian, Finnish,
-      Middle Eastern, Non-Finnish European, Other, South Asian</li>
-  <li><b>NPM Singapore</b>: Chinese, Malay, Indian</li>
-  <li><b>WBBC</b>: North Han, Central Han, South Han, Lingnan Han</li>
-</ul>
-<p>
-Three sources also expose phenotype-stratified counts:
-</p>
-<ul>
-  <li><b>SPARK WES</b> and <b>SFARI WGS</b>: ASD proband AC/AF versus non-ASD family
-      member AC/AF.</li>
-  <li><b>GREGoR</b>: Affected, Unaffected, and Unknown disease-status AC/AF.</li>
-</ul>
-<p>
-The disease-related <a href="hgTrackUi?g=varFreqsDisease">Disease-related Databases Combined</a>
-track exposes additional phenotype splits for SCHEMA (Schizophrenia case vs control).
-</p>
-
-<h3>Length Filters</h3>
-<ul>
-  <li><b>Reference/Alternate Length</b>: Filter by the length of the reference or alternate allele.</li>
-  <li><b>Length Change</b>: Filter by the size difference between alternate and reference
-      (positive = insertion, negative = deletion, zero = SNV or MNV).</li>
-</ul>
-
-<h2>Methods</h2>
-<p>
-Variant frequency VCF files from 28 sequencing-based databases were stripped of their INFO
-fields (to reduce size), normalized with <code>bcftools norm</code> (splitting multi-allelic
-sites), and merged with <code>bcftools merge</code>. The merged VCF was then annotated with
-predicted protein consequences using <code>bcftools csq</code> with the
-<a href="https://www.ensembl.org/info/data/ftp/index.html" target="_blank">Ensembl</a>
-GRCh38 release 115 gene annotation (GFF3). The same pipeline, run on different subsets of
-source VCFs, produces the
-<a href="hgTrackUi?g=varFreqsDisease">Disease-related Databases Combined</a> and
-<a href="hgTrackUi?g=varFreqsArray">Genotyping Array Databases Combined</a> tracks.
-</p>
-
-<p>
-The annotated VCF was converted to bigBed format using a custom Python script
-(<code>vcfToBigBed.py</code>) that reads frequency data from each source VCF in parallel,
-matches variants by position/ref/alt, and writes a BED file with consequence coloring,
-per-database allele counts and frequencies, and population breakdowns.
-The database configuration (which VCFs to include, field mappings, and population definitions)
-is stored in two TSV files
-(<a href="https://github.com/ucscGenomeBrowser/kent/tree/master/src/hg/makeDb/scripts/varFreqs/databases.tsv"
-target="_blank">databases.tsv</a> and
-<a href="https://github.com/ucscGenomeBrowser/kent/tree/master/src/hg/makeDb/scripts/varFreqs/populations.tsv"
-target="_blank">populations.tsv</a>)
-so that future updates only require editing these files.
-</p>
-
-<p>
-The track's
-<a href="https://github.com/ucscGenomeBrowser/kent/blob/master/src/hg/makeDb/doc/hg38/varFreqs.txt"
-target="_blank">makeDoc file</a> documents how each source VCF was converted.
-Scripts are available from
-<a href="https://github.com/ucscGenomeBrowser/kent/tree/master/src/hg/makeDb/scripts/varFreqs"
-target="_blank">Github</a>.
-</p>
-
-<h2>Data Access</h2>
-<p>
-The data can be explored interactively with the
-<a href="../cgi-bin/hgTables">Table Browser</a> or the
-<a href="../cgi-bin/hgIntegrator">Data Integrator</a>.
-For programmatic access, our <a href="https://api.genome.ucsc.edu" target="_blank">REST API</a>
-can be used; the track name is <em>varFreqsAll</em>.
-</p>
-<p>
-Because the merged callset includes data from multiple sources whose redistribution
-licenses differ, the combined bigBed is <b>not available for download</b> from our
-download server. The combined track can be reconstructed from the individual source VCFs
-using the
-<a href="https://github.com/ucscGenomeBrowser/kent/tree/master/src/hg/makeDb/scripts/varFreqs"
-target="_blank">conversion scripts on GitHub</a> together with the
-<a href="https://github.com/ucscGenomeBrowser/kent/blob/master/src/hg/makeDb/doc/hg38/varFreqs.txt"
-target="_blank">build documentation</a>. Where individual source data is downloadable from UCSC,
-the per-subtrack description page indicates the path on our download server.
-</p>
-
-<h2>Credits</h2>
-<p>
-This track is only possible thanks to the data from millions of volunteers around the world,
-who donated blood, signed consent forms and provided health information about themselves and
-sometimes their families. Click on any of the individual tracks in the
-<a href="hgTrackUi?g=varFreqs">SNV Frequencies</a> supertrack to see the specific
-credits for each project. Thanks to Alex Ioannidis, UCSC, for the motivation for this track
-and to Andreas Lahner, MGZ, for feedback.
-</p>
-
-<h2>References</h2>
-<p>
-For primary citations of each source dataset, see the References section on the
-<a href="hgTrackUi?g=varFreqs">SNV Frequencies</a> supertrack page. The merged-track
-build itself uses the following tools:
-</p>
-<p>
-Danecek P, McCarthy SA.
-<a href="https://doi.org/10.1093/bioinformatics/btx100" target="_blank">
-BCFtools/csq: haplotype-aware variant consequences</a>.
-<em>Bioinformatics</em>. 2017 Jul 1;33(13):2037-2039.
-PMID: <a href="https://www.ncbi.nlm.nih.gov/pubmed/28205675" target="_blank">28205675</a>; PMC: <a
-href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5870570/" target="_blank">PMC5870570</a>
-</p>
-<p>
-McLaren W, Gil L, Hunt SE, Riat HS, Ritchie GR, Thormann A, Flicek P, Cunningham F.
-<a href="https://doi.org/10.1186/s13059-016-0974-4" target="_blank">
-The Ensembl Variant Effect Predictor</a>.
-<em>Genome Biol</em>. 2016 Jun 6;17(1):122.
-PMID: <a href="https://www.ncbi.nlm.nih.gov/pubmed/27268795" target="_blank">27268795</a>; PMC: <a
-href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4893825/" target="_blank">PMC4893825</a>
-</p>