src/hg/makeDb/trackDb/human/varFreqs.html 64a3f9e7813e823cf724ea188c3928a911578286

64a3f9e7813e823cf724ea188c3928a911578286
max
  Thu Jun 4 00:32:22 2026 -0700
varFreqs: replace All Databases Combined with two phenotype-split tracks

Replace the single varFreqsAll combined track (and drop the varFreqsDisease
track) with two matched tracks for visual case-vs-background comparison:
varFreqsAffected   - variants seen in the affected/case arms of disease
cohorts (SFARI SPARK WES/WGS ASD probands, SCHEMA cases,
GREGoR affected, GA4K); ~130,000 individuals
varFreqsBackground - population reference cohorts + the unaffected/control
arms of disease cohorts ("all other variants");
~1.5 million individuals
A variant seen in both groups appears in both tracks. Genotyping-array cohorts
stay out of both (varFreqsArray unchanged).

vcfToBigBed.py gains --split-affected to emit both tracks in one pass; it reads
phenotype tags (affected/unaffected/unknown) from populations.tsv and
is_disease/disease_role from databases.tsv, and derives the length-filter
ranges from the observed data. TOPMed reclassified as a population cohort.
SPARK WGS display name changed to SFARI SPARK WGS for consistency with the
standalone subtracks. Fixed the trackDb mouseOver $-substitution prefix
collision by wrapping fields in ${}. New description pages for both tracks.

refs #36642

diff --git src/hg/makeDb/trackDb/human/varFreqs.html src/hg/makeDb/trackDb/human/varFreqs.html
index bb8288f2744..8a6261da7ed 100644
--- src/hg/makeDb/trackDb/human/varFreqs.html
+++ src/hg/makeDb/trackDb/human/varFreqs.html
@@ -1,90 +1,94 @@
 <h2>Description</h2>
 <p>
-This supertrack collects variant allele frequencies from population-scale sequencing and
-genotyping projects worldwide, from a total of ~1.7 million genomes/exomes/arrays.
+This track collection gathers variant allele frequencies from population-scale sequencing
+and genotyping projects worldwide, from a total of ~1.7 million genomes/exomes/arrays.
 The data was not reprocessed in a harmonized way; the variant VCFs were collected from the
 projects as-is. The goal is a single place to compare how common a variant is across
 different populations, ancestries, and cohorts, for projects that cannot be recomputed by
-gnomAD soon. Three combined tracks aggregate the source data along different lines, and
+gnomAD soon. Two combined tracks aggregate the source data along different lines, and
 there is also one subtrack per project with the original VCF data and all the annotations
 that the project provides. The different projects use different pipelines and sequencing
 technologies. Click any of the projects above or below for a summary of their sample
 selection, sequencing assay and software pipeline. Many projects do not allow us to
 distribute the data, but we document how to request it and provide all converters.
 </p>
 
 <p>
 Data from projects that provide haplotype-phased genotypes can also be found
 elsewhere: 1000 Genomes is also a separate track, and the phased genotypes HGDP, SGDP,
 HGDP+1000 Genomes and Mexico Biobank can also be found in the &quot;Phased Variants&quot; track.
 Their VCF versions below show only the isolate frequency per variant.
 </p>
 
 <p>Please contact us (<A HREF="mailto:&#103;en&#111;&#109;&#101;&#64;&#115;&#111;&#101;.&#117;&#99;s&#99;.&#101;&#100;u">&#103;en&#111;&#109;&#101;&#64;&#115;&#111;&#101;.&#117;&#99;s&#99;.&#101;&#100;u</A><!-- above address is genome at soe.ucsc.edu -->) if you know of a project that we should add. So far,
 Regeneron&apos;s Million Exomes and Mexico City Studies (request rejected) and Taiwan Biobank (pending).
 </p>
 
 <h2>Combined Tracks</h2>
 <p>
 Three combined tracks merge variants from the individual subtracks into single bigBed files
 with predicted protein consequences and cross-database filtering. All three use the same
 filter conventions (variant type, consequence, source database, allele frequency, allele
 count, and per-database AF/AC).
 </p>
 <ul>
-  <li><a href="hgTrackUi?g=varFreqsAll"><b>All Databases Combined</b></a> &mdash; 1.34
-      billion variants from 28 sequencing-based cohorts (WGS, WES, long-read). The default
-      summary view of the supertrack. Excludes the genotyping-array cohorts.</li>
-  <li><a href="hgTrackUi?g=varFreqsDisease"><b>Disease-related Databases Combined</b></a>
-      &mdash; 932 million variants from six disease-focused cohorts (SPARK, SFARI WGS,
-      TOPMed, SCHEMA, GREGoR, GA4K), with phenotype-stratified AC/AF where the source
-      provides it.</li>
+  <li><a href="hgTrackUi?g=varFreqsAffected"><b>Affected/Case Individuals</b></a> &mdash;
+      variants seen in the affected or case arm of five disease-study cohorts (SFARI SPARK
+      WES and WGS autism probands, SCHEMA schizophrenia cases, GREGoR affected, GA4K
+      rare-disease). Each variant also carries its frequency in the background, so
+      case-enriched variants can be isolated.</li>
+  <li><a href="hgTrackUi?g=varFreqsBackground"><b>Population + Unaffected</b></a> &mdash;
+      the matched background: variants seen in the population reference cohorts (gnomAD
+      HGDP+1kG, TOPMed, ALFA, HRC and the national WGS projects) and in the
+      unaffected/control arms of the disease cohorts. Showing this together with the
+      Affected track lets you compare case versus background frequency across a gene. Both
+      tracks exclude the genotyping-array cohorts.</li>
   <li><a href="hgTrackUi?g=varFreqsArray"><b>Genotyping Array Databases Combined</b></a>
       &mdash; 14.7 million variants from three array cohorts (TPMI Taiwan, Mexico Biobank,
       UK Biobank imputed). Kept separate because chip data has different per-variant
       confidence than sequencing.</li>
 </ul>
 
 <h3>Available Datasets</h3>
 
 <table class="stdTbl">
 <tr>
   <th>Database</th>
   <th>Region</th>
   <th>N</th>
   <th>Data Type</th>
   <th>Cohort</th>
   <th>Sub-populations</th>
   <th>Downloadable from UCSC</th>
 </tr>
 <tr>
-  <td><a href="hgTrackUi?g=varFreqsAll">All Databases Combined</a></td>
-  <td>Sequencing-based, all below</td>
-  <td>~1.7mil</td>
+  <td><a href="hgTrackUi?g=varFreqsAffected">Affected/Case Individuals</a></td>
+  <td>Sequencing-based disease cohorts</td>
+  <td>&mdash;</td>
   <td>WGS/WES/long-read</td>
-  <td>1.34B variants</td>
-  <td>Phenotype splits for SPARK, SFARI WGS, GREGoR</td>
+  <td>Affected/case arms of SFARI SPARK WES/WGS, SCHEMA, GREGoR, GA4K</td>
+  <td>Affected/case AF and AC; background AF for contrast</td>
   <td>No</td>
 </tr>
 <tr>
-  <td><a href="hgTrackUi?g=varFreqsDisease">Disease-related Databases Combined</a></td>
-  <td>SPARK, SFARI WGS, TOPMed, SCHEMA, GREGoR, GA4K</td>
-  <td>~300k</td>
+  <td><a href="hgTrackUi?g=varFreqsBackground">Population + Unaffected</a></td>
+  <td>Sequencing-based, population + unaffected</td>
+  <td>~1.7mil</td>
   <td>WGS/WES/long-read</td>
-  <td>932M variants</td>
-  <td>SPARK ASD/Non-ASD, SFARI WGS ASD/Non-ASD, SCHEMA case/control, GREGoR aff/unaff/unknown</td>
+  <td>Population cohorts + unaffected/control arms</td>
+  <td>Background AF and AC; per-cohort and ancestry breakdowns</td>
   <td>No</td>
 </tr>
 <tr>
   <td><a href="hgTrackUi?g=varFreqsArray">Genotyping Array Databases Combined</a></td>
   <td>TPMI, MexBB, UKBB</td>
   <td>~530k</td>
   <td>Array / imputed</td>
   <td>14.7M variants</td>
   <td>&mdash;</td>
   <td>No</td>
 </tr>
 <tr>
   <td><a href="hgTrackUi?g=allofus">AllOfUs v7</a></td>
   <td>USA</td>
   <td>245k</td>
@@ -435,31 +439,31 @@
 multi-sample callset, consequence annotations are recomputed against Ensembl with <code>bcftools csq</code>,
 and the result is converted to bigBed via <code>vcfToBigBed.py</code> + <code>bedToBigBed</code>.
 The mapping from upstream INFO fields to bigBed columns is driven by two configuration files in the
 scripts directory: <code>databases.tsv</code> (one row per source dataset) and
 <code>populations.tsv</code> (per-population AC/AF columns within each source).
 Editing those two files and rerunning <code>mergeAndAnnotate.sh</code> followed by
 <code>vcfToBigBed.py</code> rebuilds the combined track.
 </p>
 
 <h2>Data Access</h2>
 <p>All the data is publicly available. The table above indicates if we are allowed to distribute it in VCF format. Most of the databases do not allow us to redistribute the data files directly from our website, but it can always be downloaded from the original websites in some form. Click the database link in the table above and see the &quot;Data Access&quot; section of the respective track for a description of where to download the data. When the data is freely available from our website, the Data Access section will also indicate the VCF file location on our download server. Because it contains some licensed data, the combined track is not available for download, but can be recreated using the conversion scripts in our <a href="https://github.com/ucscGenomeBrowser/kent/tree/master/src/hg/makeDb/scripts/varFreqs" target="_blank">GitHub repository</a> and the accompanying <a href="https://github.com/ucscGenomeBrowser/kent/blob/master/src/hg/makeDb/doc/hg38/varFreqs.txt" target="_blank">documentation file</a>.
 </p>
 
 <h2>Credits</h2>
 
-<p>This track is only possible thanks to the data from millions of volunteers around the world, who donated blood, signed consent forms and provided health information about themselves and sometimes their families. Click any of the tracks in the list above to see the specific credits for each project. Thanks to Alex Ioannidis, UCSC, for the motivation for this track and to Andreas Lahner, MGZ, for feedback.</p>
+<p>This track is only possible thanks to the data from millions of volunteers around the world, who donated blood, signed consent forms and provided health information about themselves and sometimes their families. Click any of the tracks in the list above to see the specific credits for each project. Thanks to Alex Ioannidis, UCSC, for the inspiration for this track and to Andreas Lahner, MGZ, for feedback.</p>
 
 <h2>References</h2>
 
 <p>
 All of Us Research Program Genomics Investigators.
 <a href="https://doi.org/10.1038/s41586-023-06957-x" target="_blank">
 Genomic data in the All of Us Research Program</a>.
 <em>Nature</em>. 2024 Mar;627(8003):340-346.
 PMID: <a href="https://www.ncbi.nlm.nih.gov/pubmed/38374255" target="_blank">38374255</a>; PMC: <a
 href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10937371/" target="_blank">PMC10937371</a>
 </p>
 
 <p>
 Bhattacharyya C, Subramanian K, Uppili B, Biswas NK, Ramdas S, Tallapaka KB, Arvind P, Rupanagudi
 KV, Maitra A, Nagabandi T <em>et al</em>.