src/hg/makeDb/trackDb/human/varFreqsAffected.html 64a3f9e7813e823cf724ea188c3928a911578286

64a3f9e7813e823cf724ea188c3928a911578286
max
  Thu Jun 4 00:32:22 2026 -0700
varFreqs: replace All Databases Combined with two phenotype-split tracks

Replace the single varFreqsAll combined track (and drop the varFreqsDisease
track) with two matched tracks for visual case-vs-background comparison:
varFreqsAffected   - variants seen in the affected/case arms of disease
cohorts (SFARI SPARK WES/WGS ASD probands, SCHEMA cases,
GREGoR affected, GA4K); ~130,000 individuals
varFreqsBackground - population reference cohorts + the unaffected/control
arms of disease cohorts ("all other variants");
~1.5 million individuals
A variant seen in both groups appears in both tracks. Genotyping-array cohorts
stay out of both (varFreqsArray unchanged).

vcfToBigBed.py gains --split-affected to emit both tracks in one pass; it reads
phenotype tags (affected/unaffected/unknown) from populations.tsv and
is_disease/disease_role from databases.tsv, and derives the length-filter
ranges from the observed data. TOPMed reclassified as a population cohort.
SPARK WGS display name changed to SFARI SPARK WGS for consistency with the
standalone subtracks. Fixed the trackDb mouseOver $-substitution prefix
collision by wrapping fields in ${}. New description pages for both tracks.

refs #36642

diff --git src/hg/makeDb/trackDb/human/varFreqsAffected.html src/hg/makeDb/trackDb/human/varFreqsAffected.html
new file mode 100644
index 00000000000..0099d2fda9d
--- /dev/null
+++ src/hg/makeDb/trackDb/human/varFreqsAffected.html
@@ -0,0 +1,134 @@
+<h2>Description</h2>
+<p>
+This track shows small variants (SNVs and short indels) that were observed in
+<b>affected or case individuals</b> of disease-study cohorts, annotated with their
+predicted protein consequence and colored by severity. It is one half of a matched pair:
+the companion
+<a href="hgTrackUi?g=varFreqsBackground">Population + Unaffected</a> track shows the same
+kind of variants seen in population reference cohorts and in unaffected relatives or
+controls. Displaying the two together lets you compare, for example, how often a
+loss-of-function variant in a gene of interest is seen in affected individuals versus the
+general/unaffected background. For the full list of contributing projects, see the
+<a href="hgTrackUi?g=varFreqs">SNV Frequencies</a> collection page.
+</p>
+<p>
+The affected counts are drawn from the affected or case arm of five disease-study cohorts:
+SFARI SPARK WES and SFARI SPARK WGS (autism spectrum disorder probands), SCHEMA
+(schizophrenia cases), GREGoR (affected rare-disease participants), and GA4K (a pediatric
+rare-disease cohort). For SPARK, SFARI WGS, SCHEMA, and GREGoR the source data carries an
+explicit affected/unaffected (or case/control) label and only the affected arm feeds this
+track. GA4K reports a single cohort-wide frequency with no per-individual label; because it
+is a rare-disease cohort it is counted as affected here, with the caveat that it enrolls
+parent-child trios, so a minority of its carriers are unaffected parents. Genotyping-array
+cohorts are not included in either track.
+</p>
+
+<h2>Display Conventions</h2>
+<h3>Color by Consequence</h3>
+<p>Variants are colored by their most severe predicted consequence:</p>
+<table class="stdTbl">
+<tr><th>Color</th><th>Consequence class</th><th>Examples</th></tr>
+<tr><th style="background-color:#FF0000;width:2em">&nbsp;</th>
+    <td>Protein-truncating / loss-of-function</td>
+    <td>stop_gained, frameshift, splice_donor, splice_acceptor, stop_lost, start_lost</td></tr>
+<tr><th style="background-color:#1F77B4;width:2em">&nbsp;</th>
+    <td>Missense / in-frame</td>
+    <td>missense, inframe_insertion, inframe_deletion, protein_altering</td></tr>
+<tr><th style="background-color:#008000;width:2em">&nbsp;</th>
+    <td>Synonymous</td>
+    <td>synonymous, stop_retained</td></tr>
+<tr><th style="background-color:#808080;width:2em">&nbsp;</th>
+    <td>Non-coding / intergenic</td>
+    <td>intron, non_coding, intergenic, UTR</td></tr>
+</table>
+<p>
+The score (used for shading) is the affected/case allele frequency times 1000. Variants
+contributed only by a cohort that reports allele counts but no allele frequency (GREGoR)
+have a score of 0 but are still drawn in their consequence color.
+</p>
+
+<h3>Finding case-enriched loss-of-function variants</h3>
+<p>
+To look for protein-truncating variants that are common in affected individuals but rare
+in the background, set the Consequence filter to Stop Gained, Frameshift, Splice Donor and
+Splice Acceptor (these appear red), then add an upper limit on the
+<b>Background AF</b> filter. Each variant here carries both its affected frequency and its
+background frequency, so this isolates variants seen in cases with little or no presence in
+the population/unaffected set. Comparing visually against the
+<a href="hgTrackUi?g=varFreqsBackground">Population + Unaffected</a> track shows the same
+contrast across a whole gene.
+</p>
+
+<h2>Filters</h2>
+<ul>
+  <li><b>Variant Type</b> and <b>Consequence</b>: restrict to SNV/insertion/deletion/MNV
+      and to predicted consequence classes (the Consequence filter uses OR logic over the
+      comma-separated tokens on each variant).</li>
+  <li><b>Affected/case AF</b> and <b>Affected/case AC</b>: the maximum allele frequency and
+      summed allele count across the affected arms.</li>
+  <li><b>Background AF</b> and <b>Background AC</b>: the same variant's frequency in the
+      population + unaffected background, for filtering case-enriched variants.</li>
+  <li><b>Affected/case cohort</b>: restrict to variants seen in specific disease cohorts
+      (for example, only the two autism cohorts).</li>
+  <li><b>Reference/Alternate Length</b> and <b>Length Change</b>: filter by allele length.</li>
+</ul>
+
+<h2>Methods</h2>
+<p>
+Variant-frequency VCFs from the contributing cohorts were stripped of unneeded INFO fields,
+normalized with <code>bcftools norm</code> (splitting multi-allelic sites), and merged with
+<code>bcftools merge</code>. The merged callset was annotated with predicted protein
+consequences using <a href="https://samtools.github.io/bcftools/howtos/csq-calling.html"
+target="_blank">bcftools csq</a> against the
+<a href="https://www.ensembl.org/info/data/ftp/index.html" target="_blank">Ensembl</a>
+GRCh38 release 115 gene models.
+</p>
+<p>
+A custom Python script (<code>vcfToBigBed.py</code>) then read the per-cohort allele
+frequencies and, for each variant, summed/maximized the counts across the affected arms
+(case/proband subgroups, plus GA4K whole-cohort) to produce this track, and across the
+population cohorts and unaffected/control subgroups to produce the companion
+<a href="hgTrackUi?g=varFreqsBackground">Population + Unaffected</a> track. A variant seen in
+both groups appears in both tracks. The build is documented in the
+<a href="https://github.com/ucscGenomeBrowser/kent/blob/master/src/hg/makeDb/doc/hg38/varFreqs.txt"
+target="_blank">makeDoc</a>, and the scripts are on
+<a href="https://github.com/ucscGenomeBrowser/kent/tree/master/src/hg/makeDb/scripts/varFreqs"
+target="_blank">GitHub</a>.
+</p>
+
+<h2>Data Access</h2>
+<p>
+Because the merged callset combines cohorts whose redistribution licenses differ, this
+track is <b>not available for download</b> and is not in the Table Browser. It can be
+reconstructed from the individual source VCFs using the
+<a href="https://github.com/ucscGenomeBrowser/kent/tree/master/src/hg/makeDb/scripts/varFreqs"
+target="_blank">conversion scripts</a> and the
+<a href="https://github.com/ucscGenomeBrowser/kent/blob/master/src/hg/makeDb/doc/hg38/varFreqs.txt"
+target="_blank">build documentation</a>. The per-project subtracks on the
+<a href="hgTrackUi?g=varFreqs">SNV Frequencies</a> collection page document how to obtain
+each source dataset.
+</p>
+
+<h2>Credits</h2>
+<p>
+This track is only possible thanks to the data from the participants and families of the
+SFARI SPARK, SCHEMA, GREGoR and GA4K studies. Click the individual project subtracks on the
+<a href="hgTrackUi?g=varFreqs">SNV Frequencies</a> collection page for the specific credits
+and citations of each cohort. Thanks to Alex Ioannidis, UCSC, for the inspiration for this
+track and to Andreas Lahner, MGZ, for feedback.
+</p>
+
+<h2>References</h2>
+<p>
+For the primary citation of each source cohort, see the References section on the
+<a href="hgTrackUi?g=varFreqs">SNV Frequencies</a> collection page. The merged-track build
+uses the following tools:
+</p>
+<p>
+Danecek P, McCarthy SA.
+<a href="https://doi.org/10.1093/bioinformatics/btx100" target="_blank">
+BCFtools/csq: haplotype-aware variant consequences</a>.
+<em>Bioinformatics</em>. 2017 Jul 1;33(13):2037-2039.
+PMID: <a href="https://www.ncbi.nlm.nih.gov/pubmed/28205675" target="_blank">28205675</a>;
+PMC: <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5870570/" target="_blank">PMC5870570</a>
+</p>