src/hg/makeDb/trackDb/human/varFreqsAffected.html 1259dcfba3a263d92d2602665fd866dc44b47996

1259dcfba3a263d92d2602665fd866dc44b47996
lrnassar
  Sun Jun 21 11:17:10 2026 -0700
Clarify varFreqs description page wording per code review feedback. refs #37733

Reword the default_an sentence in the Pooled allele frequency sections of
varFreqsAffected.html and varFreqsBackground.html to explain that cohorts
publishing only AF are pooled via an assigned default_an, with per-arm AC
derived as round(AF * default_an). Change "tokens" to "terms" in the
Consequence filter section of varFreqs.html.

diff --git src/hg/makeDb/trackDb/human/varFreqsAffected.html src/hg/makeDb/trackDb/human/varFreqsAffected.html
index 53bec716f0a..e1cd9bbefab 100644
--- src/hg/makeDb/trackDb/human/varFreqsAffected.html
+++ src/hg/makeDb/trackDb/human/varFreqsAffected.html
@@ -1,147 +1,149 @@
 <h2>Description</h2>
 <p>
 This track shows small variants (SNVs and short indels) that were observed in
 <b>affected or case individuals</b> of disease-study cohorts, annotated with their
 predicted protein consequence and colored by severity. It is one half of a matched pair:
 the companion
 <a href="hgTrackUi?g=varFreqsBackground">Population reference</a> track shows the same
 kind of variants seen in population reference cohorts and in unaffected relatives or
 controls. Displaying the two together lets you compare, for example, how often a
 loss-of-function variant in a gene of interest is seen in affected individuals versus the
 general/unaffected background. For the full list of contributing projects, see the
 <a href="hgTrackUi?g=varFreqs">SNV Frequencies</a> collection page.
 </p>
 <p>
 The affected counts are drawn from the affected or case arm of five disease-study cohorts:
 SFARI SPARK WES and SFARI SPARK WGS (autism spectrum disorder probands), SCHEMA
 (schizophrenia cases), GREGoR (affected rare-disease participants), and GA4K (a pediatric
 rare-disease cohort). For SPARK, SFARI WGS, SCHEMA, and GREGoR, the source data carries an
 explicit affected/unaffected (or case/control) label, and only the affected arm feeds this
 track. GA4K reports a single cohort-wide frequency with no per-individual label; because it
 is a rare-disease cohort, it is counted as affected here, with the caveat that it enrolls
 parent-child trios, so a minority of its carriers are unaffected parents. Genotyping-array
 cohorts are not included in either track.
 </p>
 
 <h2>Display Conventions</h2>
 <h3>Color by Consequence</h3>
 <p>Variants are colored by their most severe predicted consequence:</p>
 <table class="stdTbl">
 <tr><th>Color</th><th>Consequence class</th><th>Examples</th></tr>
 <tr><th style="background-color:#FF0000;width:2em">&nbsp;</th>
     <td>Protein-truncating / loss-of-function</td>
     <td>stop_gained, frameshift, splice_donor, splice_acceptor, stop_lost, start_lost</td></tr>
 <tr><th style="background-color:#1F77B4;width:2em">&nbsp;</th>
     <td>Missense / in-frame</td>
     <td>missense, inframe_insertion, inframe_deletion, protein_altering</td></tr>
 <tr><th style="background-color:#008000;width:2em">&nbsp;</th>
     <td>Synonymous</td>
     <td>synonymous, stop_retained</td></tr>
 <tr><th style="background-color:#808080;width:2em">&nbsp;</th>
     <td>Non-coding / intergenic</td>
     <td>intron, non_coding, intergenic, UTR</td></tr>
 </table>
 <p>
 The score (used for shading) is the pooled affected/case allele frequency times 1000.
 </p>
 
 <h3>Pooled allele frequency</h3>
 <p>
 <b>Affected AF</b> is the pooled rate across contributing affected arms:
 <code>affectedAF = sum(AC) / sum(AN)</code>, where <b>affectedAC</b> sums the allele counts
 and <b>affectedAN</b> sums the allele numbers across each cohort/arm that provides both AC and
 AF (the per-arm AN is derived as <code>round(AC / AF)</code>). Cohorts that publish only AF
-contribute via a configured <code>default_an</code> in the build configuration. Cohorts
+(with no AC or AN of their own) are still pooled by assigning them an assumed allele number,
+set as a <code>default_an</code> in the build configuration; their per-arm AC is then derived
+as <code>round(AF &times; default_an)</code>. Cohorts
 that publish only AC and have no <code>default_an</code> set (currently GREGoR's per-arm
 AC_AFFECTED/UNAFFECTED/UNKNOWN) are listed in <b>affectedCohorts</b> but do not contribute
 to the pool numerator or denominator; their carriers are visible in the per-database AC
 column instead. The pooled rate is preferred over a max-across-cohorts statistic so a
 small cohort with a high local AF cannot dominate the displayed frequency.
 </p>
 
 <h3>Finding case-enriched loss-of-function variants</h3>
 <p>
 To look for protein-truncating variants that are common in affected individuals but rare
 in the background, set the Consequence filter to Stop Gained, Frameshift, Splice Donor and
 Splice Acceptor (these appear red), then add an upper limit on the
 <b>Background AF</b> filter. Each variant here carries both its affected frequency and its
 background frequency, so this isolates variants seen in cases with little or no presence in
 the population/unaffected set. Comparing visually against the
 <a href="hgTrackUi?g=varFreqsBackground">Population reference</a> track shows the same
 contrast across a whole gene.
 </p>
 
 <h2>Filters</h2>
 <ul>
   <li><b>Variant Type</b> and <b>Consequence</b>: restrict to SNV/insertion/deletion/MNV
       and to predicted consequence classes (the Consequence filter uses OR logic over the
       comma-separated tokens on each variant).</li>
   <li><b>Affected/case AF</b>, <b>AC</b>, <b>AN</b>: the pooled allele frequency
       (sum AC / sum AN), summed allele count, and summed allele number across the
       contributing affected arms. See &quot;Pooled allele frequency&quot; above.</li>
   <li><b>Background AF</b>, <b>AC</b>, <b>AN</b>: the same triple computed across the
       population + unaffected background, for filtering case-enriched variants.</li>
   <li><b>Affected/case cohort</b>: restrict to variants seen in specific disease cohorts
       (for example, only the two autism cohorts).</li>
   <li><b>Reference/Alternate Length</b> and <b>Length Change</b>: filter by allele length.</li>
 </ul>
 
 <h2>Methods</h2>
 <p>
 Variant-frequency VCFs from the contributing cohorts were stripped of unneeded INFO fields,
 normalized with <code>bcftools norm</code> (splitting multi-allelic sites), and merged with
 <code>bcftools merge</code>. The merged callset was annotated with predicted protein
 consequences using <a href="https://samtools.github.io/bcftools/howtos/csq-calling.html"
 target="_blank">bcftools csq</a> against the
 <a href="https://www.ensembl.org/info/data/ftp/index.html" target="_blank">Ensembl</a>
 GRCh38 release 115 gene models.
 </p>
 <p>
 A custom Python script (<code>vcfToBigBed.py</code>) then read the per-cohort allele
 counts and frequencies and, for each variant, pooled the allele counts and allele numbers
 across the affected arms (case/proband subgroups, plus GA4K whole-cohort) to produce this
 track, and across the population cohorts and unaffected/control subgroups to produce the
 companion <a href="hgTrackUi?g=varFreqsBackground">Population reference</a> track. A variant
 seen in both groups appears in both tracks. The build is documented in the
 <a href="https://github.com/ucscGenomeBrowser/kent/blob/master/src/hg/makeDb/doc/hg38/varFreqs.txt"
 target="_blank">makeDoc</a>, and the scripts are on
 <a href="https://github.com/ucscGenomeBrowser/kent/tree/master/src/hg/makeDb/scripts/varFreqs"
 target="_blank">GitHub</a>.
 </p>
 
 <h2>Data Access</h2>
 <p>
 Because the merged callset combines cohorts whose redistribution licenses differ, this
 track is <b>not available for download</b> and is not in the Table Browser. It can be
 reconstructed from the individual source VCFs using the
 <a href="https://github.com/ucscGenomeBrowser/kent/tree/master/src/hg/makeDb/scripts/varFreqs"
 target="_blank">conversion scripts</a> and the
 <a href="https://github.com/ucscGenomeBrowser/kent/blob/master/src/hg/makeDb/doc/hg38/varFreqs.txt"
 target="_blank">build documentation</a>. The per-project subtracks on the
 <a href="hgTrackUi?g=varFreqs">SNV Frequencies</a> collection page document how to obtain
 each source dataset.
 </p>
 
 <h2>Credits</h2>
 <p>
 This track is only possible thanks to the data from the participants and families of the
 SFARI SPARK, SCHEMA, GREGoR and GA4K studies. Click the individual project subtracks on the
 <a href="hgTrackUi?g=varFreqs">SNV Frequencies</a> collection page for the specific credits
 and citations of each cohort. Thanks to Alex Ioannidis, UCSC, for the inspiration for this
 track and to Andreas Lahner, MGZ, for feedback.
 </p>
 
 <h2>References</h2>
 <p>
 For the primary citation of each source cohort, see the References section on the
 <a href="hgTrackUi?g=varFreqs">SNV Frequencies</a> collection page. The merged-track build
 uses the following tools:
 </p>
 <p>
 Danecek P, McCarthy SA.
 <a href="https://doi.org/10.1093/bioinformatics/btx100" target="_blank">
 BCFtools/csq: haplotype-aware variant consequences</a>.
 <em>Bioinformatics</em>. 2017 Jul 1;33(13):2037-2039.
 PMID: <a href="https://www.ncbi.nlm.nih.gov/pubmed/28205675" target="_blank">28205675</a>;
 PMC: <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5870570/" target="_blank">PMC5870570</a>
 </p>