1259dcfba3a263d92d2602665fd866dc44b47996 lrnassar Sun Jun 21 11:17:10 2026 -0700 Clarify varFreqs description page wording per code review feedback. refs #37733 Reword the default_an sentence in the Pooled allele frequency sections of varFreqsAffected.html and varFreqsBackground.html to explain that cohorts publishing only AF are pooled via an assigned default_an, with per-arm AC derived as round(AF * default_an). Change "tokens" to "terms" in the Consequence filter section of varFreqs.html. diff --git src/hg/makeDb/trackDb/human/varFreqsAffected.html src/hg/makeDb/trackDb/human/varFreqsAffected.html index 53bec716f0a..e1cd9bbefab 100644 --- src/hg/makeDb/trackDb/human/varFreqsAffected.html +++ src/hg/makeDb/trackDb/human/varFreqsAffected.html @@ -1,147 +1,149 @@ <h2>Description</h2> <p> This track shows small variants (SNVs and short indels) that were observed in <b>affected or case individuals</b> of disease-study cohorts, annotated with their predicted protein consequence and colored by severity. It is one half of a matched pair: the companion <a href="hgTrackUi?g=varFreqsBackground">Population reference</a> track shows the same kind of variants seen in population reference cohorts and in unaffected relatives or controls. Displaying the two together lets you compare, for example, how often a loss-of-function variant in a gene of interest is seen in affected individuals versus the general/unaffected background. For the full list of contributing projects, see the <a href="hgTrackUi?g=varFreqs">SNV Frequencies</a> collection page. </p> <p> The affected counts are drawn from the affected or case arm of five disease-study cohorts: SFARI SPARK WES and SFARI SPARK WGS (autism spectrum disorder probands), SCHEMA (schizophrenia cases), GREGoR (affected rare-disease participants), and GA4K (a pediatric rare-disease cohort). For SPARK, SFARI WGS, SCHEMA, and GREGoR, the source data carries an explicit affected/unaffected (or case/control) label, and only the affected arm feeds this track. GA4K reports a single cohort-wide frequency with no per-individual label; because it is a rare-disease cohort, it is counted as affected here, with the caveat that it enrolls parent-child trios, so a minority of its carriers are unaffected parents. Genotyping-array cohorts are not included in either track. </p> <h2>Display Conventions</h2> <h3>Color by Consequence</h3> <p>Variants are colored by their most severe predicted consequence:</p> <table class="stdTbl"> <tr><th>Color</th><th>Consequence class</th><th>Examples</th></tr> <tr><th style="background-color:#FF0000;width:2em"> </th> <td>Protein-truncating / loss-of-function</td> <td>stop_gained, frameshift, splice_donor, splice_acceptor, stop_lost, start_lost</td></tr> <tr><th style="background-color:#1F77B4;width:2em"> </th> <td>Missense / in-frame</td> <td>missense, inframe_insertion, inframe_deletion, protein_altering</td></tr> <tr><th style="background-color:#008000;width:2em"> </th> <td>Synonymous</td> <td>synonymous, stop_retained</td></tr> <tr><th style="background-color:#808080;width:2em"> </th> <td>Non-coding / intergenic</td> <td>intron, non_coding, intergenic, UTR</td></tr> </table> <p> The score (used for shading) is the pooled affected/case allele frequency times 1000. </p> <h3>Pooled allele frequency</h3> <p> <b>Affected AF</b> is the pooled rate across contributing affected arms: <code>affectedAF = sum(AC) / sum(AN)</code>, where <b>affectedAC</b> sums the allele counts and <b>affectedAN</b> sums the allele numbers across each cohort/arm that provides both AC and AF (the per-arm AN is derived as <code>round(AC / AF)</code>). Cohorts that publish only AF -contribute via a configured <code>default_an</code> in the build configuration. Cohorts +(with no AC or AN of their own) are still pooled by assigning them an assumed allele number, +set as a <code>default_an</code> in the build configuration; their per-arm AC is then derived +as <code>round(AF × default_an)</code>. Cohorts that publish only AC and have no <code>default_an</code> set (currently GREGoR's per-arm AC_AFFECTED/UNAFFECTED/UNKNOWN) are listed in <b>affectedCohorts</b> but do not contribute to the pool numerator or denominator; their carriers are visible in the per-database AC column instead. The pooled rate is preferred over a max-across-cohorts statistic so a small cohort with a high local AF cannot dominate the displayed frequency. </p> <h3>Finding case-enriched loss-of-function variants</h3> <p> To look for protein-truncating variants that are common in affected individuals but rare in the background, set the Consequence filter to Stop Gained, Frameshift, Splice Donor and Splice Acceptor (these appear red), then add an upper limit on the <b>Background AF</b> filter. Each variant here carries both its affected frequency and its background frequency, so this isolates variants seen in cases with little or no presence in the population/unaffected set. Comparing visually against the <a href="hgTrackUi?g=varFreqsBackground">Population reference</a> track shows the same contrast across a whole gene. </p> <h2>Filters</h2> <ul> <li><b>Variant Type</b> and <b>Consequence</b>: restrict to SNV/insertion/deletion/MNV and to predicted consequence classes (the Consequence filter uses OR logic over the comma-separated tokens on each variant).</li> <li><b>Affected/case AF</b>, <b>AC</b>, <b>AN</b>: the pooled allele frequency (sum AC / sum AN), summed allele count, and summed allele number across the contributing affected arms. See "Pooled allele frequency" above.</li> <li><b>Background AF</b>, <b>AC</b>, <b>AN</b>: the same triple computed across the population + unaffected background, for filtering case-enriched variants.</li> <li><b>Affected/case cohort</b>: restrict to variants seen in specific disease cohorts (for example, only the two autism cohorts).</li> <li><b>Reference/Alternate Length</b> and <b>Length Change</b>: filter by allele length.</li> </ul> <h2>Methods</h2> <p> Variant-frequency VCFs from the contributing cohorts were stripped of unneeded INFO fields, normalized with <code>bcftools norm</code> (splitting multi-allelic sites), and merged with <code>bcftools merge</code>. The merged callset was annotated with predicted protein consequences using <a href="https://samtools.github.io/bcftools/howtos/csq-calling.html" target="_blank">bcftools csq</a> against the <a href="https://www.ensembl.org/info/data/ftp/index.html" target="_blank">Ensembl</a> GRCh38 release 115 gene models. </p> <p> A custom Python script (<code>vcfToBigBed.py</code>) then read the per-cohort allele counts and frequencies and, for each variant, pooled the allele counts and allele numbers across the affected arms (case/proband subgroups, plus GA4K whole-cohort) to produce this track, and across the population cohorts and unaffected/control subgroups to produce the companion <a href="hgTrackUi?g=varFreqsBackground">Population reference</a> track. A variant seen in both groups appears in both tracks. The build is documented in the <a href="https://github.com/ucscGenomeBrowser/kent/blob/master/src/hg/makeDb/doc/hg38/varFreqs.txt" target="_blank">makeDoc</a>, and the scripts are on <a href="https://github.com/ucscGenomeBrowser/kent/tree/master/src/hg/makeDb/scripts/varFreqs" target="_blank">GitHub</a>. </p> <h2>Data Access</h2> <p> Because the merged callset combines cohorts whose redistribution licenses differ, this track is <b>not available for download</b> and is not in the Table Browser. It can be reconstructed from the individual source VCFs using the <a href="https://github.com/ucscGenomeBrowser/kent/tree/master/src/hg/makeDb/scripts/varFreqs" target="_blank">conversion scripts</a> and the <a href="https://github.com/ucscGenomeBrowser/kent/blob/master/src/hg/makeDb/doc/hg38/varFreqs.txt" target="_blank">build documentation</a>. The per-project subtracks on the <a href="hgTrackUi?g=varFreqs">SNV Frequencies</a> collection page document how to obtain each source dataset. </p> <h2>Credits</h2> <p> This track is only possible thanks to the data from the participants and families of the SFARI SPARK, SCHEMA, GREGoR and GA4K studies. Click the individual project subtracks on the <a href="hgTrackUi?g=varFreqs">SNV Frequencies</a> collection page for the specific credits and citations of each cohort. Thanks to Alex Ioannidis, UCSC, for the inspiration for this track and to Andreas Lahner, MGZ, for feedback. </p> <h2>References</h2> <p> For the primary citation of each source cohort, see the References section on the <a href="hgTrackUi?g=varFreqs">SNV Frequencies</a> collection page. The merged-track build uses the following tools: </p> <p> Danecek P, McCarthy SA. <a href="https://doi.org/10.1093/bioinformatics/btx100" target="_blank"> BCFtools/csq: haplotype-aware variant consequences</a>. <em>Bioinformatics</em>. 2017 Jul 1;33(13):2037-2039. PMID: <a href="https://www.ncbi.nlm.nih.gov/pubmed/28205675" target="_blank">28205675</a>; PMC: <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5870570/" target="_blank">PMC5870570</a> </p>