1259dcfba3a263d92d2602665fd866dc44b47996 lrnassar Sun Jun 21 11:17:10 2026 -0700 Clarify varFreqs description page wording per code review feedback. refs #37733 Reword the default_an sentence in the Pooled allele frequency sections of varFreqsAffected.html and varFreqsBackground.html to explain that cohorts publishing only AF are pooled via an assigned default_an, with per-arm AC derived as round(AF * default_an). Change "tokens" to "terms" in the Consequence filter section of varFreqs.html. diff --git src/hg/makeDb/trackDb/human/varFreqsBackground.html src/hg/makeDb/trackDb/human/varFreqsBackground.html index 2b7d13f8ecb..5d2f82eb6ef 100644 --- src/hg/makeDb/trackDb/human/varFreqsBackground.html +++ src/hg/makeDb/trackDb/human/varFreqsBackground.html @@ -1,139 +1,141 @@ <h2>Description</h2> <p> This track shows small variants (SNVs and short indels) seen in <b>population reference cohorts and in unaffected or control individuals</b> of disease-study cohorts, annotated with their predicted protein consequence and colored by severity. It is the background half of a matched pair: the companion <a href="hgTrackUi?g=varFreqsAffected">Disease cohorts</a> track shows the same kind of variants seen in affected or case individuals. Displaying the two together lets you see how common a variant is in the general/unaffected population compared with affected individuals. For the full list of contributing projects, see the <a href="hgTrackUi?g=varFreqs">SNV Frequencies</a> collection page. </p> <p> The background combines two kinds of data: the population/biobank reference cohorts (such as gnomAD HGDP+1kG, TOPMed, ALFA, HRC and the many national WGS projects), and the unaffected/control or unknown-phenotype arms of the disease-study cohorts (non-ASD family members in SFARI SPARK WES/WGS, SCHEMA controls, and GREGoR unaffected/unknown participants). Genotyping-array cohorts are not included. A variant that also appears in affected individuals is shown in both this track and the <a href="hgTrackUi?g=varFreqsAffected">Disease cohorts</a> track. </p> <h2>Display Conventions</h2> <h3>Color by Consequence</h3> <p>Variants are colored by their most severe predicted consequence:</p> <table class="stdTbl"> <tr><th>Color</th><th>Consequence class</th><th>Examples</th></tr> <tr><th style="background-color:#FF0000;width:2em"> </th> <td>Protein-truncating / loss-of-function</td> <td>stop_gained, frameshift, splice_donor, splice_acceptor, stop_lost, start_lost</td></tr> <tr><th style="background-color:#1F77B4;width:2em"> </th> <td>Missense / in-frame</td> <td>missense, inframe_insertion, inframe_deletion, protein_altering</td></tr> <tr><th style="background-color:#008000;width:2em"> </th> <td>Synonymous</td> <td>synonymous, stop_retained</td></tr> <tr><th style="background-color:#808080;width:2em"> </th> <td>Non-coding / intergenic</td> <td>intron, non_coding, intergenic, UTR</td></tr> </table> <p> The score (used for shading) is the pooled background allele frequency times 1000. </p> <h3>Pooled allele frequency</h3> <p> <b>Background AF</b> is the pooled rate across contributing population cohorts and unaffected/control arms: <code>backgroundAF = sum(AC) / sum(AN)</code>, where <b>backgroundAC</b> sums the allele counts and <b>backgroundAN</b> sums the allele numbers across each cohort/arm that provides both AC and AF (the per-arm AN is derived as -<code>round(AC / AF)</code>). Two cohorts that publish only AF (ABraOM, ALFA) contribute -via a configured <code>default_an</code> in the build configuration. Cohorts that publish +<code>round(AC / AF)</code>). Two cohorts that publish only AF (ABraOM, ALFA) are still +pooled by assigning them an assumed allele number, set as a <code>default_an</code> in the +build configuration; their per-arm AC is then derived as <code>round(AF × default_an)</code>. +Cohorts that publish only AC with no <code>default_an</code> set (currently MGRB and the GREGoR unaffected and unknown arms), and cohorts that contribute only through per-population AC/AF (currently AllOfUs), are listed in <b>backgroundSources</b> but do not contribute to the pool numerator or denominator; their data remain visible in the per-database and per-population AC/AF columns. The pooled rate is preferred over a max-across-cohorts statistic so a small cohort with a high local AF (for example AllOfUs Oceanian) cannot dominate the displayed frequency. </p> <h2>Filters</h2> <ul> <li><b>Variant Type</b> and <b>Consequence</b>: restrict to SNV/insertion/deletion/MNV and to predicted consequence classes (the Consequence filter uses OR logic over the comma-separated tokens on each variant).</li> <li><b>Background AF</b>, <b>AC</b>, <b>AN</b>: the pooled allele frequency (sum AC / sum AN), summed allele count, and summed allele number across the contributing population cohorts and unaffected/control arms. See "Pooled allele frequency" above.</li> <li><b>Affected/case AF</b>, <b>AC</b>, <b>AN</b>: the same triple computed across affected individuals, for context.</li> <li><b>Background source</b>: restrict to variants seen in specific cohorts.</li> <li><b>Per-database AF/AC</b> and ancestry-specific allele frequencies (AllOfUs, GenomeAsia, gnomAD HGDP+1kG, NPM Singapore, WBBC) let you filter to a single cohort or ancestry group.</li> <li><b>Reference/Alternate Length</b> and <b>Length Change</b>: filter by allele length.</li> </ul> <h2>Methods</h2> <p> Variant-frequency VCFs from the contributing cohorts were stripped of unneeded INFO fields, normalized with <code>bcftools norm</code> (splitting multi-allelic sites), and merged with <code>bcftools merge</code>. The merged callset was annotated with predicted protein consequences using <a href="https://samtools.github.io/bcftools/howtos/csq-calling.html" target="_blank">bcftools csq</a> against the <a href="https://www.ensembl.org/info/data/ftp/index.html" target="_blank">Ensembl</a> GRCh38 release 115 gene models. </p> <p> A custom Python script (<code>vcfToBigBed.py</code>) then read the per-cohort allele counts and frequencies and, for each variant, pooled the allele counts and allele numbers across the population cohorts and unaffected/control subgroups to produce this track, and across the affected arms to produce the companion <a href="hgTrackUi?g=varFreqsAffected">Disease cohorts</a> track. A variant seen in both groups appears in both tracks. The build is documented in the <a href="https://github.com/ucscGenomeBrowser/kent/blob/master/src/hg/makeDb/doc/hg38/varFreqs.txt" target="_blank">makeDoc</a>, and the scripts are on <a href="https://github.com/ucscGenomeBrowser/kent/tree/master/src/hg/makeDb/scripts/varFreqs" target="_blank">GitHub</a>. </p> <h2>Data Access</h2> <p> Because the merged callset combines cohorts whose redistribution licenses differ, this track is <b>not available for download</b> and is not in the Table Browser. It can be reconstructed from the individual source VCFs using the <a href="https://github.com/ucscGenomeBrowser/kent/tree/master/src/hg/makeDb/scripts/varFreqs" target="_blank">conversion scripts</a> and the <a href="https://github.com/ucscGenomeBrowser/kent/blob/master/src/hg/makeDb/doc/hg38/varFreqs.txt" target="_blank">build documentation</a>. The per-project subtracks on the <a href="hgTrackUi?g=varFreqs">SNV Frequencies</a> collection page document how to obtain each source dataset. </p> <h2>Credits</h2> <p> This track is only possible thanks to the data from millions of volunteers around the world who contributed to the population reference projects and to the unaffected/control arms of the disease cohorts. Click the individual project subtracks on the <a href="hgTrackUi?g=varFreqs">SNV Frequencies</a> collection page for the specific credits and citations of each cohort. Thanks to Alex Ioannidis, UCSC, for the inspiration for this track and to Andreas Lahner, MGZ, for feedback. </p> <h2>References</h2> <p> For the primary citation of each source cohort, see the References section on the <a href="hgTrackUi?g=varFreqs">SNV Frequencies</a> collection page. The merged-track build uses the following tools: </p> <p> Danecek P, McCarthy SA. <a href="https://doi.org/10.1093/bioinformatics/btx100" target="_blank"> BCFtools/csq: haplotype-aware variant consequences</a>. <em>Bioinformatics</em>. 2017 Jul 1;33(13):2037-2039. PMID: <a href="https://www.ncbi.nlm.nih.gov/pubmed/28205675" target="_blank">28205675</a>; PMC: <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5870570/" target="_blank">PMC5870570</a> </p>