af9a5b388259e680dd34bc47b2cad4ff6e3d162f lrnassar Sat Jun 13 03:00:51 2026 -0700 varFreqs: pre-release polish from comprehensive sanity check. * Sync the new combined-track shortLabels into the four description pages: "Affected/Case Individuals" -> "Disease cohorts" and "Population + Unaffected" -> "Population reference" (matches the trackdb shortLabels users now see). * Add a paragraph in the supertrack Methods section describing the pooled affectedAF / backgroundAF formulation (sum AC / sum AN) and the default_an configuration that handles AF-only cohorts. * Update the in-track Methods paragraphs on varFreqsAffected.html and varFreqsBackground.html: replace "summed/maximized" with "pooled". * Fix supertrack table downloadability column to match the underscore-prefix convention: allofus "Yes" -> "No" (description page already says license restricted); gregor "No" -> "Yes" (description page says VCF is on our download server, and the gbdb path is not underscore-prefixed). * Add a 2026-06-12 makedoc section documenting the pooled-AF rebuild, the default_an mechanism, the new affectedAN/backgroundAN columns, the before/after spot-check at APOE rs429358, and the build commands. refs #36642 diff --git src/hg/makeDb/trackDb/human/varFreqsAffected.html src/hg/makeDb/trackDb/human/varFreqsAffected.html index 36af10b8e43..cd7631d1313 100644 --- src/hg/makeDb/trackDb/human/varFreqsAffected.html +++ src/hg/makeDb/trackDb/human/varFreqsAffected.html @@ -1,22 +1,22 @@ <h2>Description</h2> <p> This track shows small variants (SNVs and short indels) that were observed in <b>affected or case individuals</b> of disease-study cohorts, annotated with their predicted protein consequence and colored by severity. It is one half of a matched pair: the companion -<a href="hgTrackUi?g=varFreqsBackground">Population + Unaffected</a> track shows the same +<a href="hgTrackUi?g=varFreqsBackground">Population reference</a> track shows the same kind of variants seen in population reference cohorts and in unaffected relatives or controls. Displaying the two together lets you compare, for example, how often a loss-of-function variant in a gene of interest is seen in affected individuals versus the general/unaffected background. For the full list of contributing projects, see the <a href="hgTrackUi?g=varFreqs">SNV Frequencies</a> collection page. </p> <p> The affected counts are drawn from the affected or case arm of five disease-study cohorts: SFARI SPARK WES and SFARI SPARK WGS (autism spectrum disorder probands), SCHEMA (schizophrenia cases), GREGoR (affected rare-disease participants), and GA4K (a pediatric rare-disease cohort). For SPARK, SFARI WGS, SCHEMA, and GREGoR the source data carries an explicit affected/unaffected (or case/control) label and only the affected arm feeds this track. GA4K reports a single cohort-wide frequency with no per-individual label; because it is a rare-disease cohort it is counted as affected here, with the caveat that it enrolls parent-child trios, so a minority of its carriers are unaffected parents. Genotyping-array @@ -55,66 +55,66 @@ that publish only AC and have no <code>default_an</code> set (currently GREGoR's per-arm AC_AFFECTED/UNAFFECTED/UNKNOWN) are listed in <b>affectedCohorts</b> but do not contribute to the pool numerator or denominator; their carriers are visible in the per-database AC column instead. The pooled rate is preferred over a max-across-cohorts statistic so a small cohort with a high local AF cannot dominate the displayed frequency. </p> <h3>Finding case-enriched loss-of-function variants</h3> <p> To look for protein-truncating variants that are common in affected individuals but rare in the background, set the Consequence filter to Stop Gained, Frameshift, Splice Donor and Splice Acceptor (these appear red), then add an upper limit on the <b>Background AF</b> filter. Each variant here carries both its affected frequency and its background frequency, so this isolates variants seen in cases with little or no presence in the population/unaffected set. Comparing visually against the -<a href="hgTrackUi?g=varFreqsBackground">Population + Unaffected</a> track shows the same +<a href="hgTrackUi?g=varFreqsBackground">Population reference</a> track shows the same contrast across a whole gene. </p> <h2>Filters</h2> <ul> <li><b>Variant Type</b> and <b>Consequence</b>: restrict to SNV/insertion/deletion/MNV and to predicted consequence classes (the Consequence filter uses OR logic over the comma-separated tokens on each variant).</li> <li><b>Affected/case AF</b>, <b>AC</b>, <b>AN</b>: the pooled allele frequency (sum AC / sum AN), summed allele count, and summed allele number across the contributing affected arms. See "Pooled allele frequency" above.</li> <li><b>Background AF</b>, <b>AC</b>, <b>AN</b>: the same triple computed across the population + unaffected background, for filtering case-enriched variants.</li> <li><b>Affected/case cohort</b>: restrict to variants seen in specific disease cohorts (for example, only the two autism cohorts).</li> <li><b>Reference/Alternate Length</b> and <b>Length Change</b>: filter by allele length.</li> </ul> <h2>Methods</h2> <p> Variant-frequency VCFs from the contributing cohorts were stripped of unneeded INFO fields, normalized with <code>bcftools norm</code> (splitting multi-allelic sites), and merged with <code>bcftools merge</code>. The merged callset was annotated with predicted protein consequences using <a href="https://samtools.github.io/bcftools/howtos/csq-calling.html" target="_blank">bcftools csq</a> against the <a href="https://www.ensembl.org/info/data/ftp/index.html" target="_blank">Ensembl</a> GRCh38 release 115 gene models. </p> <p> A custom Python script (<code>vcfToBigBed.py</code>) then read the per-cohort allele -frequencies and, for each variant, summed/maximized the counts across the affected arms -(case/proband subgroups, plus GA4K whole-cohort) to produce this track, and across the -population cohorts and unaffected/control subgroups to produce the companion -<a href="hgTrackUi?g=varFreqsBackground">Population + Unaffected</a> track. A variant seen in -both groups appears in both tracks. The build is documented in the +counts and frequencies and, for each variant, pooled the allele counts and allele numbers +across the affected arms (case/proband subgroups, plus GA4K whole-cohort) to produce this +track, and across the population cohorts and unaffected/control subgroups to produce the +companion <a href="hgTrackUi?g=varFreqsBackground">Population reference</a> track. A variant +seen in both groups appears in both tracks. The build is documented in the <a href="https://github.com/ucscGenomeBrowser/kent/blob/master/src/hg/makeDb/doc/hg38/varFreqs.txt" target="_blank">makeDoc</a>, and the scripts are on <a href="https://github.com/ucscGenomeBrowser/kent/tree/master/src/hg/makeDb/scripts/varFreqs" target="_blank">GitHub</a>. </p> <h2>Data Access</h2> <p> Because the merged callset combines cohorts whose redistribution licenses differ, this track is <b>not available for download</b> and is not in the Table Browser. It can be reconstructed from the individual source VCFs using the <a href="https://github.com/ucscGenomeBrowser/kent/tree/master/src/hg/makeDb/scripts/varFreqs" target="_blank">conversion scripts</a> and the <a href="https://github.com/ucscGenomeBrowser/kent/blob/master/src/hg/makeDb/doc/hg38/varFreqs.txt" target="_blank">build documentation</a>. The per-project subtracks on the