src/hg/makeDb/trackDb/human/varFreqsDisease.html 753e4fdfc8b960c8a8775e2282b0f87c73a95449

753e4fdfc8b960c8a8775e2282b0f87c73a95449
lrnassar
  Tue Jun 2 07:49:03 2026 -0700
varFreqsDisease.html: list six disease cohorts separately to match the
"six cohorts" count in the opening sentence and the six per-source AC/AF
columns in the bigBed schema. SPARK WES and SFARI WGS are two distinct
sample sets, not one combined cohort. Per QA feedback. refs #36642

diff --git src/hg/makeDb/trackDb/human/varFreqsDisease.html src/hg/makeDb/trackDb/human/varFreqsDisease.html
index 613013eac96..010d65019b7 100644
--- src/hg/makeDb/trackDb/human/varFreqsDisease.html
+++ src/hg/makeDb/trackDb/human/varFreqsDisease.html
@@ -1,198 +1,198 @@
 <h2>Description</h2>
 <p>
 This track merges variants from six disease-focused or clinically-recruited cohorts into a
 single bigBed file with predicted protein consequences and cross-database filtering. It
-contains 932 million variants from SFARI SPARK (WES + WGS, autism families), TOPMed
-(NHLBI heart, lung and blood disease cohorts), SCHEMA (schizophrenia case/control),
-GREGoR (rare-disease families), and GA4K (PacBio long-read pediatric rare disease). Where
-the source dataset provides per-phenotype counts, those are exposed as separate AC/AF
-columns and as filter widgets.
+contains 932 million variants from SPARK WES (140k autism families), SFARI WGS (12.5k
+autism families), TOPMed (NHLBI heart, lung and blood disease cohorts), SCHEMA
+(schizophrenia case/control), GREGoR (rare-disease families), and GA4K (PacBio long-read
+pediatric rare disease). Where the source dataset provides per-phenotype counts, those are
+exposed as separate AC/AF columns and as filter widgets.
 </p>
 
 <p>
 For a summary of all available variant frequency databases, including the population-scale
 control track and the genotyping-array track, see the
 <a href="hgTrackUi?g=varFreqs">SNV Frequencies</a> supertrack page.
 </p>
 
 <p>
 Each variant is annotated with its predicted consequence on protein-coding genes
 (using <a href="https://samtools.github.io/bcftools/howtos/csq-calling.html"
 target="_blank">bcftools csq</a> with
 <a href="https://www.ensembl.org/info/data/ftp/index.html" target="_blank">Ensembl</a>
 gene models), and colored by severity. Allele counts and frequencies are shown for each
 source database and, where available, broken down by phenotype.
 </p>
 
 <h2>Display Conventions</h2>
 
 <h3>Color by Consequence</h3>
 <p>Variants are colored by their most severe predicted consequence:</p>
 <table class="stdTbl">
 <tr><th>Color</th><th>Consequence class</th><th>Examples</th></tr>
 <tr>
   <td style="background-color: rgb(255,0,0); color: white; text-align: center;"><b>Red</b></td>
   <td>Protein-truncating / Loss-of-function</td>
   <td>stop_gained, frameshift, splice_donor, splice_acceptor, stop_lost, start_lost</td>
 </tr>
 <tr>
   <td style="background-color: rgb(31,119,180); color: white; text-align: center;"><b>Blue</b></td>
   <td>Missense / In-frame</td>
   <td>missense, inframe_insertion, inframe_deletion, protein_altering</td>
 </tr>
 <tr>
   <td style="background-color: rgb(0,128,0); color: white; text-align: center;"><b>Green</b></td>
   <td>Synonymous</td>
   <td>synonymous, stop_retained</td>
 </tr>
 <tr>
   <td style="background-color: rgb(128,128,128); color: white; text-align: center;"><b>Grey</b></td>
   <td>Non-coding / Intergenic</td>
   <td>intron, non_coding, intergenic, UTR</td>
 </tr>
 </table>
 
 <h3>Amino Acid Change Notation</h3>
 <p>
 The &quot;AA change&quot; field uses bcftools csq notation: <b>23I&gt;23V</b> means position
 23 changed from Isoleucine (I) to Valine (V) (missense). <b>23I</b> alone (no arrow)
 means position 23 is Isoleucine and unchanged (synonymous). A &quot;*&quot; indicates a
 stop codon (e.g. 45R&gt;45* is a stop_gained).
 </p>
 
 <h2>Filters</h2>
 <p>
 This track supports filtering via the track settings page. Click the track title or use the
 &quot;Configure&quot; button to access filters.
 </p>
 
 <h3>Variant Type and Consequence</h3>
 <ul>
   <li><b>Variant Type</b>: SNV, Insertion, Deletion, or MNV.</li>
   <li><b>Consequence</b>: Missense, Synonymous, Stop Gained, Frameshift, Splice Donor,
       Splice Acceptor, Intron, 3' UTR, 5' UTR, Non-coding, Intergenic, or Other. The filter
       uses OR logic across the comma-separated consequence tokens on each variant. See the
       <a href="hgTrackUi?g=varFreqsAll">All Databases Combined</a> description page for a
       complete description of the &quot;Other&quot; bucket.</li>
 </ul>
 
 <h3>Frequency and Count Filters</h3>
 <ul>
   <li><b>Max Allele Frequency</b>: Filter by the maximum allele frequency observed across
       the six disease cohorts. Useful for finding rare variants enriched in cases.</li>
   <li><b>Total Allele Count</b>: Filter by the sum of allele counts across all six
       databases.</li>
   <li><b>Per-database AF and AC</b>: Filter by allele frequency or count in any specific
       source. For example, restrict to variants with SCHEMA case AF &gt; 0.001.</li>
 </ul>
 
 <h3>Phenotype-stratified Filters</h3>
 <p>
 Four of the six sources publish counts split by phenotype, which lets you compare allele
 frequencies between affected and unaffected groups within the same cohort:
 </p>
 <ul>
   <li><b>SPARK WES</b> and <b>SFARI WGS</b>: ASD proband counts versus non-ASD family
       members (mostly parents and unaffected siblings). The split is from the SPARK
       individuals_registration <code>asd</code> column.</li>
   <li><b>SCHEMA</b>: Schizophrenia case counts versus controls, summed across the 39
       analysis cohorts in the original meta-analysis.</li>
   <li><b>GREGoR</b>: Affected, Unaffected, and Unknown disease-status counts.</li>
 </ul>
 
 <h3>Source Database</h3>
 <p>
 The <b>Source Database</b> filter restricts the display to variants present in specific
 databases. It uses OR logic: selecting multiple databases shows variants found in any of
 the selected sources.
 </p>
 
 <h3>Length Filters</h3>
 <ul>
   <li><b>Reference/Alternate Length</b>: Filter by the length of the reference or alternate allele.</li>
   <li><b>Length Change</b>: Filter by the size difference between alternate and reference
       (positive = insertion, negative = deletion, zero = SNV or MNV).</li>
 </ul>
 
 <h2>Methods</h2>
 <p>
 The same merge-and-annotate pipeline used for the
 <a href="hgTrackUi?g=varFreqsAll">All Databases Combined</a> track was run on the
 disease-cohort subset of source VCFs. Each VCF was stripped of its INFO fields,
 normalized with <code>bcftools norm</code> (splitting multi-allelic sites), and merged with
 <code>bcftools merge</code>. The merged VCF was then annotated with predicted protein
 consequences using <code>bcftools csq</code> with the
 <a href="https://www.ensembl.org/info/data/ftp/index.html" target="_blank">Ensembl</a>
 GRCh38 release 115 gene annotation (GFF3).
 </p>
 
 <p>
 The SPARK WES and WGS sites VCFs were rebuilt for this track so each variant carries
 phenotype-stratified counts in addition to overall AC/AN/AF. The split uses the
 <code>asd</code> column of the SPARK <code>individuals_registration</code> TSV via
 <code>bcftools +fill-tags -S</code>, producing AC_AUT / AN_AUT / AF_AUT and
 AC_NON_AUT / AN_NON_AUT / AF_NON_AUT. SCHEMA was processed the same way, summing
 AC_CASE/AN_CASE/AF_CASE and AC_CTRL/AN_CTRL/AF_CTRL across its 39 analysis cohorts.
 GREGoR ships AC/AN/AF triples for affected, unaffected and unknown disease status
 directly in its release.
 </p>
 
 <p>
 The track's
 <a href="https://github.com/ucscGenomeBrowser/kent/blob/master/src/hg/makeDb/doc/hg38/varFreqs.txt"
 target="_blank">makeDoc file</a> documents how each source VCF was converted. Scripts are
 available from
 <a href="https://github.com/ucscGenomeBrowser/kent/tree/master/src/hg/makeDb/scripts/varFreqs"
 target="_blank">Github</a>.
 </p>
 
 <h2>Data Access</h2>
 <p>
 The data can be explored interactively with the
 <a href="../cgi-bin/hgTables">Table Browser</a> or the
 <a href="../cgi-bin/hgIntegrator">Data Integrator</a>. For programmatic access, our
 <a href="https://api.genome.ucsc.edu" target="_blank">REST API</a> can be used; the track
 name is <em>varFreqsDisease</em>.
 </p>
 <p>
 Because the merged callset includes data from multiple sources whose redistribution
 licenses differ, the combined bigBed is <b>not available for download</b> from our download
 server. The combined track can be reconstructed from the individual source VCFs using the
 <a href="https://github.com/ucscGenomeBrowser/kent/tree/master/src/hg/makeDb/scripts/varFreqs"
 target="_blank">conversion scripts on GitHub</a> together with the
 <a href="https://github.com/ucscGenomeBrowser/kent/blob/master/src/hg/makeDb/doc/hg38/varFreqs.txt"
 target="_blank">build documentation</a>.
 </p>
 
 <h2>Credits</h2>
 <p>
 This track is only possible thanks to the data from millions of volunteers around the
 world, who donated blood, signed consent forms and provided health information about
 themselves and sometimes their families. Click on any of the individual tracks in the
 <a href="hgTrackUi?g=varFreqs">SNV Frequencies</a> supertrack to see the specific credits
 for each project. Thanks to Alex Ioannidis, UCSC, for the motivation for this track and
 to Andreas Lahner, MGZ, for feedback.
 </p>
 
 <h2>References</h2>
 <p>
 For primary citations of each source dataset, see the References section on the
 <a href="hgTrackUi?g=varFreqs">SNV Frequencies</a> supertrack page. The merged-track
 build itself uses the following tools:
 </p>
 <p>
 Danecek P, McCarthy SA.
 <a href="https://doi.org/10.1093/bioinformatics/btx100" target="_blank">
 BCFtools/csq: haplotype-aware variant consequences</a>.
 <em>Bioinformatics</em>. 2017 Jul 1;33(13):2037-2039.
 PMID: <a href="https://www.ncbi.nlm.nih.gov/pubmed/28205675" target="_blank">28205675</a>; PMC: <a
 href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5870570/" target="_blank">PMC5870570</a>
 </p>
 <p>
 McLaren W, Gil L, Hunt SE, Riat HS, Ritchie GR, Thormann A, Flicek P, Cunningham F.
 <a href="https://doi.org/10.1186/s13059-016-0974-4" target="_blank">
 The Ensembl Variant Effect Predictor</a>.
 <em>Genome Biol</em>. 2016 Jun 6;17(1):122.
 PMID: <a href="https://www.ncbi.nlm.nih.gov/pubmed/27268795" target="_blank">27268795</a>; PMC: <a
 href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4893825/" target="_blank">PMC4893825</a>
 </p>