753e4fdfc8b960c8a8775e2282b0f87c73a95449 lrnassar Tue Jun 2 07:49:03 2026 -0700 varFreqsDisease.html: list six disease cohorts separately to match the "six cohorts" count in the opening sentence and the six per-source AC/AF columns in the bigBed schema. SPARK WES and SFARI WGS are two distinct sample sets, not one combined cohort. Per QA feedback. refs #36642 diff --git src/hg/makeDb/trackDb/human/varFreqsDisease.html src/hg/makeDb/trackDb/human/varFreqsDisease.html index 613013eac96..010d65019b7 100644 --- src/hg/makeDb/trackDb/human/varFreqsDisease.html +++ src/hg/makeDb/trackDb/human/varFreqsDisease.html @@ -1,198 +1,198 @@ <h2>Description</h2> <p> This track merges variants from six disease-focused or clinically-recruited cohorts into a single bigBed file with predicted protein consequences and cross-database filtering. It -contains 932 million variants from SFARI SPARK (WES + WGS, autism families), TOPMed -(NHLBI heart, lung and blood disease cohorts), SCHEMA (schizophrenia case/control), -GREGoR (rare-disease families), and GA4K (PacBio long-read pediatric rare disease). Where -the source dataset provides per-phenotype counts, those are exposed as separate AC/AF -columns and as filter widgets. +contains 932 million variants from SPARK WES (140k autism families), SFARI WGS (12.5k +autism families), TOPMed (NHLBI heart, lung and blood disease cohorts), SCHEMA +(schizophrenia case/control), GREGoR (rare-disease families), and GA4K (PacBio long-read +pediatric rare disease). Where the source dataset provides per-phenotype counts, those are +exposed as separate AC/AF columns and as filter widgets. </p> <p> For a summary of all available variant frequency databases, including the population-scale control track and the genotyping-array track, see the <a href="hgTrackUi?g=varFreqs">SNV Frequencies</a> supertrack page. </p> <p> Each variant is annotated with its predicted consequence on protein-coding genes (using <a href="https://samtools.github.io/bcftools/howtos/csq-calling.html" target="_blank">bcftools csq</a> with <a href="https://www.ensembl.org/info/data/ftp/index.html" target="_blank">Ensembl</a> gene models), and colored by severity. Allele counts and frequencies are shown for each source database and, where available, broken down by phenotype. </p> <h2>Display Conventions</h2> <h3>Color by Consequence</h3> <p>Variants are colored by their most severe predicted consequence:</p> <table class="stdTbl"> <tr><th>Color</th><th>Consequence class</th><th>Examples</th></tr> <tr> <td style="background-color: rgb(255,0,0); color: white; text-align: center;"><b>Red</b></td> <td>Protein-truncating / Loss-of-function</td> <td>stop_gained, frameshift, splice_donor, splice_acceptor, stop_lost, start_lost</td> </tr> <tr> <td style="background-color: rgb(31,119,180); color: white; text-align: center;"><b>Blue</b></td> <td>Missense / In-frame</td> <td>missense, inframe_insertion, inframe_deletion, protein_altering</td> </tr> <tr> <td style="background-color: rgb(0,128,0); color: white; text-align: center;"><b>Green</b></td> <td>Synonymous</td> <td>synonymous, stop_retained</td> </tr> <tr> <td style="background-color: rgb(128,128,128); color: white; text-align: center;"><b>Grey</b></td> <td>Non-coding / Intergenic</td> <td>intron, non_coding, intergenic, UTR</td> </tr> </table> <h3>Amino Acid Change Notation</h3> <p> The "AA change" field uses bcftools csq notation: <b>23I>23V</b> means position 23 changed from Isoleucine (I) to Valine (V) (missense). <b>23I</b> alone (no arrow) means position 23 is Isoleucine and unchanged (synonymous). A "*" indicates a stop codon (e.g. 45R>45* is a stop_gained). </p> <h2>Filters</h2> <p> This track supports filtering via the track settings page. Click the track title or use the "Configure" button to access filters. </p> <h3>Variant Type and Consequence</h3> <ul> <li><b>Variant Type</b>: SNV, Insertion, Deletion, or MNV.</li> <li><b>Consequence</b>: Missense, Synonymous, Stop Gained, Frameshift, Splice Donor, Splice Acceptor, Intron, 3' UTR, 5' UTR, Non-coding, Intergenic, or Other. The filter uses OR logic across the comma-separated consequence tokens on each variant. See the <a href="hgTrackUi?g=varFreqsAll">All Databases Combined</a> description page for a complete description of the "Other" bucket.</li> </ul> <h3>Frequency and Count Filters</h3> <ul> <li><b>Max Allele Frequency</b>: Filter by the maximum allele frequency observed across the six disease cohorts. Useful for finding rare variants enriched in cases.</li> <li><b>Total Allele Count</b>: Filter by the sum of allele counts across all six databases.</li> <li><b>Per-database AF and AC</b>: Filter by allele frequency or count in any specific source. For example, restrict to variants with SCHEMA case AF > 0.001.</li> </ul> <h3>Phenotype-stratified Filters</h3> <p> Four of the six sources publish counts split by phenotype, which lets you compare allele frequencies between affected and unaffected groups within the same cohort: </p> <ul> <li><b>SPARK WES</b> and <b>SFARI WGS</b>: ASD proband counts versus non-ASD family members (mostly parents and unaffected siblings). The split is from the SPARK individuals_registration <code>asd</code> column.</li> <li><b>SCHEMA</b>: Schizophrenia case counts versus controls, summed across the 39 analysis cohorts in the original meta-analysis.</li> <li><b>GREGoR</b>: Affected, Unaffected, and Unknown disease-status counts.</li> </ul> <h3>Source Database</h3> <p> The <b>Source Database</b> filter restricts the display to variants present in specific databases. It uses OR logic: selecting multiple databases shows variants found in any of the selected sources. </p> <h3>Length Filters</h3> <ul> <li><b>Reference/Alternate Length</b>: Filter by the length of the reference or alternate allele.</li> <li><b>Length Change</b>: Filter by the size difference between alternate and reference (positive = insertion, negative = deletion, zero = SNV or MNV).</li> </ul> <h2>Methods</h2> <p> The same merge-and-annotate pipeline used for the <a href="hgTrackUi?g=varFreqsAll">All Databases Combined</a> track was run on the disease-cohort subset of source VCFs. Each VCF was stripped of its INFO fields, normalized with <code>bcftools norm</code> (splitting multi-allelic sites), and merged with <code>bcftools merge</code>. The merged VCF was then annotated with predicted protein consequences using <code>bcftools csq</code> with the <a href="https://www.ensembl.org/info/data/ftp/index.html" target="_blank">Ensembl</a> GRCh38 release 115 gene annotation (GFF3). </p> <p> The SPARK WES and WGS sites VCFs were rebuilt for this track so each variant carries phenotype-stratified counts in addition to overall AC/AN/AF. The split uses the <code>asd</code> column of the SPARK <code>individuals_registration</code> TSV via <code>bcftools +fill-tags -S</code>, producing AC_AUT / AN_AUT / AF_AUT and AC_NON_AUT / AN_NON_AUT / AF_NON_AUT. SCHEMA was processed the same way, summing AC_CASE/AN_CASE/AF_CASE and AC_CTRL/AN_CTRL/AF_CTRL across its 39 analysis cohorts. GREGoR ships AC/AN/AF triples for affected, unaffected and unknown disease status directly in its release. </p> <p> The track's <a href="https://github.com/ucscGenomeBrowser/kent/blob/master/src/hg/makeDb/doc/hg38/varFreqs.txt" target="_blank">makeDoc file</a> documents how each source VCF was converted. Scripts are available from <a href="https://github.com/ucscGenomeBrowser/kent/tree/master/src/hg/makeDb/scripts/varFreqs" target="_blank">Github</a>. </p> <h2>Data Access</h2> <p> The data can be explored interactively with the <a href="../cgi-bin/hgTables">Table Browser</a> or the <a href="../cgi-bin/hgIntegrator">Data Integrator</a>. For programmatic access, our <a href="https://api.genome.ucsc.edu" target="_blank">REST API</a> can be used; the track name is <em>varFreqsDisease</em>. </p> <p> Because the merged callset includes data from multiple sources whose redistribution licenses differ, the combined bigBed is <b>not available for download</b> from our download server. The combined track can be reconstructed from the individual source VCFs using the <a href="https://github.com/ucscGenomeBrowser/kent/tree/master/src/hg/makeDb/scripts/varFreqs" target="_blank">conversion scripts on GitHub</a> together with the <a href="https://github.com/ucscGenomeBrowser/kent/blob/master/src/hg/makeDb/doc/hg38/varFreqs.txt" target="_blank">build documentation</a>. </p> <h2>Credits</h2> <p> This track is only possible thanks to the data from millions of volunteers around the world, who donated blood, signed consent forms and provided health information about themselves and sometimes their families. Click on any of the individual tracks in the <a href="hgTrackUi?g=varFreqs">SNV Frequencies</a> supertrack to see the specific credits for each project. Thanks to Alex Ioannidis, UCSC, for the motivation for this track and to Andreas Lahner, MGZ, for feedback. </p> <h2>References</h2> <p> For primary citations of each source dataset, see the References section on the <a href="hgTrackUi?g=varFreqs">SNV Frequencies</a> supertrack page. The merged-track build itself uses the following tools: </p> <p> Danecek P, McCarthy SA. <a href="https://doi.org/10.1093/bioinformatics/btx100" target="_blank"> BCFtools/csq: haplotype-aware variant consequences</a>. <em>Bioinformatics</em>. 2017 Jul 1;33(13):2037-2039. PMID: <a href="https://www.ncbi.nlm.nih.gov/pubmed/28205675" target="_blank">28205675</a>; PMC: <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5870570/" target="_blank">PMC5870570</a> </p> <p> McLaren W, Gil L, Hunt SE, Riat HS, Ritchie GR, Thormann A, Flicek P, Cunningham F. <a href="https://doi.org/10.1186/s13059-016-0974-4" target="_blank"> The Ensembl Variant Effect Predictor</a>. <em>Genome Biol</em>. 2016 Jun 6;17(1):122. PMID: <a href="https://www.ncbi.nlm.nih.gov/pubmed/27268795" target="_blank">27268795</a>; PMC: <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4893825/" target="_blank">PMC4893825</a> </p>