68c5b3b5dfc4053ff78a6b1d236bd1ac90251cfa lrnassar Mon Jun 1 14:40:45 2026 -0700 varFreqs: description pages for the three combined tracks and "SNV" rename sweep. Add varFreqsDisease.html and varFreqsArray.html so the two new combined tracks have full Description/Display/Methods/Data Access/References. Add a Caveats section on varFreqsArray about chip-data quality vs sequencing. Update varFreqsAll.html and the supertrack varFreqs.html to reflect the three-combined-track family (cross-links between siblings, new "Combined Tracks" section, new table rows, and updated source/variant counts). Add a GoNL row to the supertrack table. Sweep 37 subtrack longLabels and four cross-referencing description pages (colorsDbSnv.html, mei.html, meiSwegen.html, phasedVars.html) from "Variant Frequencies:" to "SNV Frequencies:" to match the supertrack shortLabel. refs #36642 diff --git src/hg/makeDb/trackDb/human/varFreqsDisease.html src/hg/makeDb/trackDb/human/varFreqsDisease.html new file mode 100644 index 00000000000..613013eac96 --- /dev/null +++ src/hg/makeDb/trackDb/human/varFreqsDisease.html @@ -0,0 +1,198 @@ +
+This track merges variants from six disease-focused or clinically-recruited cohorts into a +single bigBed file with predicted protein consequences and cross-database filtering. It +contains 932 million variants from SFARI SPARK (WES + WGS, autism families), TOPMed +(NHLBI heart, lung and blood disease cohorts), SCHEMA (schizophrenia case/control), +GREGoR (rare-disease families), and GA4K (PacBio long-read pediatric rare disease). Where +the source dataset provides per-phenotype counts, those are exposed as separate AC/AF +columns and as filter widgets. +
+ ++For a summary of all available variant frequency databases, including the population-scale +control track and the genotyping-array track, see the +SNV Frequencies supertrack page. +
+ ++Each variant is annotated with its predicted consequence on protein-coding genes +(using bcftools csq with +Ensembl +gene models), and colored by severity. Allele counts and frequencies are shown for each +source database and, where available, broken down by phenotype. +
+ +Variants are colored by their most severe predicted consequence:
+| Color | Consequence class | Examples |
|---|---|---|
| Red | +Protein-truncating / Loss-of-function | +stop_gained, frameshift, splice_donor, splice_acceptor, stop_lost, start_lost | +
| Blue | +Missense / In-frame | +missense, inframe_insertion, inframe_deletion, protein_altering | +
| Green | +Synonymous | +synonymous, stop_retained | +
| Grey | +Non-coding / Intergenic | +intron, non_coding, intergenic, UTR | +
+The "AA change" field uses bcftools csq notation: 23I>23V means position +23 changed from Isoleucine (I) to Valine (V) (missense). 23I alone (no arrow) +means position 23 is Isoleucine and unchanged (synonymous). A "*" indicates a +stop codon (e.g. 45R>45* is a stop_gained). +
+ ++This track supports filtering via the track settings page. Click the track title or use the +"Configure" button to access filters. +
+ ++Four of the six sources publish counts split by phenotype, which lets you compare allele +frequencies between affected and unaffected groups within the same cohort: +
+asd column.+The Source Database filter restricts the display to variants present in specific +databases. It uses OR logic: selecting multiple databases shows variants found in any of +the selected sources. +
+ +
+The same merge-and-annotate pipeline used for the
+All Databases Combined track was run on the
+disease-cohort subset of source VCFs. Each VCF was stripped of its INFO fields,
+normalized with bcftools norm (splitting multi-allelic sites), and merged with
+bcftools merge. The merged VCF was then annotated with predicted protein
+consequences using bcftools csq with the
+Ensembl
+GRCh38 release 115 gene annotation (GFF3).
+
+The SPARK WES and WGS sites VCFs were rebuilt for this track so each variant carries
+phenotype-stratified counts in addition to overall AC/AN/AF. The split uses the
+asd column of the SPARK individuals_registration TSV via
+bcftools +fill-tags -S, producing AC_AUT / AN_AUT / AF_AUT and
+AC_NON_AUT / AN_NON_AUT / AF_NON_AUT. SCHEMA was processed the same way, summing
+AC_CASE/AN_CASE/AF_CASE and AC_CTRL/AN_CTRL/AF_CTRL across its 39 analysis cohorts.
+GREGoR ships AC/AN/AF triples for affected, unaffected and unknown disease status
+directly in its release.
+
+The track's +makeDoc file documents how each source VCF was converted. Scripts are +available from +Github. +
+ ++The data can be explored interactively with the +Table Browser or the +Data Integrator. For programmatic access, our +REST API can be used; the track +name is varFreqsDisease. +
++Because the merged callset includes data from multiple sources whose redistribution +licenses differ, the combined bigBed is not available for download from our download +server. The combined track can be reconstructed from the individual source VCFs using the +conversion scripts on GitHub together with the +build documentation. +
+ ++This track is only possible thanks to the data from millions of volunteers around the +world, who donated blood, signed consent forms and provided health information about +themselves and sometimes their families. Click on any of the individual tracks in the +SNV Frequencies supertrack to see the specific credits +for each project. Thanks to Alex Ioannidis, UCSC, for the motivation for this track and +to Andreas Lahner, MGZ, for feedback. +
+ ++For primary citations of each source dataset, see the References section on the +SNV Frequencies supertrack page. The merged-track +build itself uses the following tools: +
++Danecek P, McCarthy SA. + +BCFtools/csq: haplotype-aware variant consequences. +Bioinformatics. 2017 Jul 1;33(13):2037-2039. +PMID: 28205675; PMC: PMC5870570 +
++McLaren W, Gil L, Hunt SE, Riat HS, Ritchie GR, Thormann A, Flicek P, Cunningham F. + +The Ensembl Variant Effect Predictor. +Genome Biol. 2016 Jun 6;17(1):122. +PMID: 27268795; PMC: PMC4893825 +