ec5c73f4dc3ef4beae16fa1c12b7e5bf872bb73d lrnassar Tue May 5 15:04:39 2026 -0700 varFreqs: fix gaspIndel bigDataUrl after Max's GenomeAsia hg38 lift; add Tishkoff180 to combined-track filter UI; sync databases.tsv with deployed bigBed; minor description-page corrections. refs #36642 GenomeAsia hg38 lift (May 5 2026, by Max): - gaspIndel.bigDataUrl was pointing at the old GRCh37 filename "All.indels.annot.cont_withmaf.vcf.gz" which was renamed to "ga100k.indels.vcf.gz" during the lift; this left the gaspIndel track broken on the sandbox until the trackdb stanza was updated to match. - gasp/gaspIndel dataVersion strings updated from "Pilot 2019 (GRCh37 - to be lifted)" to "Pilot 2019 (lifted to hg38, May 2026)". - databases.tsv: also updated GenomeAsiaIndel path to ga100k.indels.vcf.gz so the next varFreqsAll rebuild reads from the lifted file. Tishkoff180 in varFreqsAll.bb but unfilterable (fresh-eyes audit finding): - Added Tishkoff180 to filterValues.sources and added filterByRange.Tishkoff180AF / Tishkoff180AC entries. - Added Tishkoff180 (and SVatalog) rows to databases.tsv to match the deployed bigBed (which already has those columns). Description-page corrections: - varFreqsAll.html: "20 population databases" -> "25 source databases" (matches actual count); HGDP+1kG bullet "European" -> "Non-Finnish European" to disambiguate from Finnish (gnomAD's nfe). - varFreqs.html: GenomeAsia row in the Available Datasets table updated from 3 to 7 sub-populations (NEA/SEA/SAS plus the previously hidden OCE/AMR/AFR/WER) so the table matches what the data exposes once Max's rebuild populates the new filter columns. - KOVA longLabel: "1.9k WGS+3.5k WES" -> "1.9k WGS+3.4k WES" (3.4k is correct per Lee 2017 and kova.html). diff --git src/hg/makeDb/trackDb/human/varFreqsAll.html src/hg/makeDb/trackDb/human/varFreqsAll.html index 75b48c59f13..8ebfc217aaa 100644 --- src/hg/makeDb/trackDb/human/varFreqsAll.html +++ src/hg/makeDb/trackDb/human/varFreqsAll.html @@ -1,20 +1,20 @@ <h2>Description</h2> <p> This track merges variants from all individual variant frequency databases into a single bigBed file with predicted protein consequences and cross-database filtering. It contains -over 1.1 billion variants from 20 population databases worldwide. For a summary of +over 1.1 billion variants from 25 source databases worldwide. For a summary of all available databases, see the <a href="hgTrackUi?g=varFreqs">Variant Frequencies</a> supertrack page. </p> <p> Each variant is annotated with its predicted consequence on protein-coding genes (using <a href="https://samtools.github.io/bcftools/howtos/csq-calling.html" target="_blank">bcftools csq</a> with <a href="https://www.ensembl.org/info/data/ftp/index.html" target="_blank">Ensembl</a> gene models), and colored by severity. Allele counts and frequencies are shown for each source database and, where available, broken down by ancestry or population group. </p> <h2>Display Conventions</h2> @@ -87,44 +87,44 @@ The <b>Source Database</b> filter lets you restrict to variants present in specific databases. For example, select only "GREGoR" to see variants found in the rare disease cohort. This filter uses OR logic: selecting multiple databases shows variants found in <em>any</em> of the selected databases. </p> <h3>Population-Specific Filters</h3> <p> Several databases provide ancestry-specific allele frequencies: </p> <ul> <li><b>AllOfUs</b>: African, Indigenous American, East Asian, European, Oceanian, South Asian (from local ancestry inference)</li> <li><b>GenomeAsia</b>: Northeast Asian, Southeast Asian, South Asian</li> <li><b>gnomAD HGDP+1kG</b>: African, Amish, Latino, Ashkenazi Jewish, East Asian, Finnish, - Middle Eastern, European, Other, South Asian</li> + Middle Eastern, Non-Finnish European, Other, South Asian</li> <li><b>GREGoR</b>: Affected, Unaffected, Unknown (disease status, not ancestry)</li> </ul> <h3>Length Filters</h3> <ul> <li><b>Reference/Alternate Length</b>: Filter by the length of the reference or alternate allele.</li> <li><b>Length Change</b>: Filter by the size difference between alternate and reference (positive = insertion, negative = deletion, zero = SNV or MNV).</li> </ul> <h2>Methods</h2> <p> -Variant frequency VCF files from 20 databases were stripped of their INFO fields +Variant frequency VCF files from 25 databases were stripped of their INFO fields (to reduce size), normalized with <code>bcftools norm</code> (splitting multi-allelic sites), and merged with <code>bcftools merge</code>. The merged VCF was then annotated with predicted protein consequences using <code>bcftools csq</code> with the <a href="https://www.ensembl.org/info/data/ftp/index.html" target="_blank">Ensembl</a> GRCh38 release 115 gene annotation (GFF3). </p> <p> The annotated VCF was converted to bigBed format using a custom Python script (<code>vcfToBigBed.py</code>) that reads frequency data from each source VCF in parallel, matches variants by position/ref/alt, and writes a BED file with consequence coloring, per-database allele counts and frequencies, and population breakdowns. The database configuration (which VCFs to include, field mappings, and population definitions) is stored in two TSV files (<a href="https://github.com/ucscGenomeBrowser/kent/tree/master/src/hg/makeDb/scripts/varFreqs/databases.tsv"