src/hg/makeDb/trackDb/human/varFreqsAll.html 65da29c9d74d4dd832ab7f16899ad3b209b92da4

65da29c9d74d4dd832ab7f16899ad3b209b92da4
max
  Wed May 6 08:43:57 2026 -0700
varFreqs: 5 vcfToBigBed.py fixes + add NPM Singapore to combined track

vcfToBigBed.py and mergeAndAnnotate.sh moved into kent (they were
hive-only); the build is now reproducible from a fresh kent checkout.

Five vcfToBigBed.py fixes (all caught by Lou's QA pass on #36642):

- normalize_consequence(): bcftools csq emits "&"-joined compound terms
like "stop_gained&frameshift" which exact-match-failed the old 8-bucket
consequence filter and orphaned ~8.5M records. Rewrites "&" to "," so a
single record can match multiple buckets, and appends ",others" to any
token list with no named-filter token. Trackdb gains 4 buckets (3' UTR,
5' UTR, Non-coding, Other) and switches to filterType.consequence
multipleListOr.

- Source-attribution bug: the old check only inspected the unified AC/AF
slot. AllOfUs ships only per-population fields ("." in the unified
slot), so all 67M+ AllOfUs variants got no source attribution -- ~43M
rows in the previous bigBed had an empty "sources" column. Fix scans
per-population slots before declaring "no data".

- parse_bcsq() returns "" instead of "." for aaChange/dnaChange on
non-coding variants, so the mouseOver and detail page render a clean
blank line.

- maxAF format: "{:.6g}" -> "{:.6f}" so very small AFs render as
"0.000003" instead of "3.31347e-06".

- autoSql `table varFreqs` -> `table varFreqsAll` (matches the bigBed
filename; required for hgIntegrator wiring).

NPM Singapore (SG10K_Health, 9.7k WGS) added to databases.tsv,
files.txt, populations.tsv (SgChinese / SgMalay / SgIndian) and the
trackDb filter UI. NPM individual subtrack stays tableBrowser off
(license); folded into varFreqsAll same as finngen / kova / mgrb /
swefreq / tishkoff180.

varFreqsAll bigBed rebuild is in progress at /hive/data/genomes/hg38/
bed/varFreqs/all/; will land in /gbdb when the bedToBigBed step
completes.

refs #36642

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

diff --git src/hg/makeDb/trackDb/human/varFreqsAll.html src/hg/makeDb/trackDb/human/varFreqsAll.html
index 8ebfc217aaa..849c9abd374 100644
--- src/hg/makeDb/trackDb/human/varFreqsAll.html
+++ src/hg/makeDb/trackDb/human/varFreqsAll.html
@@ -1,20 +1,20 @@
 <h2>Description</h2>
 <p>
 This track merges variants from all individual variant frequency databases into a single
 bigBed file with predicted protein consequences and cross-database filtering. It contains
-over 1.1 billion variants from 25 source databases worldwide. For a summary of
+over 1.1 billion variants from 26 source databases worldwide. For a summary of
 all available databases, see the
 <a href="hgTrackUi?g=varFreqs">Variant Frequencies</a> supertrack page.
 </p>
 
 <p>
 Each variant is annotated with its predicted consequence on protein-coding genes
 (using <a href="https://samtools.github.io/bcftools/howtos/csq-calling.html"
 target="_blank">bcftools csq</a> with
 <a href="https://www.ensembl.org/info/data/ftp/index.html" target="_blank">Ensembl</a>
 gene models), and colored by severity.
 Allele counts and frequencies are shown for each source database and, where available,
 broken down by ancestry or population group.
 </p>
 
 <h2>Display Conventions</h2>
@@ -51,31 +51,48 @@
 23 changed from Isoleucine (I) to Valine (V) (missense). <b>23I</b> alone (no arrow)
 means position 23 is Isoleucine and unchanged (synonymous). A &quot;*&quot; indicates a
 stop codon (e.g. 45R&gt;45* is a stop_gained).
 </p>
 
 <h2>Filters</h2>
 <p>
 This track supports extensive filtering via the track settings page. Click on the track
 title or use the &quot;Configure&quot; button to access filters:
 </p>
 
 <h3>Variant Type and Consequence</h3>
 <ul>
   <li><b>Variant Type</b>: Filter by SNV, Insertion, Deletion, or MNV.</li>
   <li><b>Consequence</b>: Filter by predicted consequence (Missense, Synonymous, Stop Gained,
-      Frameshift, Splice Donor, Splice Acceptor, Intron, Intergenic).</li>
+      Frameshift, Splice Donor, Splice Acceptor, Intron, 3' UTR, 5' UTR, Non-coding,
+      Intergenic, Other). The filter uses OR logic across the comma-separated consequence
+      tokens on each variant: a variant tagged
+      <code>stop_gained,frameshift</code> is selected by either the &quot;Stop Gained&quot;
+      or the &quot;Frameshift&quot; filter. The &quot;Other&quot; bucket catches the less
+      common <a href="http://www.sequenceontology.org/" target="_blank">Sequence Ontology</a>
+      consequence terms emitted by <code>bcftools csq</code> that don't fit the named
+      buckets above &mdash; for example
+      <code>splice_region</code> (variant near a splice site but outside the canonical
+      donor/acceptor),
+      <code>start_lost</code> / <code>stop_lost</code> (variant disrupts the start codon
+      or replaces the stop codon with a coding amino acid),
+      <code>stop_retained</code> (variant changes the stop codon but keeps it a stop),
+      <code>inframe_insertion</code> / <code>inframe_deletion</code> (in-frame indel
+      adding or removing whole codons), and
+      <code>coding_sequence</code> (CDS variant where the precise impact is undetermined).
+      Including &quot;Other&quot; in the filter selection guarantees that no records are
+      hidden by the consequence filter.</li>
 </ul>
 
 <p><b>How to find protein-truncating variants:</b> Set the Consequence filter to include
 only &quot;Stop Gained&quot;, &quot;Frameshift&quot;, &quot;Splice Donor&quot;, and
 &quot;Splice Acceptor&quot;. These will appear as red items in the track display.</p>
 
 <h3>Frequency and Count Filters</h3>
 <ul>
   <li><b>Max Allele Frequency</b>: Filter by the maximum allele frequency observed across
       all databases. Useful for finding rare variants (e.g., set max to 0.01 for variants
       with AF &lt; 1% in all databases).</li>
   <li><b>Total Allele Count</b>: Filter by the sum of allele counts across all databases.
       Useful for excluding singletons (e.g., set minimum to 2 to remove AC=1 variants
       that may be sequencing errors).</li>
   <li><b>Per-database AF and AC</b>: Filter by allele frequency or allele count in any
@@ -100,31 +117,31 @@
   <li><b>GenomeAsia</b>: Northeast Asian, Southeast Asian, South Asian</li>
   <li><b>gnomAD HGDP+1kG</b>: African, Amish, Latino, Ashkenazi Jewish, East Asian, Finnish,
       Middle Eastern, Non-Finnish European, Other, South Asian</li>
   <li><b>GREGoR</b>: Affected, Unaffected, Unknown (disease status, not ancestry)</li>
 </ul>
 
 <h3>Length Filters</h3>
 <ul>
   <li><b>Reference/Alternate Length</b>: Filter by the length of the reference or alternate allele.</li>
   <li><b>Length Change</b>: Filter by the size difference between alternate and reference
       (positive = insertion, negative = deletion, zero = SNV or MNV).</li>
 </ul>
 
 <h2>Methods</h2>
 <p>
-Variant frequency VCF files from 25 databases were stripped of their INFO fields
+Variant frequency VCF files from 26 databases were stripped of their INFO fields
 (to reduce size), normalized with <code>bcftools norm</code> (splitting multi-allelic sites),
 and merged with <code>bcftools merge</code>. The merged VCF was then annotated with predicted
 protein consequences using <code>bcftools csq</code> with the
 <a href="https://www.ensembl.org/info/data/ftp/index.html" target="_blank">Ensembl</a>
 GRCh38 release 115 gene annotation (GFF3).
 </p>
 
 <p>
 The annotated VCF was converted to bigBed format using a custom Python script
 (<code>vcfToBigBed.py</code>) that reads frequency data from each source VCF in parallel,
 matches variants by position/ref/alt, and writes a BED file with consequence coloring,
 per-database allele counts and frequencies, and population breakdowns.
 The database configuration (which VCFs to include, field mappings, and population definitions)
 is stored in two TSV files
 (<a href="https://github.com/ucscGenomeBrowser/kent/tree/master/src/hg/makeDb/scripts/varFreqs/databases.tsv"