src/hg/makeDb/doc/hg38/varFreqs.txt 695f40f9d6139a4df393522c067f1702aff8d3bd

695f40f9d6139a4df393522c067f1702aff8d3bd
max
  Wed Apr 22 03:13:39 2026 -0700
varFreqs: add SVatalog 101 short-read SNV frequencies subtrack

SNV/indel allele frequencies from the 101-sample GWAS SVatalog cohort
(Chirmade et al. 2026, Heredity, PMID 41203876), called from 10X
Genomics linked short-read WGS with GATK HaplotypeCaller v4.0.0.0 and
phased with SHAPEIT v4.2.0. Sibling of the lrSv chirmade101Sv
structural-variant track, which is built from the same 101 samples.

8,814,835 autosomal + chrX sites. Source release ships only AF; AC and
AN are synthesized in the emitted VCF as AC=round(AF*202) and AN=202
(2*101 diploid), with the gnomAD v3.1 non-Finnish European AF and dbSNP
rsID passed through as GNOMAD_NFE_AF and RSID info fields. VCF is
bgzipped + tabix-indexed (172 MB + 1.6 MB .tbi).

Files:
- scripts/varFreqs/svatalogFreqToVcf.py (new): per-chrom allele-freq
TSV -> single VCF with hg38 ##contig header
- trackDb/human/varFreqs.ra: new svatalogSnv vcfTabix subtrack
- trackDb/human/svatalogSnv.html (new): doc page
- trackDb/human/varFreqs.html: new row in Available Datasets table
- doc/hg38/varFreqs.txt: wget-free build block (input files were
downloaded manually from Zenodo 13367574)

Note: the All Databases Combined varFreqs bigBed has NOT been rebuilt
to include this new source yet; a subsequent merge pass will add it.

refs #36258

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

diff --git src/hg/makeDb/doc/hg38/varFreqs.txt src/hg/makeDb/doc/hg38/varFreqs.txt
index 990cf38c473..f93c73f4f26 100644
--- src/hg/makeDb/doc/hg38/varFreqs.txt
+++ src/hg/makeDb/doc/hg38/varFreqs.txt
@@ -275,15 +275,38 @@
 #      ~/kent/src/hg/makeDb/scripts/varFreqs/databases.tsv
 #    and appended their /gbdb paths to files.txt.
 # 2. Deleted merged.vcf.gz and merged.annotated.vcf.gz to force a full
 #    merge + bcftools csq re-annotation (per-sample normalized VCFs
 #    from the previous run were kept; only the two new VCFs were
 #    normalized in Step 4).
 # 3. Ran ./mergeAndAnnotate.sh (~55 min: 5 min per-file, ~15 min merge,
 #    ~35 min csq).
 # 4. Ran ./vcfToBigBed.py --output-prefix varFreqsAll --threads 8
 #    (Phase 1 pre-extract ~90 min, Phase 2 chrom BED build ~30 min).
 # 5. bedToBigBed on 275 GB sorted BED (~2 h) to produce 37.7 GB
 #    varFreqsAll.bb with 1,165,666,478 records and 113 fields.
 # 6. Updated varFreqs.ra filterValues.sources and added
 #    filterByRange.GA4KAF/AC and filterByRange.CoLoRSdbAF/AC.
 # Existing /gbdb/hg38/varFreqs/varFreqsAll.bb symlink was preserved.
+
+# 2026-04-22 Claude max
+# GWAS SVatalog small-variant (SNV/indel) allele frequencies from the 101
+# SVatalog samples, sibling of the lrSv chirmade101Sv structural-variant
+# track. Paper: Chirmade et al. 2026, Heredity, PMID 41203876.
+# SNPs were called from 10X Genomics linked short-read WGS of the 101
+# samples with GATK HaplotypeCaller v4.0.0.0 and phased with SHAPEIT v4.2.0.
+# Data: the per-chromosome allele-frequency text files were downloaded into
+# /hive/data/genomes/hg38/bed/varFreqs/svCatalog/ alongside the LD-stats
+# files (see the companion Zenodo deposit 13367574).
+cd /hive/data/genomes/hg38/bed/varFreqs/svCatalog/
+# Convert the 23 per-chrom *_allele_freq.txt files to a single sites-only
+# VCF with AF/AC/AN plus gnomAD v3.1 NFE AF and dbSNP RSID as INFO fields.
+# AC is synthesized as round(AF * 202) and AN is fixed at 202 since the
+# source release does not ship AC/AN.
+python3 ~/kent/src/hg/makeDb/scripts/varFreqs/svatalogFreqToVcf.py \
+    svatalog.vcf chr{1..22}_allele_freq.txt chrX_allele_freq.txt
+bcftools sort svatalog.vcf -Oz -m 16G -T /tmp/ -o svatalog.vcf.gz
+tabix -p vcf svatalog.vcf.gz
+rm -f svatalog.vcf
+# 8,814,835 variants -> 172 MB bgzipped + 1.6 MB tabix index.
+# Symlinks placed under /gbdb/hg38/varFreqs/svatalog/ for the svatalogSnv
+# stanza in trackDb/human/varFreqs.ra.