695f40f9d6139a4df393522c067f1702aff8d3bd max Wed Apr 22 03:13:39 2026 -0700 varFreqs: add SVatalog 101 short-read SNV frequencies subtrack SNV/indel allele frequencies from the 101-sample GWAS SVatalog cohort (Chirmade et al. 2026, Heredity, PMID 41203876), called from 10X Genomics linked short-read WGS with GATK HaplotypeCaller v4.0.0.0 and phased with SHAPEIT v4.2.0. Sibling of the lrSv chirmade101Sv structural-variant track, which is built from the same 101 samples. 8,814,835 autosomal + chrX sites. Source release ships only AF; AC and AN are synthesized in the emitted VCF as AC=round(AF*202) and AN=202 (2*101 diploid), with the gnomAD v3.1 non-Finnish European AF and dbSNP rsID passed through as GNOMAD_NFE_AF and RSID info fields. VCF is bgzipped + tabix-indexed (172 MB + 1.6 MB .tbi). Files: - scripts/varFreqs/svatalogFreqToVcf.py (new): per-chrom allele-freq TSV -> single VCF with hg38 ##contig header - trackDb/human/varFreqs.ra: new svatalogSnv vcfTabix subtrack - trackDb/human/svatalogSnv.html (new): doc page - trackDb/human/varFreqs.html: new row in Available Datasets table - doc/hg38/varFreqs.txt: wget-free build block (input files were downloaded manually from Zenodo 13367574) Note: the All Databases Combined varFreqs bigBed has NOT been rebuilt to include this new source yet; a subsequent merge pass will add it. refs #36258 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> diff --git src/hg/makeDb/doc/hg38/varFreqs.txt src/hg/makeDb/doc/hg38/varFreqs.txt index 990cf38c473..f93c73f4f26 100644 --- src/hg/makeDb/doc/hg38/varFreqs.txt +++ src/hg/makeDb/doc/hg38/varFreqs.txt @@ -275,15 +275,38 @@ # ~/kent/src/hg/makeDb/scripts/varFreqs/databases.tsv # and appended their /gbdb paths to files.txt. # 2. Deleted merged.vcf.gz and merged.annotated.vcf.gz to force a full # merge + bcftools csq re-annotation (per-sample normalized VCFs # from the previous run were kept; only the two new VCFs were # normalized in Step 4). # 3. Ran ./mergeAndAnnotate.sh (~55 min: 5 min per-file, ~15 min merge, # ~35 min csq). # 4. Ran ./vcfToBigBed.py --output-prefix varFreqsAll --threads 8 # (Phase 1 pre-extract ~90 min, Phase 2 chrom BED build ~30 min). # 5. bedToBigBed on 275 GB sorted BED (~2 h) to produce 37.7 GB # varFreqsAll.bb with 1,165,666,478 records and 113 fields. # 6. Updated varFreqs.ra filterValues.sources and added # filterByRange.GA4KAF/AC and filterByRange.CoLoRSdbAF/AC. # Existing /gbdb/hg38/varFreqs/varFreqsAll.bb symlink was preserved. + +# 2026-04-22 Claude max +# GWAS SVatalog small-variant (SNV/indel) allele frequencies from the 101 +# SVatalog samples, sibling of the lrSv chirmade101Sv structural-variant +# track. Paper: Chirmade et al. 2026, Heredity, PMID 41203876. +# SNPs were called from 10X Genomics linked short-read WGS of the 101 +# samples with GATK HaplotypeCaller v4.0.0.0 and phased with SHAPEIT v4.2.0. +# Data: the per-chromosome allele-frequency text files were downloaded into +# /hive/data/genomes/hg38/bed/varFreqs/svCatalog/ alongside the LD-stats +# files (see the companion Zenodo deposit 13367574). +cd /hive/data/genomes/hg38/bed/varFreqs/svCatalog/ +# Convert the 23 per-chrom *_allele_freq.txt files to a single sites-only +# VCF with AF/AC/AN plus gnomAD v3.1 NFE AF and dbSNP RSID as INFO fields. +# AC is synthesized as round(AF * 202) and AN is fixed at 202 since the +# source release does not ship AC/AN. +python3 ~/kent/src/hg/makeDb/scripts/varFreqs/svatalogFreqToVcf.py \ + svatalog.vcf chr{1..22}_allele_freq.txt chrX_allele_freq.txt +bcftools sort svatalog.vcf -Oz -m 16G -T /tmp/ -o svatalog.vcf.gz +tabix -p vcf svatalog.vcf.gz +rm -f svatalog.vcf +# 8,814,835 variants -> 172 MB bgzipped + 1.6 MB tabix index. +# Symlinks placed under /gbdb/hg38/varFreqs/svatalog/ for the svatalogSnv +# stanza in trackDb/human/varFreqs.ra.