366afa4a74c46ec6fb2b667a2902a873feec40cf max Mon Apr 20 23:00:05 2026 -0700 varFreqsAll: rebuild combined bigBed to include GA4K and CoLoRSdb Regenerate the All Databases Combined track with the two long-read PacBio subtracks (GA4K 552 samples and CoLoRSdb v1.2.0 1,027 samples) that were added to varFreqs since the March build. Source count rises from 21 to 23 databases; final bigBed is 37.7 GB with 1.17B records and 113 fields. Updates varFreqs.ra filterValues.sources and per- database AF/AC filters for the two new sources, and databases.tsv + varFreqs.txt (build notes). refs #36642 diff --git src/hg/makeDb/doc/hg38/varFreqs.txt src/hg/makeDb/doc/hg38/varFreqs.txt index 0dcc1232735..990cf38c473 100644 --- src/hg/makeDb/doc/hg38/varFreqs.txt +++ src/hg/makeDb/doc/hg38/varFreqs.txt @@ -250,15 +250,40 @@ # /hive/data/genomes/hg38/bed/lrSv/colorsDb/ (placed there when the # CoLoRSdb SV track was first built under lrSv). We just add VCF # symlinks under each assembly's varFreqs directory using a consistent # filename so the shared trackDb stanza can use $D. mkdir -p /gbdb/hg38/varFreqs/colorsDb /gbdb/hs1/varFreqs/colorsDb ln -sf /hive/data/genomes/hg38/bed/lrSv/colorsDb/CoLoRSdb.GRCh38.v1.2.0.deepvariant.glnexus.vcf.gz /gbdb/hg38/varFreqs/colorsDb/colorsDbSnv.vcf.gz ln -sf /hive/data/genomes/hg38/bed/lrSv/colorsDb/CoLoRSdb.GRCh38.v1.2.0.deepvariant.glnexus.vcf.gz.tbi /gbdb/hg38/varFreqs/colorsDb/colorsDbSnv.vcf.gz.tbi ln -sf /hive/data/genomes/hg38/bed/lrSv/colorsDb/CoLoRSdb.CHM13.v1.2.0.deepvariant.glnexus.vcf.gz /gbdb/hs1/varFreqs/colorsDb/colorsDbSnv.vcf.gz ln -sf /hive/data/genomes/hg38/bed/lrSv/colorsDb/CoLoRSdb.CHM13.v1.2.0.deepvariant.glnexus.vcf.gz.tbi /gbdb/hs1/varFreqs/colorsDb/colorsDbSnv.vcf.gz.tbi # The varFreqs.ra trackDb file is already in human/ (shared for both # hg38 and hs1 via the human/trackDb.ra include), so no move was needed. # Only colorsDbSnv is expected to render on hs1 - the other varFreqs # subtracks have hg38-only data and will silently show nothing there. + +########## +# 2026-04-20 Claude max +# +# Rebuilt varFreqsAll combined bigBed to include GA4K and CoLoRSdb +# long-read PacBio subtracks that were added to varFreqs since the +# last build (Mar 20). +# +# Steps (in /hive/data/genomes/hg38/bed/varFreqs/all): +# 1. Added GA4K and CoLoRSdb rows to +# ~/kent/src/hg/makeDb/scripts/varFreqs/databases.tsv +# and appended their /gbdb paths to files.txt. +# 2. Deleted merged.vcf.gz and merged.annotated.vcf.gz to force a full +# merge + bcftools csq re-annotation (per-sample normalized VCFs +# from the previous run were kept; only the two new VCFs were +# normalized in Step 4). +# 3. Ran ./mergeAndAnnotate.sh (~55 min: 5 min per-file, ~15 min merge, +# ~35 min csq). +# 4. Ran ./vcfToBigBed.py --output-prefix varFreqsAll --threads 8 +# (Phase 1 pre-extract ~90 min, Phase 2 chrom BED build ~30 min). +# 5. bedToBigBed on 275 GB sorted BED (~2 h) to produce 37.7 GB +# varFreqsAll.bb with 1,165,666,478 records and 113 fields. +# 6. Updated varFreqs.ra filterValues.sources and added +# filterByRange.GA4KAF/AC and filterByRange.CoLoRSdbAF/AC. +# Existing /gbdb/hg38/varFreqs/varFreqsAll.bb symlink was preserved.