6b0d68657267f1e02c47d4224ea62446bbbb2ba0 max Fri May 22 06:55:52 2026 -0700 small non-AI changes to the html docs pages of the long-read SV tracks diff --git src/hg/makeDb/doc/hg38/lrSv.txt src/hg/makeDb/doc/hg38/lrSv.txt index f19e8c2e813..67d923f950d 100644 --- src/hg/makeDb/doc/hg38/lrSv.txt +++ src/hg/makeDb/doc/hg38/lrSv.txt @@ -426,15 +426,57 @@ # directly (placed in /hive/data/genomes/hg38/bed/lrSv/colorsDb/). # The previous bigBed came from an older build and declared `af` as a # string; the new build uses a checked-in converter, stores AF as a # float so the numeric filter works, and adds a derived `insLen` # column so the shared lrSv supertrack-level filter.insLen does not # error for this subtrack. cd /hive/data/genomes/hg38/bed/lrSv/colorsDb # Upstream VCFs (same pbsv.jasmine release, one per reference path): # CoLoRSdb.GRCh38.v1.2.0.pbsv.jasmine.vcf.gz (hg38, 426,239 SVs) # CoLoRSdb.CHM13.v1.2.0.pbsv.jasmine.vcf.gz (hs1, 839,714 SVs) bash ~/kent/src/hg/makeDb/scripts/lrSv/lrSvColorsDbSvBuild.sh # hg38: 59 MB, 192,534 DEL + 232,973 INS + 732 INV # hs1 : 87 MB (more variants due to T2T-added regions) # Existing /gbdb symlinks (sv.hg38.bb, sv.hs1.bb) are unchanged. + +########## +# 2026-05-21 Claude max +# +# hprc2JasmineSv: SV callsets from 231 HPRC v2 haplotype-resolved +# assemblies. The Hall lab (Wen-Wei Liao) ran 14 SV callers per sample: +# DELLY, DeBreak, DeepVariant, PAV, SVDSS, SVIM, SVIM-asm, Sniffles2, +# cuteSV, cuteSV-asm, dipcall, longcallD, pbsv, sawfish. +# The per-sample multi-caller output was harmonized into three per-sample +# VCFs (dipcall, PAV, longcallD pipelines). One file per sample per +# assembly path (GRCh38_no_alt, CHM13v2) was placed into +# /hive/data/genomes/hg38/bed/lrSv/hprc2jasmine/input/ via download_inputs.sh +# (S3 URLs from merged_callsets.index.csv). No publication yet. +# +# Build steps: +# 1. Split each per-sample VCF by chromosome and filter to SV-sized +# records (|alt-ref| >= 30 bp), keeping the REF/ALT sequences +# intact. Route by assembly tag in the filename into +# split2-hg38/<chr>/<sample>.vcf and split2-hs1/<chr>/<sample>.vcf. +# Filtering at this stage drops ~135x of records (SNVs + small +# indels) which is what makes the next step tractable. +# 2. Jasmine-merge across samples per chromosome with default +# sequence-aware options: --ignore_merged_inputs --normalize_type. +# Outputs go to output2/<asm>/merged.<chr>.vcf, then bcftools +# concat + sort produce output2/<asm>/merged_all.vcf.gz. +# 3. Convert merged VCFs to BED9+ with the multi-caller fields +# preserved (SUPP -> nSamples and AC, NCALLERS, CALLERS, SOURCES, +# MR), bedSort + bedToBigBed. +cd /hive/data/genomes/hg38/bed/lrSv/hprc2jasmine +bash splitVcfsFilterSv.sh +bash processJasmineSvSeq.sh hg38 +bash processJasmineSvSeq.sh hs1 +bash ~/kent/src/hg/makeDb/scripts/lrSv/lrSvHprc2JasmineBuild.sh +# hg38: 335,494 SVs merged (full 22 autosomes; chrX/chrY absent from inputs) +# hs1: (built same way from CHM13v2 per-sample calls) +# +# Note: an earlier symbolic-ALT pipeline (splitVcfs.sh + symbolizeVcfs.sh +# + processJasmine.sh, output/) was used as a workaround for a Jasmine +# NPE in sequence comparison. Once the inputs are pre-filtered to +# SV-sized records the NPE no longer fires, so the current pipeline runs +# Jasmine with its normal sequence-aware merging. The symbolic-pipeline +# scripts and output/ tree are retained for comparison.