f058c8fe4601b223ff47468eb3525c05ccd03850 max Wed Apr 22 09:17:17 2026 -0700 srSv: new short-read SV supertrack, split out of lrSv Move the three short-read SV/CNV subtracks (abelSv, onekg3202Sr, tommoJpCnv) out of the Long-read SV supertrack into a new sibling supertrack srSv (Short-read SVs), so the lrSv collection contains only long-read callsets. Filter fields (svType, svLen, insLen, AC) are mirrored at the srSv supertrack level to keep the UX parallel to lrSv. - trackDb: new human/srSv.ra with the three subtrack stanzas and updated /gbdb/$D/srSv/... bigDataUrls; corresponding stanzas removed from human/lrSv.ra. human/trackDb.ra now includes srSv.ra. Also a new human/srSv.html overview page; the SR rows and SR-specific paragraphs removed from human/lrSv.html. - Scripts: abelSv/{abelSv.as,vcfToBed.py,build.sh} and lrSv/ {lrSv1kg3202Sr*, lrSvTommoJpCnvVcfToBedGraph.py} moved to scripts/srSv/ with git mv (history preserved) and renamed to drop the "lrSv" prefix. Internal path references in abelSvBuild.sh and abelSvVcfToBed.py updated. - makeDoc: doc/hg38/abelSv.txt renamed to doc/hg38/srSv.txt and extended with the onekg3202Sr and tommoJpCnv sections moved from lrSv.txt. lrSv.txt leaves a pointer. - Data: /hive/data/genomes/hg38/bed/{abelSv,lrSv/onekg3202sr, lrSv/tommoJpCnv} moved to /hive/data/genomes/hg38/bed/srSv/*. /gbdb/hg38/lrSv/{onekg3202sr.bb,tommoJpCnv{Loss,Gain}.bw} and /gbdb/hg38/abelSv/ removed and re-linked under /gbdb/hg38/srSv/. refs #36258 diff --git src/hg/makeDb/doc/hg38/lrSv.txt src/hg/makeDb/doc/hg38/lrSv.txt index 914cb1d001b..f19e8c2e813 100644 --- src/hg/makeDb/doc/hg38/lrSv.txt +++ src/hg/makeDb/doc/hg38/lrSv.txt @@ -98,55 +98,32 @@ # VCF downloaded from jMorp: # https://jmorp.megabank.tohoku.ac.jp/datasets/tommo-jsv1-20211208-af # File: tommo-JSV1-20211208-GRCh38-without-genotype-count.vcf.gz # 74,201 SVs: 37,981 DEL, 36,220 INS # Site-only VCF, merged with SURVIVOR v1.0.6 # Native GRCh38 coordinates (confirmed via contig headers) # Trio-based: 111 families, includes Mendelian error rates # Convert VCF to BED and build bigBed python3 ~/kent/src/hg/makeDb/scripts/lrSv/lrSvTommoJpVcfToBed.py \ tommo-JSV1-20211208-GRCh38-without-genotype-count.vcf.gz tommoJp.bed bedSort tommoJp.bed tommoJp.sorted.bed bedToBigBed -type=bed9+ -as=$HOME/kent/src/hg/makeDb/scripts/lrSv/lrSvTommoJp.as \ -tab tommoJp.sorted.bed /hive/data/genomes/hg38/chrom.sizes tommoJp.bb -########## -# 2026-04-21 Claude max - -# ToMMo 48KJPN CNV Frequency Panel - short-read CNV comparator to the -# long-read tommoJpSv track above. 48,874 Japanese individuals, -# short-read WGS, GATK CNV germline workflow at 1 kb bin resolution. -# Data page: https://jmorp.megabank.tohoku.ac.jp/downloads/tommo-jcnvv1-20230828 - -mkdir -p /hive/data/genomes/hg38/bed/lrSv/tommoJpCnv -cd /hive/data/genomes/hg38/bed/lrSv/tommoJpCnv -wget https://jmorp.megabank.tohoku.ac.jp/datasets/tommo-jcnvv1-20230828/files/tommo-jcnvv1-20230828-GRCh38.vcf.gz - -# The VCF has one record per 1 kb non-N bin with per-ALT sample counts -# (SC) for each observed CN state (CN0..CN5). The converter collapses -# the per-CN counts into two per-bin values (samples with CN<2, samples -# with CN>2) and writes two bedGraphs, skipping bins with no CNV -# carrier. Displayed as a multiWig transparent overlay via trackDb -# (loss red / gain green) so CNV carrier-count density is visible at -# any zoom level. 2,006,905 bins kept. -python3 ~/kent/src/hg/makeDb/scripts/lrSv/lrSvTommoJpCnvVcfToBedGraph.py \ - tommo-jcnvv1-20230828-GRCh38.vcf.gz tommoJpCnvLoss.bg tommoJpCnvGain.bg -sort -k1,1 -k2,2n tommoJpCnvLoss.bg > tommoJpCnvLoss.sorted.bg -sort -k1,1 -k2,2n tommoJpCnvGain.bg > tommoJpCnvGain.sorted.bg -bedGraphToBigWig tommoJpCnvLoss.sorted.bg /hive/data/genomes/hg38/chrom.sizes tommoJpCnvLoss.bw -bedGraphToBigWig tommoJpCnvGain.sorted.bg /hive/data/genomes/hg38/chrom.sizes tommoJpCnvGain.bw +# ToMMo 48K CNV short-read comparator moved to the srSv supertrack. +# See doc/hg38/srSv.txt for that build. ########## # 2026-03-26 Claude max # Fourth subtrack: AoU 1K - SVs from 1,027 AoU individuals (PacBio HiFi) # Paper: Garimella et al. 2025, medRxiv, PMID 41256123 # Data: Supplementary media-2 from preprint mkdir -p /hive/data/genomes/hg38/bed/lrSv/aou1k cd /hive/data/genomes/hg38/bed/lrSv/aou1k # Downloaded supplementary CSV from preprint (media-2.gz) # 541,049 SVs: 444,524 INS, 96,525 DEL (autosomes only) # Population-specific AFs (AFR, AMR, EAS, EUR, SAS) # Gene annotations (OMIM, disease, cancer, ACMG), regulatory elements @@ -321,53 +298,32 @@ wget https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/HGSVC2/release/v2.0/integrated_callset/variants_freeze4_sv_inv.tsv.gz # Two annotation tables are complementary (same structure as HGSVC3): the # insdel table holds DEL + INS with POP_*_AF population allele frequencies # imputed back into the 1000 Genomes cohort; the inv table holds INV with # an RGN_REF_INNER column. The converter merges them into a single bigBed. python3 ~/kent/src/hg/makeDb/scripts/lrSv/lrSvHgsvc2TsvToBed.py \ variants_freeze4_sv_insdel.tsv.gz \ variants_freeze4_sv_inv.tsv.gz \ hgsvc2.bed bedSort hgsvc2.bed hgsvc2.sorted.bed bedToBigBed -type=bed9+ -as=$HOME/kent/src/hg/makeDb/scripts/lrSv/lrSvHgsvc2.as \ -tab hgsvc2.sorted.bed /hive/data/genomes/hg38/chrom.sizes hgsvc2.bb -########## -# 2026-04-20 Claude max - -# Twelfth subtrack: 1000 Genomes 3,202-sample Illumina SHORT-READ GATK-SV -# release. Included in the lrSv collection solely as a short-read -# comparator; it is NOT a long-read dataset. -# Paper: Byrska-Bishop et al. 2022, Cell, PMID 36055201 -# Data: 1KGP_3202.gatksv_svtools_novelins.freeze_V3.wAF.vcf.gz (IGSR FTP) - -mkdir -p /hive/data/genomes/hg38/bed/lrSv/onekg3202sr -cd /hive/data/genomes/hg38/bed/lrSv/onekg3202sr -wget https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/working/20210124.SV_Illumina_Integration/1KGP_3202.gatksv_svtools_novelins.freeze_V3.wAF.vcf.gz - -# 173,366 site-level SVs across 7 classes (DEL, INS, DUP, INV, CPX, CNV, -# CTX) with AC/AN/AF and per-superpopulation AFs (AFR/AMR/ASN/EUR/SAN). -# The converter extracts site-level INFO into bed9+, preserving the -# FILTER column so users can see PASS vs LowQual / HWE / etc. - -python3 ~/kent/src/hg/makeDb/scripts/lrSv/lrSv1kg3202SrVcfToBed.py \ - 1KGP_3202.gatksv_svtools_novelins.freeze_V3.wAF.vcf.gz onekg3202sr.bed -bedSort onekg3202sr.bed onekg3202sr.sorted.bed -bedToBigBed -type=bed9+ -as=$HOME/kent/src/hg/makeDb/scripts/lrSv/lrSv1kg3202Sr.as \ - -tab onekg3202sr.sorted.bed /hive/data/genomes/hg38/chrom.sizes onekg3202sr.bb +# 1KG 3202 short-read comparator moved to the srSv supertrack. +# See doc/hg38/srSv.txt for that build. ########## # 2026-04-20 Claude max # Thirteenth subtrack: HPRC release-2 pangenome SVs (233 samples). # No peer-reviewed publication yet; see HPRC release page: # https://humanpangenome.org/hprc-data-release-2/ # Sample list (alignments v2.0): # https://github.com/human-pangenomics/hprc_intermediate_assembly/blob/main/data_tables/pangenomes/alignments_v2.0.csv mkdir -p /hive/data/genomes/hg38/bed/lrSv/hprc2 cd /hive/data/genomes/hg38/bed/lrSv/hprc2 # Pangenome graph (referenced in the doc html): wget https://s3-us-west-2.amazonaws.com/human-pangenomics/pangenomes/freeze/release2/minigraph-cactus/hprc-v2.0-mc-grch38.sv.gfa.gz