5e4ca58df1b5bfe554fe5cc3309a39736ca256ee max Tue Apr 21 08:08:52 2026 -0700 cpc1Sv: restrict to the 58 CPC samples, drop HPRC-specific SVs Rewrite lrSvCpc1VcfToBed.py to identify the 58 CPC sample columns by name prefix (HIFI032* or RY*), recompute AC/AN/NS from those GT columns only, and skip any snarl that no CPC sample carries. The HPRC portion is already represented elsewhere in lrSv, so this keeps the track population-consistent with its label. Rebuild results: 46,092 snarl sites on hs1 (down from 97,205 when combined with HPRC), 36,030 lifted to hg38 (down from 81,261; 10,062 unmapped). Updates cpc1Sv.html, lrSv.ra labels, and the makeDoc. refs #36258 diff --git src/hg/makeDb/doc/hg38/lrSv.txt src/hg/makeDb/doc/hg38/lrSv.txt index 36b9b1fc364..b164679d86d 100644 --- src/hg/makeDb/doc/hg38/lrSv.txt +++ src/hg/makeDb/doc/hg38/lrSv.txt @@ -196,30 +196,33 @@ # variants_GRCh38_sv_inv_HGSVC2024v1.0.tsv.gz (300 INV) # The two tables are complementary: the insdel table holds all # insertions+deletions (with HOM_REF/HOM_TIG/TE columns specific to # insertions+deletions), while the inv table holds inversions (with an # RGN_REF_INNER column describing the inner inverted region). The lrSv # subtrack merges them into a single bigBed. python3 ~/kent/src/hg/makeDb/scripts/lrSv/lrSvHgsvc3TsvToBed.py \ variants_GRCh38_sv_insdel_HGSVC2024v1.0.tsv.gz \ variants_GRCh38_sv_inv_HGSVC2024v1.0.tsv.gz \ hgsvc3.bed bedSort hgsvc3.bed hgsvc3.sorted.bed bedToBigBed -type=bed9+ -as=$HOME/kent/src/hg/makeDb/scripts/lrSv/lrSvHgsvc3.as \ -tab hgsvc3.sorted.bed /hive/data/genomes/hg38/chrom.sizes hgsvc3.bb +# The same process is applied natively to T2T-CHM13 (hs1) using the HGSVC3 +# T2T-CHM13 annotation tables. See ~/kent/src/hg/makeDb/doc/hs1/lrSv.txt. + ########## # 2026-04-17 Claude max # Eighth subtrack: Kim et al. 2026 - PacBio HiFi long-read SVs from 100 # post-mortem brain samples (Parkinson's disease / ILBD / healthy controls). # Paper: Kim et al. 2026, bioRxiv, PMID 41929179 # Data: Supplementary Table 13 (media-13.txt) from the preprint. mkdir -p /hive/data/genomes/hg38/bed/lrSv/kwanho2026 cd /hive/data/genomes/hg38/bed/lrSv/kwanho2026 # media-13.txt holds the final high-confidence catalog of 74,552 SVs # (34,056 INS, 29,545 DEL, 9,707 DUP, 1,244 INV) across three cohorts # (PD: 35, ILBD: 31, HC: 34; 100 samples total). paper.txt has the preprint # text for reference. Numeric fields use comma thousands-separators inside # quoted strings, so the converter parses the TSV with the csv module. @@ -400,15 +403,32 @@ mkdir -p /hive/data/genomes/hg38/bed/lrSv/apr cd /hive/data/genomes/hg38/bed/lrSv/apr # (VCF placed here by the user from the MBRU SharePoint download) # Run converter + liftOver + bigBed for both hs1 (native) and hg38 (lifted). # The script iterates the comma-separated ALT alleles of each row, # classifies each by length delta (>=50 bp -> INS, <=-50 bp -> DEL, # |d|<50 and max(len)>=50 -> CPX, else drop), then emits one row per # snarl (VCF line) with AC summed across passing alts. bash ~/kent/src/hg/makeDb/scripts/lrSv/lrSvAprBuild.sh mkdir -p /gbdb/hs1/lrSv /gbdb/hg38/lrSv ln -sf /hive/data/genomes/hg38/bed/lrSv/apr/apr.hs1.bb /gbdb/hs1/lrSv/apr.bb ln -sf /hive/data/genomes/hg38/bed/lrSv/apr/apr.hg38.bb /gbdb/hg38/lrSv/apr.bb + +########## +# 2026-04-21 Claude max +# +# cpc1Sv rebuilt as CPC-only (58 samples). The upstream VCF contains +# 105 samples (58 CPC + 47 HPRC Phase 1). For this version we +# identify the 58 CPC columns by sample name prefix (HIFI032* or +# RY*), recompute AC/AN/NS from those GT columns only, and drop +# snarls where no CPC sample carries any alt. HPRC-specific SVs are +# therefore excluded; the HPRC contribution is already represented +# in the HPRC SV tracks elsewhere in this lrSv supertrack. +# +# Pipeline (same build script, updated Python converter): +cd /hive/data/genomes/hg38/bed/lrSv/cpc1 +bash ~/kent/src/hg/makeDb/scripts/lrSv/lrSvCpc1Build.sh +# hs1 sites: 46,092 (down from 97,205 combined) +# hg38 lifted: 36,030 (down from 81,261); 10,062 unmapped