src/hg/makeDb/doc/hg38/lrSv.txt 5e4ca58df1b5bfe554fe5cc3309a39736ca256ee

5e4ca58df1b5bfe554fe5cc3309a39736ca256ee
max
  Tue Apr 21 08:08:52 2026 -0700
cpc1Sv: restrict to the 58 CPC samples, drop HPRC-specific SVs

Rewrite lrSvCpc1VcfToBed.py to identify the 58 CPC sample columns by
name prefix (HIFI032* or RY*), recompute AC/AN/NS from those GT
columns only, and skip any snarl that no CPC sample carries. The
HPRC portion is already represented elsewhere in lrSv, so this keeps
the track population-consistent with its label.

Rebuild results: 46,092 snarl sites on hs1 (down from 97,205 when
combined with HPRC), 36,030 lifted to hg38 (down from 81,261;
10,062 unmapped). Updates cpc1Sv.html, lrSv.ra labels, and the
makeDoc.

refs #36258

diff --git src/hg/makeDb/doc/hg38/lrSv.txt src/hg/makeDb/doc/hg38/lrSv.txt
index 36b9b1fc364..b164679d86d 100644
--- src/hg/makeDb/doc/hg38/lrSv.txt
+++ src/hg/makeDb/doc/hg38/lrSv.txt
@@ -196,30 +196,33 @@
 #   variants_GRCh38_sv_inv_HGSVC2024v1.0.tsv.gz     (300 INV)
 # The two tables are complementary: the insdel table holds all
 # insertions+deletions (with HOM_REF/HOM_TIG/TE columns specific to
 # insertions+deletions), while the inv table holds inversions (with an
 # RGN_REF_INNER column describing the inner inverted region). The lrSv
 # subtrack merges them into a single bigBed.
 
 python3 ~/kent/src/hg/makeDb/scripts/lrSv/lrSvHgsvc3TsvToBed.py \
     variants_GRCh38_sv_insdel_HGSVC2024v1.0.tsv.gz \
     variants_GRCh38_sv_inv_HGSVC2024v1.0.tsv.gz \
     hgsvc3.bed
 bedSort hgsvc3.bed hgsvc3.sorted.bed
 bedToBigBed -type=bed9+ -as=$HOME/kent/src/hg/makeDb/scripts/lrSv/lrSvHgsvc3.as \
     -tab hgsvc3.sorted.bed /hive/data/genomes/hg38/chrom.sizes hgsvc3.bb
 
+# The same process is applied natively to T2T-CHM13 (hs1) using the HGSVC3
+# T2T-CHM13 annotation tables. See ~/kent/src/hg/makeDb/doc/hs1/lrSv.txt.
+
 ##########
 # 2026-04-17 Claude max
 
 # Eighth subtrack: Kim et al. 2026 - PacBio HiFi long-read SVs from 100
 # post-mortem brain samples (Parkinson's disease / ILBD / healthy controls).
 # Paper: Kim et al. 2026, bioRxiv, PMID 41929179
 # Data: Supplementary Table 13 (media-13.txt) from the preprint.
 
 mkdir -p /hive/data/genomes/hg38/bed/lrSv/kwanho2026
 cd /hive/data/genomes/hg38/bed/lrSv/kwanho2026
 # media-13.txt holds the final high-confidence catalog of 74,552 SVs
 # (34,056 INS, 29,545 DEL, 9,707 DUP, 1,244 INV) across three cohorts
 # (PD: 35, ILBD: 31, HC: 34; 100 samples total). paper.txt has the preprint
 # text for reference. Numeric fields use comma thousands-separators inside
 # quoted strings, so the converter parses the TSV with the csv module.
@@ -400,15 +403,32 @@
 mkdir -p /hive/data/genomes/hg38/bed/lrSv/apr
 cd /hive/data/genomes/hg38/bed/lrSv/apr
 
 # (VCF placed here by the user from the MBRU SharePoint download)
 
 # Run converter + liftOver + bigBed for both hs1 (native) and hg38 (lifted).
 # The script iterates the comma-separated ALT alleles of each row,
 # classifies each by length delta (>=50 bp -> INS, <=-50 bp -> DEL,
 # |d|<50 and max(len)>=50 -> CPX, else drop), then emits one row per
 # snarl (VCF line) with AC summed across passing alts.
 bash ~/kent/src/hg/makeDb/scripts/lrSv/lrSvAprBuild.sh
 
 mkdir -p /gbdb/hs1/lrSv /gbdb/hg38/lrSv
 ln -sf /hive/data/genomes/hg38/bed/lrSv/apr/apr.hs1.bb  /gbdb/hs1/lrSv/apr.bb
 ln -sf /hive/data/genomes/hg38/bed/lrSv/apr/apr.hg38.bb /gbdb/hg38/lrSv/apr.bb
+
+##########
+# 2026-04-21 Claude max
+#
+# cpc1Sv rebuilt as CPC-only (58 samples). The upstream VCF contains
+# 105 samples (58 CPC + 47 HPRC Phase 1). For this version we
+# identify the 58 CPC columns by sample name prefix (HIFI032* or
+# RY*), recompute AC/AN/NS from those GT columns only, and drop
+# snarls where no CPC sample carries any alt. HPRC-specific SVs are
+# therefore excluded; the HPRC contribution is already represented
+# in the HPRC SV tracks elsewhere in this lrSv supertrack.
+#
+# Pipeline (same build script, updated Python converter):
+cd /hive/data/genomes/hg38/bed/lrSv/cpc1
+bash ~/kent/src/hg/makeDb/scripts/lrSv/lrSvCpc1Build.sh
+#   hs1 sites: 46,092 (down from 97,205 combined)
+#   hg38 lifted: 36,030 (down from 81,261); 10,062 unmapped