5e4ca58df1b5bfe554fe5cc3309a39736ca256ee max Tue Apr 21 08:08:52 2026 -0700 cpc1Sv: restrict to the 58 CPC samples, drop HPRC-specific SVs Rewrite lrSvCpc1VcfToBed.py to identify the 58 CPC sample columns by name prefix (HIFI032* or RY*), recompute AC/AN/NS from those GT columns only, and skip any snarl that no CPC sample carries. The HPRC portion is already represented elsewhere in lrSv, so this keeps the track population-consistent with its label. Rebuild results: 46,092 snarl sites on hs1 (down from 97,205 when combined with HPRC), 36,030 lifted to hg38 (down from 81,261; 10,062 unmapped). Updates cpc1Sv.html, lrSv.ra labels, and the makeDoc. refs #36258 diff --git src/hg/makeDb/trackDb/human/cpc1Sv.html src/hg/makeDb/trackDb/human/cpc1Sv.html index e3f5e3f01a7..470d401eb9d 100644 --- src/hg/makeDb/trackDb/human/cpc1Sv.html +++ src/hg/makeDb/trackDb/human/cpc1Sv.html @@ -1,64 +1,71 @@
-This track displays structural variants (SVs) — deletions, insertions, and -complex substitutions of at least 50 bp — identified by the Chinese -Pangenome Consortium (CPC) from a pangenome graph built from 58 core samples -representing 36 Chinese minority ethnic groups, jointly with 47 samples from -Phase 1 of the Human Pangenome Reference Consortium (HPRC). After -decomposition of the graph bubbles, each distinct graph site (snarl) is -displayed as one variant record, with genotypes aggregated across 105 -samples.
+This track displays structural variants (SVs) — deletions, insertions, +and complex substitutions of at least 50 bp — identified by the +Chinese Pangenome Consortium (CPC) in 58 samples representing 36 Chinese +minority ethnic groups. + ++The upstream release combined the 58 CPC samples with 47 samples from +Phase 1 of the Human Pangenome Reference Consortium (HPRC) into a single +pangenome graph built on the T2T-CHM13v2 assembly with Minigraph-Cactus. +For this track we recomputed allele counts (AC), allele numbers (AN) and +sample counts (NS) using only the 58 CPC sample columns (those with +HIFI032* or RY* prefixes in the source VCF) and dropped +all snarls that no CPC sample carries (HPRC-specific SVs). To see the +HPRC data on its own, use the HPRC SV tracks elsewhere in this collection.
A pangenome is a graph that represents many genomes simultaneously, letting variants that are missing from a single linear reference be captured and -typed directly. Because the CPC pangenome was built on the T2T-CHM13v2 -assembly, variants are shown natively on the hs1 browser and lifted to hg38 -using the UCSC hs1ToHg38.over.chain.gz chain. About 16% of the -97,205 hs1 sites did not lift over cleanly (usually in highly repetitive -regions added to T2T-CHM13).
+typed directly. Variants are shown natively on the hs1 browser and lifted +to hg38 using the UCSC hs1ToHg38.over.chain.gz chain. The track +contains 46,092 snarl sites on hs1 and 36,030 lifted to hg38 (10,062 did +not lift, typically in T2T-added repetitive regions).Items are colored by SV type:
Each bed item spans from the start of the REF allele to its end on the reference. Pure insertions (where REF is a single base) therefore appear as narrow single-base marks; DELs and CPX items span the affected reference interval.
The name field is the graph snarl ID (two node identifiers separated by strand arrows, e.g. >2541>2547). It is stable across the graph but has no meaning outside the CPC pangenome graph file.
The source VCF was decomposed with bcftools norm -m -any, so each graph snarl appears as one VCF row per alternative allele (a single -bubble in the graph may have 2-20+ alt paths). For display, all alternative -alleles sharing the same snarl ID are collapsed into one track item:
+bubble in the graph may have 2-20+ alt paths). For this track we first +compute the CPC-only allele count per alt, drop any alt that no CPC sample +carries, then collapse all remaining alts sharing the same snarl ID into +one track item:Available filters: