5e4ca58df1b5bfe554fe5cc3309a39736ca256ee max Tue Apr 21 08:08:52 2026 -0700 cpc1Sv: restrict to the 58 CPC samples, drop HPRC-specific SVs Rewrite lrSvCpc1VcfToBed.py to identify the 58 CPC sample columns by name prefix (HIFI032* or RY*), recompute AC/AN/NS from those GT columns only, and skip any snarl that no CPC sample carries. The HPRC portion is already represented elsewhere in lrSv, so this keeps the track population-consistent with its label. Rebuild results: 46,092 snarl sites on hs1 (down from 97,205 when combined with HPRC), 36,030 lifted to hg38 (down from 81,261; 10,062 unmapped). Updates cpc1Sv.html, lrSv.ra labels, and the makeDoc. refs #36258 diff --git src/hg/makeDb/trackDb/human/cpc1Sv.html src/hg/makeDb/trackDb/human/cpc1Sv.html index e3f5e3f01a7..470d401eb9d 100644 --- src/hg/makeDb/trackDb/human/cpc1Sv.html +++ src/hg/makeDb/trackDb/human/cpc1Sv.html @@ -1,120 +1,127 @@

Description

-This track displays structural variants (SVs) — deletions, insertions, and -complex substitutions of at least 50 bp — identified by the Chinese -Pangenome Consortium (CPC) from a pangenome graph built from 58 core samples -representing 36 Chinese minority ethnic groups, jointly with 47 samples from -Phase 1 of the Human Pangenome Reference Consortium (HPRC). After -decomposition of the graph bubbles, each distinct graph site (snarl) is -displayed as one variant record, with genotypes aggregated across 105 -samples.

+This track displays structural variants (SVs) — deletions, insertions, +and complex substitutions of at least 50 bp — identified by the +Chinese Pangenome Consortium (CPC) in 58 samples representing 36 Chinese +minority ethnic groups.

+ +

+The upstream release combined the 58 CPC samples with 47 samples from +Phase 1 of the Human Pangenome Reference Consortium (HPRC) into a single +pangenome graph built on the T2T-CHM13v2 assembly with Minigraph-Cactus. +For this track we recomputed allele counts (AC), allele numbers (AN) and +sample counts (NS) using only the 58 CPC sample columns (those with +HIFI032* or RY* prefixes in the source VCF) and dropped +all snarls that no CPC sample carries (HPRC-specific SVs). To see the +HPRC data on its own, use the HPRC SV tracks elsewhere in this collection.

A pangenome is a graph that represents many genomes simultaneously, letting variants that are missing from a single linear reference be captured and -typed directly. Because the CPC pangenome was built on the T2T-CHM13v2 -assembly, variants are shown natively on the hs1 browser and lifted to hg38 -using the UCSC hs1ToHg38.over.chain.gz chain. About 16% of the -97,205 hs1 sites did not lift over cleanly (usually in highly repetitive -regions added to T2T-CHM13).

+typed directly. Variants are shown natively on the hs1 browser and lifted +to hg38 using the UCSC hs1ToHg38.over.chain.gz chain. The track +contains 46,092 snarl sites on hs1 and 36,030 lifted to hg38 (10,062 did +not lift, typically in T2T-added repetitive regions).

Display conventions

Items are colored by SV type:

INS insertion (net ALT longer by ≥50 bp)
DEL deletion (net REF longer by ≥50 bp)
CPX complex substitution (similar-length REF and ALT but at least one ≥50 bp)
MIXED snarl whose collapsed alt alleles belong to different classes

Each bed item spans from the start of the REF allele to its end on the reference. Pure insertions (where REF is a single base) therefore appear as narrow single-base marks; DELs and CPX items span the affected reference interval.

The name field is the graph snarl ID (two node identifiers separated by strand arrows, e.g. >2541>2547). It is stable across the graph but has no meaning outside the CPC pangenome graph file.

Collapsing of multi-allelic sites

The source VCF was decomposed with bcftools norm -m -any, so each graph snarl appears as one VCF row per alternative allele (a single -bubble in the graph may have 2-20+ alt paths). For display, all alternative -alleles sharing the same snarl ID are collapsed into one track item:

+bubble in the graph may have 2-20+ alt paths). For this track we first +compute the CPC-only allele count per alt, drop any alt that no CPC sample +carries, then collapse all remaining alts sharing the same snarl ID into +one track item:

SV type is the common class of all alts, or MIXED if they disagree (for example one alt is a DEL and another is an INS).
SV length is the maximum |len(ALT) − len(REF)| across alts.
Allele count is the sum of the per-alt allele counts.
Number of alts records how many alternative alleles were merged.

Filters

Available filters:

SV type — any combination of INS, DEL, CPX, MIXED.
SV length — maximum allele-length difference.
Allele frequency and allele count across the combined 105 samples.

Methods

The CPC assemblies were produced from PacBio HiFi long-read sequencing (mean ~30× coverage) with hifiasm in trio or Hi-C-phased mode, then combined with HPRC Phase 1 assemblies and built into a variation graph with pggb/Minigraph-Cactus. Bubbles in the graph were decomposed into variant records with vcfwave, producing the source VCF used here. For this UCSC track, the decomposed VCF was parsed, filtered to variants with an allele-length delta of at least 50 bp, and collapsed by graph snarl ID (see the build documentation linked below for details).

Data Access

The data can be explored interactively with the Table Browser or Data Integrator, and accessed from scripts via our API (track=cpc1Sv).

For automated download, the bigBed files are at http://hgdownload.soe.ucsc.edu/gbdb/hs1/lrSv/cpc1.bb (native) and http://hgdownload.soe.ucsc.edu/gbdb/hg38/lrSv/cpc1.bb (lifted). Use bigBedToBed to extract features: e.g. bigBedToBed http://hgdownload.soe.ucsc.edu/gbdb/hs1/lrSv/cpc1.bb -chrom=chr21 -start=0 -end=100000000 stdout

The original pangenome VCF is distributed by the Chinese Pangenome Consortium; see the CPC Phase I repository.

Credits

Thanks to the Chinese Pangenome Consortium and the HPRC Phase 1 team for producing and releasing the combined pangenome and its decomposed variant calls.

References

Gao Y, Yang X, Chen H, Tan X, Yang Z, Deng L, Wang B, Kong S, Li S, Cui Y et al. A pangenome reference of 36 Chinese populations. Nature. 2023 Jul;619(7968):112-121. PMID: 37316654; PMC: PMC10322713