bac95a147f49cd331052e597006e04b3deee40fc max Wed Apr 22 10:43:20 2026 -0700 lrSv/srSv: human-readable SV type filter labels, script cleanups Add human-readable labels to the supertrack-level svType filter on both the lrSv and srSv supertracks using the "CODE|CODE (Long name)" filterValues syntax: DEL -> "DEL (Deletion)", INS -> "INS (Insertion)", etc. Labels keep the short code up front so users can match what hgTracks shows next to each feature. Also sweep in the in-progress converter/as-file cleanups under scripts/lrSv/ and scripts/srSv/ (introduction of lrSvCommon.py helpers, consistent insLen / svLen / AC column naming, tightened field-description text) that had been piling up as an unstaged working tree. refs #36258 diff --git src/hg/makeDb/trackDb/human/cpc1Sv.html src/hg/makeDb/trackDb/human/cpc1Sv.html index 470d401eb9d..167e890b81d 100644 --- src/hg/makeDb/trackDb/human/cpc1Sv.html +++ src/hg/makeDb/trackDb/human/cpc1Sv.html @@ -1,127 +1,153 @@

Description

This track displays structural variants (SVs) — deletions, insertions, and complex substitutions of at least 50 bp — identified by the Chinese Pangenome Consortium (CPC) in 58 samples representing 36 Chinese minority ethnic groups.

The upstream release combined the 58 CPC samples with 47 samples from Phase 1 of the Human Pangenome Reference Consortium (HPRC) into a single pangenome graph built on the T2T-CHM13v2 assembly with Minigraph-Cactus. For this track we recomputed allele counts (AC), allele numbers (AN) and sample counts (NS) using only the 58 CPC sample columns (those with HIFI032* or RY* prefixes in the source VCF) and dropped all snarls that no CPC sample carries (HPRC-specific SVs). To see the HPRC data on its own, use the HPRC SV tracks elsewhere in this collection.

A pangenome is a graph that represents many genomes simultaneously, letting variants that are missing from a single linear reference be captured and typed directly. Variants are shown natively on the hs1 browser and lifted to hg38 using the UCSC hs1ToHg38.over.chain.gz chain. The track contains 46,092 snarl sites on hs1 and 36,030 lifted to hg38 (10,062 did not lift, typically in T2T-added repetitive regions).

Display conventions

Items are colored by SV type:

INS insertion (net ALT longer by ≥50 bp)
DEL deletion (net REF longer by ≥50 bp)
CPX complex substitution (similar-length REF and ALT but at least one ≥50 bp)
MIXED snarl whose collapsed alt alleles belong to different classes

Each bed item spans from the start of the REF allele to its end on the reference. Pure insertions (where REF is a single base) therefore appear as narrow single-base marks; DELs and CPX items span the affected reference interval.

The name field is the graph snarl ID (two node identifiers separated by strand arrows, e.g. >2541>2547). It is stable across the graph but has no meaning outside the CPC pangenome graph file.

Collapsing of multi-allelic sites

The source VCF was decomposed with bcftools norm -m -any, so each graph snarl appears as one VCF row per alternative allele (a single bubble in the graph may have 2-20+ alt paths). For this track we first compute the CPC-only allele count per alt, drop any alt that no CPC sample carries, then collapse all remaining alts sharing the same snarl ID into one track item:

SV type is the common class of all alts, or MIXED if they disagree (for example one alt is a DEL and another is an INS).
SV length is the maximum |len(ALT) − len(REF)| across alts.
Allele count is the sum of the per-alt allele counts.
Number of alts records how many alternative alleles were merged.

Filters

Available filters:

SV type — any combination of INS, DEL, CPX, MIXED.
SV length — maximum allele-length difference.
Allele frequency and allele count across the combined 105 samples.

Methods

-The CPC assemblies were produced from PacBio HiFi long-read sequencing -(mean ~30× coverage) with hifiasm -in trio or Hi-C-phased mode, then combined with HPRC Phase 1 assemblies and -built into a variation graph with pggb/Minigraph-Cactus. -Bubbles in the graph were decomposed into variant records with -vcfwave, -producing the source VCF used here. For this UCSC track, the decomposed -VCF was parsed, filtered to variants with an allele-length delta of at -least 50 bp, and collapsed by graph snarl ID (see the build documentation -linked below for details).

+Gao et al. 2023 generated PacBio HiFi long reads (mean ~30.65x, +Sequel II/IIe platforms) for 58 QC-passed samples representing 36 +minority Chinese ethnic groups, complemented with Illumina short reads +and Oxford Nanopore ultralong reads. Haplotype-phased de novo assemblies +were produced with +hifiasm +v0.16.1 (116 high-quality haplotype assemblies retained after QC) and +combined with 47 HPRC Phase 1 assemblies into a single variation graph +built on T2T-CHM13v2 with the Minigraph-Cactus pipeline (Minigraph v0.19 +for the SV skeleton, Cactus v2.1.1 base alignment, hal2vg). +Graph bubbles were decomposed into variant records with vcfwave +and normalized with bcftools norm -m -any, yielding the source +VCF (CPC.HPRC.Phase1.processed.SVs.normed.vcf.gz). The upstream +Gao et al. release identified 78,072 SVs across the combined 105-sample +graph. For this track we restrict to the 58 CPC samples (columns matching +HIFI032* or RY*), recompute AC/AN/NS from those columns +only, drop snarls with no CPC carrier (HPRC-specific sites), filter to +alts with ≥50 bp REF/ALT length difference, and collapse by graph snarl +ID. The final track contains 46,092 snarl sites on hs1; the hg38 version +is lifted with the UCSC hs1ToHg38.over.chain.gz chain (36,030 +sites, 10,062 did not lift).

+ +

+The source VCF is distributed by the + +Chinese-Pangenome-Consortium-Phase-I GitHub repository.

+ +

+The step-by-step build commands (CPC-only recount, liftOver, snarl +collapse, bigBed build) are recorded in the UCSC makeDoc for this track +container: + +doc/hg38/lrSv.txt. The conversion scripts and autoSql schemas live in + +makeDb/scripts/lrSv. +

Data Access

The data can be explored interactively with the Table Browser or Data Integrator, and accessed from scripts via our API (track=cpc1Sv).

For automated download, the bigBed files are at http://hgdownload.soe.ucsc.edu/gbdb/hs1/lrSv/cpc1.bb (native) and http://hgdownload.soe.ucsc.edu/gbdb/hg38/lrSv/cpc1.bb (lifted). Use bigBedToBed to extract features: e.g. bigBedToBed http://hgdownload.soe.ucsc.edu/gbdb/hs1/lrSv/cpc1.bb -chrom=chr21 -start=0 -end=100000000 stdout

The original pangenome VCF is distributed by the Chinese Pangenome Consortium; see the CPC Phase I repository.

Credits

Thanks to the Chinese Pangenome Consortium and the HPRC Phase 1 team for producing and releasing the combined pangenome and its decomposed variant calls.

References

Gao Y, Yang X, Chen H, Tan X, Yang Z, Deng L, Wang B, Kong S, Li S, Cui Y et al. A pangenome reference of 36 Chinese populations. Nature. 2023 Jul;619(7968):112-121. PMID: 37316654; PMC: PMC10322713