ef61e73fc416622d8557ec2439df2344a1cc80c3 max Tue Jun 9 15:10:01 2026 -0700 lrSv: replace HPRC v2.0 pangenome SV track with v2.1 (hprc2v21Sv) Drop the v2.0 wave-decomposed hprc2Sv track and add hprc2v21Sv built from the HPRC v2.1 minigraph-cactus raw vg deconstruct VCFs (gref95.ro), on both hg38 (GRCh38 path, 596,063 SVs) and hs1 (T2T-CHM13 path, 608,435 SVs). The v2.1 files lack per-allele TYPE/LEN, so the new converter classifies INS/DEL by parsimony-trimming REF/ALT and the net length change. The v2.0 build recipe, converter and schema are kept but commented out in the makeDocs in case wave-decomposed VCFs are released again, refs #36258 diff --git src/hg/makeDb/trackDb/human/hprc2v21Sv.html src/hg/makeDb/trackDb/human/hprc2v21Sv.html new file mode 100644 index 00000000000..9d3955f33c8 --- /dev/null +++ src/hg/makeDb/trackDb/human/hprc2v21Sv.html @@ -0,0 +1,108 @@ +

Description

+

+A pangenome graph holds many human genomes at once. Sequence that the +genomes share collapses onto common paths, and the places where they +differ show up as bubbles in the graph. This track shows the structural +variants found in version 2.1 of the Human Pangenome Reference Consortium +(HPRC) minigraph-cactus graph, which was built from haplotype-resolved +PacBio HiFi assemblies of 233 samples. Only larger events are shown here: +insertions and deletions of at least 50 bp. HPRC produces one variant file +per reference path, so the events are measured against GRCh38 on hg38 and +against T2T-CHM13 on hs1, and each assembly shows its own native callset. +

+

+On hg38 there are about 596,000 such alleles (roughly 448,000 insertions and +148,000 deletions). On hs1 there are about 608,000 (roughly 363,000 +insertions and 245,000 deletions). The two sets are not lifted between +assemblies; the counts differ because an insertion against one reference can +be a deletion against the other. +

+ +

Display Conventions and Configuration

+

+Items are colored by SV type: +

+ + + + + +
 Insertion (INS)
 Deletion (DEL)
+

+An insertion is drawn as a 1 bp anchor at the point where the extra +sequence goes in. A deletion spans the stretch of reference that is +missing. Each variant keeps its allele count, allele frequency, the +number of samples with data, and the level it sits at in the graph's +snarl tree. A snarl level of 0 is a top-level bubble; higher numbers are +bubbles nested inside a parent bubble. All of these can be used as +filters. +

+ +

Methods

+

+HPRC release 2 does not yet have a peer-reviewed paper. The graph was +built with minigraph-cactus from haplotype-resolved PacBio HiFi assemblies +of 233 samples, including T2T-CHM13 and the diverse 1000 Genomes Project +panel, using GRCh38 as the reference path. Variants were called from the +graph with vg deconstruct. HPRC keeps the sample list and assembly +provenance in + +alignments_v2.0.csv. +

+

+We started from the per-reference files provided by the HPRC graph team, +hprc-v2.1-mc-grch38.gref95.ro.vcf.gz for hg38 and +hprc-v2.1-mc-chm13.gref95.ro.vcf.gz for hs1. These are the raw +vg deconstruct output: each graph bubble is one multi-allelic +record with its graph traversals attached, and there are no per-allele type +or length fields. To turn a file into a track, we compared every alternate +allele to the reference allele after trimming the sequence they share at +each end. An allele was kept when the net length change was at least 50 bp, +and labeled an insertion when the alternate is longer or a deletion when it +is shorter. At this size no balanced, equal-length substitutions came up, +and the files carry no inversion calls, so the track has only insertions and +deletions. On hg38, 596,063 alleles were kept (43,580 at nested snarl +levels); on hs1, 608,435 (75,809 nested). Because these files are not broken +down into atomic indels, one bubble can appear as a single large allele +rather than several small ones, so the counts are not comparable to a +wave-decomposed callset. Allele counts, frequencies and sample counts come +straight from the VCF. +

+

+The conversion script and autoSql schema are in + +makeDb/scripts/lrSv and the build steps are in the makeDoc at + +doc/hg38/lrSv.txt. +

+ +

Data Access

+

+The data can be explored interactively in table format with the +Table Browser or the +Data Integrator, and read programmatically +through our API, +track=hprc2v21Sv. For automated download and analysis the variants +are in a bigBed file on our download server, one per assembly: + +hg38 and + +hs1. You can pull out one region or the whole set with +bigBedToBed, for example +bigBedToBed http://hgdownload.soe.ucsc.edu/gbdb/hg38/lrSv/hprc2v21.bb -chrom=chr21 -start=0 -end=100000000 stdout. +

+ +

Credits

+

+Thanks to the Human Pangenome Reference Consortium for building and +releasing the release-2 minigraph-cactus pangenome, and to Glenn Hickey +for the v2.1 deconstructed VCF. +

+ +

References

+

+HPRC release 2 is not yet described in a peer-reviewed publication. The +release announcement has background and data-access details: + +HPRC data release 2. +