src/hg/makeDb/trackDb/human/hprc2v21Sv.html 2e0addd016cfcbf61485b90d8980a8d75be622c2

2e0addd016cfcbf61485b90d8980a8d75be622c2
lrnassar
  Sun Jun 14 00:10:06 2026 -0700
lrSv: sync description-page counts to the deduped data; drop Kim PD from the supertrack page. refs #36258

After the QA dedup, update the SV counts cited on the description pages to the
unique (post-dedup) totals for the tracks served, while leaving the upstream
release/paper counts in the Methods sections:
decodeSv     133,886 -> 119,453 displayed
gustafsonSv  113,696 -> 113,159 displayed
chirmade101  87,183  -> 87,068  displayed
aou1k        541,049 -> 540,155 displayed
hprc2v21Sv   596,063 -> 549,649 (hg38) and 608,435 -> 541,176 (hs1), throughout
(no upstream publication), incl. recomputed nested-snarl counts
lrSv.html: update the Available Datasets table count cells to match, set the
lrSvAll merged cell to 2,317,508 (post Kim PD removal), and remove the Kim PD
Brain row, blurb and reference from the supertrack page (the track is staged on
dev/alpha only, kept out of the merge and the description, and is not released).

diff --git src/hg/makeDb/trackDb/human/hprc2v21Sv.html src/hg/makeDb/trackDb/human/hprc2v21Sv.html
index 9d3955f33c8..196796a19a3 100644
--- src/hg/makeDb/trackDb/human/hprc2v21Sv.html
+++ src/hg/makeDb/trackDb/human/hprc2v21Sv.html
@@ -1,31 +1,31 @@
 <h2>Description</h2>
 <p>
 A pangenome graph holds many human genomes at once. Sequence that the
 genomes share collapses onto common paths, and the places where they
 differ show up as bubbles in the graph. This track shows the structural
 variants found in version 2.1 of the Human Pangenome Reference Consortium
 (HPRC) minigraph-cactus graph, which was built from haplotype-resolved
 PacBio HiFi assemblies of 233 samples. Only larger events are shown here:
 insertions and deletions of at least 50 bp. HPRC produces one variant file
 per reference path, so the events are measured against GRCh38 on hg38 and
 against T2T-CHM13 on hs1, and each assembly shows its own native callset.
 </p>
 <p>
-On hg38 there are about 596,000 such alleles (roughly 448,000 insertions and
-148,000 deletions). On hs1 there are about 608,000 (roughly 363,000
-insertions and 245,000 deletions). The two sets are not lifted between
+On hg38 there are about 550,000 such alleles (roughly 422,000 insertions and
+128,000 deletions). On hs1 there are about 541,000 (roughly 348,000
+insertions and 193,000 deletions). The two sets are not lifted between
 assemblies; the counts differ because an insertion against one reference can
 be a deletion against the other.
 </p>
 
 <h2>Display Conventions and Configuration</h2>
 <p>
 Items are colored by SV type:
 </p>
 <table class="stdTbl">
   <tr><th style="background-color:#0000C8;width:2em">&nbsp;</th>
       <td>Insertion (INS)</td></tr>
   <tr><th style="background-color:#C80000;width:2em">&nbsp;</th>
       <td>Deletion (DEL)</td></tr>
 </table>
 <p>
@@ -49,32 +49,33 @@
 <a href="https://github.com/human-pangenomics/hprc_intermediate_assembly/blob/main/data_tables/pangenomes/alignments_v2.0.csv" target="_blank">
 alignments_v2.0.csv</a>.
 </p>
 <p>
 We started from the per-reference files provided by the HPRC graph team,
 <tt>hprc-v2.1-mc-grch38.gref95.ro.vcf.gz</tt> for hg38 and
 <tt>hprc-v2.1-mc-chm13.gref95.ro.vcf.gz</tt> for hs1. These are the raw
 <tt>vg deconstruct</tt> output: each graph bubble is one multi-allelic
 record with its graph traversals attached, and there are no per-allele type
 or length fields. To turn a file into a track, we compared every alternate
 allele to the reference allele after trimming the sequence they share at
 each end. An allele was kept when the net length change was at least 50 bp,
 and labeled an insertion when the alternate is longer or a deletion when it
 is shorter. At this size no balanced, equal-length substitutions came up,
 and the files carry no inversion calls, so the track has only insertions and
-deletions. On hg38, 596,063 alleles were kept (43,580 at nested snarl
-levels); on hs1, 608,435 (75,809 nested). Because these files are not broken
+deletions. On hg38, 549,649 alleles were kept (40,678 at nested snarl
+levels); on hs1, 541,176 (70,200 nested), after removing byte-identical
+duplicate records. Because these files are not broken
 down into atomic indels, one bubble can appear as a single large allele
 rather than several small ones, so the counts are not comparable to a
 wave-decomposed callset. Allele counts, frequencies and sample counts come
 straight from the VCF.
 </p>
 <p>
 The conversion script and autoSql schema are in
 <a href="https://github.com/ucscGenomeBrowser/kent/tree/master/src/hg/makeDb/scripts/lrSv" target="_blank">
 makeDb/scripts/lrSv</a> and the build steps are in the makeDoc at
 <a href="https://github.com/ucscGenomeBrowser/kent/blob/master/src/hg/makeDb/doc/hg38/lrSv.txt" target="_blank">
 doc/hg38/lrSv.txt</a>.
 </p>
 
 <h2>Data Access</h2>
 <p>