src/hg/makeDb/trackDb/human/cpc1Sv.html c81011d4a8f57db347e15aa1248c501b2c8a6fea

c81011d4a8f57db347e15aa1248c501b2c8a6fea
lrnassar
  Mon Jun 1 13:16:15 2026 -0700
QA fixes for the lrSv long-read SV supertrack: labels and description cleanups. refs #36258

Trim six subtrack longLabels to the 85-char limit (ga4kSv, hprc2Sv, hgsvc2Sv,
chirmade101Sv, cpc1Sv, and lrSvAll; the lrSvAll change is also made in the
lrSvMergeAll.py generator so a re-run reproduces it).
Standardize the APR dataset name to "Arab Pangenome Reference (APR)" across
lrSv.ra, lrSv.html, aprSv.html, and the makeDoc comment (was a mix of "Arabic"
and "UAE UPR").
lrSv1kgOnt.html: state per-assembly SV counts (hg38 lifted 148,375 vs hs1
native 161,332, each with its own type breakdown) and encode non-ASCII author
names as numeric entities.
hgsvc3Sv.html: correct the hg38 counts to match the served bigBed (176,231
DEL+INS, 176,531 total).
colorsDbSv.html: use $db in the hgdownload path so it resolves on hs1 as well
as hg38.
cpc1Sv.html: encode a Unicode minus sign as a numeric entity.

diff --git src/hg/makeDb/trackDb/human/cpc1Sv.html src/hg/makeDb/trackDb/human/cpc1Sv.html
index b5d40795e95..d4d9ec3d216 100644
--- src/hg/makeDb/trackDb/human/cpc1Sv.html
+++ src/hg/makeDb/trackDb/human/cpc1Sv.html
@@ -1,153 +1,153 @@
 <h2>Description</h2>
 
 <p>
 This track displays structural variants (SVs) at least 50 bp long
 (deletions, insertions, and complex substitutions) identified by the
 Chinese Pangenome Consortium (CPC) in 58 samples representing 36 Chinese
 minority ethnic groups.</p>
 
 <p>
 The upstream release combined the 58 CPC samples with 47 samples from
 Phase 1 of the Human Pangenome Reference Consortium (HPRC) into a single
 pangenome graph built on the T2T-CHM13v2 assembly with Minigraph-Cactus.
 For this track we recomputed allele counts (AC), allele numbers (AN) and
 sample counts (NS) using only the 58 CPC sample columns (those with
 <tt>HIFI032*</tt> or <tt>RY*</tt> prefixes in the source VCF) and dropped
 all snarls that no CPC sample carries (HPRC-specific SVs). To see the
 HPRC data on its own, use the HPRC SV tracks elsewhere in this collection.</p>
 
 <p>
 A pangenome is a graph that represents many genomes simultaneously, letting
 variants that are missing from a single linear reference be captured and
 typed directly. Variants are shown natively on the hs1 browser and lifted
 to hg38 using the UCSC <tt>hs1ToHg38.over.chain.gz</tt> chain. The track
 contains 46,092 snarl sites on hs1 and 36,030 lifted to hg38 (10,062 did
 not lift, typically in T2T-added repetitive regions).</p>
 
 <h2>Display Conventions and Configuration</h2>
 
 <p>Items are colored by SV type:</p>
 <ul>
   <li><span style="background-color:rgb(0,0,200);color:white;padding:1px 6px">INS</span> insertion (net ALT longer by &ge;50 bp)</li>
   <li><span style="background-color:rgb(200,0,0);color:white;padding:1px 6px">DEL</span> deletion (net REF longer by &ge;50 bp)</li>
   <li><span style="background-color:rgb(230,140,0);color:white;padding:1px 6px">CPX</span> complex substitution (similar-length REF and ALT but at least one &ge;50 bp)</li>
   <li><span style="background-color:rgb(120,120,120);color:white;padding:1px 6px">MIXED</span> snarl whose collapsed alt alleles belong to different classes</li>
 </ul>
 
 <p>
 Each bed item spans from the start of the REF allele to its end on the
 reference. Pure insertions (where REF is a single base) therefore appear
 as narrow single-base marks; DELs and CPX items span the affected reference
 interval.</p>
 
 <p>
 The <i>name</i> field is the graph snarl ID (two node identifiers separated
 by strand arrows, e.g. <tt>&gt;2541&gt;2547</tt>). It is stable across the
 graph but has no meaning outside the CPC pangenome graph file.</p>
 
 <h2>Collapsing of Multi-allelic Sites</h2>
 
 <p>
 The source VCF was decomposed with <tt>bcftools norm -m -any</tt>, so each
 graph snarl appears as one VCF row per alternative allele (a single
 bubble in the graph may have 2-20+ alt paths). For this track we first
 compute the CPC-only allele count per alt, drop any alt that no CPC sample
 carries, then collapse all remaining alts sharing the same snarl ID into
 one track item:</p>
 <ul>
   <li><b>SV type</b> is the common class of all alts, or <tt>MIXED</tt> if
       they disagree (for example one alt is a DEL and another is an INS).</li>
-  <li><b>SV length</b> is the maximum |len(ALT) − len(REF)| across alts.</li>
+  <li><b>SV length</b> is the maximum |len(ALT) &#8722; len(REF)| across alts.</li>
   <li><b>Allele count</b> is the sum of the per-alt allele counts.</li>
   <li><b>Number of alts</b> records how many alternative alleles were merged.</li>
 </ul>
 
 <h2>Filters</h2>
 
 <p>Available filters:</p>
 <ul>
   <li><b>SV type</b>: any combination of INS, DEL, CPX, MIXED.</li>
   <li><b>SV length</b>: maximum allele-length difference.</li>
   <li><b>Allele frequency</b> and <b>allele count</b> across the combined
       105 samples.</li>
 </ul>
 
 <h2>Methods</h2>
 
 <p>
 Gao et al. 2023 generated PacBio HiFi long reads (mean ~30.65x,
 Sequel II/IIe platforms) for 58 QC-passed samples representing 36
 minority Chinese ethnic groups, complemented with Illumina short reads
 and Oxford Nanopore ultralong reads. Haplotype-phased de novo assemblies
 were produced with
 <a href="https://github.com/chhylp123/hifiasm" target="_blank">hifiasm</a>
 v0.16.1 (116 high-quality haplotype assemblies retained after QC) and
 combined with 47 HPRC Phase 1 assemblies into a single variation graph
 built on T2T-CHM13v2 with the Minigraph-Cactus pipeline (Minigraph v0.19
 for the SV skeleton, Cactus v2.1.1 base alignment, <tt>hal2vg</tt>).
 Graph bubbles were decomposed into variant records with <tt>vcfwave</tt>
 and normalized with <tt>bcftools norm -m -any</tt>, yielding the source
 VCF (<tt>CPC.HPRC.Phase1.processed.SVs.normed.vcf.gz</tt>). The upstream
 Gao et al. release identified 78,072 SVs across the combined 105-sample
 graph. For this track we restrict to the 58 CPC samples (columns matching
 <tt>HIFI032*</tt> or <tt>RY*</tt>), recompute AC/AN/NS from those columns
 only, drop snarls with no CPC carrier (HPRC-specific sites), filter to
 alts with &ge;50 bp REF/ALT length difference, and collapse by graph snarl
 ID. The final track contains 46,092 snarl sites on hs1; the hg38 version
 is lifted with the UCSC <tt>hs1ToHg38.over.chain.gz</tt> chain (36,030
 sites, 10,062 did not lift).</p>
 
 <p>
 The source VCF is distributed by the
 <a href="https://github.com/Shuhua-Group/Chinese-Pangenome-Consortium-Phase-I" target="_blank">
 Chinese-Pangenome-Consortium-Phase-I GitHub repository</a>.</p>
 
 <p>
 The step-by-step build commands (CPC-only recount, liftOver, snarl
 collapse, bigBed build) are recorded in the UCSC makeDoc for this track
 container:
 <a href="https://github.com/ucscGenomeBrowser/kent/blob/master/src/hg/makeDb/doc/hg38/lrSv.txt" target="_blank">
 doc/hg38/lrSv.txt</a>. The conversion scripts and autoSql schemas live in
 <a href="https://github.com/ucscGenomeBrowser/kent/tree/master/src/hg/makeDb/scripts/lrSv" target="_blank">
 makeDb/scripts/lrSv</a>.
 </p>
 
 <h2>Data Access</h2>
 
 <p>The data can be explored interactively with the
 <a href="../cgi-bin/hgTables">Table Browser</a> or
 <a href="../cgi-bin/hgIntegrator">Data Integrator</a>, and accessed from
 scripts via our <a href="https://api.genome.ucsc.edu">API</a>
 (track=<i>cpc1Sv</i>).</p>
 
 <p>For automated download, the bigBed files are at
 <a href="http://hgdownload.soe.ucsc.edu/gbdb/hs1/lrSv/cpc1.bb" target="_blank">
 http://hgdownload.soe.ucsc.edu/gbdb/hs1/lrSv/cpc1.bb</a> (native) and
 <a href="http://hgdownload.soe.ucsc.edu/gbdb/hg38/lrSv/cpc1.bb" target="_blank">
 http://hgdownload.soe.ucsc.edu/gbdb/hg38/lrSv/cpc1.bb</a> (lifted).
 Use <tt>bigBedToBed</tt> to extract features: e.g.
 <tt>bigBedToBed http://hgdownload.soe.ucsc.edu/gbdb/hs1/lrSv/cpc1.bb -chrom=chr21 -start=0 -end=100000000 stdout</tt></p>
 
 <p>The original pangenome VCF is distributed by the Chinese Pangenome
 Consortium; see the
 <a href="https://github.com/Shuhua-Group/Chinese-Pangenome-Consortium-Phase-I" target="_blank">
 CPC Phase I repository</a>.</p>
 
 <h2>Credits</h2>
 
 <p>Thanks to the Chinese Pangenome Consortium and the HPRC Phase 1 team
 for producing and releasing the combined pangenome and its decomposed
 variant calls.</p>
 
 <h2>References</h2>
 
 
 <p>
 Gao Y, Yang X, Chen H, Tan X, Yang Z, Deng L, Wang B, Kong S, Li S, Cui Y <em>et al</em>.
 <a href="https://doi.org/10.1038/s41586-023-06173-7" target="_blank">
 A pangenome reference of 36 Chinese populations</a>.
 <em>Nature</em>. 2023 Jul;619(7968):112-121.
 PMID: <a href="https://www.ncbi.nlm.nih.gov/pubmed/37316654" target="_blank">37316654</a>; PMC: <a
 href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10322713/" target="_blank">PMC10322713</a>
 </p>