6b0d68657267f1e02c47d4224ea62446bbbb2ba0 max Fri May 22 06:55:52 2026 -0700 small non-AI changes to the html docs pages of the long-read SV tracks diff --git src/hg/makeDb/trackDb/human/cpc1Sv.html src/hg/makeDb/trackDb/human/cpc1Sv.html index 167e890b81d..b5d40795e95 100644 --- src/hg/makeDb/trackDb/human/cpc1Sv.html +++ src/hg/makeDb/trackDb/human/cpc1Sv.html @@ -1,85 +1,85 @@ <h2>Description</h2> <p> -This track displays structural variants (SVs) — deletions, insertions, -and complex substitutions of at least 50 bp — identified by the +This track displays structural variants (SVs) at least 50 bp long +(deletions, insertions, and complex substitutions) identified by the Chinese Pangenome Consortium (CPC) in 58 samples representing 36 Chinese minority ethnic groups.</p> <p> The upstream release combined the 58 CPC samples with 47 samples from Phase 1 of the Human Pangenome Reference Consortium (HPRC) into a single pangenome graph built on the T2T-CHM13v2 assembly with Minigraph-Cactus. For this track we recomputed allele counts (AC), allele numbers (AN) and sample counts (NS) using only the 58 CPC sample columns (those with <tt>HIFI032*</tt> or <tt>RY*</tt> prefixes in the source VCF) and dropped all snarls that no CPC sample carries (HPRC-specific SVs). To see the HPRC data on its own, use the HPRC SV tracks elsewhere in this collection.</p> <p> A pangenome is a graph that represents many genomes simultaneously, letting variants that are missing from a single linear reference be captured and typed directly. Variants are shown natively on the hs1 browser and lifted to hg38 using the UCSC <tt>hs1ToHg38.over.chain.gz</tt> chain. The track contains 46,092 snarl sites on hs1 and 36,030 lifted to hg38 (10,062 did not lift, typically in T2T-added repetitive regions).</p> -<h2>Display conventions</h2> +<h2>Display Conventions and Configuration</h2> <p>Items are colored by SV type:</p> <ul> <li><span style="background-color:rgb(0,0,200);color:white;padding:1px 6px">INS</span> insertion (net ALT longer by ≥50 bp)</li> <li><span style="background-color:rgb(200,0,0);color:white;padding:1px 6px">DEL</span> deletion (net REF longer by ≥50 bp)</li> <li><span style="background-color:rgb(230,140,0);color:white;padding:1px 6px">CPX</span> complex substitution (similar-length REF and ALT but at least one ≥50 bp)</li> <li><span style="background-color:rgb(120,120,120);color:white;padding:1px 6px">MIXED</span> snarl whose collapsed alt alleles belong to different classes</li> </ul> <p> Each bed item spans from the start of the REF allele to its end on the reference. Pure insertions (where REF is a single base) therefore appear as narrow single-base marks; DELs and CPX items span the affected reference interval.</p> <p> The <i>name</i> field is the graph snarl ID (two node identifiers separated by strand arrows, e.g. <tt>>2541>2547</tt>). It is stable across the graph but has no meaning outside the CPC pangenome graph file.</p> -<h2>Collapsing of multi-allelic sites</h2> +<h2>Collapsing of Multi-allelic Sites</h2> <p> The source VCF was decomposed with <tt>bcftools norm -m -any</tt>, so each graph snarl appears as one VCF row per alternative allele (a single bubble in the graph may have 2-20+ alt paths). For this track we first compute the CPC-only allele count per alt, drop any alt that no CPC sample carries, then collapse all remaining alts sharing the same snarl ID into one track item:</p> <ul> <li><b>SV type</b> is the common class of all alts, or <tt>MIXED</tt> if they disagree (for example one alt is a DEL and another is an INS).</li> <li><b>SV length</b> is the maximum |len(ALT) − len(REF)| across alts.</li> <li><b>Allele count</b> is the sum of the per-alt allele counts.</li> <li><b>Number of alts</b> records how many alternative alleles were merged.</li> </ul> <h2>Filters</h2> <p>Available filters:</p> <ul> - <li><b>SV type</b> — any combination of INS, DEL, CPX, MIXED.</li> - <li><b>SV length</b> — maximum allele-length difference.</li> + <li><b>SV type</b>: any combination of INS, DEL, CPX, MIXED.</li> + <li><b>SV length</b>: maximum allele-length difference.</li> <li><b>Allele frequency</b> and <b>allele count</b> across the combined 105 samples.</li> </ul> <h2>Methods</h2> <p> Gao et al. 2023 generated PacBio HiFi long reads (mean ~30.65x, Sequel II/IIe platforms) for 58 QC-passed samples representing 36 minority Chinese ethnic groups, complemented with Illumina short reads and Oxford Nanopore ultralong reads. Haplotype-phased de novo assemblies were produced with <a href="https://github.com/chhylp123/hifiasm" target="_blank">hifiasm</a> v0.16.1 (116 high-quality haplotype assemblies retained after QC) and combined with 47 HPRC Phase 1 assemblies into a single variation graph