9a11061ca6b40fe16bdfd09b1af53192f6c7c85b max Tue Apr 21 08:13:02 2026 -0700 lrSv: add HTML doc pages and conversion scripts for recent subtracks, + hs1 HGSVC3 Subtrack stanzas for these SV callsets landed in earlier commits but the conversion scripts and per-track HTML description pages were never added; trackDb therefore had no doc to serve. This commit catches up. Docs (new): - colorsDbSv.html CoLoRSdb 1,427-sample long-read SVs - gustafsonSv.html 1KG ONT 100 (Gustafson 2024, PMID 39358015) - hgsvc2Sv.html HGSVC2 (Ebert 2021, PMID 33632895) - hprc2Sv.html HPRC release-2 pangenome SVs (no PMID yet; see humanpangenome.org/hprc-data-release-2/) - onekg3202Sr.html 1KG 3202 Illumina SHORT-READ GATK-SV (Byrska-Bishop 2022, PMID 36055201) Scripts (new): - lrSvGustafson.as / lrSvGustafsonVcfToBed.py - lrSvHgsvc2.as / lrSvHgsvc2TsvToBed.py (merges insdel + inv tables) - lrSvHprc2.as / lrSvHprc2VcfToBed.py (streams wave-decomposed VCF, explodes multi-allelic rows, filters to SV-sized or INV) - lrSv1kg3202Sr.as / lrSv1kg3202SrVcfToBed.py HGSVC3 also on hs1: - hgsvc3Sv.html: note that the hs1 build is native (not lifted): HGSVC3 aligned all assemblies to both GRCh38 and T2T-CHM13 and released separate annotation tables per reference. Added the T2T-CHM13 source URL to the Methods section and the hs1 hgsvc3.bb download link to Data Access. - doc/hs1/lrSv.txt (new): hs1-specific wget + build steps; refers back to doc/hg38/lrSv.txt for the full process. refs #36258 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> diff --git src/hg/makeDb/trackDb/human/hgsvc2Sv.html src/hg/makeDb/trackDb/human/hgsvc2Sv.html new file mode 100644 index 00000000000..500ba0a04f9 --- /dev/null +++ src/hg/makeDb/trackDb/human/hgsvc2Sv.html @@ -0,0 +1,122 @@ +<h2>Description</h2> +<p> +This track shows structural variants (SVs) from the second phase of the +Human Genome Structural Variation Consortium (HGSVC2). The callset is +derived from 32 haplotype-resolved diploid genomes (64 phased haplotypes) +spanning five 1000 Genomes superpopulations (African, Admixed American, +East Asian, European, South Asian). Each genome was sequenced with +PacBio long reads (continuous long-read and HiFi) and phased with +Strand-seq, enabling comprehensive characterization of SVs that short-read +approaches miss. +</p> +<p> +The track merges the two SV annotation tables from the HGSVC2 v2.0 +integrated callset freeze 4: 111,330 insertions/deletions and 416 +inversions, for a total of 111,746 SVs. Each row is a site-level variant +with per-site allele count, carrier haplotypes, population-scale allele +frequencies (imputed from the phased callset back into 1000 Genomes, +insertions and deletions only) and structural annotations. +</p> + +<h2>Display Conventions and Configuration</h2> +<p> +Items are colored by SV type: +<ul> +<li><span style="color: rgb(200,0,0);">Deletions (DEL)</span> - red</li> +<li><span style="color: rgb(0,0,200);">Insertions (INS)</span> - blue</li> +<li><span style="color: rgb(230,140,0);">Inversions (INV)</span> - orange</li> +</ul> +</p> +<p> +Insertions are placed at the insertion site with a width of 1 bp; deletions +and inversions span the affected reference interval. Filters are available +for SV type, SV length, carrier-haplotype count, distinct sample count, +whether the site falls in a Tandem Repeat Finder region and the fraction +of the variant overlapping segmental duplications. +</p> +<p> +The detail page shows, where available: +<ul> +<li><b>Allele / Sample Count</b>: carrier-haplotype count (MERGE_AC) and +the number of distinct samples carrying the variant.</li> +<li><b>Population Allele Frequencies</b> (insertions and deletions only): +overall and per-population (AFR, AMR, EAS, EUR, SAS) allele frequencies +computed from the imputed 1000 Genomes callset.</li> +<li><b>RefSeq Gene Overlaps</b>: bases of overlap with CDS, 5'/3' UTRs, +introns, non-coding RNAs, and +/- 5 kb windows around each gene.</li> +<li><b>Gene Constraint</b>: maximum gnomAD pLI and minimum LOEUF upper +bound for genes overlapping the SV.</li> +<li><b>Reference Context</b>: cytoband, segmental-duplication overlap, +whether the SV falls in a Tandem Repeat Finder region.</li> +<li><b>Carrier Haplotypes</b>: full list of sample-haplotype IDs (e.g. +<tt>HG00096-h1</tt>, <tt>HG00514-un</tt>) carrying the variant.</li> +<li><b>Inner Inversion Region</b> (INV only): coordinates of the inner +inverted sequence, distinct from the outer breakpoint interval.</li> +</ul> +</p> + +<h2>Methods</h2> +<p> +HGSVC2 generated phased haplotype-resolved de novo assemblies for 32 +diploid samples across five 1000 Genomes superpopulations. Assemblies +were built from PacBio continuous long reads and HiFi reads and phased +with Strand-seq. Structural variants were discovered from each haplotype +assembly using PAV and validated with multiple orthogonal callers +(including PBSV, Bionano, DeepVariant, PAV-LRA, and others recorded in +per-site validation columns). The final SV set was merged to produce the +integrated callset used here. +</p> +<p> +Population-scale allele frequencies (POP_*_AF) were derived by imputing +the HGSVC2 SVs back into the full 1000 Genomes short-read cohort. These +fields are only available for insertions and deletions. +</p> +<p> +Two tables were merged for display here: +<tt>variants_freeze4_sv_insdel.tsv.gz</tt> (DEL + INS, 111,330 records) and +<tt>variants_freeze4_sv_inv.tsv.gz</tt> (INV, 416 records). Type-specific +columns (POP_*_AF for insdel, RGN_REF_INNER for inversions) are shown as +empty on the detail page when they do not apply. +</p> + +<h2>Data Access</h2> +<p> +The data can be explored interactively in table format with the +<a href="../cgi-bin/hgTables">Table Browser</a> or the +<a href="../cgi-bin/hgIntegrator">Data Integrator</a>, and accessed +programmatically through our <a href="https://api.genome.ucsc.edu">API</a>, +track=<i>hgsvc2Sv</i>. +</p> +<p> +The bigBed is available from +<a href="http://hgdownload.soe.ucsc.edu/gbdb/hg38/lrSv/" target="_blank">our +download server</a> as <tt>hgsvc2.bb</tt>. Example: +<tt>bigBedToBed http://hgdownload.soe.ucsc.edu/gbdb/hg38/lrSv/hgsvc2.bb -chrom=chr21 -start=0 -end=100000000 stdout</tt>. +</p> +<p> +The original annotation tables and VCFs are available from the +<a href="https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/HGSVC2/release/v2.0/integrated_callset/" target="_blank"> +HGSVC2 v2.0 integrated callset</a> on the IGSR FTP site. +</p> + +<h2>Credits</h2> +<p> +Thanks to the Human Genome Structural Variation Consortium (HGSVC) and +the 1000 Genomes Project for releasing this dataset. Later HGSVC releases +are also available as UCSC tracks: +<a href="hgTrackUi?g=hgsvc3Sv">HGSVC3 65 SVs</a>. +</p> + +<h2>References</h2> + + +<p> +Ebert P, Audano PA, Zhu Q, Rodriguez-Martin B, Porubsky D, Bonder MJ, Sulovari A, Ebler J, Zhou W, +Serra Mari R <em>et al</em>. +<a href="https:///www.science.org/doi/10.1126/science.abf7117" target="_blank"> +Haplotype-resolved diverse human genomes and integrated analysis of structural variation</a>. +<em>Science</em>. 2021 Apr 2;372(6537). +PMID: <a href="https://www.ncbi.nlm.nih.gov/pubmed/33632895" target="_blank">33632895</a>; PMC: <a +href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8026704/" target="_blank">PMC8026704</a> +</p> +