9a11061ca6b40fe16bdfd09b1af53192f6c7c85b max Tue Apr 21 08:13:02 2026 -0700 lrSv: add HTML doc pages and conversion scripts for recent subtracks, + hs1 HGSVC3 Subtrack stanzas for these SV callsets landed in earlier commits but the conversion scripts and per-track HTML description pages were never added; trackDb therefore had no doc to serve. This commit catches up. Docs (new): - colorsDbSv.html CoLoRSdb 1,427-sample long-read SVs - gustafsonSv.html 1KG ONT 100 (Gustafson 2024, PMID 39358015) - hgsvc2Sv.html HGSVC2 (Ebert 2021, PMID 33632895) - hprc2Sv.html HPRC release-2 pangenome SVs (no PMID yet; see humanpangenome.org/hprc-data-release-2/) - onekg3202Sr.html 1KG 3202 Illumina SHORT-READ GATK-SV (Byrska-Bishop 2022, PMID 36055201) Scripts (new): - lrSvGustafson.as / lrSvGustafsonVcfToBed.py - lrSvHgsvc2.as / lrSvHgsvc2TsvToBed.py (merges insdel + inv tables) - lrSvHprc2.as / lrSvHprc2VcfToBed.py (streams wave-decomposed VCF, explodes multi-allelic rows, filters to SV-sized or INV) - lrSv1kg3202Sr.as / lrSv1kg3202SrVcfToBed.py HGSVC3 also on hs1: - hgsvc3Sv.html: note that the hs1 build is native (not lifted): HGSVC3 aligned all assemblies to both GRCh38 and T2T-CHM13 and released separate annotation tables per reference. Added the T2T-CHM13 source URL to the Methods section and the hs1 hgsvc3.bb download link to Data Access. - doc/hs1/lrSv.txt (new): hs1-specific wget + build steps; refers back to doc/hg38/lrSv.txt for the full process. refs #36258 Co-Authored-By: Claude Opus 4.7 (1M context) diff --git src/hg/makeDb/trackDb/human/onekg3202Sr.html src/hg/makeDb/trackDb/human/onekg3202Sr.html new file mode 100644 index 00000000000..b0bc17186ad --- /dev/null +++ src/hg/makeDb/trackDb/human/onekg3202Sr.html @@ -0,0 +1,112 @@ +

Description

+

+This track is the only short-read dataset in the Long-Read Variants +track collection; it is included for comparison with the long-read +callsets. +

+

+This track shows structural variants (SVs) from the expanded 1000 Genomes +Project cohort of 3,202 high-coverage Illumina short-read +whole-genome sequences (including 602 trios), sequenced at ~30x on NovaSeq +6000 and described in Byrska-Bishop et al. 2022. SVs were called with the +GATK-SV / svtools integrated pipeline; this release adds re-genotyped +novel insertions and recomputed allele frequencies per continental group. +

+

+The track contains 173,366 SVs across seven classes: 90,259 deletions +(DEL), 49,693 insertions (INS), 28,242 duplications (DUP), 3,568 complex +events (CPX), 920 inversions (INV), 673 multi-allelic copy-number variants +(CNV) and 11 reciprocal translocations (CTX). Allele counts, allele +frequencies and per-superpopulation frequencies (AFR, AMR, EAS/ASN, EUR, +SAS/SAN) are provided for each site. +

+ +

Display Conventions and Configuration

+

+Items are colored by SV type: +

+

+

+Insertions are placed at the insertion site; deletions, duplications, +inversions, complex and copy-number variants span the affected reference +interval. Translocations show only the chr1-side breakpoint; the partner +chromosome is reported on the detail page. +

+

+Filters are available for SV type, SV length, overall allele frequency, +population-max allele frequency and per-population AFs (African and +European). The detail page also shows heterozygous / homozygous-alternate +carrier counts, the set of upstream SV callers, the upstream pipeline +source and the VCF FILTER status. +

+ +

Methods

+

+The 1000 Genomes expanded cohort was sequenced on Illumina NovaSeq 6000 +at ~30x coverage with 2x150 bp reads. Structural variants were called +with the GATK-SV cohort pipeline and merged with svtools; novel insertions +were re-genotyped to produce the integrated callset used here +(1KGP_3202.gatksv_svtools_novelins.freeze_V3.wAF.vcf.gz). +Allele frequencies were computed genome-wide and per-population +(AFR, AMR, EAS/ASN, EUR, SAS/SAN). +

+

+Why a short-read track in a long-read collection? Short-read SV +callsets such as this one generally have high precision for deletions +and duplications but miss many insertions, repeat expansions and +variants in complex/low-mappability regions that long-read technologies +can resolve. Displaying this callset alongside the long-read tracks in +this collection makes it easier to spot variants that are unique to +long-read data or that have substantially different breakpoints when +called from short reads. +

+ +

Data Access

+

+The data can be explored interactively in table format with the +Table Browser or the +Data Integrator, and accessed +programmatically through our API, +track=onekg3202Sr. +

+

+The bigBed is available from +our +download server as onekg3202sr.bb. Example: +bigBedToBed http://hgdownload.soe.ucsc.edu/gbdb/hg38/lrSv/onekg3202sr.bb -chrom=chr21 -start=0 -end=100000000 stdout. +

+

+The original joint-genotyped VCF is available from the + +IGSR 1000 Genomes Illumina SV integration folder. +

+ +

Credits

+

+Thanks to Byrska-Bishop, Marth and the 1000 Genomes / NYGC team for +releasing this dataset, and to the GATK-SV developers for the cohort +calling pipeline. +

+ +

References

+ + +

+Byrska-Bishop M, Evani US, Zhao X, Basile AO, Abel HJ, Regier AA, Corvelo A, Clarke WE, Musunuri R, +Nagulapalli K et al. + +High-coverage whole-genome sequencing of the expanded 1000 Genomes Project cohort including 602 +trios. +Cell. 2022 Sep 1;185(18):3426-3440.e19. +PMID: 36055201; PMC: PMC9439720 +

+