src/hg/makeDb/doc/hg38/lrSv.txt 7594507ca126d5242346787e42e13c52ea7709b1

7594507ca126d5242346787e42e13c52ea7709b1
max
  Fri Apr 17 08:40:31 2026 -0700
Add lrSv supertrack: long-read structural variants from 9 studies (hg38).

#Preview2 week - bugs introduced now will need a build patch to fix
Sub-tracks (all bigBed 9+):
han945Sv     - 945 Han Chinese, ONT (Gong 2025, PMID 39929826)
lrSv1kgOnt   - 1019 1000 Genomes, ONT, SVAN-annotated (Schloissnig 2025,
PMID 40702182; lifted from hs1)
tommoJpSv    - 333 Japanese (111 trios), ONT (Otsuki 2022, PMID 36127505)
aou1kSv      - 1027 All of Us, PacBio HiFi (Garimella 2025, PMID 41256123)
ga4kSv       - 502 GA4K pediatric rare disease, PacBio HiFi
(Cohen 2022, PMID 35305867)
decodeSv     - 3622 Icelanders, ONT (Beyter 2021, PMID 33972781)
hgsvc3Sv     - 65 HGSVC3 diverse haplotype-resolved assemblies, HiFi+ONT
(Logsdon 2025, PMID 40702183; merges insdel+inv tables)
kwanhoSv     - 100 post-mortem brains (PD/ILBD/HC), PacBio HiFi
(Kim 2026, PMID 41929179)
chirmade101Sv - 101 long-read WGS GWAS SVatalog cohort
(Chirmade 2026, PMID 41203876)

Includes per-track conversion scripts and autoSql under
scripts/lrSv/, the supertrack summary table in lrSv.html, and a
consolidated makeDoc at doc/hg38/lrSv.txt.

refs #36258

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

diff --git src/hg/makeDb/doc/hg38/lrSv.txt src/hg/makeDb/doc/hg38/lrSv.txt
new file mode 100644
index 00000000000..e37c9a1701a
--- /dev/null
+++ src/hg/makeDb/doc/hg38/lrSv.txt
@@ -0,0 +1,252 @@
+# 2026-03-25 Claude max
+
+# Long-read structural variants supertrack
+# First subtrack: Han 945 - SVs from 945 Han Chinese individuals
+# Paper: Gong et al. 2025, Nat Commun, PMID 39929826
+# Data: OMIX repository, NGDC
+
+# Download VCF
+mkdir -p /hive/data/genomes/hg38/bed/lrSv/han945
+cd /hive/data/genomes/hg38/bed/lrSv/han945
+
+# VCF was downloaded from OMIX (accession OED00945268)
+# File: OED00945268_Han_945samples_SV.vcf.gz
+# 111,288 SVs: 49,518 DEL, 42,300 INS, 13,503 DUP, 5,595 INV, 372 TRA
+# Site-only VCF (no per-sample genotypes), merged with SURVIVOR v1.0.6
+
+# Convert VCF to BED and build bigBed
+python3 ~/kent/src/hg/makeDb/scripts/lrSv/lrSvVcfToBed.py \
+    OED00945268_Han_945samples_SV.vcf.gz han945.bed
+bedSort han945.bed han945.sorted.bed
+bedToBigBed -type=bed9+ -as=$HOME/kent/src/hg/makeDb/scripts/lrSv/lrSv.as \
+    -tab han945.sorted.bed /hive/data/genomes/hg38/chrom.sizes han945.bb
+
+# Symlink
+mkdir -p /gbdb/hg38/lrSv
+ln -sf /hive/data/genomes/hg38/bed/lrSv/han945/han945.bb /gbdb/hg38/lrSv/han945.bb
+
+# Convert SUPP_VEC to per-sample genotype VCF for vcfTabix display
+python3 ~/kent/src/hg/makeDb/scripts/lrSv/lrSvHan945SuppVecToVcf.py \
+    OED00945268_Han_945samples_SV.vcf.gz han945genotypes.vcf
+bcftools sort han945genotypes.vcf -Oz -o han945genotypes.sorted.vcf.gz
+tabix -p vcf han945genotypes.sorted.vcf.gz
+
+##########
+# 2026-03-26 Claude max
+
+# Second subtrack: 1KG ONT - SVs from 1,019 diverse humans (1000 Genomes ONT)
+# Paper: Schloissnig et al. 2025, Nature, PMID 40702182
+# Data: 1000 Genomes ONT Vienna, IGSR/EBI FTP
+
+mkdir -p /hive/data/genomes/hg38/bed/lrSv/1k-1019
+cd /hive/data/genomes/hg38/bed/lrSv/1k-1019
+
+# VCF downloaded from:
+# https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1KG_ONT_VIENNA/release/v1.1/svan-annotation/
+# File: final-vcf.unphased.SVAN_1.3.vcf.gz
+# 161,332 SVAN-annotated SVs: 75,324 INS, 66,192 DEL, 19,816 COMPLEX
+# Site-only VCF (no genotypes), SVAN v1.3 annotations
+# Called with SAGA framework against pangenome graph
+# NOTE: VCF contig sizes match hs1 (CHM13/T2T), not hg38.
+# bigGuessDb confirms hs1. So we build a native hs1 bigBed and liftOver to hg38.
+
+# Convert SVAN VCF to BED, adding allele counts from the phased VCF.
+# The phased VCF (shapeit5) has AC/AN/AF for 164,625 variants.
+# Of the 161,332 SVAN variants, 158,469 (98.2%) have a matching phased variant.
+# The 2,863 unmatched SVs get alleleCount=-1 (displayed as "Unknown").
+# Fields kept from SVAN: svClass, svLen, insType, family, percResolved, tsdLen,
+# polyaLen, conformation, rtLen, nbMotifs, srcGene, nbExons, notCanonical.
+# Dropped: SOURCE_COORD (0% populated), all *_SEQ fields, *_MAPQ, REPEAT_BKP,
+# DUP_COORD, MOTIFS, CONFORMATION_EXT, HEXAMER_*, *_TD/TEMP_COORD (all rare/long).
+python3 ~/kent/src/hg/makeDb/scripts/lrSv/lrSv1kgOntVcfToBed.py \
+    final-vcf.unphased.SVAN_1.3.vcf.gz 1kgOnt.hs1.bed \
+    /hive/data/genomes/hs1/chrom.sizes \
+    shapeit5-phased-callset_final-vcf.phased.vcf.gz
+# 161,332 records, 158,469 with allele counts
+
+# Build hs1 bigBed
+bedSort 1kgOnt.hs1.bed 1kgOnt.hs1.sorted.bed
+bedToBigBed -type=bed9+ -as=$HOME/kent/src/hg/makeDb/scripts/lrSv/lrSv1kgOnt.as \
+    -tab 1kgOnt.hs1.sorted.bed /hive/data/genomes/hs1/chrom.sizes 1kgOnt.hs1.bb
+
+# LiftOver to hg38
+liftOver -tab -bedPlus=9 1kgOnt.hs1.bed \
+    /gbdb/hs1/liftOver/hs1ToHg38.over.chain.gz \
+    1kgOnt.hg38.bed 1kgOnt.unmapped.bed
+# 148,375 mapped, ~13K unmapped
+
+# Build hg38 bigBed
+bedSort 1kgOnt.hg38.bed 1kgOnt.hg38.sorted.bed
+bedToBigBed -type=bed9+ -as=$HOME/kent/src/hg/makeDb/scripts/lrSv/lrSv1kgOnt.as \
+    -tab 1kgOnt.hg38.sorted.bed /hive/data/genomes/hg38/chrom.sizes 1kgOnt.hg38.bb
+
+# Symlinks for both assemblies
+mkdir -p /gbdb/hg38/lrSv /gbdb/hs1/lrSv
+ln -sf /hive/data/genomes/hg38/bed/lrSv/1k-1019/1kgOnt.hg38.bb /gbdb/hg38/lrSv/1kgOnt.bb
+ln -sf /hive/data/genomes/hg38/bed/lrSv/1k-1019/1kgOnt.hs1.bb /gbdb/hs1/lrSv/1kgOnt.bb
+
+##########
+# 2026-03-26 Claude max
+
+# Third subtrack: ToMMo Japanese SVs - 333 individuals (111 trios)
+# Paper: Otsuki et al. 2022, Commun Biol, PMID 36127505
+# Data: jMorp portal, ToMMo
+
+mkdir -p /hive/data/genomes/hg38/bed/lrSv/tommoJp
+cd /hive/data/genomes/hg38/bed/lrSv/tommoJp
+
+# VCF downloaded from jMorp:
+# https://jmorp.megabank.tohoku.ac.jp/datasets/tommo-jsv1-20211208-af
+# File: tommo-JSV1-20211208-GRCh38-without-genotype-count.vcf.gz
+# 74,201 SVs: 37,981 DEL, 36,220 INS
+# Site-only VCF, merged with SURVIVOR v1.0.6
+# Native GRCh38 coordinates (confirmed via contig headers)
+# Trio-based: 111 families, includes Mendelian error rates
+
+# Convert VCF to BED and build bigBed
+python3 ~/kent/src/hg/makeDb/scripts/lrSv/lrSvTommoJpVcfToBed.py \
+    tommo-JSV1-20211208-GRCh38-without-genotype-count.vcf.gz tommoJp.bed
+bedSort tommoJp.bed tommoJp.sorted.bed
+bedToBigBed -type=bed9+ -as=$HOME/kent/src/hg/makeDb/scripts/lrSv/lrSvTommoJp.as \
+    -tab tommoJp.sorted.bed /hive/data/genomes/hg38/chrom.sizes tommoJp.bb
+
+##########
+# 2026-03-26 Claude max
+
+# Fourth subtrack: AoU 1K - SVs from 1,027 AoU individuals (PacBio HiFi)
+# Paper: Garimella et al. 2025, medRxiv, PMID 41256123
+# Data: Supplementary media-2 from preprint
+
+mkdir -p /hive/data/genomes/hg38/bed/lrSv/aou1k
+cd /hive/data/genomes/hg38/bed/lrSv/aou1k
+
+# Downloaded supplementary CSV from preprint (media-2.gz)
+# 541,049 SVs: 444,524 INS, 96,525 DEL (autosomes only)
+# Population-specific AFs (AFR, AMR, EAS, EUR, SAS)
+# Gene annotations (OMIM, disease, cancer, ACMG), regulatory elements
+# eQTL, GWAS, and SV-trait associations
+# Native GRCh38 coordinates
+
+# Convert CSV to BED and build bigBed
+python3 ~/kent/src/hg/makeDb/scripts/lrSv/lrSvAou1kCsvToBed.py media-2.gz aou1k.bed
+bedSort aou1k.bed aou1k.sorted.bed
+bedToBigBed -type=bed9+ -as=$HOME/kent/src/hg/makeDb/scripts/lrSv/lrSvAou1k.as \
+    -tab aou1k.sorted.bed /hive/data/genomes/hg38/chrom.sizes aou1k.bb
+
+##########
+# 2026-04-16 Claude max
+
+# Fifth subtrack: Genomic Answers for Kids (GA4K) - Children's Mercy
+# PacBio HiFi long-read SVs. 502-sample site-only release.
+# Primary reference for the program: Cohen et al. 2022, Genet Med, PMID 35305867
+# Data release: https://github.com/ChildrensMercyResearchInstitute/GA4K
+# (The matched GA4K small-variant release is handled in the Variant
+#  Frequencies collection; see ~/kent/src/hg/makeDb/doc/hg38/varFreqs.txt.)
+
+mkdir -p /hive/data/genomes/hg38/bed/lrSv/GA4K
+cd /hive/data/genomes/hg38/bed/lrSv/GA4K
+# Data cloned from the ChildrensMercyResearchInstitute/GA4K GitHub repo.
+# pacbio_sv_vcf/pb_joint_merged.sv.vcf.gz:
+#   115,554 replicated SVs from 502 samples (52,564 DEL, 58,219 INS,
+#   4,408 DUP, 363 INV). Jasmine v1.1.4 merge, filtered to SVs observed in
+#   2+ unrelated GA4K individuals or matching a Decode/HPRC SV (svpack match).
+
+python3 ~/kent/src/hg/makeDb/scripts/lrSv/lrSvGa4kSvVcfToBed.py \
+    pacbio_sv_vcf/pb_joint_merged.sv.vcf.gz ga4kSv.bed
+bedSort ga4kSv.bed ga4kSv.sorted.bed
+bedToBigBed -type=bed9+ -as=$HOME/kent/src/hg/makeDb/scripts/lrSv/lrSvGa4kSv.as \
+    -tab ga4kSv.sorted.bed /hive/data/genomes/hg38/chrom.sizes ga4kSv.bb
+
+##########
+# 2026-04-17 Claude max
+
+# Sixth subtrack: deCODE Icelandic high-confidence long-read SVs.
+# Paper: Beyter et al. 2021, Nat Genet, PMID 33972781
+# Data: https://github.com/DecodeGenetics/LRS_SV_sets
+
+mkdir -p /hive/data/genomes/hg38/bed/lrSv/decode
+cd /hive/data/genomes/hg38/bed/lrSv/decode
+# Downloaded from the DecodeGenetics/LRS_SV_sets GitHub repo:
+#   ont_sv_high_confidence_SVs.sorted.vcf.gz (+ .tbi)
+#   ont_sv_high_confidence_tandemdup.csv  (auxiliary tandem-duplication
+#       annotations; not currently displayed as a browser track)
+# 133,886 high-confidence SVs: 55,649 DEL, 75,050 INS, 3,187 INSDEL.
+# Site-only, native GRCh38 coordinates. INFO fields: SVTYPE, END, SVLEN,
+# TRRBEGIN, TRREND (surrounding tandem-repeat region).
+
+python3 ~/kent/src/hg/makeDb/scripts/lrSv/lrSvDecodeVcfToBed.py \
+    ont_sv_high_confidence_SVs.sorted.vcf.gz decodeSv.bed
+bedSort decodeSv.bed decodeSv.sorted.bed
+bedToBigBed -type=bed9+ -as=$HOME/kent/src/hg/makeDb/scripts/lrSv/lrSvDecode.as \
+    -tab decodeSv.sorted.bed /hive/data/genomes/hg38/chrom.sizes decodeSv.bb
+
+##########
+# 2026-04-17 Claude max
+
+# Seventh subtrack: HGSVC3 - Human Genome Structural Variation Consortium
+# phase 3. 65 diverse samples, PacBio HiFi + ONT, PAV-based SV discovery.
+# Paper: Logsdon et al. 2025, Nature, PMID 40702183
+# Data: IGSR FTP release v1.0 (annotation_table/)
+
+mkdir -p /hive/data/genomes/hg38/bed/lrSv/hgsvc3
+cd /hive/data/genomes/hg38/bed/lrSv/hgsvc3
+# Downloaded the two SV annotation tables from:
+# https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/HGSVC3/release/Variant_Calls/1.0/GRCh38/annotation_table/
+#   variants_GRCh38_sv_insdel_HGSVC2024v1.0.tsv.gz  (176,232 DEL+INS)
+#   variants_GRCh38_sv_inv_HGSVC2024v1.0.tsv.gz     (300 INV)
+# The two tables are complementary: the insdel table holds all
+# insertions+deletions (with HOM_REF/HOM_TIG/TE columns specific to
+# insertions+deletions), while the inv table holds inversions (with an
+# RGN_REF_INNER column describing the inner inverted region). The lrSv
+# subtrack merges them into a single bigBed.
+
+python3 ~/kent/src/hg/makeDb/scripts/lrSv/lrSvHgsvc3TsvToBed.py \
+    variants_GRCh38_sv_insdel_HGSVC2024v1.0.tsv.gz \
+    variants_GRCh38_sv_inv_HGSVC2024v1.0.tsv.gz \
+    hgsvc3.bed
+bedSort hgsvc3.bed hgsvc3.sorted.bed
+bedToBigBed -type=bed9+ -as=$HOME/kent/src/hg/makeDb/scripts/lrSv/lrSvHgsvc3.as \
+    -tab hgsvc3.sorted.bed /hive/data/genomes/hg38/chrom.sizes hgsvc3.bb
+
+##########
+# 2026-04-17 Claude max
+
+# Eighth subtrack: Kim et al. 2026 - PacBio HiFi long-read SVs from 100
+# post-mortem brain samples (Parkinson's disease / ILBD / healthy controls).
+# Paper: Kim et al. 2026, bioRxiv, PMID 41929179
+# Data: Supplementary Table 13 (media-13.txt) from the preprint.
+
+mkdir -p /hive/data/genomes/hg38/bed/lrSv/kwanho2026
+cd /hive/data/genomes/hg38/bed/lrSv/kwanho2026
+# media-13.txt holds the final high-confidence catalog of 74,552 SVs
+# (34,056 INS, 29,545 DEL, 9,707 DUP, 1,244 INV) across three cohorts
+# (PD: 35, ILBD: 31, HC: 34; 100 samples total). paper.txt has the preprint
+# text for reference. Numeric fields use comma thousands-separators inside
+# quoted strings, so the converter parses the TSV with the csv module.
+
+python3 ~/kent/src/hg/makeDb/scripts/lrSv/lrSvKwanhoTsvToBed.py \
+    media-13.txt kwanho.bed
+bedSort kwanho.bed kwanho.sorted.bed
+bedToBigBed -type=bed9+ -as=$HOME/kent/src/hg/makeDb/scripts/lrSv/lrSvKwanho.as \
+    -tab kwanho.sorted.bed /hive/data/genomes/hg38/chrom.sizes kwanho.bb
+
+##########
+# 2026-04-17 Claude max
+
+# Ninth subtrack: GWAS SVatalog - 101 long-read whole-genome sequences
+# from SickKids (Chirmade et al. 2026, Heredity, PMID 41203876).
+# Data: zenodo.org/records/13367574 (sv_annotations.tsv)
+
+mkdir -p /hive/data/genomes/hg38/bed/lrSv/shirmade101
+cd /hive/data/genomes/hg38/bed/lrSv/shirmade101
+# sv_annotations.tsv holds 87,183 SVs (del, ins, dup, inv, complex) from 101
+# long-read WGS samples, annotated with gene overlaps, ClinGen / gnomAD
+# constraints, OMIM / ClinVar / DGV / Decipher regional overlaps.
+# Coordinates in the source TSV are 1-based closed; the converter shifts to
+# standard 0-based half-open BED.
+
+python3 ~/kent/src/hg/makeDb/scripts/lrSv/lrSvChirmade101TsvToBed.py \
+    sv_annotations.tsv chirmade101.bed
+bedSort chirmade101.bed chirmade101.sorted.bed
+bedToBigBed -type=bed9+ -as=$HOME/kent/src/hg/makeDb/scripts/lrSv/lrSvChirmade101.as \
+    -tab chirmade101.sorted.bed /hive/data/genomes/hg38/chrom.sizes chirmade101.bb