9fbdfa3416ffde377072fafd2de44059155c3b44 max Thu Apr 30 06:57:35 2026 -0700 lrSv: add lrSvAll merged track combining all long-read SV subtracks Variants are merged on exact (chrom, start, end, svType, svLen, insLen). Per-database AC columns are stored as strings; "unknown" is used where the source dataset has only placeholder AC values (deCODE, SVatalog 101, 1KG ONT 100). Kim PD Brain is split into affected (PD+ILBD) and healthy (HC) AC columns. Gustafson contributes sampleCount instead of AC. Output: 2,694,871 unique SVs from 3,706,100 input rows across 15 subtracks (27% dedup). The merged track sits as the first subtrack of the lrSv supertrack with filters on sources, svType, svLen, insLen, maxAF/minAF, AC, and sourceCount. The trackDb stanza is generated by the build script directly into human/lrSvAll.ra and pulled in via 'include lrSvAll.ra' from lrSv.ra, so labels in databases.tsv stay the single source of truth. lrSv.html: add a "Disease cases" column to the dataset summary, strip parenthesized internal track names from the section headers, and shorten exact SV counts to ~Nk / ~N.NM in the prose. refs #36642 diff --git src/hg/makeDb/trackDb/human/lrSv.html src/hg/makeDb/trackDb/human/lrSv.html index 65ed0a19bcc..ae8d30a51e9 100644 --- src/hg/makeDb/trackDb/human/lrSv.html +++ src/hg/makeDb/trackDb/human/lrSv.html @@ -12,306 +12,340 @@ SV length statistics (min / median / max) are computed from the svLen field of each track, in base pairs. Some tracks include sites with svLen=0 (complex events where the reference and alternate alleles differ in sequence but not in length).

For short-read structural-variant comparators (CCDG 17,795, 1KG 3202, ToMMo 48K CNV) see the companion Short-read SVs supertrack.

+ + + + + + + + + + + + + + - + + - + + + - + + + + + + + - + - + + + + - + +
Dataset N samples Cohort / diseaseDisease cases Sequencing SVs Min Median Max
All mergedAll long-read SV datasets merged on identical position+type+length, with per-database ACmixedmixed (PacBio HiFi, ONT)2,694,87150200190,088,223
CoLoRSdb 1,427 Consortium of Long-Read Sequencing, joint callsetNo PacBio HiFi 426,239 20 33 101,381
Han 945 945 Han Chinese, general populationNo ONT (PromethION) 111,288 0 254 99,743
1KG ONT 100 1001000 Genomes, 5 superpopulations / 19 subpopulations1000 Genomes, 5 superpopulations / 19 subpop., high 37x seq. coverageNo ONT (R9.4.1) 113,696 0 164 98,289
1KG ONT Vienna 1,0191000 Genomes, globally diverse1000 Genomes, diverse, normal 17x seq. coverageNo ONT 148,375 2 177 49,171
ToMMo Japanese 333 (111 trios) Japanese, general populationNo ONT 74,201 51 162 99,980
AoU 1K 1,027All of Us, self-identified Black/African AmericanAll of Us, self-identified Black/African American, 8x cov.; biobank includes a variety of conditions (diabetes, hearing loss, etc.)Yes (mixed) PacBio HiFi 541,049 50 152 9,998
GA4K 502 Children's Mercy, pediatric rare disease probands + familiesYes (probands) PacBio HiFi 115,554 50 186 809,711
deCODE 3,622 3,622 Icelandic general populationNo ONT 133,886 0 127 861,080
HPRC v2 233 HPRC release-2 pangenome (CHM13 + diverse 1KG assemblies)No PacBio HiFi (pangenome graph) 1,483,114 50 280 97,718
HGSVC2 32 HGSVC2 haplotype-resolved assemblies (5 superpopulations)No PacBio CLR + HiFi + Strand-seq 111,746 50 168 57,207,414
HGSVC3 65 HGSVC3 diverse reference assembliesNo PacBio HiFi + ONT 176,531 50 154 30,176,500
Arab APRArab UPR 53UAE-resident Arabs from 8 countries (Arab Pangenome Reference)UAE-resident Arabs from 8 countries (UAE Pangenome Reference)No PacBio HiFi + ONT + Hi-C (pangenome graph) 72,656 1 21 99,885
CPC 58 Chinese Pangenome Consortium, 36 minority ethnic groups (HPRC-specific SVs removed)No PacBio HiFi (pangenome graph) 36,030 1 53 8,998,096
Kim PD Brain 100 Parkinson's disease, ILBD, controls (post-mortem brain)Yes (PD + ILBD) PacBio HiFi 74,552 50 160 190,088,222
SVatalog 101 101Long-read WGS cohort for GWAS LD fine-mapping (SickKids)Cystic fibrosis (CF) patients from the CF Canada-Sick Kids Program in Individual CF Therapy (CFIT). Long-read WGS used for GWAS LD fine-mappingYes (all CF) long-read 87,183 4 160 1,321,484

Note: there is likely some overlap in sample composition across these collections. For example, 1000 Genomes samples are also included in HPRC and CoLoRSdb.

-

CoLoRSdb SVs (colorsDbSv)

+

CoLoRSdb SVs

Structural variants from the Consortium of Long-Read Sequencing database (CoLoRSdb), from 1,427 PacBio HiFi long-read whole-genome sequences. -426,239 SVs (insertions, deletions, inversions) called with pbsv and +~426k SVs (insertions, deletions, inversions) called with pbsv and merged with Jasmine, with allele frequencies, genotype counts and Hardy-Weinberg statistics across the cohort.

-

Han 945 SVs (han945Sv)

+

Han 945 SVs

-Structural variants from 945 Han Chinese individuals. 111,288 SVs +Structural variants from 945 Han Chinese individuals. ~111k SVs (deletions, insertions, duplications, inversions, translocations) merged with SURVIVOR. Includes allele frequencies and per-sample support.

-

1KG ONT 100 SVs (gustafsonSv)

+

1KG ONT 100 SVs

Structural variants from Oxford Nanopore long-read sequencing of 100 1000 Genomes samples (5 superpopulations, 19 subpopulations) released by the 1000 Genomes ONT Sequencing Consortium and described in -Gustafson et al. 2024. 113,696 SVs (insertions, deletions, duplications, +Gustafson et al. 2024. ~114k SVs (insertions, deletions, duplications, inversions) called with five callers and merged with Jasmine. This is a separate dataset from the Vienna 1KG-ONT release below; the 100 samples here do not overlap with the 1,019 samples in the Vienna release.

-

1KG ONT Vienna SVs (lrSv1kgOnt)

+

1KG ONT Vienna SVs

Structural variants from 1,019 individuals across 26 populations (1000 Genomes ONT). -161,332 SVs annotated with SVAN, classifying insertions and deletions by mechanism +~161k SVs annotated with SVAN, classifying insertions and deletions by mechanism of origin (mobile elements, VNTRs, processed pseudogenes, etc.). Original coordinates are on T2T-CHM13 (hs1); the hg38 version was created via liftOver. This is a separate dataset from the 1KG ONT 100 (Gustafson et al.) track above; the 1,019 samples here do not overlap with the 100 samples in that release.

-

ToMMo Japanese SVs (tommoJpSv)

+

ToMMo Japanese SVs

Structural variants from 333 Japanese individuals (111 trios) from the Tohoku Medical -Megabank (ToMMo). 74,201 SVs (deletions and insertions) with trio-based Mendelian +Megabank (ToMMo). ~74k SVs (deletions and insertions) with trio-based Mendelian error rates and allele frequencies.

-

AoU 1K SVs (aou1kSv)

+

AoU 1K SVs

Structural variants from 1,027 individuals from the All of Us (AoU) Research Program, -sequenced with PacBio HiFi long reads. 541,049 SVs (insertions and deletions) -with population-specific allele frequencies, gene annotations, and clinical -trait associations. +sequenced with PacBio HiFi long reads. AoU is a deeply phenotyped biobank +that includes participants with a range of conditions (e.g. diabetes, +hearing loss, hypertension), so the cohort is not disease-free. +~541k SVs (insertions and deletions) with population-specific allele +frequencies, gene annotations, and clinical trait associations.

-

GA4K SVs (ga4kSv)

+

GA4K SVs

Structural variants from 502 probands and family members enrolled in the Genomic Answers for Kids (GA4K) pediatric rare-disease program at Children's -Mercy Research Institute, sequenced with PacBio HiFi long reads. 115,554 +Mercy Research Institute, sequenced with PacBio HiFi long reads. ~116k replicated SVs (deletions, insertions, duplications, inversions) called with pbsv and merged with JASMINE. The matched GA4K small-variant callset (SNVs and short indels) lives alongside other population allele-frequency resources as GA4K 552 PacBio LR in the Variant Frequencies track collection.

-

deCODE 3,622 SVs (decodeSv)

+

deCODE 3,622 SVs

High-confidence structural variants from 3,622 Icelanders (deCODE genetics), -sequenced with Oxford Nanopore long reads. 133,886 SVs (deletions, insertions +sequenced with Oxford Nanopore long reads. ~134k SVs (deletions, insertions and combined insertion/deletion events). Site-only callset with annotated surrounding tandem-repeat regions.

-

HPRC v2 SVs (hprc2Sv)

+

HPRC v2 SVs

Structural variants derived from the Human Pangenome Reference Consortium release-2 minigraph-cactus pangenome graph, built from 233 PacBio HiFi haplotype-resolved assemblies (CHM13 + diverse 1000 Genomes samples). -1,483,114 SV-sized alleles (INS, DEL, COMPLEX, INV) extracted with +~1.5M SV-sized alleles (INS, DEL, COMPLEX, INV) extracted with vg deconstruct and decomposed with vcfwave (WFA2).

-

HGSVC2 32 SVs (hgsvc2Sv)

+

HGSVC2 32 SVs

Structural variants from 32 haplotype-resolved diploid genomes (HGSVC2 -freeze 4, Ebert et al. 2021). 111,746 SVs (deletions, insertions and +freeze 4, Ebert et al. 2021). ~112k SVs (deletions, insertions and inversions) called from phased de novo assemblies with PAV, with per-variant 1000 Genomes population allele frequencies (insertions and deletions) and rich structural/gene annotations. An earlier HGSVC release complementary to HGSVC3.

-

HGSVC3 65 SVs (hgsvc3Sv)

+

HGSVC3 65 SVs

Structural variants from 65 diverse individuals sequenced and de novo assembled by the Human Genome Structural Variation Consortium phase 3 -(HGSVC3). 176,532 haplotype-resolved SVs (deletions, insertions and +(HGSVC3). ~177k haplotype-resolved SVs (deletions, insertions and inversions) called with PAV and cross-validated with ten additional callers, with per-site carrier haplotype lists and structural annotations.

-

Kim PD Brain SVs (kwanhoSv)

+

Kim PD Brain SVs

Structural variants from 100 post-mortem brain samples (Parkinson's disease, incidental Lewy body disease, and healthy controls) sequenced with PacBio -HiFi long reads. 74,552 high-confidence SVs (deletions, insertions, +HiFi long reads. ~75k high-confidence SVs (deletions, insertions, duplications, inversions) with per-cohort allele frequencies and case-control carrier-rate differentials, from Kim et al. 2026.

-

SVatalog 101 SVs (chirmade101Sv)

+

SVatalog 101 SVs

Structural variants from 101 long-read whole-genome sequences released -alongside the GWAS SVatalog tool (Chirmade et al. 2026). 87,183 SVs +alongside the GWAS SVatalog tool (Chirmade et al. 2026). The samples come +from the CF Canada-Sick Kids Program in Individual CF Therapy (CFIT), a +cystic-fibrosis (CF) patient cohort assembled to model patient-specific +responses to CFTR modulator therapies (most participants are F508del +homozygotes or F508del / minimal-function compound heterozygotes; a smaller +number carry rare nonsense or missense CFTR mutations). ~87k SVs (deletions, insertions, duplications, inversions and complex events) annotated with gene overlaps, ClinGen / gnomAD constraint scores, OMIM / ClinVar / DGV / Decipher regional annotations.

Data Access

Each subtrack has its own documentation page with details on how to download and intersect the underlying annotations.

References