9fbdfa3416ffde377072fafd2de44059155c3b44 max Thu Apr 30 06:57:35 2026 -0700 lrSv: add lrSvAll merged track combining all long-read SV subtracks Variants are merged on exact (chrom, start, end, svType, svLen, insLen). Per-database AC columns are stored as strings; "unknown" is used where the source dataset has only placeholder AC values (deCODE, SVatalog 101, 1KG ONT 100). Kim PD Brain is split into affected (PD+ILBD) and healthy (HC) AC columns. Gustafson contributes sampleCount instead of AC. Output: 2,694,871 unique SVs from 3,706,100 input rows across 15 subtracks (27% dedup). The merged track sits as the first subtrack of the lrSv supertrack with filters on sources, svType, svLen, insLen, maxAF/minAF, AC, and sourceCount. The trackDb stanza is generated by the build script directly into human/lrSvAll.ra and pulled in via 'include lrSvAll.ra' from lrSv.ra, so labels in databases.tsv stay the single source of truth. lrSv.html: add a "Disease cases" column to the dataset summary, strip parenthesized internal track names from the section headers, and shorten exact SV counts to ~Nk / ~N.NM in the prose. refs #36642 diff --git src/hg/makeDb/trackDb/human/lrSv.html src/hg/makeDb/trackDb/human/lrSv.html index 65ed0a19bcc..ae8d30a51e9 100644 --- src/hg/makeDb/trackDb/human/lrSv.html +++ src/hg/makeDb/trackDb/human/lrSv.html @@ -12,306 +12,340 @@ SV length statistics (min / median / max) are computed from the svLen field of each track, in base pairs. Some tracks include sites with svLen=0 (complex events where the reference and alternate alleles differ in sequence but not in length).

For short-read structural-variant comparators (CCDG 17,795, 1KG 3202, ToMMo 48K CNV) see the companion Short-read SVs supertrack.

+ + + + + + + + + + + + + + - + + - + + + - + + + + + + + - + - + + + + - + +

Dataset	N samples	Cohort / disease	Disease cases	Sequencing	SVs	Min	Median	Max
All merged	—	All long-read SV datasets merged on identical position+type+length, with per-database AC	mixed	mixed (PacBio HiFi, ONT)	2,694,871	50	200	190,088,223
CoLoRSdb	1,427	Consortium of Long-Read Sequencing, joint callset	No	PacBio HiFi	426,239	20	33	101,381
Han 945	945	Han Chinese, general population	No	ONT (PromethION)	111,288	0	254	99,743
1KG ONT 100	100	1000 Genomes, 5 superpopulations / 19 subpopulations	1000 Genomes, 5 superpopulations / 19 subpop., high 37x seq. coverage	No	ONT (R9.4.1)	113,696	0	164	98,289
1KG ONT Vienna	1,019	1000 Genomes, globally diverse	1000 Genomes, diverse, normal 17x seq. coverage	No	ONT	148,375	2	177	49,171
ToMMo Japanese	333 (111 trios)	Japanese, general population	No	ONT	74,201	51	162	99,980
AoU 1K	1,027	All of Us, self-identified Black/African American	All of Us, self-identified Black/African American, 8x cov.; biobank includes a variety of conditions (diabetes, hearing loss, etc.)	Yes (mixed)	PacBio HiFi	541,049	50	152	9,998
GA4K	502	Children's Mercy, pediatric rare disease probands + families	Yes (probands)	PacBio HiFi	115,554	50	186	809,711
deCODE 3,622	3,622	Icelandic general population	No	ONT	133,886	0	127	861,080
HPRC v2	233	HPRC release-2 pangenome (CHM13 + diverse 1KG assemblies)	No	PacBio HiFi (pangenome graph)	1,483,114	50	280	97,718
HGSVC2	32	HGSVC2 haplotype-resolved assemblies (5 superpopulations)	No	PacBio CLR + HiFi + Strand-seq	111,746	50	168	57,207,414
HGSVC3	65	HGSVC3 diverse reference assemblies	No	PacBio HiFi + ONT	176,531	50	154	30,176,500
Arab APR	Arab UPR	53	UAE-resident Arabs from 8 countries (Arab Pangenome Reference)	UAE-resident Arabs from 8 countries (UAE Pangenome Reference)	No	PacBio HiFi + ONT + Hi-C (pangenome graph)	72,656	1	21	99,885
CPC	58	Chinese Pangenome Consortium, 36 minority ethnic groups (HPRC-specific SVs removed)	No	PacBio HiFi (pangenome graph)	36,030	1	53	8,998,096
Kim PD Brain	100	Parkinson's disease, ILBD, controls (post-mortem brain)	Yes (PD + ILBD)	PacBio HiFi	74,552	50	160	190,088,222
SVatalog 101	101	Long-read WGS cohort for GWAS LD fine-mapping (SickKids)	Cystic fibrosis (CF) patients from the CF Canada-Sick Kids Program in Individual CF Therapy (CFIT). Long-read WGS used for GWAS LD fine-mapping	Yes (all CF)	long-read	87,183	4	160	1,321,484

Note: there is likely some overlap in sample composition across these collections. For example, 1000 Genomes samples are also included in HPRC and CoLoRSdb.

CoLoRSdb SVs (colorsDbSv)

CoLoRSdb SVs

Structural variants from the Consortium of Long-Read Sequencing database (CoLoRSdb), from 1,427 PacBio HiFi long-read whole-genome sequences. -426,239 SVs (insertions, deletions, inversions) called with pbsv and +~426k SVs (insertions, deletions, inversions) called with pbsv and merged with Jasmine, with allele frequencies, genotype counts and Hardy-Weinberg statistics across the cohort.

Han 945 SVs (han945Sv)

Han 945 SVs

-Structural variants from 945 Han Chinese individuals. 111,288 SVs +Structural variants from 945 Han Chinese individuals. ~111k SVs (deletions, insertions, duplications, inversions, translocations) merged with SURVIVOR. Includes allele frequencies and per-sample support.

1KG ONT 100 SVs (gustafsonSv)

1KG ONT 100 SVs

Structural variants from Oxford Nanopore long-read sequencing of 100 1000 Genomes samples (5 superpopulations, 19 subpopulations) released by the 1000 Genomes ONT Sequencing Consortium and described in -Gustafson et al. 2024. 113,696 SVs (insertions, deletions, duplications, +Gustafson et al. 2024. ~114k SVs (insertions, deletions, duplications, inversions) called with five callers and merged with Jasmine. This is a separate dataset from the Vienna 1KG-ONT release below; the 100 samples here do not overlap with the 1,019 samples in the Vienna release.

1KG ONT Vienna SVs (lrSv1kgOnt)

1KG ONT Vienna SVs

Structural variants from 1,019 individuals across 26 populations (1000 Genomes ONT). -161,332 SVs annotated with SVAN, classifying insertions and deletions by mechanism +~161k SVs annotated with SVAN, classifying insertions and deletions by mechanism of origin (mobile elements, VNTRs, processed pseudogenes, etc.). Original coordinates are on T2T-CHM13 (hs1); the hg38 version was created via liftOver. This is a separate dataset from the 1KG ONT 100 (Gustafson et al.) track above; the 1,019 samples here do not overlap with the 100 samples in that release.

ToMMo Japanese SVs (tommoJpSv)

ToMMo Japanese SVs

Structural variants from 333 Japanese individuals (111 trios) from the Tohoku Medical -Megabank (ToMMo). 74,201 SVs (deletions and insertions) with trio-based Mendelian +Megabank (ToMMo). ~74k SVs (deletions and insertions) with trio-based Mendelian error rates and allele frequencies.

AoU 1K SVs (aou1kSv)

AoU 1K SVs

Structural variants from 1,027 individuals from the All of Us (AoU) Research Program, -sequenced with PacBio HiFi long reads. 541,049 SVs (insertions and deletions) -with population-specific allele frequencies, gene annotations, and clinical -trait associations. +sequenced with PacBio HiFi long reads. AoU is a deeply phenotyped biobank +that includes participants with a range of conditions (e.g. diabetes, +hearing loss, hypertension), so the cohort is not disease-free. +~541k SVs (insertions and deletions) with population-specific allele +frequencies, gene annotations, and clinical trait associations.

GA4K SVs (ga4kSv)

GA4K SVs

Structural variants from 502 probands and family members enrolled in the Genomic Answers for Kids (GA4K) pediatric rare-disease program at Children's -Mercy Research Institute, sequenced with PacBio HiFi long reads. 115,554 +Mercy Research Institute, sequenced with PacBio HiFi long reads. ~116k replicated SVs (deletions, insertions, duplications, inversions) called with pbsv and merged with JASMINE. The matched GA4K small-variant callset (SNVs and short indels) lives alongside other population allele-frequency resources as GA4K 552 PacBio LR in the Variant Frequencies track collection.

deCODE 3,622 SVs (decodeSv)

deCODE 3,622 SVs

High-confidence structural variants from 3,622 Icelanders (deCODE genetics), -sequenced with Oxford Nanopore long reads. 133,886 SVs (deletions, insertions +sequenced with Oxford Nanopore long reads. ~134k SVs (deletions, insertions and combined insertion/deletion events). Site-only callset with annotated surrounding tandem-repeat regions.

HPRC v2 SVs (hprc2Sv)

HPRC v2 SVs

Structural variants derived from the Human Pangenome Reference Consortium release-2 minigraph-cactus pangenome graph, built from 233 PacBio HiFi haplotype-resolved assemblies (CHM13 + diverse 1000 Genomes samples). -1,483,114 SV-sized alleles (INS, DEL, COMPLEX, INV) extracted with +~1.5M SV-sized alleles (INS, DEL, COMPLEX, INV) extracted with vg deconstruct and decomposed with vcfwave (WFA2).

HGSVC2 32 SVs (hgsvc2Sv)

HGSVC2 32 SVs

Structural variants from 32 haplotype-resolved diploid genomes (HGSVC2 -freeze 4, Ebert et al. 2021). 111,746 SVs (deletions, insertions and +freeze 4, Ebert et al. 2021). ~112k SVs (deletions, insertions and inversions) called from phased de novo assemblies with PAV, with per-variant 1000 Genomes population allele frequencies (insertions and deletions) and rich structural/gene annotations. An earlier HGSVC release complementary to HGSVC3.

HGSVC3 65 SVs (hgsvc3Sv)

HGSVC3 65 SVs

Structural variants from 65 diverse individuals sequenced and de novo assembled by the Human Genome Structural Variation Consortium phase 3 -(HGSVC3). 176,532 haplotype-resolved SVs (deletions, insertions and +(HGSVC3). ~177k haplotype-resolved SVs (deletions, insertions and inversions) called with PAV and cross-validated with ten additional callers, with per-site carrier haplotype lists and structural annotations.

Kim PD Brain SVs (kwanhoSv)

Kim PD Brain SVs

Structural variants from 100 post-mortem brain samples (Parkinson's disease, incidental Lewy body disease, and healthy controls) sequenced with PacBio -HiFi long reads. 74,552 high-confidence SVs (deletions, insertions, +HiFi long reads. ~75k high-confidence SVs (deletions, insertions, duplications, inversions) with per-cohort allele frequencies and case-control carrier-rate differentials, from Kim et al. 2026.

SVatalog 101 SVs (chirmade101Sv)

SVatalog 101 SVs

Structural variants from 101 long-read whole-genome sequences released -alongside the GWAS SVatalog tool (Chirmade et al. 2026). 87,183 SVs +alongside the GWAS SVatalog tool (Chirmade et al. 2026). The samples come +from the CF Canada-Sick Kids Program in Individual CF Therapy (CFIT), a +cystic-fibrosis (CF) patient cohort assembled to model patient-specific +responses to CFTR modulator therapies (most participants are F508del +homozygotes or F508del / minimal-function compound heterozygotes; a smaller +number carry rare nonsense or missense CFTR mutations). ~87k SVs (deletions, insertions, duplications, inversions and complex events) annotated with gene overlaps, ClinGen / gnomAD constraint scores, OMIM / ClinVar / DGV / Decipher regional annotations.

Data Access

Each subtrack has its own documentation page with details on how to download and intersect the underlying annotations.