bac95a147f49cd331052e597006e04b3deee40fc max Wed Apr 22 10:43:20 2026 -0700 lrSv/srSv: human-readable SV type filter labels, script cleanups Add human-readable labels to the supertrack-level svType filter on both the lrSv and srSv supertracks using the "CODE|CODE (Long name)" filterValues syntax: DEL -> "DEL (Deletion)", INS -> "INS (Insertion)", etc. Labels keep the short code up front so users can match what hgTracks shows next to each feature. Also sweep in the in-progress converter/as-file cleanups under scripts/lrSv/ and scripts/srSv/ (introduction of lrSvCommon.py helpers, consistent insLen / svLen / AC column naming, tightened field-description text) that had been piling up as an unstaged working tree. refs #36258 diff --git src/hg/makeDb/trackDb/human/abelSv.html src/hg/makeDb/trackDb/human/abelSv.html index 7d5913fffb5..858d071de16 100644 --- src/hg/makeDb/trackDb/human/abelSv.html +++ src/hg/makeDb/trackDb/human/abelSv.html @@ -1,145 +1,161 @@

Description

Structural variants (SVs) are large changes in DNA — deletions, duplications, inversions, insertions of mobile elements, and translocations — that are at least 50 base pairs in size. They are a major source of genetic variation between individuals and can affect gene dosage, disrupt coding sequence, or rearrange regulatory elements. Because SVs are harder to detect than small variants, population-scale SV maps lag behind single-nucleotide variant resources.

This track displays site-frequency data for 737,998 SVs identified in 17,795 deeply sequenced human genomes (mean coverage > 20×) by Illumina short-read sequencing by Abel et al., Nature 2020. The samples were sequenced by the four sequencing centers of the NHGRI Centers for Common Disease Genomics (CCDG) program, supplemented with ancestrally diverse samples from the PAGE consortium and the Simons Genome Diversity Project. The composition includes roughly 24% African, 16% Latino, 11% Finnish, 39% non-Finnish European, and 9% other ancestries.

Two non-overlapping public callsets are combined into this track. The upstream release contains 738,624 unique primary SV records across the two callsets; 626 B37 records did not lift over to GRCh38, leaving the 737,998 shown here:

B38 (native GRCh38): 458,106 SVs from 14,623 samples, called directly on the GRCh38 assembly.
B37lift (GRCh37, lifted): 279,892 SVs from 8,417 samples originally called on GRCh37, with coordinates lifted to GRCh38 using the standard UCSC hg19→hg38 liftOver chain.

Important: the B38 and B37 callsets share 5,245 samples. When inspecting a variant present in both callsets, users should not simply sum the allele counts; the AC/AN reported for each callset reflects that callset's sample set. The callset filter can be used to restrict display to one source.

Display conventions

Items are colored by SV type:

DEL deletion
DUP duplication
INV inversion
MEI mobile-element insertion or MEI-derived deletion
BND breakend / translocation

Deletions, duplications, inversions, and mobile-element variants are drawn as intervals spanning from the variant start to its end. Breakend (BND) records are drawn as single-base items at the variant breakpoint; the mate chromosome and position are shown on the details page for each BND. Each BND pair from LUMPY is shown only once (the secondary mate record is suppressed).

Filters

The following filters are available from the track configuration page:

SV type — any combination of DEL, DUP, INV, MEI, BND.
Callset — B38 native, B37lift, or both.
Filter — PASS (high confidence) and/or LOW (low confidence, as flagged by the authors based on Mendelian-error rate).
Allele frequency (AF), Allele count (AC), SV length, and Mean sample quality (MSQ).

Per-population allele counts and numbers are shown on the details page for 8 ancestry groups: AFR (African), AMR (Latino/Admixed-American), NFE (non-Finnish European), FE (Finnish European), EAS (East-Asian), SAS (South-Asian), PI (Pacific Islander), and Other.

Methods

-The authors used their open-source -svtools pipeline to jointly call SVs across all samples. Per-sample -calls were produced with LUMPY (v0.2.13), CNVnator (v0.3.3), and svtyper -(v0.1.4); calls were merged across samples and refined with svtools. Low- -and high-confidence variants were distinguished using a Mendelian-error -cutoff on mean sample quality, calibrated against a set of 409 CEPH trios. -Per-sample validation was performed against a PacBio long-read truth set -derived from three HGSVC samples.

+Abel et al. 2020 jointly called SVs from Illumina short-read sequencing +(mean coverage >20x) of 17,795 genomes from the NHGRI Centers for +Common Disease Genomics program with per-sample calls from LUMPY v0.2.13, +CNVnator v0.3.3 and svtyper v0.1.4, integrated across the cohort by the +svtools +pipeline. Low- and high-confidence variants were separated by a +Mendelian-error cutoff on mean sample quality, calibrated against 409 +CEPH trios, and per-sample calls were validated against a PacBio +long-read truth set from three HGSVC samples. Two non-overlapping +callsets were released: 458,106 SVs from 14,623 samples called natively +on GRCh38 (B38) and 279,892 SVs from 8,417 samples called on GRCh37 +(B37). The site-frequency callsets span DELs, DUPs, INVs, mobile-element +variants and breakends/translocations.

-For this UCSC track, VCF INFO fields were parsed and converted to BED9+ -format. Variants originally called on GRCh37 (B37 callset) were lifted -to GRCh38 using the UCSC hg19ToHg38.over.chain.gz chain. See the +The B38 and B37 site-frequency VCFs (plus BEDPE companion files) were +downloaded from the authors' supplementary-data GitHub repository, + +github.com/hall-lab/sv_paper_042020. For the hg38 track, INFO fields +were parsed into BED9+ columns; B37 records were lifted to hg38 with the +UCSC hg19ToHg38.over.chain.gz chain (626 B37 records failed to +lift, leaving 737,998 SVs total in the track).

+ +

+The step-by-step build commands (download, liftOver, format conversion, +bigBed build) are recorded in the UCSC makeDoc for this track: -track build documentation for full details.

+doc/hg38/abelSv.txt. The conversion scripts and autoSql schemas live in + +makeDb/scripts/lrSv. +

Data Access

The data can be explored interactively in table format with the Table Browser or the Data Integrator and exported from there to spreadsheet or tab-sep tables. From scripts, the data can be accessed through our API, track=abelSv.

For automated download and analysis, the annotation is stored in a bigBed file that can be downloaded from our download server. The file for this track is called abelSv.bb. Individual regions or the whole genome annotation can be obtained using our tool bigBedToBed, which can be compiled from the source code or downloaded as a precompiled binary for your system. Instructions for downloading source code and binaries can be found here. The tool can also be used to obtain features within a given range, e.g. bigBedToBed http://hgdownload.soe.ucsc.edu/gbdb/hg38/abelSv/abelSv.bb -chrom=chr21 -start=0 -end=100000000 stdout

The original site-frequency VCF and BEDPE files are distributed by the authors from their supplementary-data GitHub repository.

Credits

Thanks to Haley J. Abel, David E. Larson, Ira M. Hall and colleagues at the McDonnell Genome Institute (Washington University in St. Louis), the Broad Institute, Baylor College of Medicine, the New York Genome Center, and the University of Washington for producing this resource and making the site-frequency callsets publicly available.

References

Abel HJ, Larson DE, Regier AA, Chiang C, Das I, Kanchi KL, Layer RM, Neale BM, Salerno WJ, Reeves C et al. Mapping and characterization of structural variation in 17,795 human genomes. Nature. 2020 Jul;583(7814):83-89. PMID: 32460305; PMC: PMC7547914