bac95a147f49cd331052e597006e04b3deee40fc max Wed Apr 22 10:43:20 2026 -0700 lrSv/srSv: human-readable SV type filter labels, script cleanups Add human-readable labels to the supertrack-level svType filter on both the lrSv and srSv supertracks using the "CODE|CODE (Long name)" filterValues syntax: DEL -> "DEL (Deletion)", INS -> "INS (Insertion)", etc. Labels keep the short code up front so users can match what hgTracks shows next to each feature. Also sweep in the in-progress converter/as-file cleanups under scripts/lrSv/ and scripts/srSv/ (introduction of lrSvCommon.py helpers, consistent insLen / svLen / AC column naming, tightened field-description text) that had been piling up as an unstaged working tree. refs #36258 diff --git src/hg/makeDb/trackDb/human/aprSv.html src/hg/makeDb/trackDb/human/aprSv.html index 255e1ae719d..ac90c471788 100644 --- src/hg/makeDb/trackDb/human/aprSv.html +++ src/hg/makeDb/trackDb/human/aprSv.html @@ -1,103 +1,128 @@

Description

This track displays structural variants (SVs) — deletions, insertions, and complex substitutions of at least 50 bp — from the Arabic Pangenome Reference (APR), a pangenome graph built from 53 UAE-resident Arab individuals drawn from eight countries (UAE, Saudi Arabia, Oman, Jordan, Egypt, Morocco, Syria, Yemen). Each bubble in the graph that contains an SV-sized alternative allele is shown as a single variant site, with allele counts aggregated across the 53 samples (the GRCh38 reference haplotype, present as an extra sample column in the source VCF, is excluded from the aggregation).

The APR pangenome was built on the T2T-CHM13v2 reference. Variants are shown natively on the hs1 browser and lifted to hg38 using the UCSC hs1ToHg38.over.chain.gz chain; variants that do not lift cleanly (often in T2T-added euchromatic sequence) are omitted from the hg38 version of the track.

Display conventions

Items are colored by SV type:

INS insertion (net ALT longer by ≥50 bp)
DEL deletion (net REF longer by ≥50 bp)
CPX complex substitution (similar-length REF and ALT but at least one ≥50 bp)
MIXED snarl whose alt alleles belong to different classes

Each item spans from the start of REF to its end on the reference. The name field is the graph snarl ID (e.g. <951452<1012008), which identifies the variant site in the APR pangenome graph.

Per-site alt-allele aggregation

The source VCF is multi-allelic: a single graph snarl appears as one row with a comma-separated ALT list. For this track, each ALT is classified individually using the 50 bp threshold, and the row is emitted as a single bed item with:

svType — the common class, or MIXED if alts disagree;
svLen — the maximum |len(ALT)-len(REF)| across alts that passed;
alleleCount — sum of per-alt allele counts (AC) that passed;
svLen — reference span (chromEnd - chromStart);
insLen — maximum inserted-sequence length across passing INS alts (0 otherwise);
AC — sum of per-alt allele counts (AC) that passed;
numAlts — number of alt alleles that passed the 50 bp filter.

Rows whose alts are all smaller than 50 bp are not shown.

Methods

-The APR pangenome was assembled from 53 individuals sequenced with an -average 35× PacBio HiFi, 54× ultralong ONT and 65× Hi-C -coverage, producing haplotype-phased de novo assemblies with N50 > 124 Mb. -The pangenome graph was built with Minigraph-Cactus v2.7.2 seeded on -CHM13v2 (backbone) and GRCh38; variants were extracted and deconstructed -from the graph. For this UCSC track, the decomposed VCF was parsed, -filtered to alt alleles with ≥50 bp REF/ALT length difference, and -merged per snarl site. See the build documentation in the kent source -tree at src/hg/makeDb/doc/hg38/lrSv.txt for details.

+Nassir et al. 2025 built the Arabic Pangenome Reference (APR) from 53 +UAE-resident Arab individuals drawn from eight countries, sequenced with +~35x PacBio HiFi on Sequel IIe/Revio (30-h movies), ~54x Oxford Nanopore +ultralong reads on R10.4.1 PromethION flow cells (96-h runs), and ~65x +Hi-C (Illumina NovaSeq 6000). Haplotype-phased de novo assemblies were +produced with hifiasm v0.19.5 (primary) and Verkko v1.3.1 (for +comparison), with a median N50 of 124 Mb. The pangenome graph was built +with Minigraph-Cactus seeded on T2T-CHM13v2 and augmented with GRCh38, +and SVs were extracted by graph deconstruction. The released decomposed +VCF (apr_review_v1_2902_chm13.vcf.gz) contains ~21 million +variants on CHM13v2 contigs; after filtering to alt alleles with ≥50 bp +length difference and collapsing the alts of each snarl into a single +site, the APR SV track is obtained. Variants are shown natively on hs1 +and lifted to hg38 with the UCSC hs1ToHg38.over.chain.gz chain +(variants not lifting cleanly are omitted from the hg38 version).

+ +

+The source APR VCF was downloaded from the Mohammed Bin Rashid +University SharePoint page, + +mbru.ac.ae/the-arab-pangenome-reference; the accompanying project +source code is at + +github.com/muddinmbru/arab_pangenome_reference.

+ +

+The step-by-step build commands (download, graph-VCF conversion, liftOver, +bigBed build) are recorded in the UCSC makeDoc for this track container: + +doc/hg38/lrSv.txt. The conversion scripts and autoSql schemas live in + +makeDb/scripts/lrSv. +

Data Access

The data can be explored interactively with the Table Browser or Data Integrator, and accessed from scripts via our API (track=aprSv).

For automated download, the bigBed files are at http://hgdownload.soe.ucsc.edu/gbdb/hs1/lrSv/apr.bb (native) and http://hgdownload.soe.ucsc.edu/gbdb/hg38/lrSv/apr.bb (lifted).

The original APR pangenome VCF and assemblies can be downloaded from https://www.mbru.ac.ae/the-arab-pangenome-reference/, and the project source code is at https://github.com/muddinmbru/arab_pangenome_reference.

Credits

Thanks to the Arabic Pangenome Reference team at Mohammed Bin Rashid University (Dubai), led by Mohammed Uddin, for producing and releasing the pangenome and its variant calls.

References

Nassir N, Almarri MA, Kumail M, Mohamed N, Balan B, Hanif S, AlObathani M, Jamalalail B, Elsokary H, Kondaramage D et al. A draft UAE-based Arab pangenome reference. Nat Commun. 2025 Jul 24;16(1):6747. PMID: 40707445; PMC: PMC12290100