bac95a147f49cd331052e597006e04b3deee40fc max Wed Apr 22 10:43:20 2026 -0700 lrSv/srSv: human-readable SV type filter labels, script cleanups Add human-readable labels to the supertrack-level svType filter on both the lrSv and srSv supertracks using the "CODE|CODE (Long name)" filterValues syntax: DEL -> "DEL (Deletion)", INS -> "INS (Insertion)", etc. Labels keep the short code up front so users can match what hgTracks shows next to each feature. Also sweep in the in-progress converter/as-file cleanups under scripts/lrSv/ and scripts/srSv/ (introduction of lrSvCommon.py helpers, consistent insLen / svLen / AC column naming, tightened field-description text) that had been piling up as an unstaged working tree. refs #36258 diff --git src/hg/makeDb/trackDb/human/ga4kSv.html src/hg/makeDb/trackDb/human/ga4kSv.html index 0a4b668a46e..1f6ef9d049e 100644 --- src/hg/makeDb/trackDb/human/ga4kSv.html +++ src/hg/makeDb/trackDb/human/ga4kSv.html @@ -1,103 +1,118 @@
This track shows structural variants (SVs) identified by PacBio HiFi long-read sequencing of probands and their families enrolled in the Genomic Answers for Kids (GA4K) program at Children's Mercy Research Institute. GA4K is a longitudinal pediatric genomics initiative that aims to enroll 30,000 children with suspected rare genetic disorders, together with their parents, to build a large-scale resource of clinical and genomic data.
The callset contains 115,554 SVs (52,564 deletions, 58,219 insertions, 4,408 duplications, 363 inversions) from 502 sequenced samples. Variants are site-level (no per-sample genotypes) and each SV has been replicated, meaning that it was either observed in two or more unrelated GA4K individuals, or matched an SV from an external long-read reference set (Decode or the Human Pangenome Reference Consortium).
Items are colored by SV type:
Insertions are placed at the insertion site with a width of 1 bp; deletions, duplications and inversions span the affected interval. Filters are available for SV type, SV length, carrier-sample count and allele frequency. The detail page also shows the total number of samples genotyped at each site.
-Samples were sequenced on PacBio Revio and Sequel II instruments with HiFi -chemistry. Single-sample SV callsets were produced with pbsv and then merged -across the cohort with JASMINE v1.1.4 (jasmine --output-genotypes), -which clusters equivalent SVs across samples and writes a site-level multi-sample -VCF. +The Genomic Answers for Kids (GA4K) program at Children's Mercy Research +Institute is a longitudinal pediatric rare-disease initiative described in +Cohen et al. 2022. GA4K probands and their families are sequenced with +PacBio HiFi long reads (Revio and Sequel II), and the 502-sample GA4K +PacBio SV release (pb_joint_merged.sv.vcf.gz) is produced by +running +pbsv per sample and merging with +JASMINE +v1.1.4 (--output-genotypes). The merged site-level VCF is +filtered to SVs replicated in at least two independent observations +(either matching a second unrelated CMH individual in the same Jasmine +cluster, or matching an SV in the deCODE Icelandic or HPRC callsets via + +svpack match). The released catalog contains 115,554 replicated SVs +(52,564 deletions, 58,219 insertions, 4,408 duplications and 363 +inversions) with recomputed carrier counts (SVC), total sample counts +(SVN) and allele frequencies (SVF = SVC/SVN).
-To reduce false positives, the merged VCF was filtered to retain only SVs that -were replicated in at least two independent observations: either (1) matching a -second SV from another unrelated Children's Mercy (CMH) individual within the -same Jasmine cluster, or (2) matching an SV from the Decode Icelandic or Human -Pangenome Reference Consortium (HPRC) callsets using -svpack match with default settings. +The source VCF was cloned from the Children's Mercy Research Institute +GA4K GitHub repository, + +github.com/ChildrensMercyResearchInstitute/GA4K +(pacbio_sv_vcf/pb_joint_merged.sv.vcf.gz).
-Carrier counts (SVC), total sample counts (SVN) and allele frequencies -(SVF = SVC/SVN) were recomputed on the replicated callset. +The step-by-step build commands (download, format conversion, bigBed build) +are recorded in the UCSC makeDoc for this track container: + +doc/hg38/lrSv.txt. The conversion scripts and autoSql schemas live in + +makeDb/scripts/lrSv.
The data can be explored interactively in table format with the Table Browser or the Data Integrator and exported from there to spreadsheet or tab-sep tables. From scripts, the data can be accessed through our API, track=ga4kSv.
For automated download and analysis, the annotation is stored in a bigBed file that can be downloaded from our download server. The file for this track is called ga4kSv.bb. Individual regions or the whole annotation can be obtained using the bigBedToBed utility, available as a precompiled binary or from source as described on our utilities page. Example: bigBedToBed http://hgdownload.soe.ucsc.edu/gbdb/hg38/lrSv/ga4kSv.bb -chrom=chr21 -start=0 -end=100000000 stdout.
The original VCF is available from the Children's Mercy Research Institute GA4K data release at github.com/ChildrensMercyResearchInstitute/GA4K.
Thanks to the Children's Mercy Research Institute and the Genomic Answers for Kids participants and their families for making this dataset publicly available.
Cohen ASA, Farrow EG, Abdelmoity AT, Alaimo JT, Amudhavalli SM, Anderson JT, Bansal L, Bartik L, Baybayan P, Belden B et al. Genomic answers for children: Dynamic analyses of >1000 pediatric rare disease genomes. Genet Med. 2022 Jun;24(6):1336-1348. PMID: 35305867