e4696ced7edf4d0a2a995f203a751a3782cc2f16 lrnassar Fri May 29 15:00:59 2026 -0700 Add EVA SNP Release 9 — unified pipeline for native + GenArk contributed tracks. refs #37517 Adds the unified evaSnp9.py pipeline at src/hg/makeDb/scripts/evaSnp/evaSnp9.py that builds both native UCSC db bigBeds and GenArk contributed bigBeds from one driver, replacing the two separate v8 scripts (evaSnp8.py + evaSnpGenArk.py). Built 40 native + 115 contributed tracks for v9. trackDb: evaSnp9 subtrack added to the evaSnpContainer composite with the new SNV/indel/MNV varClass labels, searchTable stanza, and parent=on/visibility=dense defaults; evaSnp8 left on per QA preference. evaSnpContainer.html: transition note explaining the v9 SO-label refresh (SNV/deletion/insertion/indel/MNV/sequence_alteration replacing the legacy substitution/delins/multipleNucleotideVariant labels in v3-v8) and the new single-most-severe ucscClass convention; SO term list updated to dual-label each entry; download example URL bumped to evaSnp9.bb. Makedoc at src/hg/makeDb/doc/evaSnp9.txt documents the unified build, deploy, and per-clade GenArk make steps. diff --git src/hg/makeDb/doc/evaSnp9.txt src/hg/makeDb/doc/evaSnp9.txt new file mode 100644 index 00000000000..d1f9ee569c8 --- /dev/null +++ src/hg/makeDb/doc/evaSnp9.txt @@ -0,0 +1,95 @@ +# Track for EVA snp release 9 - https://www.ebi.ac.uk/eva/?RS-Release&releaseVersion=9 +# Tracks built by Lou — RM #37517 + +# First release built by a unified pipeline that produces BOTH the native +# tracks (on UCSC databases) and the GenArk contributed tracks. Replaces the +# two separate v8 scripts (evaSnp8.py and evaSnpGenArk.py). + +# Unified pipeline lives at: +# ~/kent/src/hg/makeDb/scripts/evaSnp/evaSnp9.py + +# Discovery (three-bucket classification): +# ./evaSnp9.py classify +# - native: EVA assembly matched an active UCSC db -> /gbdb deployment +# - contrib: EVA assembly matched a GenArk hub only -> contrib deployment +# - skip: no UCSC db or GenArk hub for that EVA assembly +# Overlap policy: native wins. An assembly that resolves to both a UCSC db +# and a GenArk hub is built ONLY as a native track. + +# Build: +# ./evaSnp9.py build all -j 8 +# Builds every assembly in native + contrib buckets in parallel. +# Per-assembly logs at /hive/data/outside/eva9/.../pipeline.log +# Failed builds get renamed to <workDir>.failed so logs survive. + +# Deploy native (after `build all` and trackDb commit): +# ./evaSnp9.py deploy native +# Symlinks /hive/data/outside/eva9/<db>/evaSnp9.bb into /gbdb/<db>/bbi/ +# Writes /hive/data/outside/eva9/assemblyReleaseList.txt +# Then add the evaSnp9 stanza to ~/kent/src/hg/makeDb/trackDb/evaSnp.ra +# under the evaSnpContainer composite (parent on; flip evaSnp8 to off), +# commit, and run `make alpha` from src/hg/makeDb/trackDb to push. + +# Deploy contrib (after `build all` and trackDb commit): +# ./evaSnp9.py deploy contrib +# Generates /hive/data/outside/genark/evaSnp9/{contributedTracks->, +# evaSnp9.trackDb.txt, mkLinks.sh} and runs mkLinks.sh, +# which injects symlinks into each GenArk hub's contrib/evaSnp9/ dir. +# Then add 'evaSnp9' to: +# ~/kent/src/hg/makeDb/trackDb/betaGenArk.txt +# ~/kent/src/hg/makeDb/trackDb/publicGenArk.txt +# Then run the per-clade GenArk make steps: + +cd ~/kent/src/hg/makeDb/doc +for D in plantsAsmHub birdsAsmHub fishAsmHub primatesAsmHub legacyAsmHub mammalsAsmHub invertebrateAsmHub fungiAsmHub bacteriaAsmHub +do + cd "${D}" + time (make) > dbg 2>&1 + egrep --color=auto -i "error|fail|missing|cannot|clade|class|real" dbg + time (make verifyTestDownload) >> test.down.log 2>&1 + egrep --color=auto -i "error|fail|missing|cannot|clade|class|real" test.down.log + time (make sendDownload) >> send.down.log 2>&1 + egrep --color=auto -i "error|fail|missing|cannot|clade|class|real" send.down.log + time (make verifyDownload) >> verify.down.log 2>&1 + egrep --color=auto -i "error|fail|missing|cannot|clade|class|real" verify.down.log + cd ~/kent/src/hg/makeDb/doc +done + +# Run those four make steps in parallel across each clade dir; bacteriaAsmHub +# (~22k orderList entries) is the long pole. + +# --- Key v9 changes vs v8 --- +# 1. Unified pipeline (one script instead of two). +# 2. varClass labels updated: SNV / deletion / insertion / indel / MNV / +# sequence_alteration (was substitution / delins / multipleNucleotideSubstitution +# / sequence alteration in native v3-v8; the contrib v8 already used new +# labels). The trackDb filterValues.varClass on evaSnp9 reflects the new set; +# older subtracks (evaSnp..evaSnp8) keep their existing filterValues because +# those bigBeds still encode the legacy terms. +# 3. ucscClass field now stores the single most-severe consequence per rsID, +# ranked by Sequence Ontology severity (was a comma-separated list in +# native v3-v8). The filterValues.ucscClass + multipleListOnlyOr filter +# semantics remain unchanged. +# 4. Native chromAlias lookups now use hgsql chromAlias directly (alias/chrom/ +# source schema) instead of chromToUcsc with multi-step fallbacks. No more +# per-db hardcoded hacks (galGal5, bosTau9, ce11, mm10, oviAri3, bosTau6 +# no longer need special cases). +# 5. REF allele validation: every build samples 200 SNVs and rejects if +# <95% of REF alleles match the assembly 2bit. Caught two v8 mis-version +# VCFs (maize, wheat) in retrospect; now a standard part of every build. +# 6. Version-mismatch chrom-coverage threshold: builds require >=10% of EVA +# VCF chroms to map onto the assembly's chrom names. Any successful build +# where >40% of chroms didn't map is flagged in the final summary for QA. +# 7. hgVai chunks chromosomes at 5 MB to bound SIGSEGV-related data loss. + +# --- Counts at v9 --- +# EVA-9 assemblies on FTP: 244 +# Bucket counts (from `./evaSnp9.py classify`): +# Native: 43 +# Contrib: 125 +# Skipped: 76 + +# v8 had 41 native + 118 contrib. The +2 native is mostly EVA-9 picking up +# new assemblies; the +7 contrib reflects the synthetic GCF-prefix fallback +# in the new discovery (catches assemblies where the NCBI summary lacks the +# GCA->GCF mapping but the GCF hub exists).