2cb2236b2ebb81e95c173a35ec590f5a086b86d0 lrnassar Thu Mar 26 17:32:56 2026 -0700 Adding ENCODE4 Integrated Regulation tracks for hg38 (wgEncodeReg4) and mm10 (encode4Reg). Each supertrack contains 6 organ-averaged multiWig signal tracks (H3K27ac, DNase, ATAC, H3K4me3, CTCF, Transcription) and 3 bigComposite faceted individual experiment composites (Epigenetics, RNA-seq, TF ChIP-seq) using S3 URLs and the new faceted composite UI. hg38 also includes a TF rPeaks track. ENCODE3 regulation tracks are release-tagged to show a snowflake deprecation notice on alpha while remaining unchanged on beta/public. Includes generation scripts, makedocs, HTML descriptions, relatedTracks, and metadata TSVs. refs #34923 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> diff --git src/hg/makeDb/doc/hg38/encode4.regulation.txt src/hg/makeDb/doc/hg38/encode4.regulation.txt new file mode 100644 index 00000000000..b28ac58efbe --- /dev/null +++ src/hg/makeDb/doc/hg38/encode4.regulation.txt @@ -0,0 +1,249 @@ +# ENCODE4 Integrated Regulation Track (wgEncodeReg4) for hg38 +# Redmine #34923 +# Lou Nassar, 2026-03-12 + +# This track converts the ENCODE V4 Regulation hub into a native UCSC browser +# supertrack containing organ-averaged multiWig signal tracks, TF rPeak clusters, +# and individual experiment composites (bigComposite/faceted format) for +# epigenetics, RNA-seq, and TF ChIP-seq. + +# The original hub was prepared by Mingshi Gao (Weng lab, UMass Chan Medical School): +# https://users.wenglab.org/gaomingshi/ENCODE_Reg/hub.txt +# It was cloned locally for processing: +# /hive/data/outside/encode4/ccre/ENCODE_V4_Regulation/hub.txt (93K lines) + +# Scripts are located at: +# kent/src/hg/makeDb/scripts/encode4regulation/ + +############################################################################## +# Step 1: Clone the hub data locally +############################################################################## + +# The hub was cloned with hubClone -download: +mkdir -p /hive/data/outside/encode4/ccre +cd /hive/data/outside/encode4/ccre +hubClone -download https://users.wenglab.org/gaomingshi/ENCODE_Reg/hub.txt \ + ENCODE_V4_Regulation + +# Total data: ~5.6 TB across ~7,000 files (bigWig + bigBed) + +############################################################################## +# Step 2: Create gbdb symlinks +############################################################################## + +# Only organ-averaged multiWig files and TF rPeak files need gbdb symlinks. +# Individual experiment tracks use S3 URLs directly. + +# Organ-averaged multiWig signals (163 organ signals + 102 RNA strand files) +mkdir -p /gbdb/hg38/encode4/regulation/organAve +cd /hive/data/outside/encode4/ccre/ENCODE_V4_Regulation +for f in $(grep -oP '^\S+\.bigWig' hub.txt | grep -v ENCFF | sort -u); do + [ -f "$f" ] && ln -s $(pwd)/$f /gbdb/hg38/encode4/regulation/organAve/$f +done +# Also symlink strand-specific RNA files: +for f in *.minus.bigWig *.plus.bigWig; do + [ -f "$f" ] && ln -s $(pwd)/$f /gbdb/hg38/encode4/regulation/organAve/$f +done +# Total: 265 symlinks + +# TF rPeak clusters (from existing ENCODEv4TFrPeaks bed) +mkdir -p /gbdb/hg38/encode4/regulation/tfRpeak +ln -s /hive/data/genomes/hg38/bed/ENCODEv4TFrPeaks/no_trim.TF_name.rPeaks.bb \ + /gbdb/hg38/encode4/regulation/tfRpeak/TFrPeakClusters.bb +ln -s /hive/data/genomes/hg38/bed/ENCODEv4TFrPeaks/no_trim.TF_name.decorator.bb \ + /gbdb/hg38/encode4/regulation/tfRpeak/TFrPeakClustersDecorator.bb + +# Final symlink counts: +# organAve/ 265 +# tfRpeak/ 2 +# Total: 267 + +############################################################################## +# Step 3: Validate local files against ENCODE portal +############################################################################## + +# Run the validation script to verify md5sums of local files against the +# ENCODE REST API and generate S3 URL mapping for bigComposite tracks. + +cd /hive/users/lrnassar/claude/RM34923 +python3 kent/src/hg/makeDb/scripts/encode4regulation/validate_encode_urls.py + +# This produces: +# encode4_url_mapping.tsv — maps 6,747 accessions to S3 URLs with md5 validation +# encode4_validation.log — detailed log of any mismatches +# The mapping file is used by the bigComposite conversion script (Step 5). + +############################################################################## +# Step 4: Generate multiWig trackDb stanzas +############################################################################## + +# The multiWig organ-averaged tracks (H3K27ac, DNase, ATAC, H3K4me3, CTCF, Txn) +# are generated from the hub.txt and assembled into the main wgEncodeReg4.ra. + +cd /hive/users/lrnassar/claude/RM34923 +python3 kent/src/hg/makeDb/scripts/encode4regulation/generate_multiwig_ra.py \ + > multiwig_output.ra + +# The output was manually integrated into wgEncodeReg4.ra along with: +# - SuperTrack definition (priority 1.5, group=regulation) +# - TF rPeak track (bigBed 12+ with decorator, 912-factor filterValues) +# - Include directives for the 3 bigComposite files +# The main file is hand-maintained: +# kent/src/hg/makeDb/trackDb/human/hg38/wgEncodeReg4.ra (~3,327 lines) + +# multiWig track details: +# wgEncodeReg4MarkH3k27ac — full visibility, 5 organs default ON +# wgEncodeReg4Dnase — hidden +# wgEncodeReg4Atac — hidden +# wgEncodeReg4MarkH3k4me3 — hidden +# wgEncodeReg4MarkCtcf — hidden +# wgEncodeReg4Txn — hidden +# Each multiWig has both tissue-only and all-biosamples variants as subtracks. + +############################################################################## +# Step 5: Generate bigComposite (faceted) individual experiment tracks +############################################################################## + +# The three individual experiment composites use the new bigComposite/faceted +# format (refs #36320). This was a two-step process: +# +# Step 5a: Generate traditional composites from hub +# generate_composites.py parses hub.txt and creates compositeTrack-on-style +# .ra files with subGroups, views, dimensions, etc. +# +# Step 5b: Convert to bigComposite faceted format +# convert_to_bigcomposite.py reads those .ra files, strips subGroups/views, +# adds metaDataUrl + primaryKey, generates metadata TSVs, and uses S3 URLs +# from the validation mapping (Step 3). + +# First regenerate the traditional composites (if needed as intermediate): +cd /hive/users/lrnassar/claude/RM34923 +python3 kent/src/hg/makeDb/scripts/encode4regulation/generate_composites.py + +# Then convert to bigComposite format: +python3 kent/src/hg/makeDb/scripts/encode4regulation/convert_to_bigcomposite.py + +# This overwrites the .ra files in place and creates metadata TSVs: +# /gbdb/hg38/encode4/regulation/wgEncodeReg4Epigenetics_metadata.tsv +# /gbdb/hg38/encode4/regulation/wgEncodeReg4RnaSeq_metadata.tsv +# /gbdb/hg38/encode4/regulation/wgEncodeReg4TfChip_metadata.tsv +# +# Output .ra files (in kent/src/hg/makeDb/trackDb/human/hg38/): +# wgEncodeReg4Epigenetics.ra — 3,199 subtracks, facets: Assay, Organ, +# Biosample Type, Life Stage +# wgEncodeReg4RnaSeq.ra — 1,046 subtracks, facets: Organ, +# Biosample Type, Life Stage, Strand +# wgEncodeReg4TfChip.ra — 2,502 subtracks, facets: TF, Organ, +# Biosample Type, Life Stage +# +# The "Biosample" column is prefixed with underscore (_Biosample) to hide +# it from the faceted UI while retaining it as metadata (per Jonathan Casper's +# recommendation, refs #36320). +# +# Default-ON subtracks (5 per composite): +# Epigenetics: untreated K562, one per assay (DNase, ATAC, H3K4me3, H3K27ac, CTCF) +# RNA-seq: K562 +/- strand, GM12878 +/- strand, HepG2 + strand +# TF ChIP: K562 CTCF, POLR2A, MYC, MAX, EP300 +# +# All subtracks use S3 URLs (encode-public.s3.amazonaws.com) for bigDataUrl. + +############################################################################## +# Step 6: Create HTML description pages +############################################################################## + +# 11 HTML files were created in kent/src/hg/makeDb/trackDb/human/hg38/: +# wgEncodeReg4.html — SuperTrack overview +# wgEncodeReg4MarkH3k27ac.html — H3K27ac layered signal +# wgEncodeReg4Dnase.html — DNase layered signal +# wgEncodeReg4Atac.html — ATAC layered signal +# wgEncodeReg4MarkH3k4me3.html — H3K4me3 layered signal +# wgEncodeReg4MarkCtcf.html — CTCF layered signal +# wgEncodeReg4Txn.html — Transcription layered signal +# wgEncodeReg4TfPeaks.html — TF rPeaks +# wgEncodeReg4Epigenetics.html — Individual epigenetics composite +# wgEncodeReg4RnaSeq.html — Individual RNA-seq composite +# wgEncodeReg4TfChip.html — Individual TF ChIP composite + +############################################################################## +# Step 7: Add related tracks and trackDb include +############################################################################## + +# Added to kent/src/hg/makeDb/trackDb/human/hg38/trackDb.ra: +# include wgEncodeReg4.ra alpha + +# Added reciprocal entries to relatedTracks.ra: +# hg38 wgEncodeReg4 wgEncodeReg ENCODE4 update of ENCODE3 Regulation +# hg38 wgEncodeReg wgEncodeReg4 ENCODE4 update of ENCODE3 Regulation +# hg38 wgEncodeReg4 cCREs Related ENCODE4 cCRE annotations +# hg38 cCREs wgEncodeReg4 Related ENCODE4 regulation data + +############################################################################## +# Step 8: Release tags (ENCODE3 transition) +############################################################################## + +# On alpha (dev): ENCODE4 visible, ENCODE3 hidden with snowflake +# On beta/public: ENCODE3 visible as-is, ENCODE4 not visible +# +# Approach: Duplicate wgEncodeReg.ra into wgEncodeReg.alpha.ra with: +# - superTrack on hide (hidden by default) +# - pennantIcon snowflake.png (deprecation notice) +# - Inner includes tagged with release alpha +# The original wgEncodeReg.ra gets inner includes tagged beta,public. +# +# trackDb.ra includes: +# include wgEncodeReg.ra beta,public +# include wgEncodeReg.alpha.ra alpha +# include wgEncodeReg4.ra alpha + +############################################################################## +# Step 9: Load trackDb +############################################################################## + +cd /cluster/home/lrnassar/kent/src/hg/makeDb/trackDb + +# Sandbox (personal testing): +make DBS=hg38 + +# Dev (hgwdev): +make alpha DBS=hg38 + +# 20,658 track descriptions loaded + +############################################################################## +# Track hierarchy summary +############################################################################## + +# wgEncodeReg4 (superTrack, priority 1.5, group=regulation) +# ├── wgEncodeReg4MarkH3k27ac (multiWig, full, 5 organs ON) +# ├── wgEncodeReg4Dnase (multiWig, hide) +# ├── wgEncodeReg4Atac (multiWig, hide) +# ├── wgEncodeReg4MarkH3k4me3 (multiWig, hide) +# ├── wgEncodeReg4MarkCtcf (multiWig, hide) +# ├── wgEncodeReg4Txn (multiWig, hide) +# ├── wgEncodeReg4TfPeaks (bigBed 12+ with decorator, hide) +# ├── wgEncodeReg4Epigenetics (bigComposite faceted, 3,199 subtracks, 5 ON) +# ├── wgEncodeReg4RnaSeq (bigComposite faceted, 1,046 subtracks, 5 ON) +# └── wgEncodeReg4TfChip (bigComposite faceted, 2,502 subtracks, 5 ON) + +# gbdb contents (/gbdb/hg38/encode4/regulation/): +# organAve/ — 265 multiWig symlinks +# tfRpeak/ — 2 TF rPeak symlinks +# wgEncodeReg4Epigenetics_metadata.tsv +# wgEncodeReg4RnaSeq_metadata.tsv +# wgEncodeReg4TfChip_metadata.tsv + +############################################################################## +# Known upstream hub issues (reported to Weng lab) +############################################################################## + +# Report at: +# https://hgwdev.gi.ucsc.edu/~lrnassar/temp/encode4_regulation_hub_issues.md +# +# 1. Duplicate bigDataUrl — tissue-only and all-biosamples variants share +# same file for 5 non-RNA assays (83 pairs, 166 subtracks) +# 2. tp. prefix inconsistency — RNA correctly uses tp. prefix; other assays +# don't, causing issue 1 +# 3. "paraythroid" typo — should be "parathyroid"; affects 3 assays, +# 3 filenames, 21 hub lines +# 4. Bad track name: ATAC_ENCFF128Muscle (should be ATAC_ENCFF128OID) — +# handled by convert_to_bigcomposite.py fallback logic