2cb2236b2ebb81e95c173a35ec590f5a086b86d0
lrnassar
  Thu Mar 26 17:32:56 2026 -0700
Adding ENCODE4 Integrated Regulation tracks for hg38 (wgEncodeReg4) and mm10
(encode4Reg). Each supertrack contains 6 organ-averaged multiWig signal tracks
(H3K27ac, DNase, ATAC, H3K4me3, CTCF, Transcription) and 3 bigComposite
faceted individual experiment composites (Epigenetics, RNA-seq, TF ChIP-seq)
using S3 URLs and the new faceted composite UI. hg38 also includes a TF rPeaks
track. ENCODE3 regulation tracks are release-tagged to show a snowflake
deprecation notice on alpha while remaining unchanged on beta/public. Includes
generation scripts, makedocs, HTML descriptions, relatedTracks, and metadata
TSVs. refs #34923

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

diff --git src/hg/makeDb/doc/hg38/encode4.regulation.txt src/hg/makeDb/doc/hg38/encode4.regulation.txt
new file mode 100644
index 00000000000..b28ac58efbe
--- /dev/null
+++ src/hg/makeDb/doc/hg38/encode4.regulation.txt
@@ -0,0 +1,249 @@
+# ENCODE4 Integrated Regulation Track (wgEncodeReg4) for hg38
+# Redmine #34923
+# Lou Nassar, 2026-03-12
+
+# This track converts the ENCODE V4 Regulation hub into a native UCSC browser
+# supertrack containing organ-averaged multiWig signal tracks, TF rPeak clusters,
+# and individual experiment composites (bigComposite/faceted format) for
+# epigenetics, RNA-seq, and TF ChIP-seq.
+
+# The original hub was prepared by Mingshi Gao (Weng lab, UMass Chan Medical School):
+#   https://users.wenglab.org/gaomingshi/ENCODE_Reg/hub.txt
+# It was cloned locally for processing:
+#   /hive/data/outside/encode4/ccre/ENCODE_V4_Regulation/hub.txt (93K lines)
+
+# Scripts are located at:
+#   kent/src/hg/makeDb/scripts/encode4regulation/
+
+##############################################################################
+# Step 1: Clone the hub data locally
+##############################################################################
+
+# The hub was cloned with hubClone -download:
+mkdir -p /hive/data/outside/encode4/ccre
+cd /hive/data/outside/encode4/ccre
+hubClone -download https://users.wenglab.org/gaomingshi/ENCODE_Reg/hub.txt \
+    ENCODE_V4_Regulation
+
+# Total data: ~5.6 TB across ~7,000 files (bigWig + bigBed)
+
+##############################################################################
+# Step 2: Create gbdb symlinks
+##############################################################################
+
+# Only organ-averaged multiWig files and TF rPeak files need gbdb symlinks.
+# Individual experiment tracks use S3 URLs directly.
+
+# Organ-averaged multiWig signals (163 organ signals + 102 RNA strand files)
+mkdir -p /gbdb/hg38/encode4/regulation/organAve
+cd /hive/data/outside/encode4/ccre/ENCODE_V4_Regulation
+for f in $(grep -oP '^\S+\.bigWig' hub.txt | grep -v ENCFF | sort -u); do
+    [ -f "$f" ] && ln -s $(pwd)/$f /gbdb/hg38/encode4/regulation/organAve/$f
+done
+# Also symlink strand-specific RNA files:
+for f in *.minus.bigWig *.plus.bigWig; do
+    [ -f "$f" ] && ln -s $(pwd)/$f /gbdb/hg38/encode4/regulation/organAve/$f
+done
+# Total: 265 symlinks
+
+# TF rPeak clusters (from existing ENCODEv4TFrPeaks bed)
+mkdir -p /gbdb/hg38/encode4/regulation/tfRpeak
+ln -s /hive/data/genomes/hg38/bed/ENCODEv4TFrPeaks/no_trim.TF_name.rPeaks.bb \
+    /gbdb/hg38/encode4/regulation/tfRpeak/TFrPeakClusters.bb
+ln -s /hive/data/genomes/hg38/bed/ENCODEv4TFrPeaks/no_trim.TF_name.decorator.bb \
+    /gbdb/hg38/encode4/regulation/tfRpeak/TFrPeakClustersDecorator.bb
+
+# Final symlink counts:
+#   organAve/  265
+#   tfRpeak/     2
+#   Total:     267
+
+##############################################################################
+# Step 3: Validate local files against ENCODE portal
+##############################################################################
+
+# Run the validation script to verify md5sums of local files against the
+# ENCODE REST API and generate S3 URL mapping for bigComposite tracks.
+
+cd /hive/users/lrnassar/claude/RM34923
+python3 kent/src/hg/makeDb/scripts/encode4regulation/validate_encode_urls.py
+
+# This produces:
+#   encode4_url_mapping.tsv  — maps 6,747 accessions to S3 URLs with md5 validation
+#   encode4_validation.log   — detailed log of any mismatches
+# The mapping file is used by the bigComposite conversion script (Step 5).
+
+##############################################################################
+# Step 4: Generate multiWig trackDb stanzas
+##############################################################################
+
+# The multiWig organ-averaged tracks (H3K27ac, DNase, ATAC, H3K4me3, CTCF, Txn)
+# are generated from the hub.txt and assembled into the main wgEncodeReg4.ra.
+
+cd /hive/users/lrnassar/claude/RM34923
+python3 kent/src/hg/makeDb/scripts/encode4regulation/generate_multiwig_ra.py \
+    > multiwig_output.ra
+
+# The output was manually integrated into wgEncodeReg4.ra along with:
+#   - SuperTrack definition (priority 1.5, group=regulation)
+#   - TF rPeak track (bigBed 12+ with decorator, 912-factor filterValues)
+#   - Include directives for the 3 bigComposite files
+# The main file is hand-maintained:
+#   kent/src/hg/makeDb/trackDb/human/hg38/wgEncodeReg4.ra (~3,327 lines)
+
+# multiWig track details:
+#   wgEncodeReg4MarkH3k27ac — full visibility, 5 organs default ON
+#   wgEncodeReg4Dnase       — hidden
+#   wgEncodeReg4Atac        — hidden
+#   wgEncodeReg4MarkH3k4me3 — hidden
+#   wgEncodeReg4MarkCtcf    — hidden
+#   wgEncodeReg4Txn         — hidden
+# Each multiWig has both tissue-only and all-biosamples variants as subtracks.
+
+##############################################################################
+# Step 5: Generate bigComposite (faceted) individual experiment tracks
+##############################################################################
+
+# The three individual experiment composites use the new bigComposite/faceted
+# format (refs #36320). This was a two-step process:
+#
+# Step 5a: Generate traditional composites from hub
+#   generate_composites.py parses hub.txt and creates compositeTrack-on-style
+#   .ra files with subGroups, views, dimensions, etc.
+#
+# Step 5b: Convert to bigComposite faceted format
+#   convert_to_bigcomposite.py reads those .ra files, strips subGroups/views,
+#   adds metaDataUrl + primaryKey, generates metadata TSVs, and uses S3 URLs
+#   from the validation mapping (Step 3).
+
+# First regenerate the traditional composites (if needed as intermediate):
+cd /hive/users/lrnassar/claude/RM34923
+python3 kent/src/hg/makeDb/scripts/encode4regulation/generate_composites.py
+
+# Then convert to bigComposite format:
+python3 kent/src/hg/makeDb/scripts/encode4regulation/convert_to_bigcomposite.py
+
+# This overwrites the .ra files in place and creates metadata TSVs:
+#   /gbdb/hg38/encode4/regulation/wgEncodeReg4Epigenetics_metadata.tsv
+#   /gbdb/hg38/encode4/regulation/wgEncodeReg4RnaSeq_metadata.tsv
+#   /gbdb/hg38/encode4/regulation/wgEncodeReg4TfChip_metadata.tsv
+#
+# Output .ra files (in kent/src/hg/makeDb/trackDb/human/hg38/):
+#   wgEncodeReg4Epigenetics.ra — 3,199 subtracks, facets: Assay, Organ,
+#                                 Biosample Type, Life Stage
+#   wgEncodeReg4RnaSeq.ra      — 1,046 subtracks, facets: Organ,
+#                                 Biosample Type, Life Stage, Strand
+#   wgEncodeReg4TfChip.ra      — 2,502 subtracks, facets: TF, Organ,
+#                                 Biosample Type, Life Stage
+#
+# The "Biosample" column is prefixed with underscore (_Biosample) to hide
+# it from the faceted UI while retaining it as metadata (per Jonathan Casper's
+# recommendation, refs #36320).
+#
+# Default-ON subtracks (5 per composite):
+#   Epigenetics: untreated K562, one per assay (DNase, ATAC, H3K4me3, H3K27ac, CTCF)
+#   RNA-seq: K562 +/- strand, GM12878 +/- strand, HepG2 + strand
+#   TF ChIP: K562 CTCF, POLR2A, MYC, MAX, EP300
+#
+# All subtracks use S3 URLs (encode-public.s3.amazonaws.com) for bigDataUrl.
+
+##############################################################################
+# Step 6: Create HTML description pages
+##############################################################################
+
+# 11 HTML files were created in kent/src/hg/makeDb/trackDb/human/hg38/:
+#   wgEncodeReg4.html                — SuperTrack overview
+#   wgEncodeReg4MarkH3k27ac.html     — H3K27ac layered signal
+#   wgEncodeReg4Dnase.html           — DNase layered signal
+#   wgEncodeReg4Atac.html            — ATAC layered signal
+#   wgEncodeReg4MarkH3k4me3.html     — H3K4me3 layered signal
+#   wgEncodeReg4MarkCtcf.html        — CTCF layered signal
+#   wgEncodeReg4Txn.html             — Transcription layered signal
+#   wgEncodeReg4TfPeaks.html         — TF rPeaks
+#   wgEncodeReg4Epigenetics.html     — Individual epigenetics composite
+#   wgEncodeReg4RnaSeq.html          — Individual RNA-seq composite
+#   wgEncodeReg4TfChip.html          — Individual TF ChIP composite
+
+##############################################################################
+# Step 7: Add related tracks and trackDb include
+##############################################################################
+
+# Added to kent/src/hg/makeDb/trackDb/human/hg38/trackDb.ra:
+#   include wgEncodeReg4.ra alpha
+
+# Added reciprocal entries to relatedTracks.ra:
+#   hg38 wgEncodeReg4 wgEncodeReg ENCODE4 update of ENCODE3 Regulation
+#   hg38 wgEncodeReg wgEncodeReg4 ENCODE4 update of ENCODE3 Regulation
+#   hg38 wgEncodeReg4 cCREs Related ENCODE4 cCRE annotations
+#   hg38 cCREs wgEncodeReg4 Related ENCODE4 regulation data
+
+##############################################################################
+# Step 8: Release tags (ENCODE3 transition)
+##############################################################################
+
+# On alpha (dev): ENCODE4 visible, ENCODE3 hidden with snowflake
+# On beta/public: ENCODE3 visible as-is, ENCODE4 not visible
+#
+# Approach: Duplicate wgEncodeReg.ra into wgEncodeReg.alpha.ra with:
+#   - superTrack on hide (hidden by default)
+#   - pennantIcon snowflake.png (deprecation notice)
+#   - Inner includes tagged with release alpha
+# The original wgEncodeReg.ra gets inner includes tagged beta,public.
+#
+# trackDb.ra includes:
+#   include wgEncodeReg.ra beta,public
+#   include wgEncodeReg.alpha.ra alpha
+#   include wgEncodeReg4.ra alpha
+
+##############################################################################
+# Step 9: Load trackDb
+##############################################################################
+
+cd /cluster/home/lrnassar/kent/src/hg/makeDb/trackDb
+
+# Sandbox (personal testing):
+make DBS=hg38
+
+# Dev (hgwdev):
+make alpha DBS=hg38
+
+# 20,658 track descriptions loaded
+
+##############################################################################
+# Track hierarchy summary
+##############################################################################
+
+# wgEncodeReg4 (superTrack, priority 1.5, group=regulation)
+# ├── wgEncodeReg4MarkH3k27ac (multiWig, full, 5 organs ON)
+# ├── wgEncodeReg4Dnase       (multiWig, hide)
+# ├── wgEncodeReg4Atac        (multiWig, hide)
+# ├── wgEncodeReg4MarkH3k4me3 (multiWig, hide)
+# ├── wgEncodeReg4MarkCtcf    (multiWig, hide)
+# ├── wgEncodeReg4Txn         (multiWig, hide)
+# ├── wgEncodeReg4TfPeaks     (bigBed 12+ with decorator, hide)
+# ├── wgEncodeReg4Epigenetics (bigComposite faceted, 3,199 subtracks, 5 ON)
+# ├── wgEncodeReg4RnaSeq      (bigComposite faceted, 1,046 subtracks, 5 ON)
+# └── wgEncodeReg4TfChip      (bigComposite faceted, 2,502 subtracks, 5 ON)
+
+# gbdb contents (/gbdb/hg38/encode4/regulation/):
+#   organAve/                          — 265 multiWig symlinks
+#   tfRpeak/                           — 2 TF rPeak symlinks
+#   wgEncodeReg4Epigenetics_metadata.tsv
+#   wgEncodeReg4RnaSeq_metadata.tsv
+#   wgEncodeReg4TfChip_metadata.tsv
+
+##############################################################################
+# Known upstream hub issues (reported to Weng lab)
+##############################################################################
+
+# Report at:
+#   https://hgwdev.gi.ucsc.edu/~lrnassar/temp/encode4_regulation_hub_issues.md
+#
+# 1. Duplicate bigDataUrl — tissue-only and all-biosamples variants share
+#    same file for 5 non-RNA assays (83 pairs, 166 subtracks)
+# 2. tp. prefix inconsistency — RNA correctly uses tp. prefix; other assays
+#    don't, causing issue 1
+# 3. "paraythroid" typo — should be "parathyroid"; affects 3 assays,
+#    3 filenames, 21 hub lines
+# 4. Bad track name: ATAC_ENCFF128Muscle (should be ATAC_ENCFF128OID) —
+#    handled by convert_to_bigcomposite.py fallback logic