af291f82582af89f7c623eb4d13b8141d28e9862 lrnassar Mon Apr 13 10:19:16 2026 -0700 Update ENCODE4 makedocs with defaultSortField step. refs #34923 diff --git src/hg/makeDb/doc/hg38/encode4.regulation.txt src/hg/makeDb/doc/hg38/encode4.regulation.txt index 0d43025038c..f7eb3e10839 100644 --- src/hg/makeDb/doc/hg38/encode4.regulation.txt +++ src/hg/makeDb/doc/hg38/encode4.regulation.txt @@ -1,331 +1,348 @@ # ENCODE4 Integrated Regulation Track (wgEncodeReg4) for hg38 # Redmine #34923 # Lou Nassar, 2026-03-12 (updated 2026-04-10) # This track converts the ENCODE V4 Regulation hub into a native UCSC browser # supertrack containing organ-averaged multiWig signal tracks, TF rPeak clusters, # and individual experiment composites (bigComposite/faceted format) for # epigenetics, RNA-seq, and TF ChIP-seq. # The original hub was prepared by Mingshi Gao (Weng lab, UMass Chan Medical School): # https://users.wenglab.org/gaomingshi/ENCODE_Reg/hub.txt # It was cloned locally for processing: # /hive/data/outside/encode4/ccre/ENCODE_V4_Regulation/hub.txt # Scripts are located at: # kent/src/hg/makeDb/scripts/encode4regulation/ ############################################################################## # Step 1: Clone the hub data locally ############################################################################## # The hub was cloned with hubClone -download: mkdir -p /hive/data/outside/encode4/ccre cd /hive/data/outside/encode4/ccre hubClone -download https://users.wenglab.org/gaomingshi/ENCODE_Reg/hub.txt \ ENCODE_V4_Regulation # Total data: ~5.6 TB across ~7,000 files (bigWig + bigBed) ############################################################################## # Step 2: Create gbdb symlinks for organ-averaged multiWig files ############################################################################## # Only organ-averaged multiWig files and TF rPeak files need gbdb symlinks. # Individual experiment tracks use S3 URLs directly. mkdir -p /gbdb/hg38/encode4/regulation/organAve cd /hive/data/outside/encode4/ccre/ENCODE_V4_Regulation # Symlinks were created using rename_symlinks.py which maps hub filenames # to UCSC convention (camelCase, .bw extension): # ave.adipose.H3K27ac.bigWig -> adiposeH3K27ac.bw (all-biosamples) # tp.blood.H3K27ac.bigWig -> tpBloodH3K27ac.bw (tissue-only) # adrenal_gland.minus.bigWig -> adrenalGlandMinus.bw (RNA strand) # tp.blood.plus.bigWig -> tpBloodPlus.bw (RNA tp strand) # TF rPeak clusters (from existing ENCODEv4TFrPeaks bed) mkdir -p /gbdb/hg38/encode4/regulation/tfRpeak ln -s /hive/data/genomes/hg38/bed/ENCODEv4TFrPeaks/no_trim.TF_name.rPeaks.bb \ /gbdb/hg38/encode4/regulation/tfRpeak/TFrPeakClusters.bb ln -s /hive/data/genomes/hg38/bed/ENCODEv4TFrPeaks/no_trim.TF_name.decorator.bb \ /gbdb/hg38/encode4/regulation/tfRpeak/TFrPeakClustersDecorator.bb # Symlink summary: # organAve/ 363 symlinks (113 tissue-only + 250 all-biosamples/RNA) # tfRpeak/ 2 symlinks # Total: 365 symlinks + 5 metadata/JSON files + 1 nose DNase = 371 files ############################################################################## # Step 2b: Download tissue-only (tp) organ-averaged files (2026-04-08) ############################################################################## # The initial hub had a bug where tissue-only and all-biosamples variants pointed # to the same file. The Weng lab fixed this in April 2026 with proper separate # tp.{organ}.{Assay}.bigWig files. These were downloaded and symlinked: cd /hive/users/lrnassar/claude/RM34923 bash download_tp_files.sh # downloads 83 tp bigWig files (~733 GB) bash create_tp_symlinks.sh # creates 83 symlinks: tp{Organ}{Assay}.bw # Also downloaded 1 new file: ave.nose.DNase.bigWig (nose organ added to hub) # Removed 1 orphan: lymphoidTissueDNase.bw (hub removed from all-bio group) # Renamed 2 old symlinks: boneMarrowTpH3K27ac.bw -> tpBoneMarrowH3K27ac.bw # boneMarrowTpH3K4me3.bw -> tpBoneMarrowH3K4me3.bw ############################################################################## # Step 3: Validate local files against ENCODE portal ############################################################################## # Run the validation script to verify md5sums of local files against the # ENCODE REST API and generate S3 URL mapping for bigComposite tracks. cd /hive/users/lrnassar/claude/RM34923 python3 kent/src/hg/makeDb/scripts/encode4regulation/validate_encode_urls.py # This produces: # encode4_url_mapping.tsv — maps accessions to S3 URLs with md5 validation # encode4_validation.log — detailed log of any mismatches # All S3 URLs verified accessible via bigWigInfo/bigBedInfo (0 failures). ############################################################################## # Step 4: Generate multiWig trackDb stanzas ############################################################################## # The multiWig organ-averaged tracks (H3K27ac, DNase, ATAC, H3K4me3, CTCF, Txn) # are generated from the hub.txt and assembled into the main wgEncodeReg4.ra. cd /hive/users/lrnassar/claude/RM34923 python3 kent/src/hg/makeDb/scripts/encode4regulation/generate_multiwig_ra.py \ > multiwig_output.ra # The output was manually integrated into wgEncodeReg4.ra along with: # - SuperTrack definition (priority 0.9, group=regulation) # - TF rPeak track (bigBed 12+ with decorator, 911-factor filterValues) # - Include directives for the 3 bigComposite files # The main file is hand-maintained: # kent/src/hg/makeDb/trackDb/human/hg38/wgEncodeReg4.ra (~3,461 lines) # multiWig track details: # wgEncodeReg4MarkH3k27ac — full visibility (priority 1.4) # wgEncodeReg4Dnase — hidden (priority 1.1) # wgEncodeReg4Atac — hidden (priority 1.2) # wgEncodeReg4MarkH3k4me3 — hidden (priority 1.3) # wgEncodeReg4MarkCtcf — hidden (priority 1.5) # wgEncodeReg4Txn — hidden (priority 1.6) # wgEncodeReg4TfPeaks — hidden (priority 1.7) # Each multiWig has tissue-only subtracks first, then all-biosamples subtracks. # Tissue-only subtracks use tp{Organ}{Assay}.bw gbdb symlinks pointing to # tp.{organ}.{Assay}.bigWig files (genuine tissue/primary cell averages). # All-biosamples subtracks use {organ}{Assay}.bw symlinks pointing to # ave.{organ}.{Assay}.bigWig files (averages including cell lines). # # Subtrack counts per assay (tissue-only + all-biosamples): # H3K27ac: 17 + 38 = 55 # DNase: 24 + 40 = 64 # ATAC: 8 + 26 = 34 # H3K4me3: 19 + 38 = 57 # CTCF: 15 + 36 = 51 # RNA: 30 + 72 = 102 (strand pairs: Plus/Minus) ############################################################################## # Step 5: Generate bigComposite (faceted) individual experiment tracks ############################################################################## # The three individual experiment composites use the new bigComposite/faceted # format (refs #36320). This was a two-step process: # # Step 5a: Generate traditional composites from hub # generate_composites.py parses hub.txt and creates compositeTrack-on-style # .ra files with subGroups, views, dimensions, etc. # # Step 5b: Convert to bigComposite faceted format # convert_to_bigcomposite.py reads those .ra files, strips subGroups/views, # adds metaDataUrl + primaryKey, generates metadata TSVs, and uses S3 URLs # from the validation mapping (Step 3). cd /hive/users/lrnassar/claude/RM34923 python3 kent/src/hg/makeDb/scripts/encode4regulation/generate_composites.py python3 kent/src/hg/makeDb/scripts/encode4regulation/convert_to_bigcomposite.py # This overwrites the .ra files in place and creates metadata TSVs: # /gbdb/hg38/encode4/regulation/wgEncodeReg4Epigenetics_metadata.tsv # /gbdb/hg38/encode4/regulation/wgEncodeReg4RnaSeq_metadata.tsv # /gbdb/hg38/encode4/regulation/wgEncodeReg4TfChip_metadata.tsv # # Output .ra files (in kent/src/hg/makeDb/trackDb/human/hg38/): # wgEncodeReg4Epigenetics.ra — 6,353 subtracks (3,199 signal + 3,154 peak) # Facets: Assay, Organ, Biosample Type, Life Stage, Data Type # wgEncodeReg4RnaSeq.ra — 1,046 subtracks # Facets: Organ, Biosample Type, Life Stage, Strand # wgEncodeReg4TfChip.ra — 4,964 subtracks (2,502 peak + 2,462 signal) # Facets: TF, Organ, Biosample Type, Life Stage, Data Type # # Faceted UI features: # - _Biosample column hidden from facets via underscore prefix (refs #36320) # - All facet values capitalized (e.g., "Cell line", "Adult") # - longLabels cleaned: underscores replaced with spaces for readability # - colorSettingsUrl for facet color indicators: # Epigenetics: colored by Assay (epi_colors.json) # RnaSeq: colored by Organ (organ_colors.json) # TfChip: no facet colors (uses score-based spectrum coloring) # # Default-ON subtracks (tissue samples per Weng lab request): # Epigenetics (28): brain/heart/liver × 5 assays × signal+peak # Brain: middle frontal area 46, mild cognitive impairment, male 89y (4 assays, no ATAC) # Heart: right atrium auricular region, female 51y (5 assays) # Liver: right lobe, female 53y (5 assays) # RNA-seq (6): brain/heart/liver × Plus+Minus strand # Brain: dorsolateral prefrontal cortex, female 75y # Heart: right atrium auricular region, female 51y # Liver: right lobe, female 53y # TF ChIP (8): brain/heart/liver tissue CTCF + liver POLR2A # # All subtracks use S3 URLs (encode-public.s3.amazonaws.com) for bigDataUrl. ############################################################################## # Step 5c: Track color adjustments ############################################################################## # Organ colors follow the Weng lab canonical color mapping: # https://wiki.wenglab.org/references/color-mappings/ # 18 colors were darkened for contrast against white browser background while # preserving hue (canonical colors designed for dark portal background). # 1 organ previously using default gray was given its canonical color: # urinary bladder: 194,33,39 # Color mapping wiki linked in Display Conventions of all multiWig HTML pages. ############################################################################## # Step 5d: Biosample column cleanup (2026-04-08) ############################################################################## # The _Biosample column in the Epigenetics metadata TSV had redundant # assay+type suffixes on peak entries (e.g., "Homo sapiens A549 ATAC peak"). # This information is already in the Assay and Data Type columns. # Removed suffixes from 3,154 entries using: cd /hive/users/lrnassar/claude/RM34923 python3 -c " import re with open('/gbdb/hg38/encode4/regulation/wgEncodeReg4Epigenetics_metadata.tsv') as f: lines = f.readlines() out = [lines[0]] for line in lines[1:]: cols = line.rstrip('\n').split('\t') if len(cols) >= 6: cols[5] = re.sub(r' (ATAC|CTCF|DNase|H3K27ac|H3K4me3) peak$', '', cols[5]) out.append('\t'.join(cols) + '\n') with open('/gbdb/hg38/encode4/regulation/wgEncodeReg4Epigenetics_metadata.tsv', 'w') as f: f.writelines(out) " ############################################################################## # Step 6: Create HTML description pages ############################################################################## # 11 HTML files in kent/src/hg/makeDb/trackDb/human/hg38/: # wgEncodeReg4.html — SuperTrack overview # wgEncodeReg4MarkH3k27ac.html — H3K27ac layered signal # wgEncodeReg4Dnase.html — DNase layered signal # wgEncodeReg4Atac.html — ATAC layered signal # wgEncodeReg4MarkH3k4me3.html — H3K4me3 layered signal # wgEncodeReg4MarkCtcf.html — CTCF layered signal # wgEncodeReg4Txn.html — Transcription layered signal # wgEncodeReg4TfPeaks.html — TF rPeaks # wgEncodeReg4Epigenetics.html — Individual epigenetics composite # wgEncodeReg4RnaSeq.html — Individual RNA-seq composite # wgEncodeReg4TfChip.html — Individual TF ChIP composite # # Each layered track HTML includes an organ/tissue availability table. # All HTMLs include Data Access sections with bigWigToWig/bigBedToBed examples. # Production lab credits are assay-specific, matching the upstream hub. # ENCODE color mapping wiki linked in Display Conventions sections. ############################################################################## # Step 7: Add related tracks, trackDb include, ENCODE3 rename ############################################################################## # Added to kent/src/hg/makeDb/trackDb/human/hg38/trackDb.ra: # include wgEncodeReg4.ra alpha # Added reciprocal entries to relatedTracks.ra: # hg38 wgEncodeReg4 wgEncodeReg Previous ENCODE3 Regulation track # hg38 wgEncodeReg wgEncodeReg4 New ENCODE4 Regulation track # hg38 wgEncodeReg4 cCREs Related ENCODE4 cCRE annotations # hg38 cCREs wgEncodeReg4 Related ENCODE4 regulation data # Note: track1's "why" text describes track2 (the link destination). # ENCODE3 renamed to "ENCODE3 Regulation" via alpha release tags: # wgEncodeReg.alpha.ra: shortLabel "ENCODE3 Regulation", snowflake pennant ############################################################################## # Step 8: Release tags (ENCODE3 transition) ############################################################################## # On alpha (dev): ENCODE4 visible, ENCODE3 hidden with snowflake # On beta/public: ENCODE3 visible as-is, ENCODE4 not visible # # Approach: Duplicate wgEncodeReg.ra into wgEncodeReg.alpha.ra with: # - superTrack on hide (hidden by default) # - shortLabel "ENCODE3 Regulation" # - pennantIcon snowflake.png (deprecation notice) # - Inner includes tagged with release alpha # The original wgEncodeReg.ra gets inner includes tagged beta,public. # Key: inner include directives must also carry release tags. # # trackDb.ra includes: # include wgEncodeReg.ra beta,public # include wgEncodeReg.alpha.ra alpha # include wgEncodeReg4.ra alpha ############################################################################## # Step 9: Load trackDb ############################################################################## cd /cluster/home/lrnassar/kent/src/hg/makeDb/trackDb # Sandbox (personal testing): make DBS=hg38 # Dev (hgwdev): make alpha DBS=hg38 ############################################################################## # Step 10: Cleanup — delete local ENCFF files (now S3-served) ############################################################################## # Individual experiment files (ENCFF*.bigWig, ENCFF*.bigBed) and ?proxy=TRUE # download artifacts were deleted from the source hub directory, since they # are now served via S3 URLs. Only the organ-averaged multiWig files # (referenced by gbdb symlinks) and hub.txt/HTML docs were preserved. # # hg38: Deleted 7,598 files (3.6 TB freed) ############################################################################## -# Track hierarchy summary +# Step 10: Default sort by experiment to group peak/signal pairs (2026-04-13) +############################################################################## + +# The Epigenetics and TF ChIP bigComposite tracks each contain both peak and +# signal subtracks for the same experiments. To group these pairs adjacent in +# the configure page, added an _Experiment column to the metadata TSVs +# containing the ENCSR experiment accession (extracted from shortLabel in the +# .ra stanzas). The underscore prefix hides it from facet filters. +# Added defaultSortField _Experiment to the 2 composite stanzas. +# Also fixed pre-existing CRLF line endings in the Epigenetics metadata TSV. +# +# Composites updated: +# wgEncodeReg4Epigenetics: 3,152 peak/signal pairs + 49 singletons +# wgEncodeReg4TfChip: 2,461 peak/signal pairs + 42 singletons +# RNA-seq not changed (signal only, no peak/signal pairing). + +############################################################################## +# Track hierarchy and disk usage summary ############################################################################## # Regulation group priority order: cCREs (0.8) > wgEncodeReg4 (0.9) > wgEncodeReg (1) # # wgEncodeReg4 (superTrack, priority 0.9, group=regulation) # ├── wgEncodeReg4MarkH3k27ac (multiWig, full) priority 1.4 # ├── wgEncodeReg4Dnase (multiWig, hide) priority 1.1 # ├── wgEncodeReg4Atac (multiWig, hide) priority 1.2 # ├── wgEncodeReg4MarkH3k4me3 (multiWig, hide) priority 1.3 # ├── wgEncodeReg4MarkCtcf (multiWig, hide) priority 1.5 # ├── wgEncodeReg4Txn (multiWig, hide) priority 1.6 # ├── wgEncodeReg4TfPeaks (bigBed 12+ with decorator, hide) priority 1.7 # ├── wgEncodeReg4Epigenetics (bigComposite faceted, 6,353, 28 ON) priority 2.0 # ├── wgEncodeReg4RnaSeq (bigComposite faceted, 1,046, 6 ON) priority 2.1 # └── wgEncodeReg4TfChip (bigComposite faceted, 4,964, 8 ON) priority 2.2 # Disk usage (/gbdb/hg38/encode4/regulation/): # organAve tp (tissue-only): 1.8 TB (113 files) # organAve ave (all-bio + RNA): 2.6 TB (250 files) # tfRpeak: 3 GB (2 files) # metadata + JSON: 3 MB (5 files) # nose DNase: 680 MB (1 file) # Total: 4.4 TB (371 files) # File list: /hive/users/lrnassar/claude/RM34923/gbdb_file_list.txt