17b7d3c37be41135afaf8e91e365e3847af96ca5
lrnassar
  Mon Jun 22 10:56:56 2026 -0700
Add TAD (topologically associating domains) track set on hg19, hg38, mm10, mm39. refs #21599

New "tads" superTrack collecting published TAD calls, alpha-gated via include tad.ra
alpha in each assembly's trackDb.ra.

hg38 (all five sources): Dixon 2012 domains, Schmitt 2016 boundaries, McArthur & Capra
2021 boundary stability, ENCODE contact domains (faceted composite over 117 biosamples),
and 3D Genome Browser 2.0 domains (faceted composite over 464 datasets).
hg19: the three sources with hg19-compatible data (Dixon, Schmitt, McArthur).
mm10/mm39 (domains only; the boundary sources have no mouse data): Dixon, ENCODE
(faceted, 16 biosamples), and 3D Genome Browser (faceted, 30 datasets); mm39 lifted
from mm10, lift noted in the long labels.

Faceted composites are organ-colored from a TAD-owned organ_colors.json symlinked into
/gbdb/<asm>/bbi/tad/. Build scripts and autoSql are version-controlled under
makeDb/scripts/tad/ and symlinked into the per-source build dirs. Provenance and fetch
for every dataset are documented in the makedocs (doc/hg38/tad.txt, doc/mm10/tad.txt,
doc/mm39/tad.txt, and the hg19 TAD section in doc/hg19.txt).

diff --git src/hg/makeDb/doc/hg38/tad.txt src/hg/makeDb/doc/hg38/tad.txt
new file mode 100644
index 00000000000..77e586ac84b
--- /dev/null
+++ src/hg/makeDb/doc/hg38/tad.txt
@@ -0,0 +1,211 @@
+# TADs supertrack (tads) - topologically associating domains and boundaries
+# Redmine #21599
+
+##############################################################################
+# 3D Genome Browser TAD domains - faceted composite (tads3dgb)  2026-06-21 (lou)
+##############################################################################
+
+# Source: 3D Genome Browser 2.0 (Yu et al. 2026, NAR 54:D48-D54, PMID 41206958),
+# http://3dgenome.fsm.northwestern.edu/ . CC BY-NC.
+#
+# DATA PROVENANCE + FETCH (reproducible from hgwdev; the download endpoint is reachable here):
+#   1. Dataset catalog: the full 3DGB curated dataset list (716 records: id, name, species,
+#      assembly, organ, cellType, dataType, resolutions, year, doi, refNo, ...) is preserved at
+#      /hive/users/lrnassar/claude/RM21599/3dgenome/datasets_api.json (obtained from the 3DGB API).
+#   2. Selection: human hg38 Hi-C/Micro-C datasets ->
+#        python3 -c "import json;[print(r['id']) for r in json.load(open('datasets_api.json')) \
+#          if r['species']=='Human' and r['assembly']=='hg38' and r['dataType'] in ('Hi-C','Micro-C')]"
+#      yields 465 candidate ids; 464 returned a TAD bed (1 dataset ships no *_tad.bed) -> tad_beds/.
+#   3. Per-dataset download (the verified, load-bearing fetch):
+#        curl -L "http://3dgenome.fsm.northwestern.edu/api/data/download?dataset_ids=<id>" -o <id>.zip
+#      Each zip holds 3 files: <name>.bedpe (loops), <name>_cis_pc1.bw (compartment PC1), and
+#      <name>_tad.bed (the TAD domains). We extract ONLY <name>_tad.bed into
+#      /hive/users/lrnassar/claude/RM21599/3dgenome/tad_beds/ ; the .bedpe/.bw are not used.
+#   4. classification.tsv (3dgenome/classification.tsv: normal/cancer, baseline/perturbation,
+#      already-in-UCSC have_novel flag, year) is project-derived from the catalog + manual curation.
+# The 464 *_tad.bed are the build input below.
+
+# All 464 datasets are native hg38 TAD domains - no liftOver, re-calling, merging,
+# or recurrence scoring is performed. The only transform is a format normalization:
+# the raw 10-column 3DGB bed (placeholder cols + an alternating two-color itemRgb
+# with no biological meaning) is reshaped to plain bed4 (chrom,start,end,datasetName).
+
+# Subtracks are colored by organ via a TAD-owned color map,
+# /hive/data/outside/tad/organ_colors.json, symlinked into /gbdb/<asm>/bbi/tad/organ_colors.json
+# for each assembly (hg38, mm10, mm39). It started as a copy of the wgEncodeReg4 map
+# (/gbdb/hg38/encode4/regulation/organ_colors.json) so colors stay consistent across tracks, but
+# is kept separate so TAD-specific additions never touch the shared file. Three organ keys used by
+# 3DGB were added to the TAD copy (additive only; colors chosen for contrast against the white
+# background):
+#   Colon  #565624   (= the existing "Large intestine" color)
+#   Bladder #A91D22  (GTEx/ENCODE color; better contrast than grey "Urinary bladder")
+#   Cervix #BA6FA5   (= the existing "Uterus" color; no dedicated Cervix key existed)
+# The five case-only organ labels (Blood Vessel, Lymphoid Tissue, Small Intestine,
+# Adrenal Gland, Connective Tissue) are normalized in the build script to the existing
+# JSON keys (Blood vessel, etc.).
+
+# The build script converts all 464 to bigBed, writes the faceted metadata TSV
+# (primaryKey = the unique 3DGB integer dataset id; facets auto-derive from the
+# remaining non-underscore columns), and generates the faceted composite stanza
+# (one subtrack per dataset; each carries an explicit per-subtrack color resolved
+# from the organ JSON). Reproducible; rerunnable.
+
+# Build scripts and autoSql are version-controlled at ~/kent/src/hg/makeDb/scripts/tad/ and
+# symlinked into the per-source build dirs, so the commands below run the in-tree copies.
+cd /hive/data/outside/tad/3dgenome/build
+python3 buildTads3dgb.py
+# -> hg38/tads3dgb/<id>.bb  (464 bigBeds, bed4)
+# -> hg38/tads3dgb_metadata.tsv  (465 rows; cols DatasetId Organ Cell_type Assay
+#    Condition Treatment Provenance Year Study _Description)
+# -> hg38/tads3dgb.ra  (faceted composite header + 464 subtrack stanzas)
+
+# Symlink bigBeds (directory symlink) and the metadata TSV into /gbdb
+cd /gbdb/hg38/bbi/tad
+ln -sfn /hive/data/outside/tad/3dgenome/build/hg38/tads3dgb tads3dgb
+ln -sfn /hive/data/outside/tad/3dgenome/build/hg38/tads3dgb_metadata.tsv tads3dgb_metadata.tsv
+
+# trackDb: generated stanza copied into the trackDb dir, included from tad.ra
+#   ~/kent/src/hg/makeDb/trackDb/human/hg38/tads3dgb.ra   (composite + 464 subtracks)
+#   ~/kent/src/hg/makeDb/trackDb/human/hg38/tads3dgb.html (description page)
+#   tad.ra ends with:  include tads3dgb.ra
+#   tad.ra itself is gated alpha-only from trackDb.ra:  include tad.ra alpha
+cp /hive/data/outside/tad/3dgenome/build/hg38/tads3dgb.ra \
+   ~/kent/src/hg/makeDb/trackDb/human/hg38/tads3dgb.ra
+
+# Load trackDb (new/untracked .ra + .html require FIND=find to deploy)
+cd ~/kent/src/hg/makeDb/trackDb
+make DBS=hg38 FIND=find
+
+##############################################################################
+# ENCODE contact domains - faceted composite rebuild (tadsEncode)  2026-06-21 (lou)
+##############################################################################
+
+# Replaces the old 14-subtrack subGroup composite with a faceted composite over ALL
+# 117 single-biosample ENCODE contact-domain datasets on hg38. Source: ENCODE portal
+# (https://www.encodeproject.org/), Hi-C pipeline (Aiden lab / Juicer; domains by Arrowhead).
+#
+# DATA PROVENANCE + FETCH (reproducible; ENCODE is open, no login):
+#   1. Manifest of all released GRCh38 contact-domain files (602) via the ENCODE search API:
+#        https://www.encodeproject.org/search/?type=File&output_type=contact+domains&assembly=GRCh38&format=json&limit=all
+#      saved as source/_cd_select.json (each record: accession, file_format, dataset[experiment],
+#      biosample_ontology, assay_title, preferred_default, href, file_size).
+#   2. Per-file download (href from the manifest):
+#        curl -L "https://www.encodeproject.org/files/<ENCFF>/@@download/<ENCFF>.bedpe.gz" -o <ENCFF>.bedpe.gz
+#      (11 of the 602 are .bed.gz lifted-from-hg19 files.) All 602 are in
+#      /hive/users/lrnassar/claude/RM21599/encode/contact_domains/ .
+#   3. Per-experiment facet metadata + perturbed flag via the ENCODE search API (one batched call
+#      over the chosen experiment accessions), saved as source/encode_meta_all.json and
+#      source/encode_perturbed.json (fields: biosample_ontology.organ_slims, .classification,
+#      replicates.library.biosample.life_stage, biosample_summary, perturbed).
+
+# Selection: exclude the 45 multi-biosample (list biosample_ontology) and 11 aggregate-series
+# (null biosample_ontology) records -> 117 distinct single biosamples. Per biosample, pick one
+# representative experiment, preferring the canonical 16-column Juicer/Arrowhead "blocks" bedpe
+# (full 5-score format), then an untreated baseline experiment over an ENCODE-flagged perturbed
+# one (encode_perturbed.json), then preferred_default, then assay priority (intact > in situ),
+# then total file size. 111 biosamples have 16-col Arrowhead files; A549 has only a 12-col bedpe
+# (cornerScore only); 5 biosamples (LNCaP clone FGC, NCI-H460, RPMI7951, SJCRH30, SK-N-MC) have
+# only hg19-lifted 5-col bed (no Arrowhead scores, flagged "Lifted from hg19" in the Calls facet).
+# 24-col loop files are tiered out (tier 0). (The ENCODE "perturbed" flag is used only as a
+# selection tiebreaker -- it is unreliable for display, so no Treatment facet is shown.)
+
+# Conversion + pooling (buildTadsEncode.py): for the chosen experiment, pool its files of the
+# best tier; keep only domain rows (both bedpe anchors identical: chr1==chr2, x1==y1, x2==y2 --
+# this drops loop files), normalize chrom to "chr*" and validate against chrom.sizes; map
+# bedpe cols to bed4+5 (chrom, x1, x2, biosample, cornerScore=col12, uVarScore=13, lVarScore=14,
+# upSign=15, loSign=16); merge replicate domains whose endpoints both fall within one 5 kb bin,
+# keeping the higher cornerScore (scores taken verbatim, never blended -- verified against the
+# retired tadsEncodeGM12878.bb). autoSql = /hive/data/outside/tad/tadDomainEncode.as.
+# Facet metadata (organ_slims, classification, life_stage) fetched once from the ENCODE REST API
+# for all 228 candidate experiments -> source/encode_meta_all.json; perturbed flag ->
+# source/encode_perturbed.json. organ_slims (lowercase, multi-valued) mapped to organ_colors.json
+# keys and reduced to one organ per biosample by an anatomical-specificity priority (Blood vessel
+# and Muscle rank above Limb/Placenta so HUVEC->Blood vessel, tibial artery->Blood vessel,
+# gastrocnemius->Muscle); 4 biosamples with empty organ_slims assigned by known biology.
+# primaryKey = the chosen experiment Accession (ENCSR), linked to the ENCODE portal via
+# subtrackUrls (like wgEncodeReg4); the readable biosample name (_Biosample) and full ENCODE
+# summary (_Description) are shown in the metadata table but not faceted. shortLabels are
+# word-boundary summaries (no mid-word truncation; CD4-positive->CD4+ etc.); bigBed filenames
+# keep the readable biosample symbol. (A549 organ "musculature of body"->Muscle and similar
+# follow ENCODE's organ_slim verbatim; organ reflects ENCODE's annotation, not a disease-origin call.)
+
+cd /hive/data/outside/tad/encode/build
+python3 buildTadsEncode.py
+# -> hg38/tadsEncode/<symbol>.bb  (117 bigBeds, bigBed 4+5)
+# -> hg38/tadsEncode_metadata.tsv (118 rows; cols Biosample Organ Biosample_type Assay
+#    Life_stage Calls _Biosample)
+# -> hg38/tadsEncode.ra  (faceted composite header + 117 subtrack stanzas)
+
+# Symlink bigBeds (directory symlink) and metadata TSV into /gbdb
+cd /gbdb/hg38/bbi/tad
+ln -sfn /hive/data/outside/tad/encode/build/hg38/tadsEncode tadsEncode
+ln -sfn /hive/data/outside/tad/encode/build/hg38/tadsEncode_metadata.tsv tadsEncode_metadata.tsv
+
+# trackDb: the old inline tadsEncode composite in tad.ra was replaced by:  include tadsEncode.ra
+#   ~/kent/src/hg/makeDb/trackDb/human/hg38/tadsEncode.ra   (composite + 117 subtracks)
+#   ~/kent/src/hg/makeDb/trackDb/human/hg38/tadsEncode.html (description page)
+cp /hive/data/outside/tad/encode/build/hg38/tadsEncode.ra \
+   ~/kent/src/hg/makeDb/trackDb/human/hg38/tadsEncode.ra
+
+cd ~/kent/src/hg/makeDb/trackDb
+make DBS=hg38 FIND=find
+
+##############################################################################
+# Dixon 2012 TAD domains (tadsDixon)  built 2026-06-14, documented 2026-06-21 (lou)
+##############################################################################
+# Source: Dixon JR et al. 2012, Nature 485:376, PMID 22495300, doi:10.1038/nature11082.
+# GEO GSE35156. Freely downloadable, no login. The original host (promoter.bx.psu.edu /
+# chromosome.sdsc.edu) is dead; the authoritative copy is the Nature supplemental tables.
+#
+# DATA PROVENANCE + FETCH:
+#   1. Nature Supplementary Table S3 (Domains) = supplemental file MOESM330, downloaded as
+#      /hive/users/lrnassar/claude/RM21599/dixon2012/41586_2012_BFnature11082_MOESM330_ESM.xls
+#      (the full 9-file supplement MOESM329-337 is staged in that dir). Read with python xlrd.
+#   2. Human sheets = "hESC Combined" (3,127 domains) and "IMR90 Combined" (2,348), plain BED3,
+#      40 kb-binned, assembly hg18 (verified by chrom-length test). Other sheets are mouse (mm9)
+#      or reps -> not used. Extracted to source/dixon_{hESC,IMR90}.hg18.bed.
+# BUILD: liftOver hg18->hg38 (hg18ToHg38.over.chain), bedClip to chrom.sizes, drop non-primary
+#   contigs, sort, bedToBigBed -type=bed4 -as=/hive/data/outside/tad/tadDomain.as ->
+#   /hive/data/outside/tad/dixon2012/build/hg38/tadsDixon{HESC,IMR90}.bb (3,051 / ~2.3k after lift).
+#   Symlinked to /gbdb/hg38/bbi/tad/ . trackDb: tadsDixon composite in human/hg38/tad.ra
+#   (longLabel notes "lifted from hg18"). html tadsDixon.html.
+
+##############################################################################
+# Schmitt 2016 TAD boundaries (tadsSchmitt)  built 2026-06-14, documented 2026-06-21 (lou)
+##############################################################################
+# Source: Schmitt AD et al. 2016, Cell Rep 17(8):2042-2059, PMID 27851967,
+# doi:10.1016/j.celrep.2016.10.061. GEO GSE87112 (raw Hi-C only; the calls live in the supplement).
+#
+# DATA PROVENANCE + FETCH:
+#   1. Cell Reports Supplementary Table S3 (TAD boundary annotations) = supplemental file mmc4.xlsx.
+#      ACCESS FRICTION: cell.com is Cloudflare-gated and PMC bins are behind proof-of-work/reCAPTCHA,
+#      so mmc4.xlsx was obtained by MANUAL BROWSER download (open access, CC-BY family, no paywall),
+#      staged at /hive/users/lrnassar/claude/RM21599/schmidtt2016/mmc4.xlsx (note dir spelling).
+#   2. 21 worksheets (one per human sample = 14 tissues + 7 cell lines), each plain BED3, uniform
+#      40 kb boundary bins, assembly hg19. Converted per-sheet -> source/schmitt_<code>.hg19.bed (21).
+# BUILD: liftOver hg19->hg38 (hg19ToHg38.over.chain), bedClip, primary-chrom filter, sort,
+#   bedToBigBed -type=bed4 -as=/hive/data/outside/tad/tadBoundary.as ->
+#   /hive/data/outside/tad/schmitt2016/build/hg38/tadsSchmitt<code>.bb (21 files). Symlinked to
+#   /gbdb/hg38/bbi/tad/ . trackDb: tadsSchmitt composite (subGroups sample-type + organ-system) in
+#   human/hg38/tad.ra. html tadsSchmitt.html.
+
+##############################################################################
+# McArthur 2021 TAD boundary stability (tadsMcArthur)  built 2026-06-14, documented 2026-06-21 (lou)
+##############################################################################
+# Source: McArthur E, Capra JA 2021, Am J Hum Genet 108(2):269-283, PMID 33545030,
+# doi:10.1016/j.ajhg.2021.01.001. GitHub emcarthur/TAD-stability-heritability (MIT). Direct, no login.
+#
+# DATA PROVENANCE + FETCH:
+#   1. Shallow git clone of https://github.com/emcarthur/TAD-stability-heritability (branch master)
+#      to /hive/users/lrnassar/claude/RM21599/mcarthur2021/ .
+#   2. KEY FILE: data/boundariesByStability/100kbBookendBoundaries_mainText/
+#      100kbBookendBoundaries_byStability.bed -> staged at
+#      /hive/data/outside/tad/mcarthur2021/source/100kbBookendBoundaries_byStability.bed .
+#      Format: 5-col TSV w/ header (chr, loc, loc2, counts, stability_percentile); 14,345 boundaries,
+#      100 kb-wide, assembly hg19. `counts` = number of the 37 cell-type maps sharing the boundary (1-37).
+# BUILD: drop header; liftOver hg19->hg38 (hg19ToHg38.over.chain), bedClip, primary-chrom filter, sort.
+#   Emit bigBed 5 + 2 (autoSql /hive/data/outside/tad/tadStability.as): name=boundary id, score=
+#   round(contexts/37*1000) (rendering proxy), contexts (1-37, the real datum), percentile.
+#   bedToBigBed -> /hive/data/outside/tad/mcarthur2021/build/hg38/tadsMcArthur.bb (14,287 after lift).
+#   Symlinked to /gbdb/hg38/bbi/tad/ . trackDb: tadsMcArthur (filter.contexts, spectrum) in
+#   human/hg38/tad.ra. html tadsMcArthur.html.