src/hg/makeDb/doc/hg38/mpra.txt 888e7470c14eeecdca310ed36bb45c3c00ae8052

888e7470c14eeecdca310ed36bb45c3c00ae8052
lrnassar
  Tue Apr 21 15:14:04 2026 -0700
QA fixes for MPRA superTrack. refs #37359

Fix broken mpraVarDb bigDataUrl — pointed at /gbdb/hg38/mpra/mpravardb.bb
but the file is at /gbdb/hg38/mpra/mpravardb/mpravardb.bb, causing
hgTrackDb -strict to silently drop the subtrack.

Rebuild mpravardb.bb after two fixes in mpravardbToBed.py: sanitize UTF-8
in user-visible string fields (curly quotes, primes, NBSP mojibake) that
the browser does not transcode, eliminating ~246k non-ASCII occurrences
across 42% of rows; and change safe_float / pval_to_score to write NaN
and return score 0 for NA / out-of-range p-values instead of 0.0 and
score 1000 (previously inflated untested variants to the top of
score-sorted views).

trackDb stanza cleanup: shorten mpraVarDb longLabel, drop superfluous
type bed 4 from superTrack, make bigBed 9+13 explicit, remove redundant
mouseOverField, align parent mpra on, add filterValues for
cell_line/assay/cellLine and filterByRange sliders for percentile_rank /
fdr / log2FC, add labelFields and maxWindowToDraw.

Description pages: add cross-species disclosure (mouse reporter cells
used to assay human sequences), update mpraVarDb header to post-liftOver
count 239,028 with Studies-table footnote, fix mpraVarDb.html
download-server paths, soften imprecise "51 MPRA experiments" claim in
mpra.html and mprabase.html.

relatedTracks.ra: reciprocal mpra <-> wgEncodeReg4 and mpra <-> cCREs.

Expand mpra.txt makedoc with upstream provenance and QA-rebuild log.

diff --git src/hg/makeDb/doc/hg38/mpra.txt src/hg/makeDb/doc/hg38/mpra.txt
index e5f00925307..540e9659792 100644
--- src/hg/makeDb/doc/hg38/mpra.txt
+++ src/hg/makeDb/doc/hg38/mpra.txt
@@ -1,19 +1,97 @@
-# max Mar 30 2026
-# There was nothing to do, Varda from the Ahituv Lab provided the bigBed file at
+# MPRA superTrack (hg38) - Redmine #37359
+# -----------------------------------------------------------------------------
+# Two subtracks: mprabase (MPRA Base enhancer elements) and mpraVarDb (MPRA-tested
+# regulatory variants).  trackDb stanzas live in human/hg38/mpra.ra.  Description
+# pages: mpra.html, mprabase.html, mpraVarDb.html.
+
+# =============================================================================
+# mprabase subtrack - max Mar 30 2026
+# =============================================================================
+# No local processing. The bigBed was provided directly by Varda Singhal
+# (Ahituv Lab, UCSF) via UCSC hubspace and dropped into the gbdb path.
+#
+# Source (upstream bigBed):
 #   https://genome.ucsc.edu/hubspace/72/Varda006/Varda_Final_Hub/final_authorPMID.mean_v2.bb
-# and a hub at https://genome.ucsc.edu/hubspace/72/Varda006/Varda_Final_Hub/hub.txt
+# Full upstream hub:
+#   https://genome.ucsc.edu/hubspace/72/Varda006/Varda_Final_Hub/hub.txt
+# Upstream SQLite sits alongside the bigBed:
+#   /hive/data/genomes/hg38/bed/mpra/mprabase/mprabase_v4_9.3.db
+# That DB corresponds to MPRA Base v4.9.3 and is the source of truth for
+# reproducing the bigBed if Varda ever refreshes the upstream hub.
+
+mkdir -p /hive/data/genomes/hg38/bed/mpra/mprabase
+cd /hive/data/genomes/hg38/bed/mpra/mprabase
+wget https://genome.ucsc.edu/hubspace/72/Varda006/Varda_Final_Hub/final_authorPMID.mean_v2.bb -O mprabase.bb
+
+# gbdb symlink:
+#   /gbdb/hg38/mpra/mprabase/mprabase.bb -> /hive/data/genomes/hg38/bed/mpra/mprabase/mprabase.bb
+
+# Historical note: an earlier attempt lifted from hg19 via a custom SQLite
+# liftover table (hg38CustomLiftover.RDS, preserved in the build dir), but
+# had one feature beyond chrom size.  Replaced by the pre-built hub file
+# above, so the liftOver path is not used.
 
-# MPRAVarDB track
-# Mon Mar 10 2026 (claude/max)
+# =============================================================================
+# mpraVarDB subtrack - max Mar 10 2026 (claude/max), QA rebuild Apr 21 2026 (lou)
+# =============================================================================
+# Source:
+#   https://mpravardb.rc.ufl.edu/ (UFL web server)
+# Snapshot date: Mar 10 2026 (CSV via the "download_all" endpoint).  The
+# MPRAVarDB project does not publish version numbers; track the snapshot
+# date and the session URL together as the provenance pair.
+#
+# Input CSV contains 242,818 variants from 18 MPRA studies, with coordinates
+# in either hg19 or hg38: 213,689 hg19, 29,129 hg38, 3,676 with NA coords.
+# Script liftOvers hg19 -> hg38, merges with native hg38, and emits bigBed9+13.
 
-# Download data from https://mpravardb.rc.ufl.edu/
 mkdir -p /hive/data/genomes/hg38/bed/mpra/mpravardb
 cd /hive/data/genomes/hg38/bed/mpra/mpravardb
 wget 'https://mpravardb.rc.ufl.edu/session/27d7af46df917aed91f4cca7bee378a2/download/download_all?w=' -O mpravardb.csv
 
-# 242,818 variants from 18 MPRA studies, with both hg19 and hg38 coordinates.
-# 213,689 are hg19, 29,129 are hg38, 3,676 have no coordinates (NA).
-
-# Convert to BED, liftOver hg19->hg38, merge, and create bigBed:
+# Convert, liftOver, merge, and build bigBed.  Output: mpravardb.bb (239,028 rows).
 python3 ~/kent/src/hg/makeDb/scripts/mpravardb/mpravardbToBed.py
-# Output: mpravardb.bb (239,028 variants after liftOver, 114 unmapped)
+
+# gbdb symlink:
+#   /gbdb/hg38/mpra/mpravardb/mpravardb.bb -> /hive/data/genomes/hg38/bed/mpra/mpravardb/mpravardb.bb
+
+# -----------------------------------------------------------------------------
+# QA rebuild Apr 21 2026 (RM #37359)
+# -----------------------------------------------------------------------------
+# mpravardbToBed.py updated to:
+#   - sanitize UTF-8 in user-visible string fields (curly quotes, primes,
+#     NBSP mojibake) before writing BED.  Prior build had ~246k non-ASCII
+#     byte occurrences across 100,961 rows (42% of track) including mangled
+#     rsIDs like "rs34425335NBSP-MOJIBAKE".
+#   - pval_to_score() now returns 0 (not 1000) for non-positive / out-of-range
+#     pvalue.  Prior build gave score=1000 to ~7,400 rows whose upstream pvalue
+#     was literal 0 (mostly NA-coded-as-0), inflating those to the top of any
+#     score-sorted view.
+#   - safe_float() now returns NaN (was 0.0) for NA / empty / non-numeric
+#     upstream values.  27,065 rows whose upstream pvalue was literal "NA"
+#     now store pvalue="nan" instead of "0.0", so untested variants no longer
+#     masquerade as p=0 in the details page and are excluded by the default
+#     filter.fdr / filter.log2FC range sliders.  bedToBigBed accepts the
+#     literal string "nan" in float fields.
+#
+# Pre-rebuild backup preserved at:
+#   /hive/data/genomes/hg38/bed/mpra/mpravardb/mpravardb.bb.preQA-backup
+#
+# Reproduce QA rebuild:
+#   cd /hive/data/genomes/hg38/bed/mpra/mpravardb
+#   python3 ~/kent/src/hg/makeDb/scripts/mpravardb/mpravardbToBed.py
+
+# =============================================================================
+# Known outstanding items (see RM #37359)
+# =============================================================================
+# - mprabase reference field for Mattioli 2020 rows starts with "musculus ..."
+#   (species word merged into title upstream).  Flag to Varda for upstream fix.
+# - mprabase has 1 orphan element (Hela STARR-seq, PMID 23328393, Arnold 2013)
+#   not in the documented experiment table.  Flag to Varda.
+# - mpraVarDB preserves ~42k (chrom,start,end,name) duplicate rows (same rsID
+#   tested in multiple cells/studies).  Users disambiguate via the
+#   filterValues.cellLine / filterValues.mpraStudy filters in the trackDb.
+# - ~7,400 rows have upstream pvalue=0 and fdr=0 (not NA).  Could be genuine
+#   precision-floor significance or an upstream "not tested" encoding; the
+#   distinction is not recoverable from the CSV.  With pval_to_score returning
+#   0 for p<=0, these no longer dominate score-sorted views but their details
+#   page still reads "pvalue: 0.0".  Upstream clarification needed.