888e7470c14eeecdca310ed36bb45c3c00ae8052 lrnassar Tue Apr 21 15:14:04 2026 -0700 QA fixes for MPRA superTrack. refs #37359 Fix broken mpraVarDb bigDataUrl — pointed at /gbdb/hg38/mpra/mpravardb.bb but the file is at /gbdb/hg38/mpra/mpravardb/mpravardb.bb, causing hgTrackDb -strict to silently drop the subtrack. Rebuild mpravardb.bb after two fixes in mpravardbToBed.py: sanitize UTF-8 in user-visible string fields (curly quotes, primes, NBSP mojibake) that the browser does not transcode, eliminating ~246k non-ASCII occurrences across 42% of rows; and change safe_float / pval_to_score to write NaN and return score 0 for NA / out-of-range p-values instead of 0.0 and score 1000 (previously inflated untested variants to the top of score-sorted views). trackDb stanza cleanup: shorten mpraVarDb longLabel, drop superfluous type bed 4 from superTrack, make bigBed 9+13 explicit, remove redundant mouseOverField, align parent mpra on, add filterValues for cell_line/assay/cellLine and filterByRange sliders for percentile_rank / fdr / log2FC, add labelFields and maxWindowToDraw. Description pages: add cross-species disclosure (mouse reporter cells used to assay human sequences), update mpraVarDb header to post-liftOver count 239,028 with Studies-table footnote, fix mpraVarDb.html download-server paths, soften imprecise "51 MPRA experiments" claim in mpra.html and mprabase.html. relatedTracks.ra: reciprocal mpra <-> wgEncodeReg4 and mpra <-> cCREs. Expand mpra.txt makedoc with upstream provenance and QA-rebuild log. diff --git src/hg/makeDb/trackDb/human/hg38/mprabase.html src/hg/makeDb/trackDb/human/hg38/mprabase.html index 837554a8c0c..c316baee431 100644 --- src/hg/makeDb/trackDb/human/hg38/mprabase.html +++ src/hg/makeDb/trackDb/human/hg38/mprabase.html @@ -1,129 +1,136 @@
Massively Parallel Reporter Assays (MPRAs) and related methods such as STARR-seq enable quantitative testing of thousands of candidate regulatory DNA sequences in parallel by linking each sequence to a reporter gene and measuring transcriptional output using sequencing.
The MPRA Base track shows 41,275 experimentally tested cis-regulatory elements -from the MPRA Base +curated from the MPRA Base database -(Zhao et al., 2023). +(Zhao et al., 2023), +drawn from MPRA, STARR-seq, and related reporter assay experiments. The database integrates data from multiple studies, assay platforms (lentiMPRA, plasmidMPRA, STARR-seq, CRE-seq, and others), and cell types while preserving experiment-level resolution. Only elements derived from genomic fragments that can be mapped to the reference genome are included; synthetic or designed oligonucleotide libraries without genomic coordinates are excluded.
++Note on cell lines: The cell line shown for each element is the reporter +cell line in which the genomic fragment was assayed. One study (Mattioli et al., +2020) used mouse embryonic stem cells (mESC) as one of its reporter systems; the +fragments retain their human (hg38) coordinates. +
Each item represents a genomic fragment tested within a specific experiment, defined as a unique combination of cell line, assay type, and publication (PMID). The same genomic region may appear multiple times if tested in different experiments.
Items are colored by percentile rank of the mean raw activity score within each experiment:
The mouse-over shows the cell line, assay type, raw activity score, percentile rank, and citation for each element.
Within each experiment, replicate measurements for the same genomic fragment were aggregated by computing the mean raw activity score. The original dataset contained 211,053 replicate-level measurements; after aggregation, the final track contains 41,275 unique experiment-level genomic elements.
Elements are ranked by mean raw activity score independently within each experiment, and a percentile rank (0–100) is computed per experiment to avoid cross-study distortions caused by differing assay dynamic ranges.
The following table lists the experiments represented in this track.
| PMID | Author | Year | Lab | Cell type | Assay | Elements |
|---|---|---|---|---|---|---|
| 27831498 | Inoue et al. | 2017 | Shendure Lab | HepG2 | lentiMPRA | 2,241 |
| 30045748 | Klein et al. | 2018 | Shendure Lab | HepG2 | STARR-seq | 7,064 |
| 32483191 | Choi et al. | 2020 | Brown Lab | HEK293FT | lentiMPRA | 840 |
| 32483191 | Choi et al. | 2020 | Brown Lab | UACC903 | lentiMPRA | 840 |
| 32819422 | Mattioli et al. | 2020 | Mele Lab | HUES64 | plasmidMPRA | 6,954 |
| 32819422 | Mattioli et al. | 2020 | Mele Lab | mESC | plasmidMPRA | 6,954 |
| 33046894 | Klein et al. | 2020 | Shendure Lab | HepG2 | lentiMPRA | 8,116 |
| 33046894 | Klein et al. | 2020 | Shendure Lab | HepG2 | plasmidMPRA | 2,228 |
| 33046894 | Klein et al. | 2020 | Shendure Lab | HepG2 | STARR-seq | 2,230 |
| 36834916 | Koesterich et al. | 2023 | Kreimer Lab | NPC | lentiMPRA | 3,807 |
The data can be explored interactively in table format with the Table Browser or the Data Integrator and exported from there to spreadsheet or tab-sep tables. From scripts, the data can be accessed through our API, track=mprabase.
For automated download and analysis, the genome annotation is stored in a bigBed file that can be downloaded from our download server. The file for this track is called mprabase.bb. Individual regions or the whole genome annotation can be obtained using our tool bigBedToBed, which can be compiled from the source code or downloaded as a precompiled binary for your system. Instructions for downloading source code and binaries can be found here. The tool can also be used to obtain features within a given range, e.g. bigBedToBed http://hgdownload.soe.ucsc.edu/gbdb/hg38/mpra/mprabase/mprabase.bb -chrom=chr21 -start=0 -end=100000000 stdout
The original data can be downloaded from the MPRA Base web application.
Thanks to Varda Singhal, Jianyu Zhao, and the Ahituv Lab at the University of California San Francisco for creating and curating MPRA Base and for creating this track.
Zhao J, Baltoumas FA, Konnaris MA, Mouratidis I, Liu Z, Sims J, Agarwal V, Pavlopoulos GA, Georgakopoulos-Soares I, Ahituv N. MPRAbase: A Massively Parallel Reporter Assay Database. bioRxiv. 2023 Nov 22;. PMID: 38045264; PMC: PMC10690217