888e7470c14eeecdca310ed36bb45c3c00ae8052 lrnassar Tue Apr 21 15:14:04 2026 -0700 QA fixes for MPRA superTrack. refs #37359 Fix broken mpraVarDb bigDataUrl — pointed at /gbdb/hg38/mpra/mpravardb.bb but the file is at /gbdb/hg38/mpra/mpravardb/mpravardb.bb, causing hgTrackDb -strict to silently drop the subtrack. Rebuild mpravardb.bb after two fixes in mpravardbToBed.py: sanitize UTF-8 in user-visible string fields (curly quotes, primes, NBSP mojibake) that the browser does not transcode, eliminating ~246k non-ASCII occurrences across 42% of rows; and change safe_float / pval_to_score to write NaN and return score 0 for NA / out-of-range p-values instead of 0.0 and score 1000 (previously inflated untested variants to the top of score-sorted views). trackDb stanza cleanup: shorten mpraVarDb longLabel, drop superfluous type bed 4 from superTrack, make bigBed 9+13 explicit, remove redundant mouseOverField, align parent mpra on, add filterValues for cell_line/assay/cellLine and filterByRange sliders for percentile_rank / fdr / log2FC, add labelFields and maxWindowToDraw. Description pages: add cross-species disclosure (mouse reporter cells used to assay human sequences), update mpraVarDb header to post-liftOver count 239,028 with Studies-table footnote, fix mpraVarDb.html download-server paths, soften imprecise "51 MPRA experiments" claim in mpra.html and mprabase.html. relatedTracks.ra: reciprocal mpra <-> wgEncodeReg4 and mpra <-> cCREs. Expand mpra.txt makedoc with upstream provenance and QA-rebuild log. diff --git src/hg/makeDb/trackDb/human/hg38/mprabase.html src/hg/makeDb/trackDb/human/hg38/mprabase.html index 837554a8c0c..c316baee431 100644 --- src/hg/makeDb/trackDb/human/hg38/mprabase.html +++ src/hg/makeDb/trackDb/human/hg38/mprabase.html @@ -1,129 +1,136 @@ <h2>Description</h2> <p> Massively Parallel Reporter Assays (MPRAs) and related methods such as STARR-seq enable quantitative testing of thousands of candidate regulatory DNA sequences in parallel by linking each sequence to a reporter gene and measuring transcriptional output using sequencing. </p> <p> The <b>MPRA Base</b> track shows 41,275 experimentally tested cis-regulatory elements -from the <a href="http://mprabase.ucsf.edu/app/mprabase" target="_blank">MPRA Base</a> +curated from the <a href="http://mprabase.ucsf.edu/app/mprabase" target="_blank">MPRA Base</a> database -(<a href="https://pubmed.ncbi.nlm.nih.gov/38045264/" target="_blank">Zhao et al., 2023</a>). +(<a href="https://pubmed.ncbi.nlm.nih.gov/38045264/" target="_blank">Zhao et al., 2023</a>), +drawn from MPRA, STARR-seq, and related reporter assay experiments. The database integrates data from multiple studies, assay platforms (lentiMPRA, plasmidMPRA, STARR-seq, CRE-seq, and others), and cell types while preserving experiment-level resolution. Only elements derived from genomic fragments that can be mapped to the reference genome are included; synthetic or designed oligonucleotide libraries without genomic coordinates are excluded. </p> +<p> +<b>Note on cell lines:</b> The cell line shown for each element is the reporter +cell line in which the genomic fragment was assayed. One study (Mattioli et al., +2020) used mouse embryonic stem cells (mESC) as one of its reporter systems; the +fragments retain their human (hg38) coordinates. +</p> <h2>Display Conventions</h2> <p> Each item represents a genomic fragment tested within a specific experiment, defined as a unique combination of cell line, assay type, and publication (PMID). The same genomic region may appear multiple times if tested in different experiments. </p> <p> Items are colored by percentile rank of the mean raw activity score within each experiment: </p> <ul> <li><span style="color:blue;"><b>Blue</b></span> — percentile < 50</li> <li><span style="color:orange;"><b>Orange</b></span> — percentile 50–74</li> <li><span style="color:red;"><b>Red</b></span> — percentile ≥ 75</li> </ul> <p> The mouse-over shows the cell line, assay type, raw activity score, percentile rank, and citation for each element. </p> <h2>Methods</h2> <p> Within each experiment, replicate measurements for the same genomic fragment were aggregated by computing the mean raw activity score. The original dataset contained 211,053 replicate-level measurements; after aggregation, the final track contains 41,275 unique experiment-level genomic elements. </p> <p> Elements are ranked by mean raw activity score independently within each experiment, and a percentile rank (0–100) is computed per experiment to avoid cross-study distortions caused by differing assay dynamic ranges. </p> <h2>Experiments</h2> <p> The following table lists the experiments represented in this track. </p> <table class="stdTbl"> <tr> <th>PMID</th> <th>Author</th> <th>Year</th> <th>Lab</th> <th>Cell type</th> <th>Assay</th> <th>Elements</th> </tr> <tr><td><a href="https://pubmed.ncbi.nlm.nih.gov/27831498/" target="_blank">27831498</a></td><td>Inoue et al.</td><td>2017</td><td>Shendure Lab</td><td>HepG2</td><td>lentiMPRA</td><td>2,241</td></tr> <tr><td><a href="https://pubmed.ncbi.nlm.nih.gov/30045748/" target="_blank">30045748</a></td><td>Klein et al.</td><td>2018</td><td>Shendure Lab</td><td>HepG2</td><td>STARR-seq</td><td>7,064</td></tr> <tr><td><a href="https://pubmed.ncbi.nlm.nih.gov/32483191/" target="_blank">32483191</a></td><td>Choi et al.</td><td>2020</td><td>Brown Lab</td><td>HEK293FT</td><td>lentiMPRA</td><td>840</td></tr> <tr><td><a href="https://pubmed.ncbi.nlm.nih.gov/32483191/" target="_blank">32483191</a></td><td>Choi et al.</td><td>2020</td><td>Brown Lab</td><td>UACC903</td><td>lentiMPRA</td><td>840</td></tr> <tr><td><a href="https://pubmed.ncbi.nlm.nih.gov/32819422/" target="_blank">32819422</a></td><td>Mattioli et al.</td><td>2020</td><td>Mele Lab</td><td>HUES64</td><td>plasmidMPRA</td><td>6,954</td></tr> <tr><td><a href="https://pubmed.ncbi.nlm.nih.gov/32819422/" target="_blank">32819422</a></td><td>Mattioli et al.</td><td>2020</td><td>Mele Lab</td><td>mESC</td><td>plasmidMPRA</td><td>6,954</td></tr> <tr><td><a href="https://pubmed.ncbi.nlm.nih.gov/33046894/" target="_blank">33046894</a></td><td>Klein et al.</td><td>2020</td><td>Shendure Lab</td><td>HepG2</td><td>lentiMPRA</td><td>8,116</td></tr> <tr><td><a href="https://pubmed.ncbi.nlm.nih.gov/33046894/" target="_blank">33046894</a></td><td>Klein et al.</td><td>2020</td><td>Shendure Lab</td><td>HepG2</td><td>plasmidMPRA</td><td>2,228</td></tr> <tr><td><a href="https://pubmed.ncbi.nlm.nih.gov/33046894/" target="_blank">33046894</a></td><td>Klein et al.</td><td>2020</td><td>Shendure Lab</td><td>HepG2</td><td>STARR-seq</td><td>2,230</td></tr> <tr><td><a href="https://pubmed.ncbi.nlm.nih.gov/36834916/" target="_blank">36834916</a></td><td>Koesterich et al.</td><td>2023</td><td>Kreimer Lab</td><td>NPC</td><td>lentiMPRA</td><td>3,807</td></tr> </table> <h2>Data Access</h2> <p> The data can be explored interactively in table format with the <a href="../cgi-bin/hgTables">Table Browser</a> or the <a href="../cgi-bin/hgIntegrator">Data Integrator</a> and exported from there to spreadsheet or tab-sep tables. From scripts, the data can be accessed through our <a href="https://api.genome.ucsc.edu" target="_blank">API</a>, track=<i>mprabase</i>. </p> <p> For automated download and analysis, the genome annotation is stored in a bigBed file that can be downloaded from <a href="http://hgdownload.soe.ucsc.edu/gbdb/hg38/mpra/mprabase" target="_blank">our download server</a>. The file for this track is called <tt>mprabase.bb</tt>. Individual regions or the whole genome annotation can be obtained using our tool <tt>bigBedToBed</tt>, which can be compiled from the source code or downloaded as a precompiled binary for your system. Instructions for downloading source code and binaries can be found <a href="http://hgdownload.soe.ucsc.edu/downloads.html#utilities_downloads" target="_blank">here</a>. The tool can also be used to obtain features within a given range, e.g. <tt>bigBedToBed http://hgdownload.soe.ucsc.edu/gbdb/hg38/mpra/mprabase/mprabase.bb -chrom=chr21 -start=0 -end=100000000 stdout</tt> </p> <p> The original data can be downloaded from the <a href="http://mprabase.ucsf.edu/app/mprabase" target="_blank">MPRA Base web application</a>. </p> <h2>Credits</h2> <p> Thanks to Varda Singhal, Jianyu Zhao, and the <a href="https://pharm.ucsf.edu/ahituv" target="_blank">Ahituv Lab</a> at the University of California San Francisco for creating and curating MPRA Base and for creating this track. </p> <h2>References</h2> <p> Zhao J, Baltoumas FA, Konnaris MA, Mouratidis I, Liu Z, Sims J, Agarwal V, Pavlopoulos GA, Georgakopoulos-Soares I, Ahituv N. <a href="https://doi.org/10.1101/2023.11.19.567742" target="_blank"> MPRAbase: A Massively Parallel Reporter Assay Database</a>. <em>bioRxiv</em>. 2023 Nov 22;. PMID: <a href="https://www.ncbi.nlm.nih.gov/pubmed/38045264" target="_blank">38045264</a>; PMC: <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10690217/" target="_blank">PMC10690217</a> </p>