7095388baafcb7a9caac1086e9d737ea53b7151d
jcasper
  Mon Mar 9 09:45:54 2026 -0700
Docs tutorial for building a heatmap, refs #36176

diff --git src/hg/htdocs/goldenPath/help/heatmap.html src/hg/htdocs/goldenPath/help/heatmap.html
new file mode 100755
index 00000000000..0a81bdecdea
--- /dev/null
+++ src/hg/htdocs/goldenPath/help/heatmap.html
@@ -0,0 +1,232 @@
+<!DOCTYPE html>
+<!--#set var="TITLE" value="Genome Browser Heatmaps" -->
+<!--#set var="ROOT" value="../.." -->
+
+<!-- Relative paths to support mirror sites with non-standard GB docs install -->
+<!--#include virtual="$ROOT/inc/gbPageStart.html" -->
+
+<h1>Positional Heatmap Display</h1>
+
+<h2>Overview</h2>
+<p>The standard display mode for a bigBed track is a simple block or exon/intron marker
+in the window.  Extra fields in the bigBed, however, can contain a variety of additional data.
+When data in the extra fields meet the schema described below, then the simple block display
+can be replaced with a positional heatmap.  The heatmap provides a sparse 2-dimensional
+grid for information like expression of allele-specific point mutations across a transcript.</p>
+
+<div class="text-center">
+  <a href="http://genome.ucsc.edu/s/jcasper/heatmap_example" target="_blank">
+  <img src="/images/heatmap_example.png" style="width:80%;max-width:1083px"></a>
+</div>
+
+<h2>Contents</h2>
+
+<h6><a href="#extrafields">The Extra Fields</a></h6>
+<h6><a href="#gettingStarted">Getting Started</a></h6>
+<h6><a href="#troubleshooting">Troubleshooting</a></h6>
+
+<a id="annotating"></a>
+<h2>The Extra Fields</h2>
+<p>A heatmap file bigBed file starts with the standard 12 BED fields, but adds 7 more.  Moreover,
+the blockSizes and chromStarts fields take on a slightly different interpretation.
+</p>
+<pre>
+    string chrom;      "Chromosome (or contig, scaffold, etc.)"
+    uint   chromStart; "Start position in chromosome"
+    uint   chromEnd;   "End position in chromosome"
+    string name;       "Name of item"
+    uint   score;      "Score from 0-1000"
+    char[1] strand;    "+ or -"
+    uint thickStart;   "Start of where display should be thick (start codon)"
+    uint thickEnd;     "End of where display should be thick (stop codon)"
+    uint reserved;     "Used as itemRgb as of 2004-11-22"
+    int blockCount;    "Number of blocks"
+    int[blockCount] blockSizes; "Comma separated list of block sizes"
+    int[blockCount] chromStarts; "Start positions relative to chromStart"
+    int _rowCount;    "Number of heatmap rows"
+    string[_rowCount] _labels; "Comma separated list of row labels"
+    lstring _colorBounds; "Comma-separated list of threshold scores for colors"
+    lstring _colorValues; "Comma-separated list of colors, one for each threshold score"
+    lstring _scoreArray; "Comma-separated row-first list of scores, ,, indicates N/A"
+    lstring _labelArray; "Comma-separated row-first list of mouseover labels, ,, indicates N/A"
+    lstring legend; "Legend"
+</pre>
+<p>
+The _rowCount field describes how many rows exist in each item's heatmap, while the blockCount field
+describes the number of columns.  The labels field provides the labels for each row of the heatmap;
+no column labels are currently supported, as they are already tied to positions.</p>
+<p>The next two fields are a bit more complicated.  Heatmaps convert numerical scores into
+different colors for the heatmap cells, but the way the scores are translated into colors
+is controlled by the <code>_colorBounds</code> and <code>_colorValues</code> arrays.  Conceptually,
+these settings pick some score thresholds and associate colors with those scores.  Any scores between
+two adjacent thresholds receive a color that is interpolated between those two thresholds.  Any
+scores outside the bounds just copy the value of the nearest boundary.</p>
+<p>For example, if <code>_colorBounds</code> is set to 0,500,1000 and <code>_colorValues</code>
+is set to #000000,#FF0000,#FFFFFF (black, red, and white, respectively), then the following will hold.
+Any score 0 or below will be drawn in black, the score 500 will be drawn in pure red, and any score
+of 1000 or more will be drawn in white.  The score 250 is halfway between the black and red thresholds,
+so it would be drawn in a color halfway between #000000 and #FF0000, or about #490000.</p>
+<p>The <code>_scoreArray</code> and <code>_labelArray</code> fields contain the actual scores
+and mouseover text for each cell within the heatmap in row-major order, meaning the list of scores
+will fill the first row, then overflow to begin filling the second row, then the third row and so
+on.  Because these are comma-separated lists, a cell can be left empty simply by placing nothing
+between the commas that demarcate it.</p>
+<p>For example, if exonCount is 2 and _rowCount is 3, then the line will describe a 6-cell heatmap
+with 3 rows and 2 columns.  If the accompanying <code>_scoreArray</code> is "1,0,2,,0.5,,", then
+that will describe a heatmap where the top-left corner has the score 1, the top-middle score is 0,
+the top-right score is 2, and the bottom-middle score is 0.5.  The bottom-left and bottom-right
+cells will be left empty, as no score was provided for them.</p>
+<p>This format is a bit awkward to describe with only 6 cells; it becomes difficult or impossible
+to edit manually when the heatmaps reach sufficient size (like hundreds or even thousands).
+We strongly recommend the use of automated scripts to create these files.
+</p>
+<p>
+The final field is <code>legend</code>, which is much simpler.  The text in this field is used to
+create a legend for the heatmap that will be displayed at the top after the "name" from the
+BED file.
+</p>
+
+<a id="gettingStarted"></a>
+<h2>Getting started with heatmaps</h2>
+<p>
+<strong>The basic bed fields</strong>:<br>
+As noted above, it is difficult to manage sizeable heatmap examples without resorting to
+scripting and automation.  Small examples are the easiest ones to experiment with.  For the
+sake of this example, imagine that we already have a list of two transcripts, T1 and T2, that
+we want to make heatmaps for.  T1 and T2 themselves are described in a bed file like so:</p>
+<pre>
+chr1 1000 2000 T1 1000 + 1000 2000 0 2 200,200, 0,800
+chr1 1200 2400 T2 1000 - 1200 2400 0 3 500,800,300 0,600,900
+</pre>
+<p>Separately, we have a list of four heatmap boxes that we want to draw for each
+of these transcripts.  For T1, the heatmap boxes cover bases 100-200 of the
+transcript (1100-1200 in the genome) and 700-800 (1700-1800 in the genome),
+each with two boxes labeled Case1 and Case2.  For T2, the labels are the same
+but the bases covered are 200-300 and 900-1000.</p> <p>Creating heapmaps for
+the transcripts means changing the exon structure of each one - instead of
+representing exon boundaries, those fields will be used to describe the regions
+where we want to draw the heatmap boxes.  We'll also need to add extra fields
+with the remainder of the heatmap data, but let's start with the exons.  We
+immediately have a problem - in a BED file, the exons are expected to span the
+length of the item.  If our first line is intended to show a transcript on chr1
+from base 1000 to 2000, then the first exon needs to start at 1000 and the last
+exon needs to end at 2000, even though we don't have heatmap data for those
+"exons".  There are two ways to get around this.</p>
+<p>Option 1 is to simply reduce the size of each transcript to match the extent of where
+we want to draw the heatmap.  For that first line, if we only have heatmap data for bases
+100-200 and 700-800 in the transcript (bases 1100-1200 and 1700-1800 in the genome), then
+those are the new bounds for our transcript (note that this means changing chromStart/End,
+thickStart/End, and the relative chromStarts values for the exons):</p>
+<pre>
+chr1 1100 1800 T1 1000 + 1100 1800 0 2 100,100 0,600
+chr1 1400 2200 T2 1000 - 1200 2400 0 2 100,100 0,700
+</pre>
+<p>Note that in addition to changing the blockCount, blockSizes, and chromStarts, we also
+needed to change thickStart and thickEnd to indicate where the first exon "starts" and
+the last one "ends".
+</p>
+<p>Option 2 is to add a fake exon on each edge of the transcript to pad out the exon list
+(assuming that our heatmap data doesn't already reach the edges of the transcript - if
+it does, then this problem goes away).  These fake exons can then be associated with no
+score value in the list of heatmap scores, which means no heatmap color will be drawn
+there, but the bounding box of the heatmap will still extend for the full length of
+the transcript (1000-2000).</p>
+<pre>
+chr1 1000 2000 T1 1000 + 1000 2000 0 4 1,100,100,1 0,100,700,999
+chr1 1200 2400 T2 1000 - 1200 2400 0 4 1,100,100,1 0,200,900,1199
+</pre>
+<p>Either approach might be suitable for your use case.  For this example, we'll continue
+with Option 2.
+</p>
+<p>
+<strong>The extra fields:</strong><br>
+Now that we've filled in the basic BED fields, we need to populate the remaining seven
+heatmap-specific fields.  As we said earlier, in this example we have two rows that
+we want to draw in each heatmap, which we've decided to label "Case 1" and "Case 2".
+This means that our <code>_rowCount</code> value will be 2, and the <code>_labels</code>
+value will be "Case1,Case2".  Here's the updated BED, though we're not done yet:</p>
+<pre>
+chr1 1000 2000 T1 1000 + 1000 2000 0 4 1,100,100,1 0,100,700,999 2 "Case 1","Case 2"
+chr1 1200 2400 T2 1000 - 1200 2400 0 4 1,100,100,1 0,200,900,1199 2 "Case 1","Case 2"
+</pre>
+<p>The next two fields, <code>_colorBounds</code> and <code>_colorValues</code>, are tied together.
+A heatmap displays a gradient of color, which is intended to convey score information.  Frequently
+only two colors bounds are used, for the minimum anc maximum values, and intermediate scores are transformed
+into a color shade between the two.  If the maximum score is 5, associated with red
+(<span style="background-color:red; color:red;">#</span>), and the minimum score is -5, associated with blue
+(<span style="background-color:blue; color:blue;">#</span>), then a score of 0 would be purple - midway
+between the two (<span style="background-color:#7f007f; color:#7f007f;">#</span>).  That would be a
+heatmap with two colorBounds and two colorValues - the bounds are 5 and -5, and the values are red and blue.
+If you reduced the colorBounds, to 3 and -3, you'd keep the same color scheme, but some of your heatmap scores
+would now saturate at full red or blue because they went past the outer score boundary on either side.</p>
+<p>In some situations, though, you might want to keep the association of red with positive scores and blue
+with negative ones without mixing the two.  A score of 0 should instead be drawn in white, positive scores
+should range from faint red (for positive scores close to 0) to full red (for positive scores at or exceeding
+the max threshold).  Negative scores, meanwhile, should be drawn in blue (faint blue close to 0, intense blue
+at the minimum threshold).  This display would use three colorBounds and three colorValues - the bounds
+are -5, 0, and 5, and the values are blue, white, and red.  Here is an example of what that would look like
+in our BED:</p>
+<pre>
+chr1 1000 2000 T1 1000 + 1000 2000 0 4 1,100,100,1 0,100,700,999 2 "Case 1","Case 2" 3 -5,0,5 #0000ff,#ffffff,#ff0000
+chr1 1200 2400 T2 1000 - 1200 2400 0 4 1,100,100,1 0,200,900,1199 2 "Case 1","Case 2" 3 -5,0,5 #0000ff,#ffffff,#ff0000
+</pre>
+<p>Note that we're using hexadecimal RGB colors here, and that the thresholds must be listed in ascending order by score.
+</p>
+<p>
+Finally, we need to add the scores themselves, the mouseover labels for each cell, and a legend for the figure.
+The scores are whatever your data scores are - we're using some random values here.  If you don't have data for a
+particular cell, for example if Case 1 doesn't have a score for the second exon, just leave that value empty in the
+comma-separated list of scores.  Also remember that in this example, we included phantom exons to pad out the BED
+window to the length of the full transcript, so we don't have scores for those end-caps either.  The scores list for
+each heatmap is just a long comma-separated list of the entire set of scores for the heatmap, which means we're compressing
+the scores for Case 1 and Case 2 together.  We do this by listing all of the scores for Case 1 first (the first row), followed
+by the scores for Case 2.  If we had a third row, we'd add those scores to the list after the ones for Case 2.  So here,
+if Case 1 has a score of 2.8 for the first exon and no score for the second exon, and Case 2 has a score of -4 for the
+first exon and 8.9 for the second exon, it would look like this:</p>
+<pre>
+chr1 1000 2000 T1 1000 + 1000 2000 0 4 1,100,100,1 0,100,700,999 2 "Case 1","Case 2" 3 -5,0,5 #0000ff,#ffffff,#ff0000 ,2.8,,,,-4,8.9,,
+chr1 1200 2400 T2 1000 - 1200 2400 0 4 1,100,100,1 0,200,900,1199 2 "Case 1","Case 2" 3 -5,0,5 #0000ff,#ffffff,#ff0000 ,0.1,-2,,,1.1,-3.5,,
+</pre>
+<p>
+If you are looking closely, you will note that there is an extra comma at the end of the scores list.
+We need that comma because the final score value is empty, and it's hard to tell the difference between
+whether that's an intentional empty value or just the end of the list.  For this situation, we do it by
+including the extra comma.  If the final value had data (like 3.5), then we could end the list with 3.5
+and skip the final comma (though including it would also be okay - we assume there isn't a score after
+the final comma, so "3.5" and "3.5," will be treated the same way).</p>
+<p>
+All that is left now is to fill in a similarly-structured comma-separated list for cell-specific
+mouseover labels and a legend.  The mouseover labels can be used to do things like indicate the
+numerical score value (because viewers will otherwise only see the heatmap color) or provide other
+useful contextual information like "no data" or the HGVS term describing a mutation at that position.</p>
+<pre>
+chr1 1000 2000 T1 1000 + 1000 2000 0 4 1,100,100,1 0,100,700,999 2 "Case 1","Case 2" 3 -5,0,5 #0000ff,#ffffff,#ff0000 ,2.8,,,,-4,8.9,, ,"2.8, medium","no data",,,"-4.8, low","8.9, extreme",,, "Example on transcript 1"
+chr1 1200 2400 T2 1000 - 1200 2400 0 4 1,100,100,1 0,200,900,1199 2 "Case 1","Case 2" 3 -5,0,5 #0000ff,#ffffff,#ff0000 ,0.1,-2,,,1.1,-3.5,, ,"0.1, negligible","-2, low",,,"1.1 marginal","-3.5 low",, "Example on transcript 2"
+</pre>
+<p>Here is this example in a <a href="examples/heatmap.bed" target="_blank">BED file</a> (using tabs as field
+separators), and the corresponding <a href="examples/heatmap.bb" target="_blank">bigBed file</a>.  The bigBed
+was created from the bed file using the following command:</p>
+<pre>
+bedToBigBed -tab -type=bed12+ -as=heatmap.as heatmap.bed chrom.sizes heatmap.bb
+</pre>
+<p>A copy of heatmap.as is available <a href="examples/heatmap.as" target="_blank">here</a>.  chrom.sizes
+files for most assemblies can be found on our
+<a href="https://hgdownload.soe.ucsc.edu" target="_blank">download server</a>.
+
+<div class="text-center">
+  <a href="http://genome.ucsc.edu/s/jcasper/heatmap_example" target="_blank">
+  <img src="/images/heatmap_example2.png" style="width:80%;max-width:1083px"></a>
+</div>
+
+<a id="troubleshooting"></a>
+<h2>Troubleshooting</h2>
+<p>
+The most likely place to encounter errors when building a heatmap file is when running the
+<code>bedToBigBed</code> program.  The score and label arrays can be difficult to organize,
+and we highly recommend making use of a bit of scripting to automate the process.  The
+errors reported by bedToBigBed are usually helpful for identifying which part of the input
+isn't organized correctly, but please <a href="../../contacts.html">contact us</a> if you
+continue to have issues.
+</p>
+
+<!--#include virtual="$ROOT/inc/gbPageEnd.html" -->