7095388baafcb7a9caac1086e9d737ea53b7151d jcasper Mon Mar 9 09:45:54 2026 -0700 Docs tutorial for building a heatmap, refs #36176 diff --git src/hg/htdocs/goldenPath/help/heatmap.html src/hg/htdocs/goldenPath/help/heatmap.html new file mode 100755 index 00000000000..0a81bdecdea --- /dev/null +++ src/hg/htdocs/goldenPath/help/heatmap.html @@ -0,0 +1,232 @@ + + + + + + + +

Positional Heatmap Display

+ +

Overview

+

The standard display mode for a bigBed track is a simple block or exon/intron marker +in the window. Extra fields in the bigBed, however, can contain a variety of additional data. +When data in the extra fields meet the schema described below, then the simple block display +can be replaced with a positional heatmap. The heatmap provides a sparse 2-dimensional +grid for information like expression of allele-specific point mutations across a transcript.

+ +
+ + +
+ +

Contents

+ +
The Extra Fields
+
Getting Started
+
Troubleshooting
+ + +

The Extra Fields

+

A heatmap file bigBed file starts with the standard 12 BED fields, but adds 7 more. Moreover, +the blockSizes and chromStarts fields take on a slightly different interpretation. +

+
+    string chrom;      "Chromosome (or contig, scaffold, etc.)"
+    uint   chromStart; "Start position in chromosome"
+    uint   chromEnd;   "End position in chromosome"
+    string name;       "Name of item"
+    uint   score;      "Score from 0-1000"
+    char[1] strand;    "+ or -"
+    uint thickStart;   "Start of where display should be thick (start codon)"
+    uint thickEnd;     "End of where display should be thick (stop codon)"
+    uint reserved;     "Used as itemRgb as of 2004-11-22"
+    int blockCount;    "Number of blocks"
+    int[blockCount] blockSizes; "Comma separated list of block sizes"
+    int[blockCount] chromStarts; "Start positions relative to chromStart"
+    int _rowCount;    "Number of heatmap rows"
+    string[_rowCount] _labels; "Comma separated list of row labels"
+    lstring _colorBounds; "Comma-separated list of threshold scores for colors"
+    lstring _colorValues; "Comma-separated list of colors, one for each threshold score"
+    lstring _scoreArray; "Comma-separated row-first list of scores, ,, indicates N/A"
+    lstring _labelArray; "Comma-separated row-first list of mouseover labels, ,, indicates N/A"
+    lstring legend; "Legend"
+
+

+The _rowCount field describes how many rows exist in each item's heatmap, while the blockCount field +describes the number of columns. The labels field provides the labels for each row of the heatmap; +no column labels are currently supported, as they are already tied to positions.

+

The next two fields are a bit more complicated. Heatmaps convert numerical scores into +different colors for the heatmap cells, but the way the scores are translated into colors +is controlled by the _colorBounds and _colorValues arrays. Conceptually, +these settings pick some score thresholds and associate colors with those scores. Any scores between +two adjacent thresholds receive a color that is interpolated between those two thresholds. Any +scores outside the bounds just copy the value of the nearest boundary.

+

For example, if _colorBounds is set to 0,500,1000 and _colorValues +is set to #000000,#FF0000,#FFFFFF (black, red, and white, respectively), then the following will hold. +Any score 0 or below will be drawn in black, the score 500 will be drawn in pure red, and any score +of 1000 or more will be drawn in white. The score 250 is halfway between the black and red thresholds, +so it would be drawn in a color halfway between #000000 and #FF0000, or about #490000.

+

The _scoreArray and _labelArray fields contain the actual scores +and mouseover text for each cell within the heatmap in row-major order, meaning the list of scores +will fill the first row, then overflow to begin filling the second row, then the third row and so +on. Because these are comma-separated lists, a cell can be left empty simply by placing nothing +between the commas that demarcate it.

+

For example, if exonCount is 2 and _rowCount is 3, then the line will describe a 6-cell heatmap +with 3 rows and 2 columns. If the accompanying _scoreArray is "1,0,2,,0.5,,", then +that will describe a heatmap where the top-left corner has the score 1, the top-middle score is 0, +the top-right score is 2, and the bottom-middle score is 0.5. The bottom-left and bottom-right +cells will be left empty, as no score was provided for them.

+

This format is a bit awkward to describe with only 6 cells; it becomes difficult or impossible +to edit manually when the heatmaps reach sufficient size (like hundreds or even thousands). +We strongly recommend the use of automated scripts to create these files. +

+

+The final field is legend, which is much simpler. The text in this field is used to +create a legend for the heatmap that will be displayed at the top after the "name" from the +BED file. +

+ + +

Getting started with heatmaps

+

+The basic bed fields:
+As noted above, it is difficult to manage sizeable heatmap examples without resorting to +scripting and automation. Small examples are the easiest ones to experiment with. For the +sake of this example, imagine that we already have a list of two transcripts, T1 and T2, that +we want to make heatmaps for. T1 and T2 themselves are described in a bed file like so:

+
+chr1 1000 2000 T1 1000 + 1000 2000 0 2 200,200, 0,800
+chr1 1200 2400 T2 1000 - 1200 2400 0 3 500,800,300 0,600,900
+
+

Separately, we have a list of four heatmap boxes that we want to draw for each +of these transcripts. For T1, the heatmap boxes cover bases 100-200 of the +transcript (1100-1200 in the genome) and 700-800 (1700-1800 in the genome), +each with two boxes labeled Case1 and Case2. For T2, the labels are the same +but the bases covered are 200-300 and 900-1000.

Creating heapmaps for +the transcripts means changing the exon structure of each one - instead of +representing exon boundaries, those fields will be used to describe the regions +where we want to draw the heatmap boxes. We'll also need to add extra fields +with the remainder of the heatmap data, but let's start with the exons. We +immediately have a problem - in a BED file, the exons are expected to span the +length of the item. If our first line is intended to show a transcript on chr1 +from base 1000 to 2000, then the first exon needs to start at 1000 and the last +exon needs to end at 2000, even though we don't have heatmap data for those +"exons". There are two ways to get around this.

+

Option 1 is to simply reduce the size of each transcript to match the extent of where +we want to draw the heatmap. For that first line, if we only have heatmap data for bases +100-200 and 700-800 in the transcript (bases 1100-1200 and 1700-1800 in the genome), then +those are the new bounds for our transcript (note that this means changing chromStart/End, +thickStart/End, and the relative chromStarts values for the exons):

+
+chr1 1100 1800 T1 1000 + 1100 1800 0 2 100,100 0,600
+chr1 1400 2200 T2 1000 - 1200 2400 0 2 100,100 0,700
+
+

Note that in addition to changing the blockCount, blockSizes, and chromStarts, we also +needed to change thickStart and thickEnd to indicate where the first exon "starts" and +the last one "ends". +

+

Option 2 is to add a fake exon on each edge of the transcript to pad out the exon list +(assuming that our heatmap data doesn't already reach the edges of the transcript - if +it does, then this problem goes away). These fake exons can then be associated with no +score value in the list of heatmap scores, which means no heatmap color will be drawn +there, but the bounding box of the heatmap will still extend for the full length of +the transcript (1000-2000).

+
+chr1 1000 2000 T1 1000 + 1000 2000 0 4 1,100,100,1 0,100,700,999
+chr1 1200 2400 T2 1000 - 1200 2400 0 4 1,100,100,1 0,200,900,1199
+
+

Either approach might be suitable for your use case. For this example, we'll continue +with Option 2. +

+

+The extra fields:
+Now that we've filled in the basic BED fields, we need to populate the remaining seven +heatmap-specific fields. As we said earlier, in this example we have two rows that +we want to draw in each heatmap, which we've decided to label "Case 1" and "Case 2". +This means that our _rowCount value will be 2, and the _labels +value will be "Case1,Case2". Here's the updated BED, though we're not done yet:

+
+chr1 1000 2000 T1 1000 + 1000 2000 0 4 1,100,100,1 0,100,700,999 2 "Case 1","Case 2"
+chr1 1200 2400 T2 1000 - 1200 2400 0 4 1,100,100,1 0,200,900,1199 2 "Case 1","Case 2"
+
+

The next two fields, _colorBounds and _colorValues, are tied together. +A heatmap displays a gradient of color, which is intended to convey score information. Frequently +only two colors bounds are used, for the minimum anc maximum values, and intermediate scores are transformed +into a color shade between the two. If the maximum score is 5, associated with red +(#), and the minimum score is -5, associated with blue +(#), then a score of 0 would be purple - midway +between the two (#). That would be a +heatmap with two colorBounds and two colorValues - the bounds are 5 and -5, and the values are red and blue. +If you reduced the colorBounds, to 3 and -3, you'd keep the same color scheme, but some of your heatmap scores +would now saturate at full red or blue because they went past the outer score boundary on either side.

+

In some situations, though, you might want to keep the association of red with positive scores and blue +with negative ones without mixing the two. A score of 0 should instead be drawn in white, positive scores +should range from faint red (for positive scores close to 0) to full red (for positive scores at or exceeding +the max threshold). Negative scores, meanwhile, should be drawn in blue (faint blue close to 0, intense blue +at the minimum threshold). This display would use three colorBounds and three colorValues - the bounds +are -5, 0, and 5, and the values are blue, white, and red. Here is an example of what that would look like +in our BED:

+
+chr1 1000 2000 T1 1000 + 1000 2000 0 4 1,100,100,1 0,100,700,999 2 "Case 1","Case 2" 3 -5,0,5 #0000ff,#ffffff,#ff0000
+chr1 1200 2400 T2 1000 - 1200 2400 0 4 1,100,100,1 0,200,900,1199 2 "Case 1","Case 2" 3 -5,0,5 #0000ff,#ffffff,#ff0000
+
+

Note that we're using hexadecimal RGB colors here, and that the thresholds must be listed in ascending order by score. +

+

+Finally, we need to add the scores themselves, the mouseover labels for each cell, and a legend for the figure. +The scores are whatever your data scores are - we're using some random values here. If you don't have data for a +particular cell, for example if Case 1 doesn't have a score for the second exon, just leave that value empty in the +comma-separated list of scores. Also remember that in this example, we included phantom exons to pad out the BED +window to the length of the full transcript, so we don't have scores for those end-caps either. The scores list for +each heatmap is just a long comma-separated list of the entire set of scores for the heatmap, which means we're compressing +the scores for Case 1 and Case 2 together. We do this by listing all of the scores for Case 1 first (the first row), followed +by the scores for Case 2. If we had a third row, we'd add those scores to the list after the ones for Case 2. So here, +if Case 1 has a score of 2.8 for the first exon and no score for the second exon, and Case 2 has a score of -4 for the +first exon and 8.9 for the second exon, it would look like this:

+
+chr1 1000 2000 T1 1000 + 1000 2000 0 4 1,100,100,1 0,100,700,999 2 "Case 1","Case 2" 3 -5,0,5 #0000ff,#ffffff,#ff0000 ,2.8,,,,-4,8.9,,
+chr1 1200 2400 T2 1000 - 1200 2400 0 4 1,100,100,1 0,200,900,1199 2 "Case 1","Case 2" 3 -5,0,5 #0000ff,#ffffff,#ff0000 ,0.1,-2,,,1.1,-3.5,,
+
+

+If you are looking closely, you will note that there is an extra comma at the end of the scores list. +We need that comma because the final score value is empty, and it's hard to tell the difference between +whether that's an intentional empty value or just the end of the list. For this situation, we do it by +including the extra comma. If the final value had data (like 3.5), then we could end the list with 3.5 +and skip the final comma (though including it would also be okay - we assume there isn't a score after +the final comma, so "3.5" and "3.5," will be treated the same way).

+

+All that is left now is to fill in a similarly-structured comma-separated list for cell-specific +mouseover labels and a legend. The mouseover labels can be used to do things like indicate the +numerical score value (because viewers will otherwise only see the heatmap color) or provide other +useful contextual information like "no data" or the HGVS term describing a mutation at that position.

+
+chr1 1000 2000 T1 1000 + 1000 2000 0 4 1,100,100,1 0,100,700,999 2 "Case 1","Case 2" 3 -5,0,5 #0000ff,#ffffff,#ff0000 ,2.8,,,,-4,8.9,, ,"2.8, medium","no data",,,"-4.8, low","8.9, extreme",,, "Example on transcript 1"
+chr1 1200 2400 T2 1000 - 1200 2400 0 4 1,100,100,1 0,200,900,1199 2 "Case 1","Case 2" 3 -5,0,5 #0000ff,#ffffff,#ff0000 ,0.1,-2,,,1.1,-3.5,, ,"0.1, negligible","-2, low",,,"1.1 marginal","-3.5 low",, "Example on transcript 2"
+
+

Here is this example in a BED file (using tabs as field +separators), and the corresponding bigBed file. The bigBed +was created from the bed file using the following command:

+
+bedToBigBed -tab -type=bed12+ -as=heatmap.as heatmap.bed chrom.sizes heatmap.bb
+
+

A copy of heatmap.as is available here. chrom.sizes +files for most assemblies can be found on our +download server. + +

+ + +
+ + +

Troubleshooting

+

+The most likely place to encounter errors when building a heatmap file is when running the +bedToBigBed program. The score and label arrays can be difficult to organize, +and we highly recommend making use of a bit of scripting to automate the process. The +errors reported by bedToBigBed are usually helpful for identifying which part of the input +isn't organized correctly, but please contact us if you +continue to have issues. +

+ +