8b8e3dd40b5f555d315dafce8a4bd64ad07bb1b2 dschmelt Fri Sep 9 12:51:21 2022 -0700 Proofread and edited for clarity refs #29356 diff --git src/hg/htdocs/goldenPath/help/bigRmsk.html src/hg/htdocs/goldenPath/help/bigRmsk.html index 839b9ea..8b061e0 100755 --- src/hg/htdocs/goldenPath/help/bigRmsk.html +++ src/hg/htdocs/goldenPath/help/bigRmsk.html @@ -1,155 +1,287 @@ <!DOCTYPE html> <!--#set var="TITLE" value="Genome Browser bigRmsk RepeatMasker Format" --> <!--#set var="ROOT" value="../.." --> <!-- Relative paths to support mirror sites with non-standard GB docs install --> <!--#include virtual="$ROOT/inc/gbPageStart.html" --> <h1>bigRmsk Track Format</h1> <p> The bigRmsk format allows for the display of annotations of a genome generated by the <a href="http://www.repeatmasker.org/" target="_blank">RepeatMasker</a> program that screens DNA sequences for interspersed repeats and low complexity DNA sequences. -It is the recommend method of adding RepeatMaster tracks to assembly hubs. -For a descriptions of this features of this track type, with examples, see -<a href="bigRmskTrackDescExample.html">standard bigRmsk track description</a>. + It is the recommended method of adding RepeatMasker tracks to assembly hubs. +</p> <p> The bigRmsk format enables taking the annotation output of RepeatMasker and converting it into a compressed and indexed <a href="/goldenPath/help/bigBed.html">bigBed</a> file. Please see this page for a details of the bigBed format, its use, and associated tools. </p> +<h2>Display Conventions and Configuration</h2> + +<h4>Context Sensitive Zooming</h4> +<p> + This track employs a technique which chooses the appropriate visual representation for the data based on the + zoom scale and the number of annotations currently in view. The track will automatically switch from the + most detailed visualization ('Full' mode) to the denser view ('Pack' mode) when the window size is greater + than 45kb of sequence. It will further switch to the even denser single line view ('Dense' mode) if more than + 500 annotations are present in the current view. +</p> +<h4>Dense Mode Visualization</h4> +<p> + In dense display mode, a single line is displayed denoting the coverage of repeats using a series + of colored boxes. The boxes are colored based on the classification of the repeat (see below for legend). +<br> +<br> +<img height="30" width="1250" src="/images/rmskDense.png"> +</p> +<h4>Pack Mode Visualization</h4> +<p> + In pack mode, repeats are represented as sets of joined features. These are color coded as above based on the + class of the repeat, and the further details such as orientation (denoted by chevrons) and a family label are provided. + This family label may be optionally turned off in the track configuration. +<br> +<br> +<img height="100" width="1250" src="/images/rmskPack.png"> +<br> +<br> + The pack display mode may also be configured to resemble the original UCSC repeat track. In this visualization, + repeat features are grouped by classes (see below), and displayed on separate track lines. The repeat ranges are + denoted as grayscale boxes, reflecting both the size of the repeat and + the amount of base mismatch, base deletion, and base insertion associated with a repeat element. + The higher the combined number of this divergence from the reference, the lighter the shading. +<br> +<br> +<img height="100" width="1250" src="/images/rmskOrigPack.png"> +</p> +<h4>Full Mode Visualization</h4> +<p> + In the most detailed visualization, repeats are displayed as chevron boxes, indicating the size and orientation of + the repeat. The interior grayscale shading represents the divergence of the repeat (see above) while the outline color + represents the class of the repeat. Dotted lines above the repeat and extending left or right + indicate the length of unaligned repeat model sequence and provide context for where a repeat fragment originates in its + consensus or pHMM model. If the length of the unaligned sequence + is large, an interruption line and bp size is indicated instead of drawing the extension to scale. +<br> +<br> +<img height="125" width="1250" src="/images/rmskFull.png"> +</p> + +<p> + For example, the following repeat is a SINE element in the forward orientation with average + divergence. Only the 5' proximal fragment of the consensus sequence is aligned to the genome. + The 3' unaligned length (384bp) is not drawn to scale and is instead displayed using a set of + interruption lines along with the length of the unaligned sequence. +</p> + +<img src="/images/rmskExample1.svg"> + +<p> + Repeats that have been fragmented by insertions or large internal deletions are now represented + by join lines. In the example below, a LINE element is found as two fragments. The solid + connection lines indicate that there are no unaligned consensus bases between the two fragments. + Also note these fragments form the 3' extremity of the repeat, as there is no unaligned consensus + sequence following the last fragment. +</p> + +<img src="/images/rmskExample2.svg"> + +<p> + In cases where there is unaligned consensus sequence between the fragments, the repeat will look like + the following. The dotted line indicates the length of the unaligned sequence between the two + fragments. In this case the unaligned consensus is longer than the actual genomic distance between + these two fragments. +</p> + +<img src="/images/rmskExample3.svg"> + +<p> + If there is consensus overlap between the two fragments, the joining lines will be drawn to indicate + how much of the left fragment is repeated in the right fragment. +</p> + +<img src="/images/rmskExample4.svg"> + +<p> + The following table lists the repeat class colors: +</p> + +<table> + <thead> + <tr> + <th style="border-bottom: 2px solid #6678B1;">Color</th> + <th style="border-bottom: 2px solid #6678B1;">Repeat Class</th> + </tr> + </thead> + <tr> + + <td bgcolor="#1F77B4"></td> + <td align="left"><b>SINE</b> - Short Interspersed Nuclear Element</td> + </tr> + <tr> + <td bgcolor="#FF7F0E"></td> + <td align="left"><b>LINE</b> - Long Interspersed Nuclear Element</td> + </tr> + <tr> + <td bgcolor="#2CA02C"></td> + <td align="left"><b>LTR</b> - Long Terminal Repeat</td> + </tr> + <tr> + <td bgcolor="#D62728"></td> + <td align="left"><b>DNA</b> - DNA Transposon</td> + </tr> + <tr> + <td bgcolor="#9467BD"></td> + <td align="left"><b>Simple</b> - Single Nucleotide Stretches and Tandem Repeats</td> + </tr> + <tr> + <tr> + <td bgcolor="#8C564B"></td> + <td align="left"><b>Low_complexity</b> - Low Complexity DNA</td> + </tr> + <tr> + <td bgcolor="#E377C2"></td> + <td align="left"><b>Satellite</b> - Satellite Repeats</td> + </tr> + <tr> + <td bgcolor="#7F7F7F"></td> + <td align="left"><b>RNA</b> - RNA Repeats (including RNA, tRNA, rRNA, snRNA, scRNA, srpRNA)</td> + </tr> + <tr> + <td bgcolor="#BCBD22"></td> + <td align="left"><b>Other</b> - Other Repeats (including class RC - Rolling Circle)</td> + </tr> + <tr> + <td bgcolor="#17BECF"></td> + <td align="left"><b>Unknown</b> - Unknown Classification</td> + </tr> +</table> + + +<p> + A "?" at the end of the "Family" or "Class" (for example, DNA?) + signifies that the curator was unsure of the classification. At some point in the future, + either the "?" will be removed or the classification will be changed.</p> + + + <h2 id="bigRmsk">bigRmsk track definitions</h2> <p> - The bigRmsk tracks consist of two bigBed files define by + The bigRmsk tracks consist of two bigBed files defined by <a href="http://www.linuxjournal.com/article/5949" target="_blank">autoSql</a> schema: </p> <ul> - <li>The primary bigRmsk file, define by <a href="examples/bigRmskBed.as"> + <li>The primary bigRmsk file, defined by <a href="examples/bigRmskBed.as"> <em>bigRmskBed.as</em></a>, which has the annotations of repeats. - <li>The secondary bigRmskAlign file, define by <a href="examples/bigRmskAlignBed.as"> + <li>The secondary bigRmskAlign file, defined by <a href="examples/bigRmskAlignBed.as"> <em>bigRmskAlignBed.as</em></a>, which contains the alignments of the consensus repeats to the genome. This file is optional, - if omitted, the bigRmsk track will function, without the ability to view the alignments. + if omitted, the bigRmsk track will function without the ability to view the alignments. </ul> <p> The input files for the bigRmsk files are created from the RepeatMasker <em>*.out</em> and <em>*.align</em> files using the <em>rmToTrackHub.pl</em> program that is included with RepeatMasker. The bigRmsk format is not designed to work with any other type of data. </p> <h2 id="steps">Creating a bigRmsk track</h2> <p> To create a bigRmsk track and its supporting files, follow the below steps. This assumes that you have already run RepeatMasker and have a <em>*.out</em>, and optionally <em>*.align</em> file. </p> <p> - RepeatMasker output files are convert to the bigRmsk textual form using the + RepeatMasker output files are converted to the bigRmsk textual form using the <em>RepeatMasker/util/rmToTrackHub.pl</em> program that is part of the <a href="http://www.repeatmasker.org/RepeatMasker/">RepeatMasker 4.1.3 or newer distribution</a>. </p> <p> <strong>Step 1.</strong> If you wish to experiment with quickly building an example track, download the example RepeatMasker output files for the human GRCh38 (hg38) assembly <a href="examples/bigRmskExample.out">bigRmskExample.out</a> and <a href="examples/bigRmskExample.align">bigRmskExample.align</a> used in this tutorial: <pre> - <code> wget https://genome.ucsc.edu/goldenPath/help/examples/bigRmskExample.out - wget https://genome.ucsc.edu/goldenPath/help/examples/bigRmskExample.align - </code> - </pre> + wget https://genome.ucsc.edu/goldenPath/help/examples/bigRmskExample.align</pre> <p> Otherwise, substitute your <em>*.out</em> and <em>*.align</em> in theses instructions. Generating the alignment bigRmsk file is optional if you don't have the <em>*.align</em> files from RepeatMasker, the track will function with reduced functionality without them. Just skip the steps involved in build the alignment files. <p> <strong>Step 2.</strong> Download the autoSql schemes <a href="examples/bigRmskBed.as">bigRmskBed.as</a> and <a href="examples/bigRmskAlignBed.as">bigRmskAlignBed.as</a>: <pre> - <code> wget https://genome.ucsc.edu/goldenPath/help/examples/bigRmskBed.as - wget https://genome.ucsc.edu/goldenPath/help/examples/bigRmskAlignBed.as - </code> - </pre> + wget https://genome.ucsc.edu/goldenPath/help/examples/bigRmskAlignBed.as</pre> <p> You will also need a file of chromosome sizes for your genome, or download the hg38 file for the example: <pre> - <code> - wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes - </code> - </pre> + wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes</pre> <p> <strong>Step 3.</strong> Convert the RepeatMasker files to the text format bigRmsk files for conversion to the bigRmsk files with <em>rmToTrackHub.pl</em>, which sorts the output for direct input to <em>bedToBigBed</em>: <pre> - <code> - RepeatMasker/util/rmToTrackHub.pl -out bigRmskExample.out -align bigRmskExample.align - </code> - </pre> + RepeatMasker/util/rmToTrackHub.pl -out bigRmskExample.out -align bigRmskExample.align</pre> <p> <strong>Step 4.</strong> Build the bigRmsk and optional bigRmskAlign files: <pre> - <code> bedToBigBed -tab -type=bed9+5 -as=bigRmskBed.as bigRmskExample.join.tsv hg38.chrom.sizes bigRmskExample.bb - bedToBigBed -tab -type=bed3+14 -as=bigRmskAlignBed.as bigRmskExample.align.tsv hg38.chrom.sizes bigRmskExampleAlign.bb - </code> - </pre> + bedToBigBed -tab -type=bed3+14 -as=bigRmskAlignBed.as bigRmskExample.align.tsv hg38.chrom.sizes bigRmskExampleAlign.bb</pre> <p> <strong>Step 6.</strong> Place the newly created bigRmsk file (<em>bigRmskExample.bb</em>), and optional bigRmskAlign (<em>bigRmskExampleAlign.bb</em>) to a web-accessible http, https or ftp location. </p> <strong>Step 7.</strong> <p> As with other bigBed-based tracks, bigRmsk tracks can be displayed as <a href="hgTracksHelp.html#CustomTracks">custom tracks</a>, included in <a href="hubQuickStart.html">track hubs</a>, or <a href="hubQuickStartAssembly.html">assembly hubs</a>. </p> <p> - The following options are used for bigRmsk custom tracks or trackDb entries: + The following options are used for bigRmsk custom tracks (with an equals sign between key and value) or trackDb hub entries as below: <ul> <li> <code>type bigRmsk</code> <li> <code>bigDataUrl<em> <url></em></code> - URL or relative path of bigRmsk file <li> <code>xrefDataUrl<em> <url></em></code> - URL or relative path of optional bigRmskAlign file </ul> - +<p> A standard bigRmsk track description is available at <a href="../trackDescriptions/bigRmskTrackDesc.html">bigRmskTrackDesc.html</a>, - which can be directly to with as the URL:<br> + which can be directly linked to with the URL:<br> <em>http://genome.ucsc.edu/goldenPath/trackDescriptions/bigRmskTrackDesc.html</em>. - +</p> <p> See the <a href="#examples">Examples</a> section below for detailed examples of bigRmsk custom tracks and track hub definitions. </p> <h2 id="examples">Examples</h2> <h3 id="example1">Example of a bigRmsk custom track</h3> <p> Construct a <a href="hgTracksHelp.html#CustomTracks">custom track</a> using a single <a href="hgTracksHelp.html#TRACK">track line</a>. Note that any of the track attributes listed <a href="customTrack.html#TRACK">here</a> are applicable to tracks of type bigBed. <p> To create a custom track using the example bigRmsk file: <ol> @@ -174,22 +306,23 @@ with a stanza such as the following:</p> <pre> track bigRmskExample shortLabel Example bigRmsk longLabel This is an example bigRmsk Track Hub Stanza type bigRmsk visibility full html http://genome.ucsc.edu/goldenPath/trackDescriptions/bigRmskTrackDesc.html bigDataUrl http://genome.ucsc.edu/goldenPath/help/examples/bigRmskExample.bb xrefDataUrl http://genome.ucsc.edu/goldenPath/help/examples/bigRmskExampleAlign.bb </pre> <h2 id="share">Additional information</h2> <p> See the <a href="bigBed.html">bigBed documentation</a> for guidance on - sharing, trouble shooting and extracting data from bigRmsk files. + sharing, troubleshooting, and extracting data from bigRmsk files. </p> <h2 id="credits">Credits</h2> +<p> The bigRmsk system was developed by Robert Hubley of the Institute for Systems Biology. - +</p> <!--#include virtual="$ROOT/inc/gbPageEnd.html" -->