e19b35681f137d5bcc141f968112bf77476fac28 markd Tue Sep 13 17:09:32 2022 -0700 added maxWindowToDraw to bigRmsk example diff --git src/hg/htdocs/goldenPath/help/bigRmsk.html src/hg/htdocs/goldenPath/help/bigRmsk.html index 8b061e0..497a48f 100755 --- src/hg/htdocs/goldenPath/help/bigRmsk.html +++ src/hg/htdocs/goldenPath/help/bigRmsk.html @@ -1,328 +1,329 @@ <!DOCTYPE html> <!--#set var="TITLE" value="Genome Browser bigRmsk RepeatMasker Format" --> <!--#set var="ROOT" value="../.." --> <!-- Relative paths to support mirror sites with non-standard GB docs install --> <!--#include virtual="$ROOT/inc/gbPageStart.html" --> <h1>bigRmsk Track Format</h1> <p> The bigRmsk format allows for the display of annotations of a genome generated by the <a href="http://www.repeatmasker.org/" target="_blank">RepeatMasker</a> program that screens DNA sequences for interspersed repeats and low complexity DNA sequences. It is the recommended method of adding RepeatMasker tracks to assembly hubs. </p> <p> The bigRmsk format enables taking the annotation output of RepeatMasker and converting it into a compressed and indexed <a href="/goldenPath/help/bigBed.html">bigBed</a> file. Please see this page for a details of the bigBed format, its use, and associated tools. </p> <h2>Display Conventions and Configuration</h2> <h4>Context Sensitive Zooming</h4> <p> This track employs a technique which chooses the appropriate visual representation for the data based on the zoom scale and the number of annotations currently in view. The track will automatically switch from the most detailed visualization ('Full' mode) to the denser view ('Pack' mode) when the window size is greater than 45kb of sequence. It will further switch to the even denser single line view ('Dense' mode) if more than 500 annotations are present in the current view. </p> <h4>Dense Mode Visualization</h4> <p> In dense display mode, a single line is displayed denoting the coverage of repeats using a series of colored boxes. The boxes are colored based on the classification of the repeat (see below for legend). <br> <br> <img height="30" width="1250" src="/images/rmskDense.png"> </p> <h4>Pack Mode Visualization</h4> <p> In pack mode, repeats are represented as sets of joined features. These are color coded as above based on the class of the repeat, and the further details such as orientation (denoted by chevrons) and a family label are provided. This family label may be optionally turned off in the track configuration. <br> <br> <img height="100" width="1250" src="/images/rmskPack.png"> <br> <br> The pack display mode may also be configured to resemble the original UCSC repeat track. In this visualization, repeat features are grouped by classes (see below), and displayed on separate track lines. The repeat ranges are denoted as grayscale boxes, reflecting both the size of the repeat and the amount of base mismatch, base deletion, and base insertion associated with a repeat element. The higher the combined number of this divergence from the reference, the lighter the shading. <br> <br> <img height="100" width="1250" src="/images/rmskOrigPack.png"> </p> <h4>Full Mode Visualization</h4> <p> In the most detailed visualization, repeats are displayed as chevron boxes, indicating the size and orientation of the repeat. The interior grayscale shading represents the divergence of the repeat (see above) while the outline color represents the class of the repeat. Dotted lines above the repeat and extending left or right indicate the length of unaligned repeat model sequence and provide context for where a repeat fragment originates in its consensus or pHMM model. If the length of the unaligned sequence is large, an interruption line and bp size is indicated instead of drawing the extension to scale. <br> <br> <img height="125" width="1250" src="/images/rmskFull.png"> </p> <p> For example, the following repeat is a SINE element in the forward orientation with average divergence. Only the 5' proximal fragment of the consensus sequence is aligned to the genome. The 3' unaligned length (384bp) is not drawn to scale and is instead displayed using a set of interruption lines along with the length of the unaligned sequence. </p> <img src="/images/rmskExample1.svg"> <p> Repeats that have been fragmented by insertions or large internal deletions are now represented by join lines. In the example below, a LINE element is found as two fragments. The solid connection lines indicate that there are no unaligned consensus bases between the two fragments. Also note these fragments form the 3' extremity of the repeat, as there is no unaligned consensus sequence following the last fragment. </p> <img src="/images/rmskExample2.svg"> <p> In cases where there is unaligned consensus sequence between the fragments, the repeat will look like the following. The dotted line indicates the length of the unaligned sequence between the two fragments. In this case the unaligned consensus is longer than the actual genomic distance between these two fragments. </p> <img src="/images/rmskExample3.svg"> <p> If there is consensus overlap between the two fragments, the joining lines will be drawn to indicate how much of the left fragment is repeated in the right fragment. </p> <img src="/images/rmskExample4.svg"> <p> The following table lists the repeat class colors: </p> <table> <thead> <tr> <th style="border-bottom: 2px solid #6678B1;">Color</th> <th style="border-bottom: 2px solid #6678B1;">Repeat Class</th> </tr> </thead> <tr> <td bgcolor="#1F77B4"></td> <td align="left"><b>SINE</b> - Short Interspersed Nuclear Element</td> </tr> <tr> <td bgcolor="#FF7F0E"></td> <td align="left"><b>LINE</b> - Long Interspersed Nuclear Element</td> </tr> <tr> <td bgcolor="#2CA02C"></td> <td align="left"><b>LTR</b> - Long Terminal Repeat</td> </tr> <tr> <td bgcolor="#D62728"></td> <td align="left"><b>DNA</b> - DNA Transposon</td> </tr> <tr> <td bgcolor="#9467BD"></td> <td align="left"><b>Simple</b> - Single Nucleotide Stretches and Tandem Repeats</td> </tr> <tr> <tr> <td bgcolor="#8C564B"></td> <td align="left"><b>Low_complexity</b> - Low Complexity DNA</td> </tr> <tr> <td bgcolor="#E377C2"></td> <td align="left"><b>Satellite</b> - Satellite Repeats</td> </tr> <tr> <td bgcolor="#7F7F7F"></td> <td align="left"><b>RNA</b> - RNA Repeats (including RNA, tRNA, rRNA, snRNA, scRNA, srpRNA)</td> </tr> <tr> <td bgcolor="#BCBD22"></td> <td align="left"><b>Other</b> - Other Repeats (including class RC - Rolling Circle)</td> </tr> <tr> <td bgcolor="#17BECF"></td> <td align="left"><b>Unknown</b> - Unknown Classification</td> </tr> </table> <p> A "?" at the end of the "Family" or "Class" (for example, DNA?) signifies that the curator was unsure of the classification. At some point in the future, either the "?" will be removed or the classification will be changed.</p> <h2 id="bigRmsk">bigRmsk track definitions</h2> <p> The bigRmsk tracks consist of two bigBed files defined by <a href="http://www.linuxjournal.com/article/5949" target="_blank">autoSql</a> schema: </p> <ul> <li>The primary bigRmsk file, defined by <a href="examples/bigRmskBed.as"> <em>bigRmskBed.as</em></a>, which has the annotations of repeats. <li>The secondary bigRmskAlign file, defined by <a href="examples/bigRmskAlignBed.as"> <em>bigRmskAlignBed.as</em></a>, which contains the alignments of the consensus repeats to the genome. This file is optional, if omitted, the bigRmsk track will function without the ability to view the alignments. </ul> <p> The input files for the bigRmsk files are created from the RepeatMasker <em>*.out</em> and <em>*.align</em> files using the <em>rmToTrackHub.pl</em> program that is included with RepeatMasker. The bigRmsk format is not designed to work with any other type of data. </p> <h2 id="steps">Creating a bigRmsk track</h2> <p> To create a bigRmsk track and its supporting files, follow the below steps. This assumes that you have already run RepeatMasker and have a <em>*.out</em>, and optionally <em>*.align</em> file. </p> <p> RepeatMasker output files are converted to the bigRmsk textual form using the <em>RepeatMasker/util/rmToTrackHub.pl</em> program that is part of the <a href="http://www.repeatmasker.org/RepeatMasker/">RepeatMasker 4.1.3 or newer distribution</a>. </p> <p> <strong>Step 1.</strong> If you wish to experiment with quickly building an example track, download the example RepeatMasker output files for the human GRCh38 (hg38) assembly <a href="examples/bigRmskExample.out">bigRmskExample.out</a> and <a href="examples/bigRmskExample.align">bigRmskExample.align</a> used in this tutorial: <pre> wget https://genome.ucsc.edu/goldenPath/help/examples/bigRmskExample.out wget https://genome.ucsc.edu/goldenPath/help/examples/bigRmskExample.align</pre> <p> Otherwise, substitute your <em>*.out</em> and <em>*.align</em> in theses instructions. Generating the alignment bigRmsk file is optional if you don't have the <em>*.align</em> files from RepeatMasker, the track will function with reduced functionality without them. Just skip the steps involved in build the alignment files. <p> <strong>Step 2.</strong> Download the autoSql schemes <a href="examples/bigRmskBed.as">bigRmskBed.as</a> and <a href="examples/bigRmskAlignBed.as">bigRmskAlignBed.as</a>: <pre> wget https://genome.ucsc.edu/goldenPath/help/examples/bigRmskBed.as wget https://genome.ucsc.edu/goldenPath/help/examples/bigRmskAlignBed.as</pre> <p> You will also need a file of chromosome sizes for your genome, or download the hg38 file for the example: <pre> wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes</pre> <p> <strong>Step 3.</strong> Convert the RepeatMasker files to the text format bigRmsk files for conversion to the bigRmsk files with <em>rmToTrackHub.pl</em>, which sorts the output for direct input to <em>bedToBigBed</em>: <pre> RepeatMasker/util/rmToTrackHub.pl -out bigRmskExample.out -align bigRmskExample.align</pre> <p> <strong>Step 4.</strong> Build the bigRmsk and optional bigRmskAlign files: <pre> bedToBigBed -tab -type=bed9+5 -as=bigRmskBed.as bigRmskExample.join.tsv hg38.chrom.sizes bigRmskExample.bb bedToBigBed -tab -type=bed3+14 -as=bigRmskAlignBed.as bigRmskExample.align.tsv hg38.chrom.sizes bigRmskExampleAlign.bb</pre> <p> <strong>Step 6.</strong> Place the newly created bigRmsk file (<em>bigRmskExample.bb</em>), and optional bigRmskAlign (<em>bigRmskExampleAlign.bb</em>) to a web-accessible http, https or ftp location. </p> <strong>Step 7.</strong> <p> As with other bigBed-based tracks, bigRmsk tracks can be displayed as <a href="hgTracksHelp.html#CustomTracks">custom tracks</a>, included in <a href="hubQuickStart.html">track hubs</a>, or <a href="hubQuickStartAssembly.html">assembly hubs</a>. </p> <p> The following options are used for bigRmsk custom tracks (with an equals sign between key and value) or trackDb hub entries as below: <ul> <li> <code>type bigRmsk</code> <li> <code>bigDataUrl<em> <url></em></code> - URL or relative path of bigRmsk file <li> <code>xrefDataUrl<em> <url></em></code> - URL or relative path of optional bigRmskAlign file </ul> <p> A standard bigRmsk track description is available at <a href="../trackDescriptions/bigRmskTrackDesc.html">bigRmskTrackDesc.html</a>, which can be directly linked to with the URL:<br> <em>http://genome.ucsc.edu/goldenPath/trackDescriptions/bigRmskTrackDesc.html</em>. </p> <p> See the <a href="#examples">Examples</a> section below for detailed examples of bigRmsk custom tracks and track hub definitions. </p> <h2 id="examples">Examples</h2> <h3 id="example1">Example of a bigRmsk custom track</h3> <p> Construct a <a href="hgTracksHelp.html#CustomTracks">custom track</a> using a single <a href="hgTracksHelp.html#TRACK">track line</a>. Note that any of the track attributes listed <a href="customTrack.html#TRACK">here</a> are applicable to tracks of type bigBed. <p> To create a custom track using the example bigRmsk file: <ol> <li> Construct a track line that references the file:<br> <pre><code>track type=bigRmsk name="bigRmsk Example" description="RepeatMasker example" visibility=full bigDataUrl=http://genome.ucsc.edu/goldenPath/help/examples/bigRmskExample.bb xrefDataUrl=http://genome.ucsc.edu/goldenPath/help/examples/bigRmskExampleAlign.bb</code></pre> </li> <li> Paste the track line into the <a href="../../cgi-bin/hgCustom?db=hg38">custom track management page</a> for the human assembly hg38 (Dec. 2013). </li> <li> Click the "submit" button. </li> <li> Navigate to <code>chr1:8,890-35,190</code> to see the track. </li> </ol> <h3 id="example2">Example of a bigRmsk track hub </h3> <p> This example can also be loaded in a Track or Assembly Hub <em>trackDb.txt</em> with a stanza such as the following:</p> <pre> track bigRmskExample shortLabel Example bigRmsk longLabel This is an example bigRmsk Track Hub Stanza type bigRmsk + maxWindowToDraw 10000000 visibility full html http://genome.ucsc.edu/goldenPath/trackDescriptions/bigRmskTrackDesc.html bigDataUrl http://genome.ucsc.edu/goldenPath/help/examples/bigRmskExample.bb xrefDataUrl http://genome.ucsc.edu/goldenPath/help/examples/bigRmskExampleAlign.bb </pre> <h2 id="share">Additional information</h2> <p> See the <a href="bigBed.html">bigBed documentation</a> for guidance on sharing, troubleshooting, and extracting data from bigRmsk files. </p> <h2 id="credits">Credits</h2> <p> The bigRmsk system was developed by Robert Hubley of the Institute for Systems Biology. </p> <!--#include virtual="$ROOT/inc/gbPageEnd.html" -->