8b8e3dd40b5f555d315dafce8a4bd64ad07bb1b2 dschmelt Fri Sep 9 12:51:21 2022 -0700 Proofread and edited for clarity refs #29356 diff --git src/hg/htdocs/goldenPath/help/bigRmsk.html src/hg/htdocs/goldenPath/help/bigRmsk.html index 839b9ea..8b061e0 100755 --- src/hg/htdocs/goldenPath/help/bigRmsk.html +++ src/hg/htdocs/goldenPath/help/bigRmsk.html @@ -1,155 +1,287 @@

bigRmsk Track Format

The bigRmsk format allows for the display of annotations of a genome generated by the RepeatMasker program that screens DNA sequences for interspersed repeats and low complexity DNA sequences. -It is the recommend method of adding RepeatMaster tracks to assembly hubs. -For a descriptions of this features of this track type, with examples, see -standard bigRmsk track description. + It is the recommended method of adding RepeatMasker tracks to assembly hubs. +

The bigRmsk format enables taking the annotation output of RepeatMasker and converting it into a compressed and indexed bigBed file. Please see this page for a details of the bigBed format, its use, and associated tools.

Display Conventions and Configuration

+ +

Context Sensitive Zooming

+ This track employs a technique which chooses the appropriate visual representation for the data based on the + zoom scale and the number of annotations currently in view. The track will automatically switch from the + most detailed visualization ('Full' mode) to the denser view ('Pack' mode) when the window size is greater + than 45kb of sequence. It will further switch to the even denser single line view ('Dense' mode) if more than + 500 annotations are present in the current view. +

Dense Mode Visualization

+ In dense display mode, a single line is displayed denoting the coverage of repeats using a series + of colored boxes. The boxes are colored based on the classification of the repeat (see below for legend). +
+
+ +

Pack Mode Visualization

+ In pack mode, repeats are represented as sets of joined features. These are color coded as above based on the + class of the repeat, and the further details such as orientation (denoted by chevrons) and a family label are provided. + This family label may be optionally turned off in the track configuration. +
+
+ +
+
+ The pack display mode may also be configured to resemble the original UCSC repeat track. In this visualization, + repeat features are grouped by classes (see below), and displayed on separate track lines. The repeat ranges are + denoted as grayscale boxes, reflecting both the size of the repeat and + the amount of base mismatch, base deletion, and base insertion associated with a repeat element. + The higher the combined number of this divergence from the reference, the lighter the shading. +
+
+ +

Full Mode Visualization

+ In the most detailed visualization, repeats are displayed as chevron boxes, indicating the size and orientation of + the repeat. The interior grayscale shading represents the divergence of the repeat (see above) while the outline color + represents the class of the repeat. Dotted lines above the repeat and extending left or right + indicate the length of unaligned repeat model sequence and provide context for where a repeat fragment originates in its + consensus or pHMM model. If the length of the unaligned sequence + is large, an interruption line and bp size is indicated instead of drawing the extension to scale. +
+
+ +

+ +

+ For example, the following repeat is a SINE element in the forward orientation with average + divergence. Only the 5' proximal fragment of the consensus sequence is aligned to the genome. + The 3' unaligned length (384bp) is not drawn to scale and is instead displayed using a set of + interruption lines along with the length of the unaligned sequence. +

+ +

+ Repeats that have been fragmented by insertions or large internal deletions are now represented + by join lines. In the example below, a LINE element is found as two fragments. The solid + connection lines indicate that there are no unaligned consensus bases between the two fragments. + Also note these fragments form the 3' extremity of the repeat, as there is no unaligned consensus + sequence following the last fragment. +

+ +

+ In cases where there is unaligned consensus sequence between the fragments, the repeat will look like + the following. The dotted line indicates the length of the unaligned sequence between the two + fragments. In this case the unaligned consensus is longer than the actual genomic distance between + these two fragments. +

+ +

+ If there is consensus overlap between the two fragments, the joining lines will be drawn to indicate + how much of the left fragment is repeated in the right fragment. +

+ +

+ The following table lists the repeat class colors: +

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +

Color	Repeat Class
	SINE - Short Interspersed Nuclear Element
	LINE - Long Interspersed Nuclear Element
	LTR - Long Terminal Repeat
	DNA - DNA Transposon
	Simple - Single Nucleotide Stretches and Tandem Repeats
	Low_complexity - Low Complexity DNA
	Satellite - Satellite Repeats
	RNA - RNA Repeats (including RNA, tRNA, rRNA, snRNA, scRNA, srpRNA)
	Other - Other Repeats (including class RC - Rolling Circle)
	Unknown - Unknown Classification

+ + +

+ A "?" at the end of the "Family" or "Class" (for example, DNA?) + signifies that the curator was unsure of the classification. At some point in the future, + either the "?" will be removed or the classification will be changed.

+ + +

bigRmsk track definitions

- The bigRmsk tracks consist of two bigBed files define by + The bigRmsk tracks consist of two bigBed files defined by autoSql schema:

The primary bigRmsk file, define by +
The primary bigRmsk file, defined by bigRmskBed.as, which has the annotations of repeats. -
The secondary bigRmskAlign file, define by +
The secondary bigRmskAlign file, defined by bigRmskAlignBed.as, which contains the alignments of the consensus repeats to the genome. This file is optional, - if omitted, the bigRmsk track will function, without the ability to view the alignments. + if omitted, the bigRmsk track will function without the ability to view the alignments.

The input files for the bigRmsk files are created from the RepeatMasker *.out and *.align files using the rmToTrackHub.pl program that is included with RepeatMasker. The bigRmsk format is not designed to work with any other type of data.

Creating a bigRmsk track

To create a bigRmsk track and its supporting files, follow the below steps. This assumes that you have already run RepeatMasker and have a *.out, and optionally *.align file.

- RepeatMasker output files are convert to the bigRmsk textual form using the + RepeatMasker output files are converted to the bigRmsk textual form using the RepeatMasker/util/rmToTrackHub.pl program that is part of the RepeatMasker 4.1.3 or newer distribution.

Step 1. If you wish to experiment with quickly building an example track, download the example RepeatMasker output files for the human GRCh38 (hg38) assembly bigRmskExample.out and bigRmskExample.align used in this tutorial:

-    
       wget https://genome.ucsc.edu/goldenPath/help/examples/bigRmskExample.out
-      wget https://genome.ucsc.edu/goldenPath/help/examples/bigRmskExample.align
-    
-

+ wget https://genome.ucsc.edu/goldenPath/help/examples/bigRmskExample.align

Otherwise, substitute your *.out and *.align in theses instructions. Generating the alignment bigRmsk file is optional if you don't have the *.align files from RepeatMasker, the track will function with reduced functionality without them. Just skip the steps involved in build the alignment files.

Step 2. Download the autoSql schemes bigRmskBed.as and bigRmskAlignBed.as:

-    
       wget https://genome.ucsc.edu/goldenPath/help/examples/bigRmskBed.as
-      wget https://genome.ucsc.edu/goldenPath/help/examples/bigRmskAlignBed.as
-    
-

+ wget https://genome.ucsc.edu/goldenPath/help/examples/bigRmskAlignBed.as

You will also need a file of chromosome sizes for your genome, or download the hg38 file for the example:

-    
-      wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes
-    
-

+ wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes

Step 3. Convert the RepeatMasker files to the text format bigRmsk files for conversion to the bigRmsk files with rmToTrackHub.pl, which sorts the output for direct input to bedToBigBed:

-    
-      RepeatMasker/util/rmToTrackHub.pl -out bigRmskExample.out -align bigRmskExample.align
-    
-

+ RepeatMasker/util/rmToTrackHub.pl -out bigRmskExample.out -align bigRmskExample.align

Step 4. Build the bigRmsk and optional bigRmskAlign files:

-    
       bedToBigBed -tab -type=bed9+5 -as=bigRmskBed.as bigRmskExample.join.tsv hg38.chrom.sizes bigRmskExample.bb
-      bedToBigBed -tab -type=bed3+14 -as=bigRmskAlignBed.as bigRmskExample.align.tsv hg38.chrom.sizes bigRmskExampleAlign.bb
-    
-

+ bedToBigBed -tab -type=bed3+14 -as=bigRmskAlignBed.as bigRmskExample.align.tsv hg38.chrom.sizes bigRmskExampleAlign.bb

Step 6. Place the newly created bigRmsk file (bigRmskExample.bb), and optional bigRmskAlign (bigRmskExampleAlign.bb) to a web-accessible http, https or ftp location.

Step 7.

As with other bigBed-based tracks, bigRmsk tracks can be displayed as custom tracks, included in track hubs, or assembly hubs.

- The following options are used for bigRmsk custom tracks or trackDb entries: + The following options are used for bigRmsk custom tracks (with an equals sign between key and value) or trackDb hub entries as below:

type bigRmsk
bigDataUrl <url> - URL or relative path of bigRmsk file
xrefDataUrl <url> - URL or relative path of optional bigRmskAlign file

- +

A standard bigRmsk track description is available at bigRmskTrackDesc.html, - which can be directly to with as the URL:
+ which can be directly linked to with the URL:
http://genome.ucsc.edu/goldenPath/trackDescriptions/bigRmskTrackDesc.html. - +

See the Examples section below for detailed examples of bigRmsk custom tracks and track hub definitions.

Examples

Example of a bigRmsk custom track

Construct a custom track using a single track line. Note that any of the track attributes listed here are applicable to tracks of type bigBed.

To create a custom track using the example bigRmsk file:

     track bigRmskExample
     shortLabel Example bigRmsk
     longLabel This is an example bigRmsk Track Hub Stanza
     type bigRmsk
     visibility full
     html http://genome.ucsc.edu/goldenPath/trackDescriptions/bigRmskTrackDesc.html
     bigDataUrl http://genome.ucsc.edu/goldenPath/help/examples/bigRmskExample.bb
     xrefDataUrl http://genome.ucsc.edu/goldenPath/help/examples/bigRmskExampleAlign.bb

See the bigBed documentation for guidance on - sharing, trouble shooting and extracting data from bigRmsk files. + sharing, troubleshooting, and extracting data from bigRmsk files.

Credits

The bigRmsk system was developed by Robert Hubley of the Institute for Systems Biology. - +