d3de6531647d5797033474fa96e29cc88df3b37b brianlee Thu May 26 09:57:32 2022 -0700 Not for Code Review, see ticket #29356, check-in of temporary status of bigRmsk at request MarkD diff --git src/hg/htdocs/goldenPath/help/bigRmsk.html src/hg/htdocs/goldenPath/help/bigRmsk.html new file mode 100755 index 0000000..7e0361f --- /dev/null +++ src/hg/htdocs/goldenPath/help/bigRmsk.html @@ -0,0 +1,282 @@ + + + + + + + +

bigRmsk Track Format

+ +

This page is under development and is not ready for public use.

+The bigRmsk format allows for the display of annotations of a genome generated by the +RepeatMasker +program that screens DNA sequences for interspersed repeats and low complexity DNA sequences. +The output of RepeatMasker is a detailed annotation of the repeats that are present in +the "query" sequence as well as a modified version of this query sequence +in which all the annotated repeats have been masked, where the default replaces +the discovered repeats by Ns. The bigRmsk format enables taking the annotation output +of RepeatMasker and converting it into a compressed and indexed version of a +bigBed file, where the results when +identified as type bigRmsk in a Track Hub can be visualized as described +below.

+The bigRmsk files are created using the program bedToBigBed. It must be run with the +-as option to pull in a special autoSql (.as) file, bigRmskBed.as that defines the fields +of bigRmsk. Along side the bigRmsk file, an auxilary data bigBed can be made, with its own .as +definitions file (bigRmskAlignBed.as) and referenced with a special xrefDataUrl +setting, whereas the bigRmsk file location is named with the standard bigDataUrl setting.

+The bigRmsk files are in an indexed binary format. The main advantage of this format is that only +those portions of the file needed to display a particular region are transferred to the Genome +Browser server. Because of this, bigRmsk files have considerably faster display performance than +if they were stored in a text-based format. The bigRmsk file remains on your local +web-accessible server (http, https or ftp), not on the UCSC server, and only the portion needed for +the currently displayed chromosomal position is locally cached as a "sparse file". If you +do not have access to a web-accessible server and need hosting space for your bigRmsk files, please +see the Hosting section of the Track Hub Help +documentation.

+ +

bigRmsk file definitions

+The following autoSql definition is used to specify the main bigRmsk files. This +definition, contained in the file bigRmsk.as, is +pulled in when the bedToBigBed utility is run with the -as=bigRmsk.as +option.

bigRmsk.as

table bigRmskBed
+"Repetitive Element Annotation" 
+    (
+    string  chrom;        "Reference sequence chromosome or scaffold" 
+    uint    chromStart;    "Start position of visualization on chromosome" 
+    uint    chromEnd;    "End position of visualation on chromosome" 
+    string  name;        "Name repeat, including the type/subtype suffix" 
+    uint    score;        "Divergence score" 
+    char[1] strand;        "+ or - for strand" 
+    uint    thickStart;    "Start position of aligned sequence on chromosome" 
+    uint    thickEnd;    "End position of aligned sequence on chromosome" 
+    uint      reserved;    "Reserved" 
+    uint    blockCount;    "Count of sequence blocks" 
+    lstring blockSizes;     "A comma-separated list of the block sizes(+/-)" 
+    lstring blockStarts;    "A comma-separated list of the block starts(+/-)" 
+    uint    id;             "A unique identifier for the joined annotations in this record" 
+    lstring description;    "A comma separated list of technical annotation descriptions"
+    )

An example: bedToBigBed -tab -as=bigRmsk.as -type=bed9+5 bigRmsk.txt +hg38.chrom.sizes bigRmsk.bb.

+ +

Supporting bigRmskAlign.bb auxilary data

+Alongside the bigRmsk file, a supporting bigBed can provide alignment data. The following autoSql +definition is used to create this supporting file, pointed to online with xrefDataUrl, +rather than the standard bigDataUrl used with bigRmsk. The file +bigRmskAlignBed.as, is pulled in when +the bedToBigBed utility is run with the -as=bigRmskAlignBed.as +option.

bigRmskAlignedBed.as

table bigRmskAlignBed
+"Repetitive Element Alignment Auxilary Data" 
+    (
+    string  chrom;        "Reference sequence chromosome or scaffold" 
+    uint    chromStart;    "Start position of alignment on chromosome" 
+    uint    chromEnd;    "End position of alignment on chromosome" 
+    uint    chromRemain;    "Remaining bp in the chromosome or scaffold" 
+    float   score;          "alignment score (sw, bits or evalue)" 
+    float   percSubst;      "Base substitution percentage" 
+    float   percDel;        "Base deletion percentage" 
+    float   percIns;        "Bases insertion percentage" 
+    char[1] strand;         "Strand - either + or -" 
+    string  repName;        "Name of repeat" 
+    string  repType;        "Type of repeat" 
+    string  repSubtype;     "Subtype of repeat" 
+    uint    repStart;       "Start in repeat sequence" 
+    uint    repEnd;         "End in repeat sequence" 
+    uint    repRemain;      "Remaining unaligned bp in the repeat sequence" 
+    uint    id;             "The ID of the hit. Used to link related fragments" 
+    lstring calignData;     "The alignment data stored as a single string" 
+    )

An example: bedToBigBed -tab -as=bigRmskAlignBed.as -type=bed3+14 +bigRmskAlign.tsv.sorted.txt hg38.chrom.sizes bigRmskAlign.bb +.CHECK - ISSUE IS xrefDataUrl doesn't work on this data yet.

+Note that the bedToBigBed utility uses a substantial amount of memory: approximately +25% more RAM than the uncompressed BED input file.

+ +

Creating a bigRmsk track

+To create a bigRmsk track, and its supporting file, follow the below steps. All input +files into bedToBigBed must be sorted on the coordinates of the first two columns, +sort -k1,1 -k2,2n input.tsv.txt > input.tsv.sorted.txt. To learn about a perl +program that can build the tab-separated values (tsv) input bedToBigBed text files from the +RepeatMasker output files, contact Robert Hubley: https://github.com/rmhubley.

+Step 1. +If you already have an input file you would like to convert to a bigRmsk, skip to Step 3. +Otherwise, download this example bigRmsk.txt +file for the human GRCh38 (hg38) assembly.

+Step 2. +If you would like to include the optional auxilary alignment data bigRmskAlign.bb file, +download this bigRmskAlign.txt file.

+Step 3. +Download the autoSql file bigRmsk.as needed by +bedToBigBed. If you have opted to include the optional auxilary alignment data file, +bigRmskAlign.bb, with your bigRmsk file, you must also download the autoSql file +bigRmskAlignBed.as.

+Here are wget commands to obtain the above files and the hg38.chrom.sizes file mentioned below: +

wget https://genome.ucsc.edu/goldenPath/help/examples/
+wget https://genome.ucsc.edu/goldenPath/help/examples/bigRmsk.txt
+wget https://genome.ucsc.edu/goldenPath/help/examples/bigRmskAlign.txt
+wget https://genome.ucsc.edu/goldenPath/help/examples/bigRmsk.as
+wget https://genome.ucsc.edu/goldenPath/help/examples/bigRmskAlign.as
+wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes
+

+Step 4. +Download the bedToBigBed program from the UCSC +binary utilities directory.

+Step 5. +Download the chrom.sizes file for any assembly hosted at UCSC from our +downloads page (click on "Full +data set" for any assembly). For example, the hg38.chrom.sizes file for the hg38 +database is located at +http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes.

+bedToBigBed -tab -as=bigRmsk.as -type=bed9+5 bigRmsk.txt hg38.chrom.sizes bigRmsk.bb

+Step 6. +Move the newly created bigRmsk file (bigRmsk.bb) to a web-accessible http, https or ftp +location. If you generated the bigRmskAlign.bb files move those to a web accessible +location, likely same location as the bigRmsk.bb file.

+Step 7. +Construct a custom track using a single +track line. Note that any of the track attributes listed +here are applicable to tracks of type bigBed. The most basic +version of the track line will look something like this:

track type=bigRmsk name="My bigRmsk" description="A RepeatMasker Track" bigDataUrl=http://myorg.edu/mylab/bigRmsk.bb

+Step 8. +Paste the custom track line into the text box on the custom +track management page. Navigate to chr1:1-21,571 to see the example data for this track.

+The bedToBigBed program can be run with several additional options. For a full +list of the available options, type bedToBigBed (with no arguments) on the command line +to display the usage message.

+ +

Examples

+ +

Example #1

+In this example, you will create a bigRmsk custom track using an existing bigRmsk file, +bigRmsk.bb, located on the UCSC Genome Browser http server. This file contains data for +the hg38 assembly.

+To create a custom track using this bigRmsk file: +

+ Construct a track line that references the file:

track type=bigRmsk name="bigRmsk Example One" description="A bigRmsk file" visibility=full bigDataUrl=http://genome.ucsc.edu/goldenPath/help/examples/bigRmsk.bb

+ Paste the track line into the custom track management + page for the human assembly hg38 (Dec. 2013).
+ Click the "submit" button.
+ Navigate to chr1:1-21,571 to see the track. +

+Custom tracks can also be loaded via one URL line. +This link loads the same bigRmsk.bb track and sets additional display +parameters in the URL:

http://genome.ucsc.edu/cgi-bin/hgTracks?db=hg38&position=chr1:1-21,571&hgct_customText=track%20type=bigRmsk%20name=Example%20bigDataUrl=http://genome.ucsc.edu/goldenPath/help/examples/bigRmsk.bb%20visibility=full

+After this example bigRmsk is loaded in the Genome Browser, click into an item on the browser's +track display. Note that the details page display lacks information about the individual alignments, +as this example does not include the optional supporting alignment file.

+This example can also be loaded in a Track Hub with a stanza such as the following:

+track ExBigRmsk
+shortLabel Example bigRmsk
+longLabel This is an example Track Hub Stanza
+type bigRmsk
+visibility full
+bigDataUrl http://genome.ucsc.edu/goldenPath/help/examples/bigRmsk.bb
+

+NOTE: FOR WHEN REDOING PAGE, only Track Hubs now allow clicking into hgc. + + + +

+If you would like to share your bigRmsk data track with a colleague, learn how to create a URL by +looking at Example 6 on this page.

+ +

Extracting data from the bigRmsk format

+Because bigRmsk files are an extension of bigBed files, which are indexed binary files, it can +be difficult to extract data from them. UCSC has developed the following programs to assist +in working with bigBed formats, available from the +binary utilities directory.

+ bigBedToBed — converts a bigBed file to ASCII BED format.
+ bigBedSummary — extracts summary information from a bigBed file.
+ bigBedInfo — prints out information about a bigBed file.

+As with all UCSC Genome Browser programs, simply type the program name (with no parameters) at the +command line to view the usage statement.

+ +

Troubleshooting

+If you encounter an error when you run the bedToBigBed program, check your input +file for data coordinates that extend past the the end of the chromosome. If these are present, run +the bedClip program +(available here) to remove the problematic +row(s) in your input file before running the bedToBigBed program.

+ +