d3de6531647d5797033474fa96e29cc88df3b37b brianlee Thu May 26 09:57:32 2022 -0700 Not for Code Review, see ticket #29356, check-in of temporary status of bigRmsk at request MarkD diff --git src/hg/htdocs/goldenPath/help/bigRmsk.html src/hg/htdocs/goldenPath/help/bigRmsk.html new file mode 100755 index 0000000..7e0361f --- /dev/null +++ src/hg/htdocs/goldenPath/help/bigRmsk.html @@ -0,0 +1,282 @@ + + + + + + + +
+The bigRmsk format allows for the display of annotations of a genome generated by the
+RepeatMasker
+program that screens DNA sequences for interspersed repeats and low complexity DNA sequences.
+The output of RepeatMasker is a detailed annotation of the repeats that are present in
+the "query" sequence as well as a modified version of this query sequence
+in which all the annotated repeats have been masked, where the default replaces
+the discovered repeats by Ns. The bigRmsk format enables taking the annotation output
+of RepeatMasker and converting it into a compressed and indexed version of a
+bigBed file, where the results when
+identified as type bigRmsk
in a Track Hub can be visualized as described
+below.
+The bigRmsk files are created using the program bedToBigBed
. It must be run with the
+-as
option to pull in a special autoSql (.as) file, bigRmskBed.as
that defines the fields
+of bigRmsk. Along side the bigRmsk file, an auxilary data bigBed can be made, with its own .as
+definitions file (bigRmskAlignBed.as
) and referenced with a special xrefDataUrl
+setting, whereas the bigRmsk file location is named with the standard bigDataUrl
setting.
+The bigRmsk files are in an indexed binary format. The main advantage of this format is that only +those portions of the file needed to display a particular region are transferred to the Genome +Browser server. Because of this, bigRmsk files have considerably faster display performance than +if they were stored in a text-based format. The bigRmsk file remains on your local +web-accessible server (http, https or ftp), not on the UCSC server, and only the portion needed for +the currently displayed chromosomal position is locally cached as a "sparse file". If you +do not have access to a web-accessible server and need hosting space for your bigRmsk files, please +see the Hosting section of the Track Hub Help +documentation.
+ +
+The following autoSql definition is used to specify the main bigRmsk files. This
+definition, contained in the file bigRmsk.as, is
+pulled in when the bedToBigBed
utility is run with the -as=bigRmsk.as
+option.
table bigRmskBed
+"Repetitive Element Annotation"
+ (
+ string chrom; "Reference sequence chromosome or scaffold"
+ uint chromStart; "Start position of visualization on chromosome"
+ uint chromEnd; "End position of visualation on chromosome"
+ string name; "Name repeat, including the type/subtype suffix"
+ uint score; "Divergence score"
+ char[1] strand; "+ or - for strand"
+ uint thickStart; "Start position of aligned sequence on chromosome"
+ uint thickEnd; "End position of aligned sequence on chromosome"
+ uint reserved; "Reserved"
+ uint blockCount; "Count of sequence blocks"
+ lstring blockSizes; "A comma-separated list of the block sizes(+/-)"
+ lstring blockStarts; "A comma-separated list of the block starts(+/-)"
+ uint id; "A unique identifier for the joined annotations in this record"
+ lstring description; "A comma separated list of technical annotation descriptions"
+ )
+An example: bedToBigBed -tab -as=bigRmsk.as -type=bed9+5 bigRmsk.txt
+hg38.chrom.sizes bigRmsk.bb
.
+Alongside the bigRmsk file, a supporting bigBed can provide alignment data. The following autoSql
+definition is used to create this supporting file, pointed to online with xrefDataUrl
,
+rather than the standard bigDataUrl
used with bigRmsk. The file
+bigRmskAlignBed.as, is pulled in when
+the bedToBigBed
utility is run with the -as=bigRmskAlignBed.as
+option.
table bigRmskAlignBed
+"Repetitive Element Alignment Auxilary Data"
+ (
+ string chrom; "Reference sequence chromosome or scaffold"
+ uint chromStart; "Start position of alignment on chromosome"
+ uint chromEnd; "End position of alignment on chromosome"
+ uint chromRemain; "Remaining bp in the chromosome or scaffold"
+ float score; "alignment score (sw, bits or evalue)"
+ float percSubst; "Base substitution percentage"
+ float percDel; "Base deletion percentage"
+ float percIns; "Bases insertion percentage"
+ char[1] strand; "Strand - either + or -"
+ string repName; "Name of repeat"
+ string repType; "Type of repeat"
+ string repSubtype; "Subtype of repeat"
+ uint repStart; "Start in repeat sequence"
+ uint repEnd; "End in repeat sequence"
+ uint repRemain; "Remaining unaligned bp in the repeat sequence"
+ uint id; "The ID of the hit. Used to link related fragments"
+ lstring calignData; "The alignment data stored as a single string"
+ )
+An example: bedToBigBed -tab -as=bigRmskAlignBed.as -type=bed3+14
+bigRmskAlign.tsv.sorted.txt hg38.chrom.sizes bigRmskAlign.bb
+
.CHECK - ISSUE IS xrefDataUrl doesn't work on this data yet.
+Note that the bedToBigBed
utility uses a substantial amount of memory: approximately
+25% more RAM than the uncompressed BED input file.
+To create a bigRmsk track, and its supporting file, follow the below steps. All input
+files into bedToBigBed
must be sorted on the coordinates of the first two columns,
+sort -k1,1 -k2,2n input.tsv.txt > input.tsv.sorted.txt
. To learn about a perl
+program that can build the tab-separated values (tsv) input bedToBigBed text files from the
+RepeatMasker output files, contact Robert Hubley: https://github.com/rmhubley.
+Step 1. +If you already have an input file you would like to convert to a bigRmsk, skip to Step 3. +Otherwise, download this example bigRmsk.txt +file for the human GRCh38 (hg38) assembly.
+
+Step 2.
+If you would like to include the optional auxilary alignment data bigRmskAlign.bb
file,
+download this bigRmskAlign.txt file.
+Step 3.
+Download the autoSql file bigRmsk.as needed by
+bedToBigBed
. If you have opted to include the optional auxilary alignment data file,
+bigRmskAlign.bb, with your bigRmsk file, you must also download the autoSql file
+bigRmskAlignBed.as.
+Here are wget commands to obtain the above files and the hg38.chrom.sizes file mentioned below: +
wget https://genome.ucsc.edu/goldenPath/help/examples/
+wget https://genome.ucsc.edu/goldenPath/help/examples/bigRmsk.txt
+wget https://genome.ucsc.edu/goldenPath/help/examples/bigRmskAlign.txt
+wget https://genome.ucsc.edu/goldenPath/help/examples/bigRmsk.as
+wget https://genome.ucsc.edu/goldenPath/help/examples/bigRmskAlign.as
+wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes
+
+
+Step 4.
+Download the bedToBigBed
program from the UCSC
+binary utilities directory.
+Step 5. +Download the chrom.sizes file for any assembly hosted at UCSC from our +downloads page (click on "Full +data set" for any assembly). For example, the hg38.chrom.sizes file for the hg38 +database is located at +http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes.
+
+bedToBigBed -tab -as=bigRmsk.as -type=bed9+5 bigRmsk.txt hg38.chrom.sizes bigRmsk.bb
++Step 6. +Move the newly created bigRmsk file (bigRmsk.bb) to a web-accessible http, https or ftp +location. If you generated the bigRmskAlign.bb files move those to a web accessible +location, likely same location as the bigRmsk.bb file.
++Step 7. +Construct a custom track using a single +track line. Note that any of the track attributes listed +here are applicable to tracks of type bigBed. The most basic +version of the track line will look something like this:
+track type=bigRmsk name="My bigRmsk" description="A RepeatMasker Track" bigDataUrl=http://myorg.edu/mylab/bigRmsk.bb+
+Step 8. +Paste the custom track line into the text box on the custom +track management page. Navigate to chr1:1-21,571 to see the example data for this track.
+
+The bedToBigBed
program can be run with several additional options. For a full
+list of the available options, type bedToBigBed
(with no arguments) on the command line
+to display the usage message.
+In this example, you will create a bigRmsk custom track using an existing bigRmsk file, +bigRmsk.bb, located on the UCSC Genome Browser http server. This file contains data for +the hg38 assembly.
++To create a custom track using this bigRmsk file: +
track type=bigRmsk name="bigRmsk Example One" description="A bigRmsk file" visibility=full bigDataUrl=http://genome.ucsc.edu/goldenPath/help/examples/bigRmsk.bb
chr1:1-21,571
to see the track.
++Custom tracks can also be loaded via one URL line. +This link loads the same bigRmsk.bb track and sets additional display +parameters in the URL:
+http://genome.ucsc.edu/cgi-bin/hgTracks?db=hg38&position=chr1:1-21,571&hgct_customText=track%20type=bigRmsk%20name=Example%20bigDataUrl=http://genome.ucsc.edu/goldenPath/help/examples/bigRmsk.bb%20visibility=full
++After this example bigRmsk is loaded in the Genome Browser, click into an item on the browser's +track display. Note that the details page display lacks information about the individual alignments, +as this example does not include the optional supporting alignment file.
++This example can also be loaded in a Track Hub with a stanza such as the following:
++track ExBigRmsk +shortLabel Example bigRmsk +longLabel This is an example Track Hub Stanza +type bigRmsk +visibility full +bigDataUrl http://genome.ucsc.edu/goldenPath/help/examples/bigRmsk.bb ++NOTE: FOR WHEN REDOING PAGE, only Track Hubs now allow clicking into hgc. + + +
+If you would like to share your bigRmsk data track with a colleague, learn how to create a URL by +looking at Example 6 on this page.
+ ++Because bigRmsk files are an extension of bigBed files, which are indexed binary files, it can +be difficult to extract data from them. UCSC has developed the following programs to assist +in working with bigBed formats, available from the +binary utilities directory.
+bigBedToBed
— converts a bigBed file to ASCII BED format.bigBedSummary
— extracts summary information from a bigBed file.bigBedInfo
— prints out information about a bigBed file.+As with all UCSC Genome Browser programs, simply type the program name (with no parameters) at the +command line to view the usage statement.
+ +
+If you encounter an error when you run the bedToBigBed
program, check your input
+file for data coordinates that extend past the the end of the chromosome. If these are present, run
+the bedClip
program
+(available here) to remove the problematic
+row(s) in your input file before running the bedToBigBed
program.