b7d1897d8117b9a1e39f88710af76fff8172fed6 dschmelt Wed Aug 3 15:34:31 2022 -0700 Proofreading the doc for #29356 diff --git src/hg/htdocs/goldenPath/help/bigRmsk.html src/hg/htdocs/goldenPath/help/bigRmsk.html index 406467b..c644765 100755 --- src/hg/htdocs/goldenPath/help/bigRmsk.html +++ src/hg/htdocs/goldenPath/help/bigRmsk.html @@ -1,94 +1,93 @@
The bigRmsk format allows for the display of annotations of a genome generated by the
RepeatMasker
program that screens DNA sequences for interspersed repeats and low complexity DNA sequences.
The output of RepeatMasker is a detailed annotation of the repeats that are present in
the "query" sequence as well as a modified version of this query sequence
in which all the annotated repeats have been masked, where the default replaces
the discovered repeats by Ns. The bigRmsk format enables taking the annotation output
of RepeatMasker and converting it into a compressed and indexed version of a
-bigBed file, where the results when
-identified as type bigRmsk
in a Track Hub can be visualized as described
-below.
type bigRmsk
in a Track Hub and can be visualized as described
+below.
The bigRmsk files are created using the program bedToBigBed
. It must be run with the
-as
option to pull in a special autoSql (.as) file, bigRmskBed.as
that defines the fields
-of bigRmsk. Along side the bigRmsk file, an auxilary data bigBed can be made, with its own .as
+of bigRmsk. Along with the bigRmsk file, an auxiliary data bigBed can be made, with its own .as
definitions file (bigRmskAlignBed.as
) and referenced with a special xrefDataUrl
setting, whereas the bigRmsk file location is named with the standard bigDataUrl
setting.
The bigRmsk files are in an indexed binary format. The main advantage of this format is that only those portions of the file needed to display a particular region are transferred to the Genome Browser server. Because of this, bigRmsk files have considerably faster display performance than if they were stored in a text-based format. The bigRmsk file remains on your local web-accessible server (http, https or ftp), not on the UCSC server, and only the portion needed for the currently displayed chromosomal position is locally cached as a "sparse file". If you do not have access to a web-accessible server and need hosting space for your bigRmsk files, please see the Hosting section of the Track Hub Help documentation.
The following autoSql definition is used to specify the main bigRmsk files. This
definition, contained in the file bigRmsk.as, is
pulled in when the bedToBigBed
utility is run with the -as=bigRmsk.as
option.
table bigRmskBed
"Repetitive Element Annotation"
(
string chrom; "Reference sequence chromosome or scaffold"
uint chromStart; "Start position of visualization on chromosome"
- uint chromEnd; "End position of visualation on chromosome"
+ uint chromEnd; "End position of visualization on chromosome"
string name; "Name repeat, including the type/subtype suffix"
uint score; "Divergence score"
char[1] strand; "+ or - for strand"
uint thickStart; "Start position of aligned sequence on chromosome"
uint thickEnd; "End position of aligned sequence on chromosome"
uint reserved; "Reserved"
uint blockCount; "Count of sequence blocks"
lstring blockSizes; "A comma-separated list of the block sizes(+/-)"
lstring blockStarts; "A comma-separated list of the block starts(+/-)"
uint id; "A unique identifier for the joined annotations in this record"
lstring description; "A comma separated list of technical annotation descriptions"
)
An example: bedToBigBed -tab -as=bigRmsk.as -type=bed9+5 bigRmsk.txt
hg38.chrom.sizes bigRmsk.bb
.
Alongside the bigRmsk file, a supporting bigBed can provide alignment data. The following autoSql
definition is used to create this supporting file, pointed to online with xrefDataUrl
,
rather than the standard bigDataUrl
used with bigRmsk. The file
bigRmskAlignBed.as, is pulled in when
the bedToBigBed
utility is run with the -as=bigRmskAlignBed.as
option.
table bigRmskAlignBed
-"Repetitive Element Alignment Auxilary Data"
+"Repetitive Element Alignment Auxiliary Data"
(
string chrom; "Reference sequence chromosome or scaffold"
uint chromStart; "Start position of alignment on chromosome"
uint chromEnd; "End position of alignment on chromosome"
uint chromRemain; "Remaining bp in the chromosome or scaffold"
float score; "alignment score (sw, bits or evalue)"
float percSubst; "Base substitution percentage"
float percDel; "Base deletion percentage"
float percIns; "Bases insertion percentage"
char[1] strand; "Strand - either + or -"
string repName; "Name of repeat"
string repType; "Type of repeat"
string repSubtype; "Subtype of repeat"
uint repStart; "Start in repeat sequence"
uint repEnd; "End in repeat sequence"
@@ -96,47 +95,47 @@
uint id; "The ID of the hit. Used to link related fragments"
lstring calignData; "The alignment data stored as a single string"
)
An example: bedToBigBed -tab -as=bigRmskAlignBed.as -type=bed3+14
bigRmskAlign.tsv.sorted.txt hg38.chrom.sizes bigRmskAlign.bb
.CHECK - ISSUE IS xrefDataUrl doesn't work on this data yet.
Note that the bedToBigBed
utility uses a substantial amount of memory: approximately
25% more RAM than the uncompressed BED input file.
To create a bigRmsk track, and its supporting file, follow the below steps. All input
files into bedToBigBed
must be sorted on the coordinates of the first two columns,
-sort -k1,1 -k2,2n input.tsv.txt > input.tsv.sorted.txt
. To learn about a perl
+sort -k1,1 -k2,2n input.tsv.txt > input.tsv.sorted.txt
. To learn about a perl
program that can build the tab-separated values (tsv) input bedToBigBed text files from the
RepeatMasker output files, contact Robert Hubley: https://github.com/rmhubley.
Step 1. If you already have an input file you would like to convert to a bigRmsk, skip to Step 3. Otherwise, download this example bigRmsk.txt file for the human GRCh38 (hg38) assembly.
Step 2.
-If you would like to include the optional auxilary alignment data bigRmskAlign.bb
file,
+If you would like to include the optional auxiliary alignment data bigRmskAlign.bb
file,
download the bigRmskAlign.txt file.
Step 3.
Download the autoSql file bigRmsk.as needed by
-bedToBigBed
. If you have opted to include the optional auxilary alignment data file,
+bedToBigBed
. If you have opted to include the optional auxiliary alignment data file,
bigRmskAlign.bb, with your bigRmsk file, you must also download the autoSql file
bigRmskAlignBed.as.
Here are wget commands to obtain the above files and the hg38.chrom.sizes file mentioned below:
wget https://genome.ucsc.edu/goldenPath/help/examples/
wget https://genome.ucsc.edu/goldenPath/help/examples/bigRmsk.txt
wget https://genome.ucsc.edu/goldenPath/help/examples/bigRmskAlign.txt
wget https://genome.ucsc.edu/goldenPath/help/examples/bigRmsk.as
wget https://genome.ucsc.edu/goldenPath/help/examples/bigRmskAlign.as
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes
Step 4.
Download the bedToBigBed
program from the UCSC
binary utilities directory.
http://genome.ucsc.edu/cgi-bin/hgTracks?db=hg38&position=chr1:1-21,571&hgct_customText=track%20type=bigRmsk%20name=Example%20bigDataUrl=http://genome.ucsc.edu/goldenPath/help/examples/bigRmsk.bb%20visibility=full
After this example bigRmsk is loaded in the Genome Browser, click into an item on the browser's track display. Note that the details page display lacks information about the individual alignments, as this example does not include the optional supporting alignment file.
This example can also be loaded in a Track Hub with a stanza such as the following:
track ExBigRmsk shortLabel Example bigRmsk longLabel This is an example Track Hub Stanza type bigRmsk visibility full bigDataUrl http://genome.ucsc.edu/goldenPath/help/examples/bigRmsk.bb-NOTE: FOR WHEN REDOING PAGE, only Track Hubs now allow clicking into hgc.