b7d1897d8117b9a1e39f88710af76fff8172fed6 dschmelt Wed Aug 3 15:34:31 2022 -0700 Proofreading the doc for #29356 diff --git src/hg/htdocs/goldenPath/help/bigRmsk.html src/hg/htdocs/goldenPath/help/bigRmsk.html index 406467b..c644765 100755 --- src/hg/htdocs/goldenPath/help/bigRmsk.html +++ src/hg/htdocs/goldenPath/help/bigRmsk.html @@ -1,94 +1,93 @@

bigRmsk Track Format

This page is under development and is not ready for public use.

The bigRmsk format allows for the display of annotations of a genome generated by the RepeatMasker program that screens DNA sequences for interspersed repeats and low complexity DNA sequences. The output of RepeatMasker is a detailed annotation of the repeats that are present in the "query" sequence as well as a modified version of this query sequence in which all the annotated repeats have been masked, where the default replaces the discovered repeats by Ns. The bigRmsk format enables taking the annotation output of RepeatMasker and converting it into a compressed and indexed version of a -bigBed file, where the results when -identified as type bigRmsk in a Track Hub can be visualized as described -below.

+bigBed file, where the results can be +identified as type bigRmsk in a Track Hub and can be visualized as described +below.

The bigRmsk files are created using the program bedToBigBed. It must be run with the -as option to pull in a special autoSql (.as) file, bigRmskBed.as that defines the fields -of bigRmsk. Along side the bigRmsk file, an auxilary data bigBed can be made, with its own .as +of bigRmsk. Along with the bigRmsk file, an auxiliary data bigBed can be made, with its own .as definitions file (bigRmskAlignBed.as) and referenced with a special xrefDataUrl setting, whereas the bigRmsk file location is named with the standard bigDataUrl setting.

The bigRmsk files are in an indexed binary format. The main advantage of this format is that only those portions of the file needed to display a particular region are transferred to the Genome Browser server. Because of this, bigRmsk files have considerably faster display performance than if they were stored in a text-based format. The bigRmsk file remains on your local web-accessible server (http, https or ftp), not on the UCSC server, and only the portion needed for the currently displayed chromosomal position is locally cached as a "sparse file". If you do not have access to a web-accessible server and need hosting space for your bigRmsk files, please see the Hosting section of the Track Hub Help documentation.

bigRmsk file definitions

The following autoSql definition is used to specify the main bigRmsk files. This definition, contained in the file bigRmsk.as, is pulled in when the bedToBigBed utility is run with the -as=bigRmsk.as option.

bigRmsk.as

table bigRmskBed
 "Repetitive Element Annotation" 
     (
     string  chrom;        "Reference sequence chromosome or scaffold" 
     uint    chromStart;    "Start position of visualization on chromosome" 
-    uint    chromEnd;    "End position of visualation on chromosome" 
+    uint    chromEnd;    "End position of visualization on chromosome" 
     string  name;        "Name repeat, including the type/subtype suffix" 
     uint    score;        "Divergence score" 
     char[1] strand;        "+ or - for strand" 
     uint    thickStart;    "Start position of aligned sequence on chromosome" 
     uint    thickEnd;    "End position of aligned sequence on chromosome" 
     uint      reserved;    "Reserved" 
     uint    blockCount;    "Count of sequence blocks" 
     lstring blockSizes;     "A comma-separated list of the block sizes(+/-)" 
     lstring blockStarts;    "A comma-separated list of the block starts(+/-)" 
     uint    id;             "A unique identifier for the joined annotations in this record" 
     lstring description;    "A comma separated list of technical annotation descriptions"
     )

An example: bedToBigBed -tab -as=bigRmsk.as -type=bed9+5 bigRmsk.txt hg38.chrom.sizes bigRmsk.bb.

Supporting bigRmskAlign.bb auxilary data

Supporting bigRmskAlign.bb auxiliary data

Alongside the bigRmsk file, a supporting bigBed can provide alignment data. The following autoSql definition is used to create this supporting file, pointed to online with xrefDataUrl, rather than the standard bigDataUrl used with bigRmsk. The file bigRmskAlignBed.as, is pulled in when the bedToBigBed utility is run with the -as=bigRmskAlignBed.as option.

bigRmskAlignedBed.as

table bigRmskAlignBed
-"Repetitive Element Alignment Auxilary Data" 
+"Repetitive Element Alignment Auxiliary Data" 
     (
     string  chrom;        "Reference sequence chromosome or scaffold" 
     uint    chromStart;    "Start position of alignment on chromosome" 
     uint    chromEnd;    "End position of alignment on chromosome" 
     uint    chromRemain;    "Remaining bp in the chromosome or scaffold" 
     float   score;          "alignment score (sw, bits or evalue)" 
     float   percSubst;      "Base substitution percentage" 
     float   percDel;        "Base deletion percentage" 
     float   percIns;        "Bases insertion percentage" 
     char[1] strand;         "Strand - either + or -" 
     string  repName;        "Name of repeat" 
     string  repType;        "Type of repeat" 
     string  repSubtype;     "Subtype of repeat" 
     uint    repStart;       "Start in repeat sequence" 
     uint    repEnd;         "End in repeat sequence" 
@@ -96,47 +95,47 @@
     uint    id;             "The ID of the hit. Used to link related fragments" 
     lstring calignData;     "The alignment data stored as a single string" 
     )

An example: bedToBigBed -tab -as=bigRmskAlignBed.as -type=bed3+14 bigRmskAlign.tsv.sorted.txt hg38.chrom.sizes bigRmskAlign.bb.CHECK - ISSUE IS xrefDataUrl doesn't work on this data yet.

Note that the bedToBigBed utility uses a substantial amount of memory: approximately 25% more RAM than the uncompressed BED input file.

Creating a bigRmsk track

To create a bigRmsk track, and its supporting file, follow the below steps. All input files into bedToBigBed must be sorted on the coordinates of the first two columns, -sort -k1,1 -k2,2n input.tsv.txt > input.tsv.sorted.txt. To learn about a perl +sort -k1,1 -k2,2n input.tsv.txt > input.tsv.sorted.txt. To learn about a perl program that can build the tab-separated values (tsv) input bedToBigBed text files from the RepeatMasker output files, contact Robert Hubley: https://github.com/rmhubley.

Step 1. If you already have an input file you would like to convert to a bigRmsk, skip to Step 3. Otherwise, download this example bigRmsk.txt file for the human GRCh38 (hg38) assembly.

Step 2. -If you would like to include the optional auxilary alignment data bigRmskAlign.bb file, +If you would like to include the optional auxiliary alignment data bigRmskAlign.bb file, download the bigRmskAlign.txt file.

Step 3. Download the autoSql file bigRmsk.as needed by -bedToBigBed. If you have opted to include the optional auxilary alignment data file, +bedToBigBed. If you have opted to include the optional auxiliary alignment data file, bigRmskAlign.bb, with your bigRmsk file, you must also download the autoSql file bigRmskAlignBed.as.

Here are wget commands to obtain the above files and the hg38.chrom.sizes file mentioned below:

wget https://genome.ucsc.edu/goldenPath/help/examples/
 wget https://genome.ucsc.edu/goldenPath/help/examples/bigRmsk.txt
 wget https://genome.ucsc.edu/goldenPath/help/examples/bigRmskAlign.txt
 wget https://genome.ucsc.edu/goldenPath/help/examples/bigRmsk.as
 wget https://genome.ucsc.edu/goldenPath/help/examples/bigRmskAlign.as
 wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes

Step 4. Download the bedToBigBed program from the UCSC binary utilities directory.

@@ -200,33 +199,33 @@

http://genome.ucsc.edu/cgi-bin/hgTracks?db=hg38&position=chr1:1-21,571&hgct_customText=track%20type=bigRmsk%20name=Example%20bigDataUrl=http://genome.ucsc.edu/goldenPath/help/examples/bigRmsk.bb%20visibility=full

After this example bigRmsk is loaded in the Genome Browser, click into an item on the browser's track display. Note that the details page display lacks information about the individual alignments, as this example does not include the optional supporting alignment file.

This example can also be loaded in a Track Hub with a stanza such as the following:

 track ExBigRmsk
 shortLabel Example bigRmsk
 longLabel This is an example Track Hub Stanza
 type bigRmsk
 visibility full
 bigDataUrl http://genome.ucsc.edu/goldenPath/help/examples/bigRmsk.bb

-NOTE: FOR WHEN REDOING PAGE, only Track Hubs now allow clicking into hgc.