src/hg/htdocs/goldenPath/help/bigRmsk.html d3de6531647d5797033474fa96e29cc88df3b37b

d3de6531647d5797033474fa96e29cc88df3b37b
brianlee
  Thu May 26 09:57:32 2022 -0700
Not for Code Review, see ticket #29356, check-in of temporary status of bigRmsk at request MarkD

diff --git src/hg/htdocs/goldenPath/help/bigRmsk.html src/hg/htdocs/goldenPath/help/bigRmsk.html
new file mode 100755
index 0000000..7e0361f
--- /dev/null
+++ src/hg/htdocs/goldenPath/help/bigRmsk.html
@@ -0,0 +1,282 @@
+<!DOCTYPE html>
+<!--#set var="TITLE" value="Genome Browser bigRmsk RepeatMasker Format" -->
+<!--#set var="ROOT" value="../.." -->
+
+<!-- Relative paths to support mirror sites with non-standard GB docs install -->
+<!--#include virtual="$ROOT/inc/gbPageStart.html" -->
+
+<h1>bigRmsk Track Format</h1>
+
+<h3>This page is under development and is not ready for public use.</h3>
+<p>
+The bigRmsk format allows for the display of annotations of a genome generated by the
+<a href="http://www.repeatmasker.org" "target=_blank">RepeatMasker</a>
+program that screens DNA sequences for interspersed repeats and low complexity DNA sequences.
+The output of RepeatMasker is a detailed annotation of the repeats that are present in
+the &quot;query&quot; sequence as well as a modified version of this query sequence
+in which all the annotated repeats have been masked, where the default replaces
+the discovered repeats by Ns. The bigRmsk format enables taking the annotation output
+of RepeatMasker and converting it into a compressed and indexed version of a
+<a href="/goldenPath/help/bigBed.html">bigBed</a> file, where the results when
+identified as <code>type bigRmsk</code> in a Track Hub can be visualized as described
+<a href="#linkToVisualizationSECTION_TOCOME">below</a>.</p>
+<p>
+The bigRmsk files are created using the program <code>bedToBigBed</code>. It must be run with the 
+<code>-as</code> option to pull in a special <a href="http://www.linuxjournal.com/article/5949" 
+target="_blank">autoSql</a> (<em>.as</em>) file, <code>bigRmskBed.as</code> that defines the fields
+of bigRmsk. Along side the bigRmsk file, an auxilary data bigBed can be made, with its own .as
+definitions file (<code>bigRmskAlignBed.as</code>) and referenced with a special <code>xrefDataUrl</code>
+setting, whereas the bigRmsk file location is named with the standard <code>bigDataUrl</code> setting.</p>
+<p>
+The bigRmsk files are in an indexed binary format. The main advantage of this format is that only 
+those portions of the file needed to display a particular region are transferred to the Genome 
+Browser server. Because of this, bigRmsk files have considerably faster display performance than
+if they were stored in a text-based format. The bigRmsk file remains on your local 
+web-accessible server (http, https or ftp), not on the UCSC server, and only the portion needed for 
+the currently displayed chromosomal position is locally cached as a &quot;sparse file&quot;. If you
+do not have access to a web-accessible server and need hosting space for your bigRmsk files, please
+see the <a href="hgTrackHubHelp.html#Hosting">Hosting</a> section of the Track Hub Help
+documentation.</p>
+
+<h2 id="bigRmsk">bigRmsk file definitions</h2>
+<p>
+The following autoSql definition is used to specify the main bigRmsk files. This
+definition, contained in the file <a href="examples/bigRmsk.as"><em>bigRmsk.as</em></a>, is 
+pulled in when the <code>bedToBigBed</code> utility is run with the <code>-as=bigRmsk.as</code> 
+option. </p>
+<h6>bigRmsk.as</h6>
+<pre><code>table bigRmskBed
+"Repetitive Element Annotation" 
+    (
+    string  chrom;        "Reference sequence chromosome or scaffold" 
+    uint    chromStart;    "Start position of visualization on chromosome" 
+    uint    chromEnd;    "End position of visualation on chromosome" 
+    string  name;        "Name repeat, including the type/subtype suffix" 
+    uint    score;        "Divergence score" 
+    char[1] strand;        "+ or - for strand" 
+    uint    thickStart;    "Start position of aligned sequence on chromosome" 
+    uint    thickEnd;    "End position of aligned sequence on chromosome" 
+    uint      reserved;    "Reserved" 
+    uint    blockCount;    "Count of sequence blocks" 
+    lstring blockSizes;     "A comma-separated list of the block sizes(+/-)" 
+    lstring blockStarts;    "A comma-separated list of the block starts(+/-)" 
+    uint    id;             "A unique identifier for the joined annotations in this record" 
+    lstring description;    "A comma separated list of technical annotation descriptions"
+    )</code></pre>
+<p>An example: <code>bedToBigBed -tab -as=bigRmsk.as -type=bed9+5 bigRmsk.txt
+hg38.chrom.sizes bigRmsk.bb</code>.</p>
+
+<h3 id="supporting">Supporting bigRmskAlign.bb auxilary data</h3>
+<p>
+Alongside the bigRmsk file, a supporting bigBed can provide alignment data. The following autoSql
+definition is used to create this supporting file, pointed to online with <code>xrefDataUrl</code>,
+rather than the standard <code>bigDataUrl</code> used with bigRmsk. The file
+<a href="examples/bigRmskAlignBed.as"><em>bigRmskAlignBed.as</em></a>, is pulled in when
+the <code>bedToBigBed</code> utility is run with the <code>-as=bigRmskAlignBed.as</code>
+option.</p>
+<h6>bigRmskAlignedBed.as</h6>
+<pre><code>table bigRmskAlignBed
+"Repetitive Element Alignment Auxilary Data" 
+    (
+    string  chrom;        "Reference sequence chromosome or scaffold" 
+    uint    chromStart;    "Start position of alignment on chromosome" 
+    uint    chromEnd;    "End position of alignment on chromosome" 
+    uint    chromRemain;    "Remaining bp in the chromosome or scaffold" 
+    float   score;          "alignment score (sw, bits or evalue)" 
+    float   percSubst;      "Base substitution percentage" 
+    float   percDel;        "Base deletion percentage" 
+    float   percIns;        "Bases insertion percentage" 
+    char[1] strand;         "Strand - either + or -" 
+    string  repName;        "Name of repeat" 
+    string  repType;        "Type of repeat" 
+    string  repSubtype;     "Subtype of repeat" 
+    uint    repStart;       "Start in repeat sequence" 
+    uint    repEnd;         "End in repeat sequence" 
+    uint    repRemain;      "Remaining unaligned bp in the repeat sequence" 
+    uint    id;             "The ID of the hit. Used to link related fragments" 
+    lstring calignData;     "The alignment data stored as a single string" 
+    )</code></pre>
+<p>An example: <code>bedToBigBed -tab -as=bigRmskAlignBed.as -type=bed3+14
+bigRmskAlign.tsv.sorted.txt hg38.chrom.sizes bigRmskAlign.bb
+</code>.CHECK - ISSUE IS xrefDataUrl doesn't work on this data yet.</p>
+</p>
+<p>
+Note that the <code>bedToBigBed</code> utility uses a substantial amount of memory: approximately 
+25% more RAM than the uncompressed BED input file.</p>
+
+<h2 id="steps">Creating a bigRmsk track</h2>
+<p>
+To create a bigRmsk track, and its supporting file, follow the below steps. All input
+files into <code>bedToBigBed</code> must be sorted on the coordinates of the first two columns,
+<code>sort -k1,1 -k2,2n input.tsv.txt >  input.tsv.sorted.txt</code>. To learn about a perl
+program that can build the tab-separated values (tsv) input bedToBigBed text files from the
+RepeatMasker output files, contact Robert Hubley: <a href="https://github.com/rmhubley"
+target="_blank">https://github.com/rmhubley</a>.</p> 
+<p>
+<strong>Step 1.</strong> 
+If you already have an input file you would like to convert to a bigRmsk, skip to <em>Step 3</em>.
+Otherwise, download <a href="examples/bigRmsk.txt">this example bigRmsk.txt
+file</a> for the human GRCh38 (hg38) assembly.</p>
+<p>
+<strong>Step 2.</strong> 
+If you would like to include the optional auxilary alignment data <code>bigRmskAlign.bb</code> file,
+download this <a href="examples/bigRmskAlign.txt">bigRmskAlign.txt file</a>.</p>
+<p>
+<strong>Step 3.</strong> 
+Download the autoSql file <em><a href="examples/bigRmsk.as">bigRmsk.as</a></em> needed by 
+<code>bedToBigBed</code>. If you have opted to include the optional auxilary alignment data file,
+bigRmskAlign.bb, with your bigRmsk file, you must also download the autoSql file
+<a href="examples/bigRmskAlignBed.as">bigRmskAlignBed.as</a>.</p>
+<p>
+Here are wget commands to obtain the above files and the hg38.chrom.sizes file mentioned below:
+<pre><code>wget https://genome.ucsc.edu/goldenPath/help/examples/
+wget https://genome.ucsc.edu/goldenPath/help/examples/bigRmsk.txt
+wget https://genome.ucsc.edu/goldenPath/help/examples/bigRmskAlign.txt
+wget https://genome.ucsc.edu/goldenPath/help/examples/bigRmsk.as
+wget https://genome.ucsc.edu/goldenPath/help/examples/bigRmskAlign.as
+wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes
+</code></pre>
+<p>
+<strong>Step 4.</strong> 
+Download the <code>bedToBigBed</code> program from the UCSC
+<a href="http://hgdownload.soe.ucsc.edu/admin/exe/">binary utilities directory</a>.</p>
+<p>
+<strong>Step 5.</strong> 
+Download the  <em>chrom.sizes</em> file for any assembly hosted at UCSC from our 
+<a href="http://hgdownload.soe.ucsc.edu/downloads.html">downloads</a> page (click on &quot;Full 
+data set&quot; for any assembly). For example, the <em>hg38.chrom.sizes</em> file for the hg38 
+database is located at 
+<a href="http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes" 
+target="_blank">http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes</a>.</p>
+<pre>
+<code>bedToBigBed -tab -as=bigRmsk.as -type=bed9+5 bigRmsk.txt hg38.chrom.sizes bigRmsk.bb</code></pre>
+<p>
+<strong>Step 6.</strong> 
+Move the newly created bigRmsk file (<em>bigRmsk.bb</em>) to a web-accessible http, https or ftp
+location. If you generated the <em>bigRmskAlign.bb</em> files move those to a web accessible
+location, likely same location as the <em>bigRmsk.bb</em> file.</p> 
+<p>
+<strong>Step 7.</strong> 
+Construct a <a href="hgTracksHelp.html#CustomTracks?db=hg38">custom track</a> using a single 
+<a href="hgTracksHelp.html#TRACK">track line</a>. Note that any of the track attributes listed 
+<a href="customTrack.html#TRACK">here</a> are applicable to tracks of type bigBed. The most basic
+version of the track line will look something like this:</p>
+<pre>track type=bigRmsk name="My bigRmsk" description="A RepeatMasker Track" bigDataUrl=http://myorg.edu/mylab/bigRmsk.bb</pre>
+<p>
+<strong>Step 8.</strong> 
+Paste the custom track line into the text box on the <a href="../../cgi-bin/hgCustom?db=hg38">custom
+track management page</a>. Navigate to chr1:1-21,571 to see the example data for this track.</p>
+<p>
+The <code>bedToBigBed</code> program can be run with several additional options. For a full
+list of the available options, type <code>bedToBigBed</code> (with no arguments) on the command line
+to display the usage message. </p>
+
+<h2 id="examples">Examples</h2>
+
+<h3 id="example1">Example #1</h3>
+<p>
+In this example, you will create a bigRmsk custom track using an existing bigRmsk file,
+<em>bigRmsk.bb</em>, located on the UCSC Genome Browser http server. This file contains data for 
+the hg38 assembly.</p>
+<p>
+To create a custom track using this bigRmsk file: 
+<ol>
+  <li>
+  Construct a track line that references the file:</p>
+  <pre><code>track type=bigRmsk name=&quot;bigRmsk Example One&quot; description=&quot;A bigRmsk file&quot; visibility=full bigDataUrl=http://genome.ucsc.edu/goldenPath/help/examples/bigRmsk.bb</code></pre></li>
+  <li>
+  Paste the track line into the <a href="../../cgi-bin/hgCustom?db=hg38">custom track management 
+  page</a> for the human assembly hg38 (Dec. 2013).</li> 
+  <li>
+  Click the &quot;submit&quot; button.</li>
+  <li>
+  Navigate to <code>chr1:1-21,571</code> to see the track.
+</ol>
+<p>
+Custom tracks can also be loaded via one URL line. 
+<a href="http://genome.ucsc.edu/cgi-bin/hgTracks?db=hg38&position=chr1:1-21,571&hgct_customText=track%20type=bigRmsk%20name=Example%20bigDataUrl=http://genome.ucsc.edu/goldenPath/help/examples/bigRmsk.bb%20visibility=full"
+target="_blank">This link</a> loads the same <em>bigRmsk.bb</em> track and sets additional display 
+parameters in the URL:</p>
+<pre><code>http://genome.ucsc.edu/cgi-bin/hgTracks?db=hg38&position=chr1:1-21,571&hgct_customText=track%20type=bigRmsk%20name=Example%20bigDataUrl=http://genome.ucsc.edu/goldenPath/help/examples/bigRmsk.bb%20visibility=full</code></pre>
+<p>
+After this example bigRmsk is loaded in the Genome Browser, click into an item on the browser's 
+track display. Note that the details page display lacks information about the individual alignments, 
+as this example does not include the optional supporting alignment file.</p>
+<p>
+This example can also be loaded in a Track Hub with a stanza such as the following:</p>
+<pre>
+track ExBigRmsk
+shortLabel Example bigRmsk
+longLabel This is an example Track Hub Stanza
+type bigRmsk
+visibility full
+bigDataUrl http://genome.ucsc.edu/goldenPath/help/examples/bigRmsk.bb
+</pre>
+NOTE: FOR WHEN REDOING PAGE, only Track Hubs now allow clicking into hgc. 
+
+<!---
+NOTE: The below is innaccurate and just a holder for when <b>xrefDataUrl works</b> to give an example building it.
+<h3 id="example2">Example #2</h2>
+<p>
+In this example, you will create a bigRmsk file from an existing bigRmsk input file, 
+<em>bigRmsk.txt</em>, located on the UCSC Genome Browser http server.</p>
+<ol>
+  <li>
+  Save the bed3+1 example file, <a href="examples/bigRmsk.txt"><em>bigRmsk.txt</em></a>, to your 
+  computer (<em>Step 6</em>, above).</li>
+  <li>
+  Save the autoSql file <a href="examples/bigRmsk.as"><em>bigRmsk.as</em></a> to your computer 
+  (<em>Step 3</em>, above).</li>
+  <li>
+  Download the 
+  <a href="http://hgdownload.soe.ucsc.edu/admin/exe/"><code>bedToBigBed</code> utility</a> 
+ (<em>Step 4</em>, above).</li>
+  <li>
+  Save the <a href="hg38.chrom.sizes"><em>hg38.chrom.sizes</em> text file</a> to your computer. 
+  This file contains the chrom.sizes for the human (hg38) assembly (<em>Step 5</em>, above).</li>
+  <li>
+  Run the <code>bedToBigBed</code> utility to create a binary indexed MAF file (<em>Step 6</em>,
+  above):
+<pre><code>bedToBigBed -type=bed3+1 -tab -as=bigRmsk.as bigRmsk.txt hg38.chrom.sizes bigRmsk.bb</code></pre></li>
+  <li>
+  Move the newly created bigRmsk file (<em>bigRmsk.bb</em>) to a web-accessible location (<em>Step 
+  7</em>, above).</li>
+  <li>
+  Construct a track line that points to the bigRmsk file (<em>Step 8</em>, above).</li>
+  <li>
+  Create the custom track on the human assembly hg38 (Dec. 2013), and view it in the Genome Browser 
+  (<em>step 9</em>, above).</li>
+</ol>
+-->
+<h2 id="share">Sharing your data with others</h2>
+<p>
+If you would like to share your bigRmsk data track with a colleague, learn how to create a URL by 
+looking at Example 6 on <a href="customTrack.html#EXAMPLE6">this page</a>.</p>
+
+<h2 id="extract">Extracting data from the bigRmsk format</h2>
+<p>
+Because bigRmsk files are an extension of bigBed files, which are indexed binary files, it can 
+be difficult to extract data from them. UCSC has developed the following programs to assist
+in working with bigBed formats, available from the 
+<a href="http://hgdownload.soe.ucsc.edu/admin/exe/">binary utilities directory</a>.</p>
+<ul>
+  <li>
+  <code>bigBedToBed</code> &mdash; converts a bigBed file to ASCII BED format.</li>
+  <li>
+  <code>bigBedSummary</code> &mdash; extracts summary information from a bigBed file.</li>
+  <li>
+  <code>bigBedInfo</code> &mdash; prints out information about a bigBed file.</li>
+</ul>
+<p>
+As with all UCSC Genome Browser programs, simply type the program name (with no parameters) at the 
+command line to view the usage statement.</p>
+
+<h2 id="trouble">Troubleshooting</h2>
+<p>
+If you encounter an error when you run the <code>bedToBigBed</code> program, check your input 
+file for data coordinates that extend past the the end of the chromosome. If these are present, run 
+the <code>bedClip</code> program 
+(<a href="http://hgdownload.soe.ucsc.edu/admin/exe/">available here</a>) to remove the problematic
+row(s) in your input file before running the <code>bedToBigBed</code> program.</p> 
+
+<!--#include virtual="$ROOT/inc/gbPageEnd.html" -->