d3de6531647d5797033474fa96e29cc88df3b37b brianlee Thu May 26 09:57:32 2022 -0700 Not for Code Review, see ticket #29356, check-in of temporary status of bigRmsk at request MarkD diff --git src/hg/htdocs/goldenPath/help/bigRmsk.html src/hg/htdocs/goldenPath/help/bigRmsk.html new file mode 100755 index 0000000..7e0361f --- /dev/null +++ src/hg/htdocs/goldenPath/help/bigRmsk.html @@ -0,0 +1,282 @@ +<!DOCTYPE html> +<!--#set var="TITLE" value="Genome Browser bigRmsk RepeatMasker Format" --> +<!--#set var="ROOT" value="../.." --> + +<!-- Relative paths to support mirror sites with non-standard GB docs install --> +<!--#include virtual="$ROOT/inc/gbPageStart.html" --> + +<h1>bigRmsk Track Format</h1> + +<h3>This page is under development and is not ready for public use.</h3> +<p> +The bigRmsk format allows for the display of annotations of a genome generated by the +<a href="http://www.repeatmasker.org" "target=_blank">RepeatMasker</a> +program that screens DNA sequences for interspersed repeats and low complexity DNA sequences. +The output of RepeatMasker is a detailed annotation of the repeats that are present in +the "query" sequence as well as a modified version of this query sequence +in which all the annotated repeats have been masked, where the default replaces +the discovered repeats by Ns. The bigRmsk format enables taking the annotation output +of RepeatMasker and converting it into a compressed and indexed version of a +<a href="/goldenPath/help/bigBed.html">bigBed</a> file, where the results when +identified as <code>type bigRmsk</code> in a Track Hub can be visualized as described +<a href="#linkToVisualizationSECTION_TOCOME">below</a>.</p> +<p> +The bigRmsk files are created using the program <code>bedToBigBed</code>. It must be run with the +<code>-as</code> option to pull in a special <a href="http://www.linuxjournal.com/article/5949" +target="_blank">autoSql</a> (<em>.as</em>) file, <code>bigRmskBed.as</code> that defines the fields +of bigRmsk. Along side the bigRmsk file, an auxilary data bigBed can be made, with its own .as +definitions file (<code>bigRmskAlignBed.as</code>) and referenced with a special <code>xrefDataUrl</code> +setting, whereas the bigRmsk file location is named with the standard <code>bigDataUrl</code> setting.</p> +<p> +The bigRmsk files are in an indexed binary format. The main advantage of this format is that only +those portions of the file needed to display a particular region are transferred to the Genome +Browser server. Because of this, bigRmsk files have considerably faster display performance than +if they were stored in a text-based format. The bigRmsk file remains on your local +web-accessible server (http, https or ftp), not on the UCSC server, and only the portion needed for +the currently displayed chromosomal position is locally cached as a "sparse file". If you +do not have access to a web-accessible server and need hosting space for your bigRmsk files, please +see the <a href="hgTrackHubHelp.html#Hosting">Hosting</a> section of the Track Hub Help +documentation.</p> + +<h2 id="bigRmsk">bigRmsk file definitions</h2> +<p> +The following autoSql definition is used to specify the main bigRmsk files. This +definition, contained in the file <a href="examples/bigRmsk.as"><em>bigRmsk.as</em></a>, is +pulled in when the <code>bedToBigBed</code> utility is run with the <code>-as=bigRmsk.as</code> +option. </p> +<h6>bigRmsk.as</h6> +<pre><code>table bigRmskBed +"Repetitive Element Annotation" + ( + string chrom; "Reference sequence chromosome or scaffold" + uint chromStart; "Start position of visualization on chromosome" + uint chromEnd; "End position of visualation on chromosome" + string name; "Name repeat, including the type/subtype suffix" + uint score; "Divergence score" + char[1] strand; "+ or - for strand" + uint thickStart; "Start position of aligned sequence on chromosome" + uint thickEnd; "End position of aligned sequence on chromosome" + uint reserved; "Reserved" + uint blockCount; "Count of sequence blocks" + lstring blockSizes; "A comma-separated list of the block sizes(+/-)" + lstring blockStarts; "A comma-separated list of the block starts(+/-)" + uint id; "A unique identifier for the joined annotations in this record" + lstring description; "A comma separated list of technical annotation descriptions" + )</code></pre> +<p>An example: <code>bedToBigBed -tab -as=bigRmsk.as -type=bed9+5 bigRmsk.txt +hg38.chrom.sizes bigRmsk.bb</code>.</p> + +<h3 id="supporting">Supporting bigRmskAlign.bb auxilary data</h3> +<p> +Alongside the bigRmsk file, a supporting bigBed can provide alignment data. The following autoSql +definition is used to create this supporting file, pointed to online with <code>xrefDataUrl</code>, +rather than the standard <code>bigDataUrl</code> used with bigRmsk. The file +<a href="examples/bigRmskAlignBed.as"><em>bigRmskAlignBed.as</em></a>, is pulled in when +the <code>bedToBigBed</code> utility is run with the <code>-as=bigRmskAlignBed.as</code> +option.</p> +<h6>bigRmskAlignedBed.as</h6> +<pre><code>table bigRmskAlignBed +"Repetitive Element Alignment Auxilary Data" + ( + string chrom; "Reference sequence chromosome or scaffold" + uint chromStart; "Start position of alignment on chromosome" + uint chromEnd; "End position of alignment on chromosome" + uint chromRemain; "Remaining bp in the chromosome or scaffold" + float score; "alignment score (sw, bits or evalue)" + float percSubst; "Base substitution percentage" + float percDel; "Base deletion percentage" + float percIns; "Bases insertion percentage" + char[1] strand; "Strand - either + or -" + string repName; "Name of repeat" + string repType; "Type of repeat" + string repSubtype; "Subtype of repeat" + uint repStart; "Start in repeat sequence" + uint repEnd; "End in repeat sequence" + uint repRemain; "Remaining unaligned bp in the repeat sequence" + uint id; "The ID of the hit. Used to link related fragments" + lstring calignData; "The alignment data stored as a single string" + )</code></pre> +<p>An example: <code>bedToBigBed -tab -as=bigRmskAlignBed.as -type=bed3+14 +bigRmskAlign.tsv.sorted.txt hg38.chrom.sizes bigRmskAlign.bb +</code>.CHECK - ISSUE IS xrefDataUrl doesn't work on this data yet.</p> +</p> +<p> +Note that the <code>bedToBigBed</code> utility uses a substantial amount of memory: approximately +25% more RAM than the uncompressed BED input file.</p> + +<h2 id="steps">Creating a bigRmsk track</h2> +<p> +To create a bigRmsk track, and its supporting file, follow the below steps. All input +files into <code>bedToBigBed</code> must be sorted on the coordinates of the first two columns, +<code>sort -k1,1 -k2,2n input.tsv.txt > input.tsv.sorted.txt</code>. To learn about a perl +program that can build the tab-separated values (tsv) input bedToBigBed text files from the +RepeatMasker output files, contact Robert Hubley: <a href="https://github.com/rmhubley" +target="_blank">https://github.com/rmhubley</a>.</p> +<p> +<strong>Step 1.</strong> +If you already have an input file you would like to convert to a bigRmsk, skip to <em>Step 3</em>. +Otherwise, download <a href="examples/bigRmsk.txt">this example bigRmsk.txt +file</a> for the human GRCh38 (hg38) assembly.</p> +<p> +<strong>Step 2.</strong> +If you would like to include the optional auxilary alignment data <code>bigRmskAlign.bb</code> file, +download this <a href="examples/bigRmskAlign.txt">bigRmskAlign.txt file</a>.</p> +<p> +<strong>Step 3.</strong> +Download the autoSql file <em><a href="examples/bigRmsk.as">bigRmsk.as</a></em> needed by +<code>bedToBigBed</code>. If you have opted to include the optional auxilary alignment data file, +bigRmskAlign.bb, with your bigRmsk file, you must also download the autoSql file +<a href="examples/bigRmskAlignBed.as">bigRmskAlignBed.as</a>.</p> +<p> +Here are wget commands to obtain the above files and the hg38.chrom.sizes file mentioned below: +<pre><code>wget https://genome.ucsc.edu/goldenPath/help/examples/ +wget https://genome.ucsc.edu/goldenPath/help/examples/bigRmsk.txt +wget https://genome.ucsc.edu/goldenPath/help/examples/bigRmskAlign.txt +wget https://genome.ucsc.edu/goldenPath/help/examples/bigRmsk.as +wget https://genome.ucsc.edu/goldenPath/help/examples/bigRmskAlign.as +wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes +</code></pre> +<p> +<strong>Step 4.</strong> +Download the <code>bedToBigBed</code> program from the UCSC +<a href="http://hgdownload.soe.ucsc.edu/admin/exe/">binary utilities directory</a>.</p> +<p> +<strong>Step 5.</strong> +Download the <em>chrom.sizes</em> file for any assembly hosted at UCSC from our +<a href="http://hgdownload.soe.ucsc.edu/downloads.html">downloads</a> page (click on "Full +data set" for any assembly). For example, the <em>hg38.chrom.sizes</em> file for the hg38 +database is located at +<a href="http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes" +target="_blank">http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes</a>.</p> +<pre> +<code>bedToBigBed -tab -as=bigRmsk.as -type=bed9+5 bigRmsk.txt hg38.chrom.sizes bigRmsk.bb</code></pre> +<p> +<strong>Step 6.</strong> +Move the newly created bigRmsk file (<em>bigRmsk.bb</em>) to a web-accessible http, https or ftp +location. If you generated the <em>bigRmskAlign.bb</em> files move those to a web accessible +location, likely same location as the <em>bigRmsk.bb</em> file.</p> +<p> +<strong>Step 7.</strong> +Construct a <a href="hgTracksHelp.html#CustomTracks?db=hg38">custom track</a> using a single +<a href="hgTracksHelp.html#TRACK">track line</a>. Note that any of the track attributes listed +<a href="customTrack.html#TRACK">here</a> are applicable to tracks of type bigBed. The most basic +version of the track line will look something like this:</p> +<pre>track type=bigRmsk name="My bigRmsk" description="A RepeatMasker Track" bigDataUrl=http://myorg.edu/mylab/bigRmsk.bb</pre> +<p> +<strong>Step 8.</strong> +Paste the custom track line into the text box on the <a href="../../cgi-bin/hgCustom?db=hg38">custom +track management page</a>. Navigate to chr1:1-21,571 to see the example data for this track.</p> +<p> +The <code>bedToBigBed</code> program can be run with several additional options. For a full +list of the available options, type <code>bedToBigBed</code> (with no arguments) on the command line +to display the usage message. </p> + +<h2 id="examples">Examples</h2> + +<h3 id="example1">Example #1</h3> +<p> +In this example, you will create a bigRmsk custom track using an existing bigRmsk file, +<em>bigRmsk.bb</em>, located on the UCSC Genome Browser http server. This file contains data for +the hg38 assembly.</p> +<p> +To create a custom track using this bigRmsk file: +<ol> + <li> + Construct a track line that references the file:</p> + <pre><code>track type=bigRmsk name="bigRmsk Example One" description="A bigRmsk file" visibility=full bigDataUrl=http://genome.ucsc.edu/goldenPath/help/examples/bigRmsk.bb</code></pre></li> + <li> + Paste the track line into the <a href="../../cgi-bin/hgCustom?db=hg38">custom track management + page</a> for the human assembly hg38 (Dec. 2013).</li> + <li> + Click the "submit" button.</li> + <li> + Navigate to <code>chr1:1-21,571</code> to see the track. +</ol> +<p> +Custom tracks can also be loaded via one URL line. +<a href="http://genome.ucsc.edu/cgi-bin/hgTracks?db=hg38&position=chr1:1-21,571&hgct_customText=track%20type=bigRmsk%20name=Example%20bigDataUrl=http://genome.ucsc.edu/goldenPath/help/examples/bigRmsk.bb%20visibility=full" +target="_blank">This link</a> loads the same <em>bigRmsk.bb</em> track and sets additional display +parameters in the URL:</p> +<pre><code>http://genome.ucsc.edu/cgi-bin/hgTracks?db=hg38&position=chr1:1-21,571&hgct_customText=track%20type=bigRmsk%20name=Example%20bigDataUrl=http://genome.ucsc.edu/goldenPath/help/examples/bigRmsk.bb%20visibility=full</code></pre> +<p> +After this example bigRmsk is loaded in the Genome Browser, click into an item on the browser's +track display. Note that the details page display lacks information about the individual alignments, +as this example does not include the optional supporting alignment file.</p> +<p> +This example can also be loaded in a Track Hub with a stanza such as the following:</p> +<pre> +track ExBigRmsk +shortLabel Example bigRmsk +longLabel This is an example Track Hub Stanza +type bigRmsk +visibility full +bigDataUrl http://genome.ucsc.edu/goldenPath/help/examples/bigRmsk.bb +</pre> +NOTE: FOR WHEN REDOING PAGE, only Track Hubs now allow clicking into hgc. + +<!--- +NOTE: The below is innaccurate and just a holder for when <b>xrefDataUrl works</b> to give an example building it. +<h3 id="example2">Example #2</h2> +<p> +In this example, you will create a bigRmsk file from an existing bigRmsk input file, +<em>bigRmsk.txt</em>, located on the UCSC Genome Browser http server.</p> +<ol> + <li> + Save the bed3+1 example file, <a href="examples/bigRmsk.txt"><em>bigRmsk.txt</em></a>, to your + computer (<em>Step 6</em>, above).</li> + <li> + Save the autoSql file <a href="examples/bigRmsk.as"><em>bigRmsk.as</em></a> to your computer + (<em>Step 3</em>, above).</li> + <li> + Download the + <a href="http://hgdownload.soe.ucsc.edu/admin/exe/"><code>bedToBigBed</code> utility</a> + (<em>Step 4</em>, above).</li> + <li> + Save the <a href="hg38.chrom.sizes"><em>hg38.chrom.sizes</em> text file</a> to your computer. + This file contains the chrom.sizes for the human (hg38) assembly (<em>Step 5</em>, above).</li> + <li> + Run the <code>bedToBigBed</code> utility to create a binary indexed MAF file (<em>Step 6</em>, + above): +<pre><code>bedToBigBed -type=bed3+1 -tab -as=bigRmsk.as bigRmsk.txt hg38.chrom.sizes bigRmsk.bb</code></pre></li> + <li> + Move the newly created bigRmsk file (<em>bigRmsk.bb</em>) to a web-accessible location (<em>Step + 7</em>, above).</li> + <li> + Construct a track line that points to the bigRmsk file (<em>Step 8</em>, above).</li> + <li> + Create the custom track on the human assembly hg38 (Dec. 2013), and view it in the Genome Browser + (<em>step 9</em>, above).</li> +</ol> +--> +<h2 id="share">Sharing your data with others</h2> +<p> +If you would like to share your bigRmsk data track with a colleague, learn how to create a URL by +looking at Example 6 on <a href="customTrack.html#EXAMPLE6">this page</a>.</p> + +<h2 id="extract">Extracting data from the bigRmsk format</h2> +<p> +Because bigRmsk files are an extension of bigBed files, which are indexed binary files, it can +be difficult to extract data from them. UCSC has developed the following programs to assist +in working with bigBed formats, available from the +<a href="http://hgdownload.soe.ucsc.edu/admin/exe/">binary utilities directory</a>.</p> +<ul> + <li> + <code>bigBedToBed</code> — converts a bigBed file to ASCII BED format.</li> + <li> + <code>bigBedSummary</code> — extracts summary information from a bigBed file.</li> + <li> + <code>bigBedInfo</code> — prints out information about a bigBed file.</li> +</ul> +<p> +As with all UCSC Genome Browser programs, simply type the program name (with no parameters) at the +command line to view the usage statement.</p> + +<h2 id="trouble">Troubleshooting</h2> +<p> +If you encounter an error when you run the <code>bedToBigBed</code> program, check your input +file for data coordinates that extend past the the end of the chromosome. If these are present, run +the <code>bedClip</code> program +(<a href="http://hgdownload.soe.ucsc.edu/admin/exe/">available here</a>) to remove the problematic +row(s) in your input file before running the <code>bedToBigBed</code> program.</p> + +<!--#include virtual="$ROOT/inc/gbPageEnd.html" -->