b7d1897d8117b9a1e39f88710af76fff8172fed6 dschmelt Wed Aug 3 15:34:31 2022 -0700 Proofreading the doc for #29356 diff --git src/hg/htdocs/goldenPath/help/bigRmsk.html src/hg/htdocs/goldenPath/help/bigRmsk.html index 406467b..c644765 100755 --- src/hg/htdocs/goldenPath/help/bigRmsk.html +++ src/hg/htdocs/goldenPath/help/bigRmsk.html @@ -1,94 +1,93 @@ <!DOCTYPE html> <!--#set var="TITLE" value="Genome Browser bigRmsk RepeatMasker Format" --> <!--#set var="ROOT" value="../.." --> <!-- Relative paths to support mirror sites with non-standard GB docs install --> <!--#include virtual="$ROOT/inc/gbPageStart.html" --> <h1>bigRmsk Track Format</h1> -<h3>This page is under development and is not ready for public use.</h3> <p> The bigRmsk format allows for the display of annotations of a genome generated by the <a href="http://www.repeatmasker.org" "target=_blank">RepeatMasker</a> program that screens DNA sequences for interspersed repeats and low complexity DNA sequences. The output of RepeatMasker is a detailed annotation of the repeats that are present in the "query" sequence as well as a modified version of this query sequence in which all the annotated repeats have been masked, where the default replaces the discovered repeats by Ns. The bigRmsk format enables taking the annotation output of RepeatMasker and converting it into a compressed and indexed version of a -<a href="/goldenPath/help/bigBed.html">bigBed</a> file, where the results when -identified as <code>type bigRmsk</code> in a Track Hub can be visualized as described -<a href="#linkToVisualizationSECTION_TOCOME">below</a>.</p> +<a href="/goldenPath/help/bigBed.html">bigBed</a> file, where the results can be +identified as <code>type bigRmsk</code> in a Track Hub and can be visualized as described +below.</p> <p> The bigRmsk files are created using the program <code>bedToBigBed</code>. It must be run with the <code>-as</code> option to pull in a special <a href="http://www.linuxjournal.com/article/5949" target="_blank">autoSql</a> (<em>.as</em>) file, <code>bigRmskBed.as</code> that defines the fields -of bigRmsk. Along side the bigRmsk file, an auxilary data bigBed can be made, with its own .as +of bigRmsk. Along with the bigRmsk file, an auxiliary data bigBed can be made, with its own .as definitions file (<code>bigRmskAlignBed.as</code>) and referenced with a special <code>xrefDataUrl</code> setting, whereas the bigRmsk file location is named with the standard <code>bigDataUrl</code> setting.</p> <p> The bigRmsk files are in an indexed binary format. The main advantage of this format is that only those portions of the file needed to display a particular region are transferred to the Genome Browser server. Because of this, bigRmsk files have considerably faster display performance than if they were stored in a text-based format. The bigRmsk file remains on your local web-accessible server (http, https or ftp), not on the UCSC server, and only the portion needed for the currently displayed chromosomal position is locally cached as a "sparse file". If you do not have access to a web-accessible server and need hosting space for your bigRmsk files, please see the <a href="hgTrackHubHelp.html#Hosting">Hosting</a> section of the Track Hub Help documentation.</p> <h2 id="bigRmsk">bigRmsk file definitions</h2> <p> The following autoSql definition is used to specify the main bigRmsk files. This definition, contained in the file <a href="examples/bigRmsk.as"><em>bigRmsk.as</em></a>, is pulled in when the <code>bedToBigBed</code> utility is run with the <code>-as=bigRmsk.as</code> option. </p> <h6>bigRmsk.as</h6> <pre><code>table bigRmskBed "Repetitive Element Annotation" ( string chrom; "Reference sequence chromosome or scaffold" uint chromStart; "Start position of visualization on chromosome" - uint chromEnd; "End position of visualation on chromosome" + uint chromEnd; "End position of visualization on chromosome" string name; "Name repeat, including the type/subtype suffix" uint score; "Divergence score" char[1] strand; "+ or - for strand" uint thickStart; "Start position of aligned sequence on chromosome" uint thickEnd; "End position of aligned sequence on chromosome" uint reserved; "Reserved" uint blockCount; "Count of sequence blocks" lstring blockSizes; "A comma-separated list of the block sizes(+/-)" lstring blockStarts; "A comma-separated list of the block starts(+/-)" uint id; "A unique identifier for the joined annotations in this record" lstring description; "A comma separated list of technical annotation descriptions" )</code></pre> <p>An example: <code>bedToBigBed -tab -as=bigRmsk.as -type=bed9+5 bigRmsk.txt hg38.chrom.sizes bigRmsk.bb</code>.</p> -<h3 id="supporting">Supporting bigRmskAlign.bb auxilary data</h3> +<h3 id="supporting">Supporting bigRmskAlign.bb auxiliary data</h3> <p> Alongside the bigRmsk file, a supporting bigBed can provide alignment data. The following autoSql definition is used to create this supporting file, pointed to online with <code>xrefDataUrl</code>, rather than the standard <code>bigDataUrl</code> used with bigRmsk. The file <a href="examples/bigRmskAlignBed.as"><em>bigRmskAlignBed.as</em></a>, is pulled in when the <code>bedToBigBed</code> utility is run with the <code>-as=bigRmskAlignBed.as</code> option.</p> <h6>bigRmskAlignedBed.as</h6> <pre><code>table bigRmskAlignBed -"Repetitive Element Alignment Auxilary Data" +"Repetitive Element Alignment Auxiliary Data" ( string chrom; "Reference sequence chromosome or scaffold" uint chromStart; "Start position of alignment on chromosome" uint chromEnd; "End position of alignment on chromosome" uint chromRemain; "Remaining bp in the chromosome or scaffold" float score; "alignment score (sw, bits or evalue)" float percSubst; "Base substitution percentage" float percDel; "Base deletion percentage" float percIns; "Bases insertion percentage" char[1] strand; "Strand - either + or -" string repName; "Name of repeat" string repType; "Type of repeat" string repSubtype; "Subtype of repeat" uint repStart; "Start in repeat sequence" uint repEnd; "End in repeat sequence" @@ -96,47 +95,47 @@ uint id; "The ID of the hit. Used to link related fragments" lstring calignData; "The alignment data stored as a single string" )</code></pre> <p>An example: <code>bedToBigBed -tab -as=bigRmskAlignBed.as -type=bed3+14 bigRmskAlign.tsv.sorted.txt hg38.chrom.sizes bigRmskAlign.bb </code>.CHECK - ISSUE IS xrefDataUrl doesn't work on this data yet.</p> </p> <p> Note that the <code>bedToBigBed</code> utility uses a substantial amount of memory: approximately 25% more RAM than the uncompressed BED input file.</p> <h2 id="steps">Creating a bigRmsk track</h2> <p> To create a bigRmsk track, and its supporting file, follow the below steps. All input files into <code>bedToBigBed</code> must be sorted on the coordinates of the first two columns, -<code>sort -k1,1 -k2,2n input.tsv.txt > input.tsv.sorted.txt</code>. To learn about a perl +<code>sort -k1,1 -k2,2n input.tsv.txt > input.tsv.sorted.txt</code>. To learn about a perl program that can build the tab-separated values (tsv) input bedToBigBed text files from the RepeatMasker output files, contact Robert Hubley: <a href="https://github.com/rmhubley" target="_blank">https://github.com/rmhubley</a>.</p> <p> <strong>Step 1.</strong> If you already have an input file you would like to convert to a bigRmsk, skip to <em>Step 3</em>. Otherwise, download <a href="examples/bigRmsk.txt">this example bigRmsk.txt file</a> for the human GRCh38 (hg38) assembly.</p> <p> <strong>Step 2.</strong> -If you would like to include the optional auxilary alignment data <code>bigRmskAlign.bb</code> file, +If you would like to include the optional auxiliary alignment data <code>bigRmskAlign.bb</code> file, download the bigRmskAlign.txt file.</p> <p> <strong>Step 3.</strong> Download the autoSql file <em><a href="examples/bigRmsk.as">bigRmsk.as</a></em> needed by -<code>bedToBigBed</code>. If you have opted to include the optional auxilary alignment data file, +<code>bedToBigBed</code>. If you have opted to include the optional auxiliary alignment data file, bigRmskAlign.bb, with your bigRmsk file, you must also download the autoSql file <a href="examples/bigRmskAlignBed.as">bigRmskAlignBed.as</a>.</p> <p> Here are wget commands to obtain the above files and the hg38.chrom.sizes file mentioned below: <pre><code>wget https://genome.ucsc.edu/goldenPath/help/examples/ wget https://genome.ucsc.edu/goldenPath/help/examples/bigRmsk.txt wget https://genome.ucsc.edu/goldenPath/help/examples/bigRmskAlign.txt wget https://genome.ucsc.edu/goldenPath/help/examples/bigRmsk.as wget https://genome.ucsc.edu/goldenPath/help/examples/bigRmskAlign.as wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes </code></pre> <p> <strong>Step 4.</strong> Download the <code>bedToBigBed</code> program from the UCSC <a href="http://hgdownload.soe.ucsc.edu/admin/exe/">binary utilities directory</a>.</p> @@ -200,33 +199,33 @@ <pre><code>http://genome.ucsc.edu/cgi-bin/hgTracks?db=hg38&position=chr1:1-21,571&hgct_customText=track%20type=bigRmsk%20name=Example%20bigDataUrl=http://genome.ucsc.edu/goldenPath/help/examples/bigRmsk.bb%20visibility=full</code></pre> <p> After this example bigRmsk is loaded in the Genome Browser, click into an item on the browser's track display. Note that the details page display lacks information about the individual alignments, as this example does not include the optional supporting alignment file.</p> <p> This example can also be loaded in a Track Hub with a stanza such as the following:</p> <pre> track ExBigRmsk shortLabel Example bigRmsk longLabel This is an example Track Hub Stanza type bigRmsk visibility full bigDataUrl http://genome.ucsc.edu/goldenPath/help/examples/bigRmsk.bb </pre> -NOTE: FOR WHEN REDOING PAGE, only Track Hubs now allow clicking into hgc. <!--- +NOTE: FOR WHEN REDOING PAGE, only Track Hubs now allow clicking into hgc. NOTE: The below is innaccurate and just a holder for when <b>xrefDataUrl works</b> to give an example building it. Adding potential input file (this is from RobertH T2T hub), both the align.bb and bigRmsk.bb for a region are stashed below (not for hg38 though). $ bigBedToBed -chrom=chr1 -start=4513 -end=7608 https://hgdownload.soe.ucsc.edu/hubs/GCA/009/914/755/GCA_009914755.4/bbi/GCA_009914755.4_T2T-CHM13v2.0.t2tRepeatMasker/chm13v2.0_rmsk.align.bb stdout chr14082453324838279596227.3212.421.00-LTR60BLTRERV126476503TA/GTTACT/CGGGG/AAGG/TGCT/GGA/GT/AG/ATCC+T+CA/GGTTCTT+A+GTT/CTA/TACTTGGA/GAGAAAGAT/ATTT/CC/GA/GCCAAGAGG/ACAG/ATA/TC/TAA/GA/CG/CATG/AG/AC/AAGAT/GAAC/TTT-C-ATTGAAA/GA/GG/AAAAC/TAC/GAGT/AGT/CAA/GAGAGC/TTTATT+TAAAGAGACA+GTA+CACTCT+GAAAA/GATA/GG/AGGA/CG/AGAGT/CGGGCTG+CTGAAAG+AGC/AGTGC/AA/GT/CT/CAA/G+C+AA/GCAGCCT/C+C+A/GAGAGTC/TCTGT/CT/GC/TA/TGGA/GA/GA/TTTTTATT/N+ATG+TG/CGGACTTC/TTTC+TTG+AC/AA/GTTCCT/CGCCTCTGTCTC/TAAG-T-CTCCA/GCCTG/TTTTTCTTTGTCTG/AG+T+TTTTC/TCT+TAA+GC/TT/CA/CCT/CGCCTT-AG-C/NTCCCCGA/CCT+AG+TG/TCCCC/GA/CCT/CT/CAGGCTTGTGGGACC+CT+T/CCCTC/TACTGTG/CG/AGTTGA/GG/TGT/CA/GCATGT/CG+CGGGCC+T/CGGTGA/TTC/GA/GATACGAATC/TCA/TC/AT/CCTG/AG/AC/TA/GC/GCA/NGC+GTTG+CTCC/ATTC/ACCGCCAT/CCCCAGGC/AAC/GGC/TT/CG+T+AC/TAGCGA/GTCAC/AG/ATT/CTGTACC/TTAC/TTGT/CGCCTGC+GTAT+CTCTTT/AT/GGAAT-G-TC/TCTT/CCTC/TTGCCCT chr14533466024838266850520.477.870.00-LTR60BLTRERV11783144514AATCTGTACTTATG/TGG/CGCCA/TG+C+GTT/ATCTCTTAA/GGAATG/TTCC/TCT/CTTTG+CCCTCTT+G/TCCTT/CCTTAC/TCAA/GCATGTAGCTAGCA/TAT/CATTCTGACAT/GT/GTTT/AAT/CTGCAGAGG/TGAA/GT/CGATTG/A+CT+GGGCA/GTCTTC/AAGA/GGGA/CGTTC chr146635139248382189130422.814.201.43+L1MC_orf2LINEL12804329225GGTGA/CG/TGGAAC-GATT-AAT/CTGGAA/CA+T+CCAT/CAA/TGA/CAAT/AG/AAT/AATGC/AATA/CTAGAT/CG/AA/CAA/GACT/CTTACAA/CCT/CC/TA/TCACAAC/AT/AAA/TTC/AACTCAAAAT+GGAT+CATC/A+GA+CT/CTAC/AAC/TT/GA/TAAAAT/CGCT/AAAACTATAC/AAAT/CTT/CCTAGAAGA+T+AACA-A-TAGA/GAGAAAAG/TCTAT/GG/ATGC/ACT/CTTGGGTTTGGT/CA/GATGAA/CTTTTA/TAC/GAA/TAT/CG/AAT/CACA/CAAAGGT/C+A+T/CGAT+CC+ATA/GC/AAC/AA/GAAAG/NAAA/TTGAC/TAT/AT/GG/CTGGT/AT/CTTC+A+TTAAT/AATTT/AAAAG/AT/CTTA/CTA/GCTCTG+CG+G/AAAGACAC-CT-TGTT/CAAGAGAAC/TA/GAAAAGACAAGCCACAT/GAT/CTGA/G+G+AGAAAATATTTGCAAAAT/GACAC/TATCTGAG/TAAAGA/GAT/CTT/GG/TTC/ATT/CCAAAATATAT/CAAAA/GAAA/CTA/CTTAAAACTA/CAACAATAAGT/A+AAA+T/CAAACAG/ACCCA/GAC/TT+N+AAAAATGC/GA/GCAC/AAC/A+G+AT/CCTGAACAGACACCTCACCAAAGAAGATC/ATACAGATGGCAAG/ATAAA/GCATAC/TA/GAAAAGATGCTCA/NACAT chr14997526324838206588715.4612.782.74+L1MC3_3endLINEL129322395TTGTC/ATT/CCAAAATATAT/CAAAA/GAAA/CTA/CTTAAAACTA/CAACAATAAGT/A+AAA+T/CAAACAG/ACCCAAC/TTAAAAA+A+TGC/GA/GCAC/AAC/A+G+ATCTGAACAGACACCTCACCAAAGAAGATC/ATACAGATGGCAAG/ATAAA/GCATAC/TA/GAAAAGATGCTCAACAT+CATTTGTC+AC/TTAGA/GGAAC/TTG+CAAATT+AAAACC/AACAATGAGATAG/CCAC-AGCTGG-TC/AT/CAT/CAT/CCTC/ATTAGAAC/TT/GGCTAAAC/ATCC-CT-AAAAAA+C+TGACA+ATACC+AAT/NTGCTG+GCGAGGAT+GA/CGGAA/GA/CAACAA/GGAACTCTT/C+A+TTCATTGCC/TGGTGGA/GA chr152745528248381800140314.681.970.78+MER34C_vLTRERV12633226AGA/NCCAA/GAATATGCCACCCCAAAATATA/GAT/CG/TGTAGGAA/GACCAGAATATGCCACCCCAAAATATGT/CCC/TCTTTGT/GCT/ATAAGA/GATTATTC/TC/TA/GAGCTGATTATTTTGAA/GAAAA/CTA/GA/CAT/GG/AC-TA-ACAA/GA/GG/AGAAGT/CTCTGAAAACAGAGTAGAAGTTACCCTTG/TTGTAAGGA/GAAATTTACATCTATAAAGGAAATCC/TCCATTTA/G+T+AAA/GGC/GTA/GC/TCT+CC+CTCTCTA/GC/TACCAA/GGAAGAGAAGGATA/GA+CT+CTAAATCACTAA/GAGAG/CTCTT chr155285686248381642354424.566.810.65+L1MC3_3endLINEL181296815645TAATA/GGTGG-G-ATAT/CC/ATGACACA/TAC/TGCATTTA/GTCAAG/AAT/CA/CCAC/TAGAAT/CTTTAT/CG/AGC+A+CAAA-T-GG/AGTA/GAAT/CCA/TA/TAT/ATC/GTATT/GCAAATTA/TA/TAC/AAAAATT/CAC/TTC/TAGGAT/GGT+C+GGC/GGT/GATCCCAGGAC/TA/GGAATGCAT/GC/AA/NTGTGA+C+AAAAG/NAATT/CTA-T-G/ACTA/GC/TAA/T-A-T chr156866131248381197244214.805.841.29+MSTA1LTRERVL-MaLR46507TA/GCTATGGTTTGGATGT-GGT-TTGTCCCCGCA/CAAAACTCATGTTGAAATTTGAC/TCCCCAATGTGGCAGTGTG/TGGG/A-C-GGTGGGGCCTAGTGGA/GT/AGGTGTTTGGGTCATGGGGA/GT/CGGATCCCTCATGAATAGATTAATGT/CCCTCC+CTCGNG+A/GTGGGG/NGTGAGTGAGTA/TCT+C+GCTCT+NN+CA/GT/CA/GGGAATGGATTAA/GTTCCT/CGCA/GG/AGAGT/CA/GGGTA/TA/GTTAAAAAGAGTCTGGC+GNC+TT/CCCTT/CG/CG/TCT+CTC+TCC/TCTT+GC+TTGCTTT/CCA/TCTT/CTT/CGCT/CATGTGATCTCTG-G-T/CG/ACACC/GCCT/C-T-GCTCCCCTTCC+NCTTC+GCTTTCCA/GCCATGAGG/TT/NGAAA/GA/CAGA/CCTGAA/GGCCC+T+CACCAGATGCAA/GCTGCCCA/GA/NT/ACT/CC/NG/TGA/CC/TA/TTTC+GNC+CAGCT/CACCAGT/AATT/CGTGAGCCAAATG/AAAT/CCTT/CTTTTA/CC/TTTATAAATTACCCAGCCTCAGGTATTCT/CGTTAC/TAGA/CAG/ACACAAG/AAT/CGGACTAAGACA chr161317132248380196354424.566.810.65+L1MC3_3endLINEL196920414915CAAATGTAG/TGT/AAAA/CAAC+C+TCACTGAAGGT/GGG+TG+A/GGGGAAAAT/AGGTGT/CTGACCTAAGTC/AACTTTGA/GAAATGAA/GTA/GGAA/GTCTG+T+G/AAGG/ACTG/AAAGGCAC/AA+A+T/GGAACTA/GTACT/ATC/AAT/GA/CAT/CTGG/TAT/CTA/CC/TAT/GTTT/GATAAAGTTA/GTTTCCA/CACA/GGA/GA/GGC/TAA/CC/GT/GGTG/TAACAATTG/CTA/GAA/NACCA/GCA/TG/ATG/AT/CC/ATGTAT+A+CTGGAG/ATA/TA/GAACAATG/TAC/AT/GTAC/AATA/GA/GG/ATC/GGCA/GGATGGTGGGAA/GCCAGC/GTTTCTCACTGTTGA/GAGTGGGAGG/NTTACAA/GATT/AAGCAAGA/GC/GGAGA/GAGGCTAGAATGATT/CCC/ATGTGA/GTAG/ATA/GGATC/TAGAGG/TTGGAGACATCAA/GC/TG/ATA/GAACTT/CATGC/TTTAGT/CTTAATATAGATACAC/GAC/TA/GGTTC/AT/CAC/TATAGAAAA/TC/ATTTATAA/GT/ATAG/TGTGTG/ATG/ATAG/CG/AT/CA/GGGTTAG+T+AC/TACACACATATAC/TTTCCTA/TGCA/TT/CTGC/TT/CAA/GT/C+TGA+GAGGGA/CCA/TAGAT/AA/GCAAT/C+GACACCCCAGTAGCAACGA+GT/CG/ACAT/CT/CC/TAGCA/GG/CCCAC/GATG/CTA/TA/GGTTT+C+TC/AC/AC/TACCATTC+TCCAA+TG/AAAAGGAAT/CCA+G+GGCTCT/CTTGA/GAGAAATGT/GCTGATA/TCTAGA/GACTGGGA/GCAGT/GAAATAT+ACAAG+AG/TGAGCCA/TGGAT/GA/CATCTG/TGA/TAGTA/GT/CCAGAAAGT+AAGG+AAGTA/GCT+C+AAAAAAAT/CT/CA/CAA/CA+ATGATGGGGG+TATA/GTCAAAC/GA/GA/GAA/CAT/CAA/GA/GAGCCAAT/CA/TA/GAAAC/GAGCTA/CCCG/AATGGCCAAC/AA/GCA/TGGAAG/CG/AAA/TTTGT/AGCAACAT/AAAT+A+G/AC/ATA+AAG+TAGTG/ATC/TGA/GATA/TATAA+C+CT/CAAAGC/TT/ATAAAG/ATAAT/ATATCT/CAG/TGT/AGTCT/CG/ATAT/CTT/GG/ATATAC/AC/ATAG/AG/ATG+ATTGAATA+AATAAG/AC/TAAATGGA/GGT/GT/AGC/AAT/GAG+AC+AAATCTCCT/CT/GTGCAA/GAAGAATTCCAAATAAC/TTG/TATGTAGAC/TACTCA/CGCCA/CTCAAGA/GAGGTGGAGC+A+C/TAACTCCT/CCACTCCG/TTAAGTGTGGGCTC/GT/CGCATAGTGACTTG/CCTC/TCA/CAAAGA-ACAC-A/GTG/ACAGTATGGAC/AAA/GGGA/GGGAAAAA+AGAG+TAACTTC/TACAGTGGAGAAAT/CCTGACAAACAG/CTAG/CCTCT/AGCCAA/GA/GTGATCC/AAA/GGTG/CAACAC/TCAAA/CG/AC/GTGAC/TAG/AT/GTCAC/TC/GTTGAG/TAA/GC/TATG chr171417533248379795121520.504.597.89+L1MC3_3endLINEL12150252935TGG/AGGGACATTCTACAAAAA/TT/ACCTGACCAA/GTC/ACTCCTCAG/AT/AG/ACTA/GTG/CAAGGTCATCAT/AG/AAG/A+C+AT/AGGAAAGC/TCTA/GAC/GAC/AACTGTCACAGCCAG/AGAA/GGAGCCTAT/AG+GAGACA+TGAT/CGT/ACTAC/AATGTC/AG/ATGC/TGGG/TATCCTGGATGGGATCCTGGG/AT/ACAGAG/AT/AAAGA/G+ACAT+TAG+GTAAA+AACTAAGGG/AAATCC/TA/GAATG/AAAA/GTATGA/GACTTTAGTTAATAAC/TAG/ATC/GTATCAG/ATATTGGTTCATTAAC/TTGTGG/ACAA-ATT-ATGTA-AGATATTAATAAG-CCAT-GTGAGACAC-ACTG/AATA/GG/TAAGATGTTAATAAG/TAGA/GGGAAACTA/GGGT+G+TG-C-GGC/GTAC/TATGGGAAA/CTCTCTG-CTTT-TT/AT/CTT/ATT/CTTG/CA/GCG/AATTTC/TTG/CTGTAAG/ATA/CA/TAAAAA/CA/TG/AA/TC/TG/CTAAAATAAAAC/A+G+TTTATTTA/TA/NAA