src/hg/htdocs/FAQ/FAQformat.html cb55bf7cf5fe6f9561a8d026075252ee380b008e

cb55bf7cf5fe6f9561a8d026075252ee380b008e
brianlee
  Sat Jun 11 07:34:50 2022 -0700
Adding to FAQ/FAQformat.html 2bit entry link to twoBit.html with extraction example of sequence  ref #29548

diff --git src/hg/htdocs/FAQ/FAQformat.html src/hg/htdocs/FAQ/FAQformat.html
index 6eff1c2..a67b3b3 100755
--- src/hg/htdocs/FAQ/FAQformat.html
+++ src/hg/htdocs/FAQ/FAQformat.html
@@ -1,1397 +1,1401 @@
 <!DOCTYPE html>
 <!--#set var="TITLE" value="Genome Browser FAQ" -->
 <!--#set var="ROOT" value=".." -->
 
 <!-- Relative paths to support mirror sites with non-standard GB docs install -->
 <!--#include virtual="$ROOT/inc/gbPageStart.html" -->
 
 <!--#include virtual="$ROOT/redmineWidget.html" -->
 
 <h1>Frequently Asked Questions: Data File Formats</h1>
 
 <h2>Topics</h2>
 <h6>General formats</h6>
 <div class="row">
   <!-- Left column -->
   <div class="col-md-6">
     <ul>
       <li><a href="../goldenPath/help/axt.html">Axt format</a></li>
       <li><a href="/goldenPath/help/bam.html">BAM format</a></li>
       <li><a href="#format1">BED format</a></li>
       <li><a href="#format1.7">BED detail format</a></li>
       <li><a href="/goldenPath/help/bedgraph.html">bedGraph format</a></li>
       <li><a href="/goldenPath/help/barChart.html">barChart and bigBarChart format</a></li>
       <li><a href="/goldenPath/help/bigBed.html">bigBed format</a></li>
       <li><a href="/goldenPath/help/bigGenePred.html">bigGenePred table format</a></li>
       <li><a href="/goldenPath/help/bigPsl.html">bigPsl table format</a></li>
       <li><a href="/goldenPath/help/bigMaf.html">bigMaf table format</a></li>
       <li><a href="/goldenPath/help/bigChain.html">bigChain table format</a></li>
       <li><a href="/goldenPath/help/bigNarrowPeak.html">bigNarrowPeak table format</a></li>
       <li><a href="/goldenPath/help/bigLolly.html">bigLolly table format</a></li>
       <li><a href="/goldenPath/help/bigWig.html">bigWig format</a></li>
       <li><a href="../goldenPath/help/chain.html">Chain format</a></li>
     </ul>
   </div>
   <!-- Right column -->
   <div class="col-md-6">
     <ul>
       <li><a href="/goldenPath/help/cram.html">CRAM format</a></li>
       <li><a href="#format9">GenePred table format</a></li>
       <li><a href="#format3">GFF format</a></li>
       <li><a href="#format4">GTF format</a></li>
       <li><a href="#format20">HAL format</a></li>
       <li><a href="/goldenPath/help/hic.html">Hic format</a></li>
       <li><a href="/goldenPath/help/interact.html">Interact and bigInteract format</a></li>
       <li><a href="#format24">Longrange longTabix format</a></li>
       <li><a href="#format5">MAF format</a></li>
       <li><a href="#format6.5">Microarray format</a></li>
       <li><a href="../goldenPath/help/net.html">Net format</a></li>
       <li><a href="#format10">Personal Genome SNP format</a></li>
       <li><a href="#format2">PSL format</a></li>
       <li><a href="/goldenPath/help/vcf.html">VCF format</a></li>
       <li><a href="/goldenPath/help/wiggle.html">WIG format</a></li>
     </ul>
   </div>
 </div>
 <a name="ENCODE"></a>
 <h6>ENCODE-specific formats</h6>
 <ul>
   <li><a href="#format13">ENCODE broadPeak format</a></li>
   <li><a href="#format14">ENCODE gappedPeak format</a></li>
   <li><a href="#format12">ENCODE narrowPeak format</a></li>
   <li><a href="#format16">ENCODE pairedTagAlign format</a></li>
   <li><a href="#format17">ENCODE peptideMapping format</a></li>
   <li><a href="#format11">ENCODE RNA elements format</a></li>
   <li><a href="#format15">ENCODE tagAlign format</a></li>
 </ul>
 <h6>Download-only formats</h6>
 <ul>
   <li><a href="#format7">.2bit format</a></li>
   <li><a href="http://genetics.bwh.harvard.edu/pph/FASTA.html" target="_blank">.fasta format</a></li>
   <li><a href="http://maq.sourceforge.net/fastq.shtml" target="_blank">.fastQ format</a></li>
   <li><a href="#format8">.nib format</a></li>
 </ul>
 <hr>
 <p> 
 <a href="index.html">Return to FAQ Table of Contents</a></p>
 
 <a name="format1"></a>
 <h2>BED format</h2>
 <p>
 BED (Browser Extensible Data) format provides a flexible way to define the data lines that are 
 displayed in an annotation track. BED lines have three required fields and nine additional optional 
 fields. The number of fields per line must be consistent throughout any single set of data in an 
 annotation track.  The order of the optional fields is binding: lower-numbered fields must always
 be populated if higher-numbered fields are used.</p>
 <p>
 BED information should not be mixed as explained above (BED3 should not be mixed with BED4), rather 
 additional column information must be filled for consistency, for example with  a &quot;.&quot; in 
 some circumstances, if the field content is to be empty. BED fields in custom tracks can be 
 whitespace-delimited or tab-delimited. Only some variations of BED types, such as 
 <a href="../FAQ/FAQformat.html#format1.7">bedDetail</a>, require a tab character delimitation for 
 the detail columns.</p>
 <p>
 Please note that only in custom tracks can the first lines of the file consist of header lines, 
 which begin with the word &quot;browser&quot; or &quot;track&quot; to assist the browser in the 
 display and interpretation of the lines of BED data following the headers. Such annotation track 
 header lines are not permissible in downstream utilities such as bedToBigBed, which convert lines of
 BED text to indexed binary files. </p>
 <P>
 If your data set is BED-like, but it is very large (over 50MB) and you would like to keep it on your
 own server, you should use the <a href="../goldenPath/help/bigBed.html">bigBed</a> data format.</p>
 <p>
 The first three required BED fields are: </p>
 <ol>
   <li> 
   <strong>chrom</strong> - The name of the chromosome (e.g. chr3, chrY, chr2_random) or scaffold 
   (e.g.  scaffold10671).</li>
   <li>
   <strong>chromStart</strong> - The starting position of the feature in the chromosome or scaffold. 
   The first base in a chromosome is numbered 0.</li>
   <li>
   <strong>chromEnd</strong> - The ending position of the feature in the chromosome or scaffold. The 
   <em>chromEnd</em> base is not included in the display of the feature, however,
   the number in <a href="FAQtracks#tracks1">position format</a> will be represented. For example,
   the first 100 bases of chromosome 1 are defined as <em>chrom=1, chromStart=0, chromEnd=100</em>,
   and span the bases numbered 0-99 in our software (not 0-100), but will represent the
   position notation chr1:1-100. Read more
   <a href="http://genome.ucsc.edu/blog/the-ucsc-genome-browser-coordinate-counting-systems/">here</a>. 
   <br><em>chromStart</em> and <em>chromEnd</em> can be identical, creating a feature of length 0, commonly
   used for insertions. For example, use <em>chromStart=0, chromEnd=0</em> to represent an insertion before the
   first nucleotide of a chromosome.</li>
 </ol>
 <p>The 9 additional optional BED fields are:</p>
 <ol start=4>
   <li>
   <strong>name</strong> - Defines the name of the BED line. This label is displayed to the left of 
   the BED line in the Genome Browser window when the track is open to full display mode or directly 
   to the left of the item in pack mode.</li>
   <li>
   <strong>score</strong> - A score between 0 and 1000. If the track line <em>useScore</em> attribute
   is set to 1 for this annotation data set, the <em>score</em> value will determine the level of 
   gray in which this feature is displayed (higher numbers = darker gray). This table shows the 
   Genome Browser's translation of BED score values into shades of gray:
   <table>
     <tr>
       <td>shade</td>
       <td style="background-color: #e2e2e2">&nbsp;</td>
       <td style="background-color: #c6c6c6">&nbsp;</td>
       <td style="background-color: #aaaaaa">&nbsp;</td>
       <td style="background-color: #8d8d8d">&nbsp;</td>
       <td style="background-color: #717171">&nbsp;</td>
       <td style="background-color: #555555">&nbsp;</td>
       <td style="background-color: #383838">&nbsp;</td>
       <td style="background-color: #1c1c1c">&nbsp;</td>
       <td style="background-color: #000000">&nbsp;</td>
     </tr>
     <tr>
       <td>score in range&nbsp;&nbsp;</td>
       <td>&le; 166</td>
       <td>167-277</td>
       <td>278-388</td>
       <td>389-499</td>
       <td>500-611</td>
       <td>612-722</td>
       <td>723-833</td>
       <td>834-944</td>
       <td>&ge; 945</td>
     </tr>
   </table>
   <li>
   <strong>strand</strong> - Defines the strand. Either "." (=no strand) or "+" or "-".</li>
   <li>
   <strong>thickStart</strong> - The starting position at which the feature is drawn thickly (for 
   example, the start codon in gene displays). When there is no thick part, thickStart and thickEnd 
   are usually set to the chromStart position.</li>
   <li>
   <strong>thickEnd</strong> - The ending position at which the feature is drawn thickly (for example
   the stop codon in gene displays).</li> 
   <li>
   <strong>itemRgb</strong> - An RGB value of the form R,G,B (e.g. 255,0,0).  If the track line 
   <em>itemRgb</em> attribute is set to &quot;On&quot;, this RBG value will determine the display 
   color of the data contained in this BED line. NOTE: It is recommended that a simple color 
   scheme (eight colors or less) be used with this attribute to avoid overwhelming the color 
   resources of the Genome Browser and your Internet browser.</li>
   <li>
   <strong>blockCount</strong> - The number of blocks (exons) in the BED line.</li>
   <li>
   <strong>blockSizes</strong> - A comma-separated list of the block sizes. The number of items in 
   this list should correspond to <em>blockCount</em>.</li>
   <li>
   <strong>blockStarts</strong> - A comma-separated list of block starts. All of the 
   <em>blockStart</em> positions should be calculated relative to <em>chromStart</em>. The number of 
   items in this list should correspond to <em>blockCount</em>.</li>
 </ol>
 <p>
 In BED files with block definitions, the first <i>blockStart</i> value must be 0, so that the first 
 block begins at <em>chromStart</em>. Similarly, the final <em>blockStart</em> position plus the 
 final <em>blockSize</em> value must equal <em>chromEnd</em>. Blocks may not overlap.</p>
 <p>
 <strong><em>Example:</em></strong><br>
 Here's an example of an annotation track, introduced by a 
 <a href="FAQcustom.html#custom11" target="_blank">header line</a>, that is followed by a complete 
 BED definition:</p>
 <pre><code>track name=pairedReads description=&quot;Clone Paired Reads&quot; useScore=1
 chr22 1000 5000 cloneA 960 + 1000 5000 0 2 567,488, 0,3512
 chr22 2000 6000 cloneB 900 - 2000 6000 0 2 433,399, 0,3601</code></pre>
 <p>
 <strong><em>Example:</em></strong><br>
 This example shows an annotation track that uses the itemRgb attribute to individually color each 
 data line. In this track, the color scheme distinguishes between items named &quot;Pos*&quot; and 
 those named &quot;Neg*&quot;. See the usage note in the <em>itemRgb</em> description above for color
 palette restrictions. NOTE: The <a href="FAQcustom.html#custom11" target="_blank">track and data 
 lines</a> in this example have been reformatted for documentation purposes. This 
 <a href="../goldenPath/help/ItemRGBDemo.txt" target="_blank">example</a> can be pasted into the 
 browser without editing.</p>
 <pre><code>browser position chr7:127471196-127495720
 browser hide all
 track name="ItemRGBDemo" description="Item RGB demonstration" visibility=2 itemRgb="On"
 chr7    127471196  127472363  Pos1  0  +  127471196  127472363  255,0,0
 chr7    127472363  127473530  Pos2  0  +  127472363  127473530  255,0,0
 chr7    127473530  127474697  Pos3  0  +  127473530  127474697  255,0,0
 chr7    127474697  127475864  Pos4  0  +  127474697  127475864  255,0,0
 chr7    127475864  127477031  Neg1  0  -  127475864  127477031  0,0,255
 chr7    127477031  127478198  Neg2  0  -  127477031  127478198  0,0,255
 chr7    127478198  127479365  Neg3  0  -  127478198  127479365  0,0,255
 chr7    127479365  127480532  Pos5  0  +  127479365  127480532  255,0,0
 chr7    127480532  127481699  Neg4  0  -  127480532  127481699  0,0,255
 </code></pre>
 <p>
 Click <a class="insideLink" 
 href="../cgi-bin/hgTracks?org=human&amp;position=chr7&amp;hgt.customText=http://genome.ucsc.edu/goldenPath/help/ItemRGBDemo.txt"
 target="_blank">here</a> to display this track in the Genome Browser.</p>
 <p>
 <strong><em>Example:</em></strong><br>
 It is also possible to color items by strand in a BED track using the <em>colorByStrand</em> 
 attribute in the <a href="FAQcustom.html#custom11" target="_blank">track line</a> as shown below. 
 For BED tracks, this attribute functions only for custom tracks with 6 to 8 fields (i.e. BED6 
 through BED8). NOTE: The track and data lines in this example have been reformatted for 
 documentation purposes. This <a href="../goldenPath/help/ColorByStrandDemo.txt" 
 target="_blank">example</a> can be pasted into the browser without editing.</p>
 <pre><code>browser position chr7:127471196-127495720
 browser hide all
 track name="ColorByStrandDemo" description="Color by strand demonstration" visibility=2 colorByStrand="255,0,0 0,0,255"
 chr7    127471196  127472363  Pos1  0  +
 chr7    127472363  127473530  Pos2  0  +
 chr7    127473530  127474697  Pos3  0  +
 chr7    127474697  127475864  Pos4  0  +
 chr7    127475864  127477031  Neg1  0  -
 chr7    127477031  127478198  Neg2  0  -
 chr7    127478198  127479365  Neg3  0  -
 chr7    127479365  127480532  Pos5  0  +
 chr7    127480532  127481699  Neg4  0  -
 </code></pre>
 <p>
 Click <a class="insideLink" href="../cgi-bin/hgTracks?org=human&amp;position=chr7&amp;hgt.customText=http://genome.ucsc.edu/goldenPath/help/ColorByStrandDemo.txt"
 target="_blank">here</a> to display this track in the Genome Browser.</p>
 
 <a name="format1.7"></a>
 <h2>BED detail format</h2>
 <p>
 This is an extension of BED format. BED detail uses the first 4 to 12 columns of BED format, plus 
 2 additional fields that are used to enhance the track details pages. The first additional field 
 is an ID, which can be used in place of the name field for creating links from the details pages. 
 The second additional field is a description of the item, which can be a long description and can 
 consist of html, including tables and lists. </p>
 <p>
 <strong>Requirements</strong> for BED detail custom tracks are: fields must be tab-separated, 
 &quot;type=bedDetail&quot; must be included in the 
 <a href="../goldenPath/help/customTrack.html#TRACK" target="_blank">track line</a>, and the name and
 position fields should uniquely describe items so that the correct ID and description will be 
 displayed on the details pages.</p>
 <p>
 <strong><em>Example:</em></strong><br>
 This example uses the first 4 columns of BED format, but up to 12 may be used. Click 
 <a class="insideLink" href="../cgi-bin/hgTracks?db=hg19&amp;hgt.customText=http://genome.ucsc.edu/goldenPath/help/examples/bedDetailExample.txt" 
 target="_blank">here</a> to view this track in the Genome Browser.</p>
 <pre><code>track name=HbVar type=bedDetail description="HbVar custom track" db=hg19 visibility=3 url="http://globin.bx.psu.edu/cgi-bin/hbvar/query_vars3?display_format=page&amp;mode=output&amp;id=$$"
 chr11&#09;5246919&#09;5246920&#09;Hb_North_York&#09;2619&#09;Hemoglobin variant
 chr11&#09;5255660&#09;5255661&#09;HBD c.1 G>A&#09;2659&#09;delta0 thalassemia
 chr11&#09;5247945&#09;5247946&#09;Hb Sheffield&#09;2672&#09;Hemoglobin variant
 chr11&#09;5255415&#09;5255416&#09;Hb A2-Lyon&#09;2676&#09;Hemoglobin variant
 chr11&#09;5248234&#09;5248235&#09;Hb Aix-les-Bains&#09;2677&#09;Hemoglobin variant </code></pre>
 <p>
 To see an example of turning a bedDetail custom track into the <code>bigBed</code>
 format, see this <a href="https://genome-blog.soe.ucsc.edu/blog/2021/08/03/how-make-a-bigbed-file-part-1/"
 target="_blank">How to make a bigBed file</a> blog post.</p>
 
 <a name="format2"></a>
 <h2>PSL format</h2>
 <p>
 PSL lines represent alignments, and are typically taken from files generated by BLAT or psLayout. 
 See the <a href="../goldenPath/help/hgTracksHelp.html#BLATAlign" target="_blank">BLAT
 documentation</a> for more details. All of the following fields are required on each data line 
 within a PSL file:</p>
 <ol>
   <li>
   <strong>matches</strong> - Number of bases that match that aren't repeats</li>
   <li>
   <strong>misMatches</strong> - Number of bases that don't match</li>
   <li>
   <strong>repMatches</strong> - Number of bases that match but are part of repeats</li>
   <li>
   <strong>nCount</strong> - Number of &quot;N&quot; bases</li>
   <li>
   <strong>qNumInsert</strong> - Number of inserts in query</li>
   <li>
   <strong>qBaseInsert</strong> - Number of bases inserted in query</li>
   <li>
   <strong>tNumInsert</strong> - Number of inserts in target</li>
   <li>
   <strong>tBaseInsert</strong> - Number of bases inserted in target</li>
   <li>
   <strong>strand</strong> - &quot+&quot; or &quot-&quot; for query strand. For translated 
   alignments, second &quot;+&quot;or &quot;-&quot; is for target genomic strand.</li>
   <li>
   <strong>qName</strong> - Query sequence name</li>
   <li>
   <strong>qSize</strong> - Query sequence size.</li>
   <li>
   <strong>qStart</strong> - Alignment start position in query</li>
   <li>
   <strong>qEnd</strong> - Alignment end position in query</li>
   <li>
   <strong>tName</strong> - Target sequence name</li>
   <li>
   <strong>tSize</strong> - Target sequence size</li>
   <li>
   <strong>tStart</strong> - Alignment start position in target</li>
   <li>
   <strong>tEnd</strong> - Alignment end position in target</li>
   <li>
   <strong>blockCount</strong> - Number of blocks in the alignment (a block contains no gaps)</li>
   <li>
   <strong>blockSizes</strong> - Comma-separated list of sizes of each block. If the query is a protein and the target the genome, blockSizes are in amino acids. See below for more information on protein query PSLs.</li>
   <li>
   <strong>qStarts</strong> - Comma-separated list of starting positions of each block in query</li>
   <li>
   <strong>tStarts</strong> - Comma-separated list of starting positions of each block in target</li>
 </ol>
 <p>
 <strong><em>Example:</em></strong><br>
 Here is an example of an annotation track in PSL format.</p>
 <pre><code>browser position chr22:13073000-13074000
 browser hide all
 track name=fishBlats description=&quot;Fish BLAT&quot; visibility=2 useScore=1
 59 9 0 0 1 823 1 96 +- FS_CONTIG_48080_1 1955 171 1062 chr22 47748585 13073589 13073753 2 48,20,  171,1042,  34674832,34674976,
 59 7 0 0 1 55 1 55 +- FS_CONTIG_26780_1 2825 2456 2577 chr22 47748585 13073626 13073747 2 21,45,  2456,2532,  34674838,34674914,
 59 7 0 0 1 55 1 55 -+ FS_CONTIG_26780_1 2825 2455 2676 chr22 47748585 13073727 13073848 2 45,21,  249,349,  13073727,13073827, </code></pre>
 <p>
 Click <a class="insideLink" href="../cgi-bin/hgTracks?org=human&amp;position=chr7&amp;hgt.customText=http://genome.ucsc.edu/goldenPath/help/fishBlats.txt"
 target="_blank">here</a> to display this track in the Genome Browser.</p>
 <p>
 Be aware that the coordinates for a negative strand in a dna query PSL line are handled in a special way. In 
 the <em>qStart</em> and <em>qEnd</em> fields, the coordinates indicate the position where the 
 query matches from the point of view of the forward strand, even when the match is on the reverse 
 strand. However, in the <I>qStarts</I> list, the coordinates are reversed.</p>
 <p>
 <strong><em>Example:</em></strong><br>
 Here is a 61-mer containing 2 blocks that align on the minus strand and 2 blocks that align on the 
 plus strand (this sometimes happens due to assembly errors): 
 <pre><code>0         1         2         3         4         5         6 tens position in query  
 0123456789012345678901234567890123456789012345678901234567890 ones position in query   
                       ++++++++++++++                    +++++ plus strand alignment on query   
     ------------------              --------------------      minus strand alignment on query   
 0987654321098765432109876543210987654321098765432109876543210 ones position in query negative strand coordinates
 6         5         4         3         2         1         0 tens position in query negative strand coordinates
 
 Plus strand:   
      qStart=22
      qEnd=61 
      blockSizes=14,5 
      qStarts=22,56 
                   
 Minus strand:   
      qStart=4 
      qEnd=56 
      blockSizes=20,18 
      qStarts=5,39 </code></pre>
 <p>
 Essentially, the minus strand <em>blockSizes</em> and <em>qStarts</em> are what you would get if 
 you reverse-complemented the query. However, the <em>qStart</em> and <em>qEnd</em> are not reversed.
 Use the following formulas to convert one to the other:</p>
 <pre><code>Negative-strand-coordinate-qStart = qSize - qEnd   = 61 - 56 =  5
 Negative-strand-coordinate-qEnd   = qSize - qStart = 61 -  4 = 57
 </code></pre>
 <p>
 BLAT this actual sequence against hg19 for a real-world example:</p>
 <pre><code>CCCC
 GGGTAAAATGAGTTTTTT
 GGTCCAATCTTTTA
 ATCCACTCCCTACCCTCCTA
 GCAAG</code></pre>
 <p>
 Look for the alignment on the negative strand (-) of chr21, which conveniently aligns to the window 
 chr21:10,000,001-10,000,061.</p>
 <p>
 Browser window coordinates are 1-based [start,end] while PSL coordinates are 0-based [start,end), so
 a start of 10,000,001 in the browser corresponds to a start of 10,000,000 in the PSL. Subtracting 
 10,000,000 from the target (chromosome) position in PSL gives the query negative strand coordinate 
 above.</p>
 <p>
 The 4, 14, and 5 bases at beginning, middle, and end were chosen to not match with the genome at the
 corresponding position.</p>
 <p>
 <strong><em>Translated Queries:</em></strong><br>
 Translated queries translate both the query and target dna into amino acids for greater sensitivity. 
 They are also used for protein search, although in that case the query does not need to be translated.
 For these search types, the strand field lists two values, the first for the query strand (qStrand) and the second for the target strand (tStrand).<br>
 The following rules apply, where x can be q or t:<br>
 If xStrand is negative, the xStarts list has negative-strand coordinates. <br>
 However, the xStart,xEnd values are always given in positive-strand coordinates, regardless of xStrand.<br>
 <p>
 <strong><em>Protein Query:</em></strong><br>
 A protein query consists of amino acids. To align amino acids against a database of nucleic acids,
 each target chromosome is first translated into amino acids for each of the six different reading 
 frames. The resulting protein PSL is a hybrid; the query fields are all in amino acid coordinates 
 and sizes, while the target database fields are in nucleic acid chromosome coordinates and sizes.
 The fields shared by query and target are blockCount and blockSizes. But blockSizes differ between 
 query (AA) and target (NA), so a single field cannot represent both. A choice was therefore made 
 to report the blockSizes field in amino acids since it is a protein query.</p>
 <p>
 To find the size of a target exon in nucleic acids, use the formula:
 <pre><code>blockSizes[exonNumber]*3</code></pre>
 <p>
 Or, to find the end position of a target exon, use the formula:</p> 
 <pre><code>tStarts[exonNumber] + (blockSizes[exonNumber]*3)</code></pre>
 
 <a name="format3"></a>
 <h2>GFF format</h2>
 <p>
 GFF (General Feature Format) lines are based on the Sanger 
 <a href="http://www.sanger.ac.uk/resources/software/gff/spec.html" target="_blank">GFF2 
 specification</a>. GFF lines have nine required fields that <I>must</I> be
 tab-separated. If the fields are separated by spaces instead of tabs, the track will not display 
 correctly. For more information on GFF format, refer to Sanger's 
 <a href="http://www.sanger.ac.uk/resources/software/gff/" target="_blank">GFF page</a>.</p>
 <p>
 Note that there is also a GFF3 specification that is not currently supported by the Browser.
 All GFF tracks must be formatted according to Sanger's GFF2 specification.</p>
 <p>
 If you would like to obtain browser data in GFF (GTF) format, please refer to
 <a href="http://genomewiki.ucsc.edu/index.php/Genes_in_gtf_or_gff_format" 
 target="_blank">Genes in gtf or gff format</a> on the Wiki.</p>
 <p>
 Here is a brief description of the GFF fields:</p>
 <ol>
   <li>
   <strong>seqname</strong> - The name of the sequence. Must be a chromosome or scaffold.</li>
   <li>
   <strong>source</strong> - The program that generated this feature.</li>
   <li>
   <strong>feature</strong> - The name of this type of feature. Some examples of standard feature 
   types are &quot;CDS&quot; &quot;start_codon&quot; &quot;stop_codon&quot; and &quot;exon&quot;li>
   <li>
   <strong>start</strong> - The starting position of the feature in the sequence. The first base is 
   numbered 1.</li>
   <li>
   <strong>end</strong> - The ending position of the feature (inclusive).</li>
   <li>
   <strong>score</strong> - A score between 0 and 1000. If the track line <em>useScore</em> attribute
   is set to 1 for this annotation data set, the <em>score</em> value will determine the level of 
   gray in which this feature is displayed (higher numbers = darker gray). If there is no score 
   value, enter &quot.&quot.        
   <li>
   <strong>strand</strong> - Valid entries include &quot;+&quot;, &quot;-&quot;, or &quot;.&quot; 
   (for don't know/don't care).</li>        
   <li>
   <strong>frame</strong> - If the feature is a coding exon, <em>frame</em> should be a number 
   between 0-2 that represents the reading frame of the first base. If the feature is not a coding 
   exon, the value should be &quot;.&quot;.</li>        
   <li>
   <strong>group</strong> - All lines with the same group are linked together into a single 
   item.</li>
 </ol>
 <p>
 <strong><em>Example:</em></strong><br>
 Here's an example of a GFF-based track. This data format require tabs and some operating systems convert tabs to spaces. If pasting doesn't work, this <a href="../goldenPath/help/regulatory.txt" 
 target="blank">example's</a> contents or the url itself can be pasted into the custom track text box.
 <pre><code>browser position chr22:10000000-10025000
 browser hide all
 track name=regulatory description=&quot;TeleGene(tm) Regulatory Regions&quot; visibility=2
 chr22	TeleGene	enhancer	10000000	10001000	500	+	.	touch1
 chr22	TeleGene	promoter	10010000	10010100	900	+	.	touch1
 chr22	TeleGene	promoter	10020000	10025000	800	-	.	touch2
 </code></pre>
 <p>
 Click <a class="insideLink" href="../cgi-bin/hgTracks?org=human&amp;position=chr22&amp;hgt.customText=http://genome.ucsc.edu/goldenPath/help/regulatory.txt"
 target="_blank">here</a> to display this track in the Genome Browser.</p>
 
 <a name="format4"></a>
 <h2>GTF format</h2>
 <p>
 <!--#include virtual="../goldenPath/help/GTF.html" -->
 </p>
 
 <a name="format20"></a>
 <h2>HAL format</h2>
 <p> 
 HAL is a graph-based structure to efficiently store and index multiple genome alignments and 
 ancestral reconstructions. HAL files are represented in <a href="http://www.hdfgroup.org/HDF5/" 
 target="_blank">HDF5 format</a>, an open standard for storing and indexing large, compressed 
 scientific data sets. Genomes within HAL are organized according to the phylogenetic tree that 
 relate them: each genome is segmented into pairwise DNA alignment blocks with respect to its parent 
 and children (if present) in the tree. Note that if the phylogeny is unknown, a star tree can be 
 used. The modularity provided by this tree-based decomposition allows for efficient querying of 
 sub-alignments, as well as the ability to add, remove and update genomes within the alignment with 
 only local modifications to the structure. Another important feature of HAL is reference 
 independence: alignments in this format can be queried with respect to the coordinates of any 
 genome they contain. </p> 
 <p> 
 HAL files can be created or read with a comprehensive C++ API (click 
 <a href="https://github.com/glennhickey/hal" target="_blank">here</a> for source code and manual). 
 A set of command line tools is included to perform basic operations, such as importing and exporting
 data, identifying mutations, coordinate mapping (liftOver), and comparative assembly hub 
 generation.</p> 
 <p> 
 HAL is the native output format of the Progressive Cactus alignment pipeline, and is included in 
 the <a href="https://github.com/glennhickey/progressiveCactus" 
 target="_blank">Progressive Cactus</a> installation package.</p>
 
 <a name="format24"></a>
 <h2>Longrange longTabix format</h2>
 <p>
 The longrange track is a bed format-like file type. Each row contains columns
 that define chromosome, start position (0-based), and end position (not included),
 and interaction target in this format chr2:333-444,55. For examples,
 see the  source of this format
 at <a href="https://epigenomegateway.readthedocs.io/en/latest/tracks.html#longrange"
 target="_blank">WashU Epigenome Browser</a>.</p>
 <p>
 Also, review the enhanced  <a href="../goldenPath/help/interact.html">interact</a> format
 for information on how to visualize pairwise interactions as arcs in the browser.
 </p>
 
 <a name="format5"></a>
 <h2>MAF format</h2> 
 <p> 
 The multiple alignment format stores a series of multiple alignments in a format that is easy to 
 parse and relatively easy to read. This format stores multiple alignments at the DNA level between 
 entire genomes. Previously used formats are suitable for multiple alignments of single proteins or 
 regions of DNA without rearrangements, but would require considerable extension to cope with genomic
 issues such as forward and reverse strand directions, multiple pieces to the alignment, and so 
 forth.</p> 
 <p> 
 <strong>General Structure</strong><br> 
 The <em>.maf</em> format is line-oriented. Each multiple alignment beigns with the reference genome
 line and ends with a blank line. Each 
 sequence in an alignment is on a single line, which can get quite long, but there is no length 
 limit. Words in a line are delimited by any white space. Lines starting with # are considered to be 
 comments. Lines starting with ## can be ignored by most programs, but contain meta-data of one form 
 or another.</p> 
 <p> 
 The file is divided into paragraphs that terminate in a blank line.  Within a paragraph, the first 
 word of a line indicates its type. Each multiple alignment is in a separate paragraph that begins 
 with an &quot;a&quot; line and contains an &quot;s&quot; line for each sequence in the multiple 
 alignment. The first sequence must be the reference genome on which the rest of the sequenes map. 
 Some MAF files may contain other optional line types: </p>
 <ul>
   <li>
   an &quot;i&quot; line containing information about what is in the aligned species DNA before and 
   after the immediately preceding &quot;s&quot; line</li>
   <li>
   an &quot;e&quot; line containing information about the size of the gap between the alignments 
   that span the current block</li>
   <li>
   a &quot;q&quot; line indicating the quality of each aligned base for the species</li>
 </ul>
 <p>
 Parsers may ignore any other types of paragraphs and other types of lines within an alignment 
 paragraph. </p> 
 <p> 
 <strong>Custom Tracks</strong><br> 
 The first line of a custom MAF track must be a &quot;track&quot; line that contains a name=value 
 pair specifying the track name. Here is an example of a minimal track line: </p>
 <pre><code>track name=sample</code></pre> 
 <p> 
 The following variables can be specified in the track line of a custom MAF:</p>
 <ul>
   <li>
   <strong>name=sample</strong> - Required. Name of the track. </li>
   <li>
   <strong>description="Sample Track"</strong> - Optional. Assigns a long name for the track.</li>
   <li>
   <strong>frames=multiz28wayFrames</strong> - Optional. Tells the browser which table to grab the 
   gene frames from. This is usually associated with an N-way alignment where the name ends in the 
   string &quot;Frames&quot;.</li>
   <li>
   <strong>mafDot=on</strong> - Optional. Use dots instead of bases when bases are identical.</li> 
   <li>
   <strong>visibility=dense|pack|full</strong> - Optional. Sets the default visibility mode for this 
   track.</li>
   <li>
   <strong>speciesOrder="hg18 panTro2"</strong> - Optional. White-space separated list specifying the
   order in which the sequences in the maf should be displayed.</li>
 </ul>
 <p>
 The second line of a custom MAF track must be a header line as described below.</p> 
 <p>
 <strong>Header Line</strong>
 <p> 
 The first line of a <em>.maf</em> file begins with ##maf. This word is followed by 
 white-space-separated variable=value pairs. There should be <em>no</em> white space surrounding the 
 &quot;=&quot;.</p>  <pre><code>##maf version=1 scoring=tba.v8</code></pre> 
 <p>
 The currently defined variables are:</p> 
 <ul>
   <li>
   <strong>version</strong> - Required. Currently set to one.</li>
   <li>
   <strong>scoring</strong> - Optional. A name for the scoring scheme used for the alignments. The 
   current scoring schemes are:
   <ul>
     <li>
     <em>bit</em> - roughly corresponds to blast bit values (roughly 2 points per aligning base 
     minus penalties for mismatches and inserts).</li>
     <li>
     <em>blastz</em> - blastz scoring scheme -- roughly 100 points per aligning base.</li>
     <li>
     <em>probability</em> - some score normalized between 0 and 1.</li>
   </ul>
   <li>
   <strong>program</strong> - Optional. Name of the program generating the alignment.</li>
 </ul>
 <p> 
 Undefined variables are ignored by the parser.</p> 
 <p> 
 The header line is usually followed by a comment line (it begins with a #) that describes the 
 parameters that were used to run the alignment program: </p>
 <pre><code># tba.v8 (((human chimp) baboon) (mouse rat))</code></pre> 
 <p> 
 <strong>Alignment Block Lines</strong> (lines starting with &quot;a&quot; -- parameters for a new 
 alignment block)</p> 
 <p><pre><code>a score=23262.0</code></pre>
 <p>
 Each alignment begins with an &quot;a&quot; line that set variables for the entire alignment block. 
 The &quot;a&quot; is followed by name=value pairs. There are no required name=value pairs. The 
 currently defined variables are: </p> 
 <ul> 
   <li> 
   <strong>score</strong> -- Optional. Floating point score. If this is present, it is good practice 
   to also define scoring in the first line.</li> 
   <li> 
   <strong>pass</strong> -- Optional. Positive integer value. For programs that do multiple pass 
   alignments such as blastz, this shows which pass this alignment came from. Typically, pass 1 will 
   find the strongest alignments genome-wide, and pass 2 will find weaker alignments between two 
   first-pass alignments.</li>
 </ul> 
 <p> 
 <strong>Lines starting with &quot;s&quot; -- a sequence within an alignment block</strong></p>
 <pre><code> s hg16.chr7    27707221 13 + 158545518 gcagctgaaaaca
  s panTro1.chr6 28869787 13 + 161576975 gcagctgaaaaca
  s baboon         249182 13 +   4622798 gcagctgaaaaca
  s mm4.chr6     53310102 13 + 151104725 ACAGCTGAAAATA
 </code></pre>
 <p> 
 The &quot;s&quot; lines together with the &quot;a&quot; lines define a multiple alignment. 
 The first &quot;s&quot; line must be the reference genome, hg16 in the above example.
 The &quot;s&quot; lines have the following fields which are defined by position.</p> 
 <ul> 
   <li> 
   <strong>src</strong> -- The name of one of the source sequences for the alignment. For sequences 
   that are resident in a browser assembly, the form 'database.chromosome' allows automatic creation 
   of links to other assemblies. Non-browser sequences are typically reference by the species name 
   alone.</li>
   <li> 
   <strong>start</strong> -- The start of the aligning region in the source sequence. This is a 
   zero-based number. If the strand field is &quot;-&quot; then this is the start relative to the 
   reverse-complemented source sequence (see 
   <a href="http://genomewiki.ucsc.edu/index.php/Coordinate_Transforms" target=blank>Coordinate 
   Transforms</a>).</li> 
   <li> 
   <strong>size</strong> -- The size of the aligning region in the source sequence. This number is 
   equal to the number of non-dash characters in the alignment text field below.</li>          
   <li> 
   <strong>strand</strong> -- Either &quot;+&quot; or &quot;-&quot;. If &quot;-&quot;, then the 
   alignment is to the reverse-complemented source.</li> 
   <li> 
   <strong>srcSize</strong> -- The size of the entire source sequence, not just the parts involved in
   the alignment.</li>
   <li> 
   <strong>text</strong> -- The nucleotides (or amino acids) in the alignment and any insertions 
   (dashes) as well.</li> 
 </ul>          
 <p> 
 <strong>Lines starting with &quot;i&quot; -- information about what's happening before and after 
 this block in the aligning species</strong></p> 
 <pre><code> s hg16.chr7    27707221 13 + 158545518 gcagctgaaaaca
  s panTro1.chr6 28869787 13 + 161576975 gcagctgaaaaca
  i panTro1.chr6 N 0 C 0
  s baboon         249182 13 +   4622798 gcagctgaaaaca
  i baboon       I 234 n 19 </code></pre> 
 <p> 
 The &quot;i&quot; lines contain information about the context of the sequence lines immediately 
 preceding them. The following fields are defined by position rather than name=value pairs:</p> 
 <ul> 
   <li>
   <strong>src</strong> -- The name of the source sequence for the alignment. Should be the same as 
   the &quot;s&quot; line immediately above this line.</li>             
   <li>
   <strong>leftStatus</strong> -- A character that specifies the relationship between the sequence in
   this block and the sequence that appears in the previous block.</li> 
   <li>
   <strong>leftCount</strong> -- Usually the number of bases in the aligning species between the 
   start of this alignment and the end of the previous one.</li>                          
   <li>
   <strong>rightStatus</strong> -- A character that specifies the relationship between the sequence 
   in this block and the sequence that appears in the subsequent block.</li>
   <li>
   <strong>rightCount</strong> -- Usually the number of bases in the aligning species between the end
   of this alignment and the start of the next one.</li> 
 </ul> 
 <p> 
 The status characters can be one of the following values: </p>
 <ul>                          
   <li>
   <strong>C</strong> -- the sequence before or after is contiguous with this block.</li> 
   <li>
   <strong>I</strong> -- there are bases between the bases in this block and the one before or after 
   it.</li> 
   <li>
   <strong>N</strong> -- this is the first sequence from this src chrom or scaffold.</li> 
   <li>
   <strong>n</strong> -- this is the first sequence from this src chrom or scaffold but it is bridged
   by another alignment from a different chrom or scaffold.</li>                          
   <li>
   <strong>M</strong> -- there is missing data before or after this block (Ns in the sequence).</li> 
   <li>
   <strong>T</strong> -- the sequence in this block has been used before in a previous block (likely 
   a tandem duplication)</li> 
 </ul>                        
 <p> 
 <strong>Lines starting with &quot;e&quot; -- information about empty parts of the alignment
 block</strong></p>
 <pre><code> s hg16.chr7    27707221 13 + 158545518 gcagctgaaaaca
  e mm4.chr6     53310102 13 + 151104725 I </code></pre> 
 <p> 
 The &quot;e&quot; lines indicate that there isn't aligning DNA for a species but that the current 
 block is bridged by a chain that connects blocks before and after this block. The following fields 
 are defined by position rather than name=value pairs.</p> 
 <ul>                          
   <li>
   <strong>src</strong> -- The name of one of the source sequences for the alignment.</li>
   <li>
   <strong>start</strong> -- The start of the non-aligning region in the source sequence. This is a 
   zero-based number. If the strand field is &quot;-&quot; then this is the start relative to the 
   reverse-complemented source sequence (see 
   <a href="http://genomewiki.ucsc.edu/index.php/Coordinate_Transforms" target=blank>Coordinate 
   Transforms</a>).</li>
   <li>
   <strong>size</strong> -- The size in base pairs of the non-aligning region in the source 
   sequence.</li>
   <li>
   <strong>strand</strong> -- Either &quot;+&quot; or &quot;-&quot;. If &quot;-&quot;, then the 
   alignment is to the reverse-complemented source.</li> 
   <li>
   <strong>srcSize</strong> -- The size of the entire source sequence, not just the parts involved in
   the alignment; alignment and any insertions (dashes) as well.</li> 
   <li>
   <strong>status</strong> -- A character that specifies the relationship between the non-aligning 
   sequence in this block and the sequence that appears in the previous and subsequent blocks.</li> 
 </ul>
 <p>
 The status character can be one of the following values:</p> 
 <ul>                          
   <li>
   <strong>C</strong> -- the sequence before and after is contiguous implying that this region was 
   either deleted in the source or inserted in the reference sequence. The browser draws a single 
   line or a &quot;-&quot; in base mode in these blocks.</li>  
   <li>
   <strong>I</strong> -- there are non-aligning bases in the source species between chained alignment
   blocks before and after this block. The browser shows a double line or &quot;=&quot; in base 
   mode.</li>  
   <li>
   <strong>M</strong> -- there are non-aligning bases in the source and more than 90% of them are Ns 
   in the source. The browser shows a pale yellow bar.</li>                        
   <li>
   <strong>n</strong> -- there are non-aligning bases in the source and the next aligning block 
   starts in a new chromosome or scaffold that is bridged by a chain between still other blocks. The 
   browser shows either a single line or a double line based on how many bases are in the gap between
   the bridging alignments.</li>
 </ul>
 <p>
 <strong>Lines starting with &quot;q&quot; -- information about the quality of each aligned base for 
 the species</strong> </p>
 <pre><code> s hg18.chr1                  32741 26 + 247249719 TTTTTGAAAAACAAACAACAAGTTGG
  s panTro2.chrUn            9697231 26 +  58616431 TTTTTGAAAAACAAACAACAAGTTGG
  q panTro2.chrUn                                   99999999999999999999999999
  s dasNov1.scaffold_179265     1474  7 +      4584 TT----------AAGCA---------
  q dasNov1.scaffold_179265                         99----------32239--------- </code></pre> 
 <p> 
 The &quot;q&quot; lines contain a compressed version of the actual raw quality data, representing 
 the quality of each aligned base for the species with a single character of 0-9 or F. The following 
 fields are defined by position rather than name=value pairs:</p> 
 <ul>
   <li>
   <strong>src</strong> -- The name of the source sequence for the alignment. Should be the same as 
   the &quot;s&quot; line immediately preceding this line.</li>             
   <li>
   <strong>value</strong> -- A MAF quality value corresponding to the aligning nucleotide acid in the
   preceding &quot;s&quot; line. Insertions (dashes) in the preceding &quot;s&quot; line are 
   represented by dashes in the &quot;q&quot; line as well. The quality value can be &quot;F&quot; 
   (finished sequence) or a number derived from the actual quality scores (which range from 0-97) or 
   the manually assigned score of 98. These numeric values are calculated as:</p>
   <pre><code>MAF quality value = min( floor(actual quality value/5), 9 )</code></pre>
   <p> 
   This results in the following mapping:</p> 
   <p> 
   <table> 
     <tr> 
       <th nowrap><strong>MAF quality value</strong></th> 
       <th nowrap><strong>Raw quality score range</strong></th> 
       <th nowrap><strong>Quality level</strong></th></tr> 
     <tr> 
       <td align="center">0-8</td>
       <td align="center">0-44</td> 
       <td align="center">Low</td> 
     </tr> 
     <tr> 
       <td align="center">9</td>
       <td align="center">45-97</td>
       <td align="center">High</td>
     </tr>
     <tr>
       <td align="center">0</td>
       <td align="center">98</td>
       <td align="center">Manually assigned</td>
     </tr>
     <tr>
       <td align="center">F</td>
       <td align="center">99</td>
       <td align="center">Finished</td>
     </tr>
   </table>
 </ul> 
 <p> 
 <strong>A Simple Example</strong></p>                     
 <p>
 Here is a simple example of a three alignment blocks derived from five starting sequences. The 
 first <strong>track</strong> line is necessary for custom tracks, but should be removed otherwise. Repeats are
 shown as lowercase, and each block may have a subset of the input sequences. All sequence columns 
 and rows must contain at least one nucleotide (no columns or rows that contain only insertions).</p>
 <pre><code>track name=euArc visibility=pack
 ##maf version=1 scoring=tba.v8 
 # tba.v8 (((human chimp) baboon) (mouse rat)) 
                    
 a score=23262.0     
 s hg18.chr7    27578828 38 + 158545518 AAA-GGGAATGTTAACCAAATGA---ATTGTCTCTTACGGTG
 s panTro1.chr6 28741140 38 + 161576975 AAA-GGGAATGTTAACCAAATGA---ATTGTCTCTTACGGTG
 s baboon         116834 38 +   4622798 AAA-GGGAATGTTAACCAAATGA---GTTGTCTCTTATGGTG
 s mm4.chr6     53215344 38 + 151104725 -AATGGGAATGTTAAGCAAACGA---ATTGTCTCTCAGTGTG
 s rn3.chr4     81344243 40 + 187371129 -AA-GGGGATGCTAAGCCAATGAGTTGTTGTCTCTCAATGTG
                    
 a score=5062.0                    
 s hg18.chr7    27699739 6 + 158545518 TAAAGA
 s panTro1.chr6 28862317 6 + 161576975 TAAAGA
 s baboon         241163 6 +   4622798 TAAAGA 
 s mm4.chr6     53303881 6 + 151104725 TAAAGA
 s rn3.chr4     81444246 6 + 187371129 taagga
 
 a score=6636.0
 s hg18.chr7    27707221 13 + 158545518 gcagctgaaaaca
 s panTro1.chr6 28869787 13 + 161576975 gcagctgaaaaca
 s baboon         249182 13 +   4622798 gcagctgaaaaca
 s mm4.chr6     53310102 13 + 151104725 ACAGCTGAAAATA </code></pre>
 
 <a name="format6.5"></a>
 <h2>Microarray format</h2>
 <p>
 The datasets for the built-in microarray tracks in the Genome Browser are stored in BED15 format, 
 an extension of <a href="#format1">BED</a> format that includes three additional fields: expCount, 
 expIds, and expScores. To display correctly in the Genome Browser, microarray tracks require the 
 setting of several attributes in the trackDb file associated with the track's genome assembly. 
 Each microarray track set must also have an associated microarrayGroups.ra configuration file that 
 contains additional information about the data in each of the arrays.</p>
 <p>
 User-created microarray custom tracks are similar in format to BED custom tracks with the addition 
 of three required track line parameters in the header--expNames, expScale, and expStep--that mimic 
 the trackDb and microarrayGroups.ra settings of built-in microarray tracks. </p>
 <p>
 For a complete description of the microarray track format and an explanation of how to construct a 
 microarray custom track, see the <a href="http://genomewiki.ucsc.edu/index.php/Microarray_track"
 target="_blank">Genome Browser Wiki</a>. </p>
 
 <a name="format7"></a>
 <h2>.2bit format</h2>
 <p>
 A .2bit file stores multiple DNA sequences (up to 4 Gb total) in a compact randomly-accessible 
 format. The file contains masking information as well as the DNA itself.</p>
 <p>
 The file begins with a 16-byte header containing the following fields:</p>
 <ul>
   <li>
   <strong>signature</strong> - the number 0x1A412743 in the architecture of the machine that created
   the file</li>
   <li>
   <strong>version</strong> - zero for now. Readers should abort if they see a version number higher 
   than 0</li>
   <li>
   <strong>sequenceCount</strong> - the number of sequences in the file</li>  
   <li>
   <strong>reserved</strong> - always zero for now</li>
 </ul>
 <p>
 All fields are 32 bits unless noted. If the signature value is not as given, the reader program 
 should byte-swap the signature and check if the swapped version matches. If so, all multiple-byte 
 entities in the file will have to be byte-swapped. This enables these binary files to be used 
 unchanged on different architectures.</p>
 <p>
 The header is followed by a file index, which contains one entry for each sequence. Each index entry
 contains three fields:</p>
 <ul>
   <li>
   <strong>nameSize</strong> - a byte containing the length of the name field</li>
   <li>
   <strong>name</strong> - the sequence name itself (in ASCII-compatible byte string), of variable 
   length depending on nameSize</li>
   <li>
   <strong>offset</strong> - the 32-bit offset of the sequence data relative to the start of the 
   file, not aligned to any 4-byte padding boundary</li>
 </ul>
 <p>
 The index is followed by the sequence records, which contain nine fields:</p>
 <ul>
   <li>
   <strong>dnaSize</strong> - number of bases of DNA in the sequence</li>
   <li>
   <strong>nBlockCount</strong> - the number of blocks of Ns in the file (representing unknown 
   sequence)</li>
   <li>
   <strong>nBlockStarts</strong> - an array of length nBlockCount of 32 bit integers indicating the 
   (0-based) starting position of a block of Ns</li>
   <li>
   <strong>nBlockSizes</strong> - an array of length nBlockCount of 32 bit integers indicating the 
   length of a block of Ns</li>
   <li>
   <strong>maskBlockCount</strong> - the number of masked (lower-case) blocks</li>
   <li>
   <strong>maskBlockStarts</strong> - an array of length maskBlockCount of 32 bit integers indicating
   the (0-based) starting position of a masked block</li>
   <li>
   <strong>maskBlockSizes</strong> - an array of length maskBlockCount of 32 bit integers indicating 
   the length of a masked block</li>
   <li>
   <strong>reserved</strong> - always zero for now</li>
   <li>
   <strong>packedDna</strong> - the DNA packed to two bits per base, represented as so: T - 00, 
   C - 01, A - 10, G - 11. The first base is in the most significant 2-bit byte; the last base is in 
   the least significant 2 bits. For example, the sequence TCAG is represented as 00011011.</li>
 </ul>
 <p>
 For a complete definition of all fields in the twoBit format, see 
 <a href="http://genome-source.soe.ucsc.edu/gitlist/kent.git/raw/master/src/inc/twoBit.h">this</a> 
-description in the source code.</p>
+description in the source code. Click these links to see examples of using the
+<a href="../../goldenPath/help/twoBit.html" target ="_blank"><code>faToTwoBit</code>,
+<code>twoBitInfo</code>, and <code>twoBitToFa</code></a> commands, and how to
+<a href="../../goldenPath/help/twoBit.html#extract" target="_blank">extract DNA</a> from 2bit
+files, including with our <a href="../../goldenPath/help/api.html" target="_blank">API</a>.</p>
 
 <a name="format8"></a>
 <h2>.nib format</h2>
 <p>
 The .nib format pre-dates the .2bit format and is less compact. It describes a DNA sequence by 
 packing two bases into each byte. Each .nib file contains only a single sequence. The file begins 
 with a 32-bit signature that is 0x6BE93D3A in the architecture of the machine that created the file 
 (or possibly a byte-swapped version of the same number on another machine). This is followed by a 
 32-bit number in the same format that describes the number of bases in the file. Next, the bases 
 themselves are listed, packed two bases to the byte. The first base is packed in the high-order 4 
 bits (nibble); the second base is packed in the low-order four bits:</p>
 <pre><code>byte = (base1<<4) + base2
 </code></pre>
 <p>
 The numerical representations for the bases are:</p>
 <pre><code>0 - T
 1 - C
 2 - A
 3 - G
 4 - N (unknown)</code></pre>
 <p>
 The most significant bit in a nibble is set if the base is masked. </p>
 
 <a name="format9"></a>
 <h2>GenePred table format</h2>
 <p>
 genePred is a table format commonly used for gene prediction tracks in the Genome Browser. 
 Variations of the genePred format are listed below.</p> 
 <p>
 If you would like to obtain browser data in GFF (GTF) format, please refer to 
 <a href="http://genomewiki.ucsc.edu/index.php/Genes_in_gtf_or_gff_format" 
 target="_blank">Genes in gtf or gff format</a> on the Wiki. There is also a
 format of genePred called <a href="../goldenPath/help/bigGenePred.html"
 target="_blank">bigGenePred</a>, a version of <a href="../goldenPath/help/bigBed.html"
 target="_blank">bigBed</a>, which enables custom tracks to display codon numbers
 and amino acids when zoomed in to the base level.</p>
 
 <a name=GenePredictions></a>
 <p><strong>Gene Predictions</strong></p>
 <p> 
 The following definition is used for gene prediction tables.In alternative-splicing situations, each
 transcript has a row in this table.</p>
 <pre><code>table genePred
 "A gene prediction."
     (
     string  name;               "Name of gene"
     string  chrom;              "Chromosome name"
     char[1] strand;             "+ or - for strand"
     uint    txStart;            "Transcription start position"
     uint    txEnd;              "Transcription end position"
     uint    cdsStart;           "Coding region start"
     uint    cdsEnd;             "Coding region end"
     uint    exonCount;          "Number of exons"
     uint[exonCount] exonStarts; "Exon start positions"
     uint[exonCount] exonEnds;   "Exon end positions"
     )
 </code></pre>
 
 <a name=GenePredExt></a>
 <p><strong>Gene Predictions (Extended)</strong></p>
 <p>
 The following definition is used for extended gene prediction tables. In alternative-splicing 
 situations, each transcript has a row in this table. The refGene table is an example of the 
 genePredExt format.</p>
 <pre><code>table genePredExt
 "A gene prediction with some additional info."
     (
     string name;        	"Name of gene (usually transcript_id from GTF)"
     string chrom;       	"Chromosome name"
     char[1] strand;     	"+ or - for strand"
     uint txStart;       	"Transcription start position"
     uint txEnd;         	"Transcription end position"
     uint cdsStart;      	"Coding region start"
     uint cdsEnd;        	"Coding region end"
     uint exonCount;     	"Number of exons"
     uint[exonCount] exonStarts; "Exon start positions"
     uint[exonCount] exonEnds;   "Exon end positions"
     int score;            	"Score"
     string name2;       	"Alternate name (e.g. gene_id from GTF)"
     string cdsStartStat; 	"Status of CDS start annotation (none, unknown, incomplete, or complete)"
     string cdsEndStat;   	"Status of CDS end annotation (none, unknown, incomplete, or complete)"
     lstring exonFrames; 	"Exon frame offsets {0,1,2}"
     )
 </code></pre>
 <p>The fields cdsStartStat and cdsEndStat can have the values ('none','unk','incmpl','cmpl'). However,
 the values are not used for our display and can not be used to subset for coding or non-coding
 genes. For most purposes, to get more information about a transcript, other tables will need to
 be used e.g. in the case of hg38, the tables named wgEncodeGencodeAttrsVxx, where xx is the
 Gencode Version number. See this <a href="../../FAQ/FAQgenes.html#coding"
 target="_blank">coding/non-coding genes FAQ</a> for more information.</p>
 
 <a name=RefFlat></a>
 <p><strong>Gene Predictions and RefSeq Genes with Gene Names</strong></p>
 <p>
 A version of genePred that associates the gene name with the gene prediction information. In 
 alternative-splicing situations, each transcript has a row in this table.</p>
 <pre><code>table refFlat
 "A gene prediction with additional geneName field."
     (
     string  geneName;           "Name of gene as it appears in Genome Browser."
     string  name;               "Name of gene"
     string  chrom;              "Chromosome name"
     char[1] strand;             "+ or - for strand"
     uint    txStart;            "Transcription start position"
     uint    txEnd;              "Transcription end position"
     uint    cdsStart;           "Coding region start"
     uint    cdsEnd;             "Coding region end"
     uint    exonCount;          "Number of exons"
     uint[exonCount] exonStarts; "Exon start positions"
     uint[exonCount] exonEnds;   "Exon end positions"
     )
 </code></pre>
 
 <a name="format10"></a>
 <h2>Personal Genome SNP format</h2>
 <p>
 This format is for displaying SNPs from personal genomes. It is the same as is used for the Genome 
 Variants and Population Variants tracks.</p>
 <ol>
   <li>
   <strong>chrom</strong> - The name of the chromosome (e.g. chr3, chrY, chr2_random) or scaffold 
   (e.g.  scaffold10671).</li> 
   <li>
   <strong>chromStart</strong> - The starting position of the feature in the chromosome or scaffold. 
   The first base in a chromosome is numbered 0.</li>
   <li>
   <strong>chromEnd</strong> - The ending position of the feature in the chromosome or scaffold. The 
   <em>chromEnd</em> base is not included in the display of the feature. For example, the first 100 
   bases of a chromosome are defined as <em>chromStart=0, chromEnd=100</em>, and span the bases 
   numbered 0-99.</li>
   <li>
   <strong>name</strong> - The allele or alleles, consisting of one or more A, C, T, or G, optionally
   followed by one or more &quot;/&quot; and another allele (there can be more than 2 alleles). A 
   &quot;-&quot; can be used in place of a base to denote an insertion or deletion; if the position 
   given is zero bases wide, it is an insertion. The alleles are expected to be for the plus 
   strand.</li>
   <li>
   <strong>alleleCount</strong> - The number of alleles listed in the name field.</li>
   <li>
   <strong>alleleFreq</strong> - A comma-separated list of the frequency of each allele, given in the
   same order as the name field. If unknown, a list of zeroes (matching the alleleCount) should be 
   used.</li>
   <li>
   <strong>alleleScores</strong> - A comma-separated list of the quality score of each allele, given 
   in the same order as the name field. If unknown, a list of zeroes (matching the alleleCount) 
   should be used.</li>
 </ol>
 <p>
 In the Genome Browser, when viewing the forward strand of the reference genome (the normal case), 
 the displayed alleles are relative to the forward strand. When viewing the reverse strand of the 
 reference genome (via the &quot;<--&quot; or &quot;reverse&quot; button), the displayed alleles are 
 reverse-complemented to match the reverse strand. If the allele frequencies are given, the coloring 
 of the box will reflect the frequency for each allele.</p>
 <p>
 The details pages for this track type will automatically compute amino acid changes for coding SNPs 
 as well as give a chart of amino acid properties if there is a non-synonymous change. (The Sift and 
 PolyPhen predictions that are in some of the Genome Variants subtracks are not available.)</p>
 <p>
 <strong><em>Example:</em></strong><br>
 Here is an example of an annotation track in Personal Genome SNP format. The first SNP using a 
 &quot;-&quot; is an insertion; the second is a deletion. The last 4 SNPs are in a coding region.</p>
 <pre><code>track type=pgSnp visibility=3 db=hg19 name="pgSnp" description="Personal Genome SNP example"
 browser position chr21:31811924-31812937
 chr21	31812007	31812008	T/G	2	21,70	90,70
 chr21	31812031	31812032	T/G/A	3	9,60,7	80,80,30
 chr21	31812035	31812035	-/CGG	2	20,80	0,0
 chr21	31812088	31812093	-/CTCGG	2	30,70	0,0
 chr21	31812277	31812278	T	1	15	90
 chr21	31812771	31812772	A	1	36	80
 chr21	31812827	31812828	A/T	2	15,5	0,0
 chr21	31812879	31812880	C	1	0	0
 chr21   31812915	31812916	-	1	0	0
 </code></pre>
 
 <a name="format11"></a>
 <h2>ENCODE RNA elements: BED6 + 3 scores format</h2>
 <p>
 <ol> 
   <li>
   <strong>chrom</strong> - Name of the chromosome (or contig, scaffold, etc.).</li>
   <li>
   <strong>chromStart</strong> - The starting position of the feature in the chromosome or scaffold. 
   The first base in a chromosome is numbered 0.</li>
   <li>
   <strong>chromEnd</strong> - The ending position of the feature in the chromosome or scaffold. The
   <em>chromEnd</em> base is not included in the display of the feature. For example, the first 100 
   bases of a chromosome are defined as <em>chromStart=0, chromEnd=100</em>, and span the bases 
   numbered 0-99.</li>
   <li>
   <strong>name</strong> - Name given to a region (preferably unique). Use &quot;.&quot; if no name 
   is assigned.</li>
   <li>
   <strong>score</strong> - Indicates how dark the peak will be displayed in the browser (0-1000). If
   all scores were &quot;0&quot; when the data were submitted to the DCC, the DCC assigned scores 
   1-1000 based on signal value. Ideally the average signalValue per base spread is between 
   100-1000.</li>
   <li>
   <strong>strand</strong> - +/- to denote strand or orientation (whenever applicable). Use 
   &quot;.&quot; if no orientation is assigned.</li>
   <li>
   <strong>level</strong> - Expression level, e.g. RPKM or FPKM.</li>
   <li>
   <strong>signif</strong> - Statistical significance, e.g. IDR.</li>
   <li>
   <strong>score2</strong> - Additional measurement/count, e.g. number of reads.</li>
 </ol>
 
 <a name="format12"></a>
 <h2>ENCODE narrowPeak: Narrow (or Point-Source) Peaks format</h2>
 <p>
 This format is used to provide called peaks of signal enrichment based on pooled, normalized 
 (interpreted) data. It is a BED6+4 format.</p>
 <ol>
   <li>
   <strong>chrom</strong> - Name of the chromosome (or contig, scaffold, etc.).</li>
   <li>
   <strong>chromStart</strong> - The starting position of the feature in the chromosome or scaffold. 
   The first base in a chromosome is numbered 0.</li>
   <li>
   <strong>chromEnd</strong> - The ending position of the feature in the chromosome or scaffold. The
   <em>chromEnd</em> base is not included in the display of the feature. For example, the first 100 
   bases of a chromosome are defined as <em>chromStart=0, chromEnd=100</em>, and span the bases 
   numbered 0-99.</li>
   <li>
   <strong>name</strong> - Name given to a region (preferably unique).  Use &quot;.&quot; if no name 
   is assigned.</li>
   <li>
   <strong>score</strong> - Indicates how dark the peak will be displayed in the browser (0-1000). If
   all scores were &quot;'0&quot;' when the data were submitted to the DCC, the DCC assigned scores 
   1-1000 based on signal value. Ideally the average signalValue per base spread is between 
   100-1000.</li>
   <li>
   <strong>strand</strong> - +/- to denote strand or orientation (whenever applicable). Use 
   &quot;.&quot; if no orientation is assigned.</li>
   <li>
   <strong>signalValue</strong> - Measurement of overall (usually, average) enrichment for the 
   region.</li>
   <li>
   <strong>pValue</strong> - Measurement of statistical significance (-log10). Use -1 if no pValue is
   assigned.</li>
   <li>
   <strong>qValue</strong> - Measurement of statistical significance using false discovery rate 
   (-log10).  Use -1 if no qValue is assigned.</li>
   <li>
   <strong>peak</strong> - Point-source called for this peak; 0-based offset from chromStart. Use -1 
   if no point-source called.</li>
 </ol>
 <p>
 Here is an example of narrowPeak format:</p>
 <pre><code>track type=narrowPeak visibility=3 db=hg19 name="nPk" description="ENCODE narrowPeak Example"
 browser position chr1:9356000-9365000
 chr1    9356548 9356648 .       0       .       182     5.0945  -1  50
 chr1    9358722 9358822 .       0       .       91      4.6052  -1  40
 chr1    9361082 9361182 .       0       .       182     9.2103  -1  75
 </code></pre>
 There is also a format of narrowPeak called <a href="../goldenPath/help/bigNarrowPeak.html"
 target="_blank">bigNarrowPeak</a>, a version of <a href="../goldenPath/help/bigBed.html"
 target="_blank">bigBed</a>, which enables using this point-source display in Track Hubs.</p>
 
 <a name="format13"></a>
 <h2>ENCODE broadPeak: Broad Peaks (or Regions) format</h2> 
 <p>
 This format is used to provide called regions of signal enrichment based on pooled, normalized 
 (interpreted) data. It is a BED 6+3 format.</p>
 <ol>
   <li>
   <strong>chrom</strong> - Name of the chromosome (or contig, scaffold, etc.).</li>
   <li>
   <strong>chromStart</strong> - The starting position of the feature in the chromosome or scaffold. 
   The first base in a chromosome is numbered 0.</li>
   <li>
   <strong>chromEnd</strong> - The ending position of the feature in the chromosome or scaffold. The
   <em>chromEnd</em> base is not included in the display of the feature. For example, the first 100 
   bases of a chromosome are defined as <em>chromStart=0, chromEnd=100</em>, and span the bases 
   numbered 0-99. If all scores were &quot;0&quot; when the data were submitted to the DCC, the DCC 
   assigned scores 1-1000 based on signal value. Ideally the average signalValue per base spread is 
   between 100-1000.</li>
   <li>
   <strong>name</strong> - Name given to a region (preferably unique). Use &quot;.&quot; if no name 
   is assigned.</li>
   <li>
   <strong>score</strong> - Indicates how dark the peak will be displayed in the browser 
   (0-1000).</li>
   <li>
   <strong>strand</strong> - +/- to denote strand or orientation (whenever applicable). Use
   &quot;.&quot; if no orientation is assigned.</li>
   <li>
   <strong>signalValue</strong> - Measurement of overall (usually, average) enrichment for the 
   region.</li>
   <li>
   <strong>pValue</strong> - Measurement of statistical significance (-log10). Use -1 if no pValue is
   assigned.</li>
   <li>
   <strong>qValue</strong> - Measurement of statistical significance using false discovery rate 
   (-log10).  Use -1 if no qValue is assigned.</li>
 </ol>
 <p>
 Here is an example of broadPeak format:</p>
 <pre><code>track type=broadPeak visibility=3 db=hg19 name="bPk" description="ENCODE broadPeak Example"
 browser position chr1:798200-800700
 chr1     798256 798454 .       116      .       4.89716 3.70716 -1
 chr1     799435 799507 .       103      .       2.46426 1.54117 -1
 chr1     800141 800596 .       107      .       3.22803 2.12614 -1
 </code></pre>
 
 <a name="format14"></a>
 <h2>ENCODE gappedPeak: Gapped Peaks (or Regions) format</h2>
 <p>
 This format is used to provide called regions of signal enrichment based on pooled, normalized 
 (interpreted) data where the regions may be spliced or incorporate gaps in the genomic sequence. 
 It is a BED12+3 format.</p>
 <ol>
   <li>
   <strong>chrom</strong> - Name of the chromosome (or contig, scaffold, etc.).
   <li>
   <strong>chromStart</strong> - The starting position of the feature in the chromosome or scaffold. 
   The first base in a chromosome is numbered 0.</li>
   <li>
   <strong>chromEnd</strong> - The ending position of the feature in the chromosome or scaffold. The
   <em>chromEnd</em> base is not included in the display of the feature. For example, the first 100 
   bases of a chromosome are defined as <em>chromStart=0, chromEnd=100</em>, and span the bases 
   numbered 0-99.</li>
   <li>
   <strong>name</strong> - Name given to a region (preferably unique). Use &quot;.&quot; if no name 
   is assigned.</li>
   <li>
   <strong>score</strong> - Indicates how dark the peak will be displayed in the browser (0-1000). If
   all scores were &quot;0&quot; when the data were submitted to the DCC, the DCC assigned scores 
   1-1000 based on signal value. Ideally the average signalValue per base spread is between 
   100-1000.</li>
   <li>
   <strong>strand</strong> - +/- to denote strand or orientation (whenever applicable). Use 
   &quot;.&quot; if no orientation is assigned.</li>
   <li>
   <strong>thickStart</strong> - The starting position at which the feature is drawn thickly. Not 
   used in gappedPeak type, set to 0.</li>
   <li>
   <strong>thickEnd</strong> - The ending position at which the feature is drawn thickly. Not used in
   gappedPeak type, set to 0.</li>
   <li>
   <strong>itemRgb</strong> - An RGB value of the form R,G,B (e.g. 255,0,0). Not used in gappedPeak 
   type, set to 0. </li>
   <li>
   <strong>blockCount</strong> - The number of blocks (exons) in the BED line.</li>
   <li>
   <strong>blockSizes</strong> - A comma-separated list of the block sizes. The number of items in 
   this list should correspond to <I>blockCount</I>.</li>
   <li>
   <strong>blockStarts</strong> - A comma-separated list of block starts. The first value must be 0 
   and all of the <em>blockStart</em> positions should be calculated relative to <em>chromStart</em>.
   The number of items in this list should correspond to <em>blockCount</em>.</li>
   <li>
   <strong>signalValue</strong> - Measurement of overall (usually, average) enrichment for the 
   region.</li>
   <li>
   <strong>pValue</strong> - Measurement of statistical significance (-log10). Use -1 if no pValue is
   assigned.</li>
   <li>
   <strong>qValue</strong> - Measurement of statistical significance using false discovery rate 
   (-log10).  Use -1 if no qValue is assigned.</li>
 </ol>
 <p>
 Here is an example of gappedPeak format:</p>
 <pre><code>track name=gappedPeakExample type=gappedPeak
 chr1 171000 171600 Anon_peak_1 55 . 0 0 0 2 400,100 0,500 4.04761 7.53255 5.52807
 </code></pre>
 
 <a name="format15"></a>
 <h2>ENCODE tagAlign: BED3+3 format (historical)</h2>
 <p>tagAlign was used in hg18, but not in subsequent assemblies. Tag Alignment provided 
 genomic mapping of short sequence tags. It is a BED3+3 format.</p>
 <ol>
   <li>
   <strong>chrom</strong> - Name of the chromosome.</li>
   <li>
   <strong>chromStart</strong> - The starting position of the feature in the chromosome. The first 
   base in a chromosome is numbered 0.</li>
   <li>
   <strong>chromEnd</strong> - The ending position of the feature in the chromosome or scaffold. The 
   chromEnd base is not included in the display of the feature. For example, the first 100 bases of a
   chromosome are defined as chromStart=0, chromEnd=100, and span the bases numbered 0-99.</li>
   <li>
   <strong>sequence</strong> - Sequence of this read.</li>
   <li>
   <strong>score</strong> - Indicates uniqueness or quality (preferably 1000/alignmentCount).</li>
   <li>
   <strong>strand</strong> - Orientation of this read (+ or -).</li>
 </ol>
 <p>
 Here is an example of tagAlign format: </p>
 <pre><code>chrX 8823384 8823409 AGAAGGAAAATGATGTGAAGACATA 1000 +
 chrX 8823387 8823412 TCTTATGTCTTCACATCATTTTCCT 500  -
 </code></pre>
 
 <a name="format16"></a>
 <h2>ENCODE pairedTagAlign: BED6+2 format (historical)</h2> 
 <p>
 pairedTagAlign was used in hg18, but not in subsequent assemblies. Tag Alignment Format 
 for Paired Reads was used to provide genomic mapping of paired-read short sequence tags. It is a 
 BED6+2 format. </p>
 <ol>
   <li>
   <strong>chrom</strong> - Name of the chromosome.</li>
   <li>
   <strong>chromStart</strong> - The starting position of the feature in the chromosome. The first 
   base in a chromosome is numbered 0.</li>
   <li>
   <strong>chromEnd</strong> - The ending position of the feature in the chromosome or scaffold. The 
   chromEnd base is not included in the display of the feature. For example, the first 100 bases of a
   chromosome are defined as chromStart=0, chromEnd=100, and span the bases numbered 0-99.</li>
   <li>
   <strong>name</strong> - Identifier of paired-read.</li>
   <li>
   <strong>score</strong> - Indicates uniqueness or quality (preferably 1000/alignment-count).</li>
   <li>
   <strong>strand</strong> - Orientation of this read (+ or -).</li>
   <li>
   <strong>seq1</strong> - Sequence of first read.</li>
   <li>
   <strong>seq2</strong> - Sequence of second read.</li>
 </ol>
 
 <a name="format17"></a>
 <h2>ENCODE peptideMapping: BED6+4 format</h2> 
 <p>
 The peptide mapping format was used to provide genomic mapping of proteogenomic mappings of peptides
 to the genome, with information that is appropriate for assessing the confidence of the mapping.</p>
 <ol>
   <li>
   <strong>chrom</strong> - Name of the chromosome.</li>
   <li>
   <strong>chromStart</strong> - The starting position of the feature in the chromosome. The first 
   base in a chromosome is numbered 0.</li>
   <li>
   <strong>chromEnd</strong> - The ending position of the feature in the chromosome or scaffold. The 
   chromEnd base is not included in the display of the feature. For example, the first 100 bases of a
   chromosome are defined as chromStart=0, chromEnd=100, and span the bases numbered 0-99.</li>
   <li>
   <strong>name</strong> - The peptide sequence.</li>
   <li>
   <strong>score</strong> - Indicates uniqueness or quality (preferably 1000/alignment-count).</li>
   <li>
   <strong>strand</strong> - Orientation of this read (+ or -).</li>
   <li>
   <strong>rawScore</strong> - Raw score for this hit, as estimated through HMM analysis.</li>
   <li>
   <strong>spectrumId</strong> - Non-unique identifier for the spectrum file.</li>
   <li>
   <strong>peptideRank</strong> - Rank of this hit, for peptides with multiple genomic hits.</li>
   <li>
   <strong>peptideRepeatCount</strong> - Indicates how many times this same hit was observed.</li>
 </ol>
 
 <!--#include virtual="$ROOT/inc/gbPageEnd.html" -->