src/hg/htdocs/FAQ/FAQblat.html 62553262014b67dd654402b70b90fc76bb3f3aef

62553262014b67dd654402b70b90fc76bb3f3aef
dschmelt
  Tue Jul 30 11:11:51 2019 -0700
Adding note about U nucleotides on stand-alone blat, fixing tags #23890

diff --git src/hg/htdocs/FAQ/FAQblat.html src/hg/htdocs/FAQ/FAQblat.html
index 94fad92..8971b7b 100755
--- src/hg/htdocs/FAQ/FAQblat.html
+++ src/hg/htdocs/FAQ/FAQblat.html
@@ -1,374 +1,381 @@
 <!DOCTYPE html>
 <!--#set var="TITLE" value="Genome Browser FAQ" -->
 <!--#set var="ROOT" value=".." -->
 
 <!-- Relative paths to support mirror sites with non-standard GB docs install -->
 <!--#include virtual="$ROOT/inc/gbPageStart.html" -->
-
+<body>
 <h1>Frequently Asked Questions: BLAT</h1>
 
 <h2>Topics</h2>
 
 <ul>
 <li><a href="#blat1">BLAT vs. BLAST</a></li>
 <li><a href="#blat1b">Blat cannot find a sequence at all or not all expected matches</a></li>
 <li><a href="#blat2">BLAT use restrictions</a></li>
 <li><a href="#blat3">Downloading Blat source and documentation</a></li>
 <li><a href="#blat5">Replicating web-based Blat parameters in command-line version</a></li>
-<li><a href="#blat6">Using the <em>-ooc</em> flag</strong></a></li>
+<li><a href="#blat6">Using the <em>-ooc</em> flag</a></li>
 <li><a href="#blat4">Replicating web-based Blat percent identity and score calculations</a></li>
 <li><a href="#blat7">Replicating web-based Blat &quot;I'm feeling lucky&quot; search 
 results</a></li>
 <li><a href="#blat8">Using Blat for short sequences with maximum sensitivity</a></li>
 <li><a href="#blat9">Blat ALL genomes</a></li>
 <li><a href="#blat10">Blat ALL genomes: No matches found</a></li>
 
 </ul>
 <hr>
 <p>
 <a href="index.html">Return to FAQ Table of Contents</a></p>
 
 <a name="blat1"></a>
 <h2>BLAT vs. BLAST</h2>
 <h6>What are the differences between BLAT and BLAST?</h6>
 <p>
 BLAT is an alignment tool like BLAST, but it is structured differently. On DNA, BLAT works by 
 keeping an index of an entire genome in memory. Thus, the target database of BLAT is not a set of 
 GenBank sequences, but instead an index derived from the assembly of the entire genome. By default,
 the index consists of all non-overlapping 11-mers except for those heavily involved in repeats, and 
 it uses less than a gigabyte of RAM. This smaller size means that BLAT is far more easily 
 <a href="../goldenPath/help/mirror.html">mirrored</a> than BLAST. Blat of DNA is designed to quickly find 
 sequences of 95% and greater similarity of length 40 bases or more. It may miss more divergent or 
 shorter sequence alignments. (The default settings and expected behavior of standalone Blat are 
 slightly different from those on the <a href="../cgi-bin/hgBlat">graphical version of BLAT</a>.)</p>
 <p>
 On proteins, BLAT uses 4-mers rather than 11-mers, finding protein sequences of 80% and greater 
 similarity to the query of length 20+ amino acids. The protein index requires slightly more than 
 2 gigabytes of RAM. In practice -- due to sequence divergence rates over evolutionary time -- DNA 
 Blat works well within humans and primates, while protein Blat continues to find good matches within
 terrestrial vertebrates and even earlier organisms for conserved proteins. Within humans, protein 
 Blat gives a much better picture of gene families (paralogs) than DNA Blat. However, BLAST and 
 psi-BLAST at NCBI can find much more remote matches.</p>
 <p>
 From a practical standpoint, BLAT has several advantages over BLAST:</p> 
 <ul>
   <li>
   speed (no queues, response in seconds) at the price of lesser homology depth</li>
   <li>
   the ability to submit a long list of simultaneous queries in fasta format</li>
   <li>
   five convenient output sort options</li>
   <li>
   a direct link into the UCSC browser</li>
   <li>
   alignment block details in natural genomic order</li>
   <li>
   an option to launch the alignment later as part of a custom track</li>
 </ul>
 <p>
 BLAT is commonly used to look up the location of a sequence in the genome or determine the exon 
 structure of an mRNA, but expert users can run large batch jobs and make internal parameter 
 sensitivity changes by installing command line Blat on their own Linux server.</p>
 
 <a name="blat1b"></a>
 <h2>Blat can't find a sequence or not all expected matches</h2>
 <h6>I can't find a sequence with Blat although I'm sure it is in the genome. Am I doing 
 something wrong?</h6>
 <p>
 First, check if you are using the correct version of the genome. For example, two versions of the 
 human genome are currently in wide use (hg19 and hg38) and your sequence may be only in one of
-them. Many published articles do not specify the assembly version so trying a few may be necessary.</p>
+them. Many published articles do not specify the assembly version so trying both may be necessary.</p>
 <p>
 Very short sequences that go over a splice site in a cDNA sequence can't be found, as they are not 
 in the genome. qPCR primers are a typical example. For these cases, try using 
 <a href="../cgi-bin/hgPcr">In-Silico PCR</a> and selecting a gene set as the target. In general, 
 the In-Silico PCR tool is more sensitive and should be preferred for pairs of primers.</p>
 <p>
 If you have verified that you are using the correct genome and that the sequence is indeed there, 
-for example by using the "Short match" track, the problem may be a result of BLAT's query-masking. 
+for example by using the <a href="../cgi-bin/hgTrackUi?db=hg38&g=oligoMatch">"Short match" track
+</a>, the problem may be a result of BLAT's query-masking. 
 This happens if your input sequence is part of a repeat and present thousands of times in the genome.
 The online version of Blat masks 11mers from the query that occur more than 1024 times in the 
 genome. This is done to improve speed, but may result in missed hits when you are searching for 
 sequences in repeats.</p>
 <p>
 If your input sequence is not one of the very repetitive sequences, but still
 present a few dozen times on a chromosome, note that Blat results are limited
 to 16 results per chromosome strand. This means that at most 32 locations
 per chromosome are returned.
 </p>
 <p>
 To find all matches for repetitive sequences with the online version of Blat, you can add more flanking sequence to your 
 query. If this is not possible, the only alternative is to download the executables of Blat and the 
 .2bit file of a genome to your own machine and use BLAT on the command line. See 
 <a href="#blat3">Downloading BLAT source and documentation</a> for more information.</p>
 
 <a name="blat2"></a>
 <h2>BLAT usage restrictions</h2>
 <h6>I received a warning from your Blat server informing me that I had exceeded the server use 
 limitations. Can you give me information on the UCSC Blat server use parameters?</h6>
 <p>
 Due to the high demand on our Blat servers, we restrict service for users who programmatically
 query the BLAT tool or do large batch queries. Program-driven use of BLAT is limited to a maximum
 of one hit every 15 seconds and no more than 5,000 hits per day. Please limit batch queries to 25 
 sequences or less.</p> 
 <p>
 For users with high-volume Blat demands, we recommend downloading the BLAT tool for local use. For more 
 information, see <a href="#blat3">Downloading BLAT source and documentation</a>.</p>
 
 <a name="blat3"></a>
 <h2>Downloading Blat source and documentation</h2>
 <h6>Is the Blat source available for download? Is documentation available?</h6>
 <p>
 Blat source and executables are freely available for academic, nonprofit and personal use. 
 Commercial licensing information is available on the <a href="http://www.kentinformatics.com" 
 target="_blank">Kent Informatics website</a>.</p>
 <p>
 Blat source may be downloaded from 
 <a href="http://hgdownload.soe.ucsc.edu/admin/">http://hgdownload.soe.ucsc.edu/admin/</a> (located 
 at /kent/src/blat within the most recent jksrci*.zip source tree). For Blat executables, go to 
 <a href="http://hgdownload.soe.ucsc.edu/admin/exe/">http://hgdownload.soe.ucsc.edu/admin/exe/</a> 
 and choose your machine type.</p>
 <p>
 Documentation on Blat program specifications is available 
-<a href="../goldenPath/help/blatSpec.html">here</a>.</p>
+<a href="../goldenPath/help/blatSpec.html">here</a>. Note that the command-line BLAT 
+does not return matches to U nucleotides in the query sequence.</p>
 
 <a name="blat5"></a>
 <h2>Replicating web-based Blat parameters in command-line version</h2>
 <h6>I'm setting up my own Blat server and would like to use the same parameter values that the 
 UCSC web-based Blat server uses.</h6> 
 <p>
-We almost always expect there to be some small differences between the hgBlat/gfServer and the 
-stand-alone command-line Blat. The best matches can be found using pslReps and pslCDnaFilter 
+We almost always expect small differences between the hgBlat/gfServer and the 
+stand-alone, command-line Blat. The best matches can be found using pslReps and pslCDnaFilter 
 utilities. The web-based Blat is tuned permissively with a minimum cut-off score of 20, which will 
-display most of the alignments. Other than to confirm that your command-line Blat is working, there 
-is little use in perfectly replicating the web-based Blat results. We advise deciding which 
+display most of the alignments. We advise deciding which 
 filtering parameters make the most sense for the experiment or analysis. Often these settings will 
 be different and more stringent than those of the web-based Blat. With that in mind, use the 
 following settings to replicate the search results of the web-based Blat:</p>
 <p>
+<em>standalone Blat</em>:</p>
+<ul>
+  <li>Blat search:<br>
+  &nbsp;&nbsp;&nbsp;<code>blat -stepSize=5 -repMatch=2253 -minScore=20 -minIdentity=0
+  database.2bit query.fa output.psl </code><br></li>
+  <li><strong>Note:</strong> To replicate web results, PSL output should be used. BLAT handles
+  alternative output formats (such as blast8) slightly differently, and this can lead to minor
+  differences in results; particularly for short alignments. Furthermore, the query sequence
+  should have all U nucleotides converted to T nucleotides or have the "-q=rna" flag used
+  to match the web-Blat.</li>
+</ul>
+<p>
 <em>faToTwoBit</em>:</p>
 <ul>
-  <li>Use soft masking.</li>
+  <li>Uses soft masking to converst Fasta format to the 2bit format for Blat input.</li>
 </ul>
 <p>
 <em>gfServer</em> (this is how the UCSC web-based Blat servers are configured):
 <ul>
   <li>
   Blat server (capable of PCR):<br>
   &nbsp;&nbsp;&nbsp;<code>gfServer start blatMachine portX -stepSize=5 -log=untrans.log 
   database.2bit</code></li>
   <li>
   translated Blat server:<br>
   &nbsp;&nbsp;&nbsp;<code>gfServer start blatMachine portY -trans -mask -log=trans.log 
   database.2bit</code> </li>
 </ul>
 <p>For enabling DNA/DNA and DNA/RNA matches, only the host, port and twoBit files are needed. The 
 same port is used for both untranslated BLAT (gfClient) and PCR (webPcr). You'll need a separate 
 Blat server on a separate port to enable translated Blat (protein searches or translated searches 
 in protein-space).</p>
 <p>
 <em>gfClient</em>:</p>
 <ul>
-  <Li>Set <em>-minScore=0</em> and <em>-minIdentity=0</em>. This will result in some low-scoring, 
+  <li>Set <em>-minScore=0</em> and <em>-minIdentity=0</em>. This will result in some low-scoring, 
   generally spurious hits, but for interactive use it's sufficiently easy to ignore them (because 
   results are sorted by score) and sometimes the low-scoring hits come in handy.</li>
 </ul>
 <p>
-<em>standalone Blat</em>:</p>
-<ul>
-  <li>Blat search:<br>
-  &nbsp;&nbsp;&nbsp;<code>blat -stepSize=5 -repMatch=2253 -minScore=0 -minIdentity=0 
-  database.2bit query.fa output.psl </code><br>
-  <li><strong>Note:</strong> To replicate web results, PSL output should be used. BLAT handles
-  alternative output formats (such as blast8) slightly differently, and this can lead to minor 
-  differences in results; particularly for short alignments.
-</ul>
-<p>
 Notes on repMatch:</p>
 <ul>
   <li>
   The default setting for gfServer dna matches is: repMatch = 1024 * (tileSize/stepSize).</li>
   <li>
   The default setting for Blat dna matches is: repMatch = 1024 (if tileSize=11).</li>
-  <Li>To get command-line results that are equivalent to web-based results, repMatch must be 
+  <li>To get command-line results that are equivalent to web-based results, repMatch must be 
   specified when using BLAT.</li>
 </ul>
 <p>
 For more information about how to replicate the score and percent identity matches displayed by our 
 web-based Blat, please see this <a href="../FAQ/FAQblat.html#blat4">BLAT FAQ</a>.</p>
 <p>
 For more information on the parameters available for BLAT, gfServer, and gfClient, see the 
 <a href="../goldenPath/help/blatSpec.html">BLAT specifications</a>.</p>
 
 <a name="blat6"></a>
 <h2>Using the <em>-ooc</em> flag</h2>
 <h6>What does the <em>-ooc</em> flag do?</h6>
 <p>
 Using any <em>-ooc</em> option in Blat, such as <em>-ooc=11.ooc</em>, speeds up searches similar to 
 repeat-masking sequence. The <em>11.ooc</em> file contains sequences determined to be 
 over-represented in the genome sequence. To improve search speed, these sequences are not used when 
 seeding an alignment against the genome. For reasonably sized sequences, this will not create a 
 problem and will significantly reduce processing time.</p>
 <p>
 By not using the <em>11.ooc</em> file, you will increase alignment time, but will also slightly 
 increase sensitivity. This may be important if you are aligning shorter sequences or sequences of 
 poor quality. For example, if a particular sequence consists primarily of sequences in the 
 <em>11.ooc</em> file, it will never be seeded correctly for an alignment if the <em>-ooc</em> flag 
 is used.</p>
 <p>
 In summary, if you are not finding certain sequences and can afford the extra processing time, you 
 may want to run Blat without the <em>11.ooc</em> file if your particular situation warrants its 
 use.</p>
 
 <a name="blat4"></a>
 <h2>Replicating web-based Blat percent identity and score calculations</h2>
 <h6>Using my own command-line Blat server, how can I replicate the percent identity and score 
 calculations produced by web-based Blat?</h6>
 <p>
 There is no option to command-line Blat that gives you the percent ID and the score. However, we 
 have created scripts that include the calculations: </p>
 <ul>
   <li>
   View the perl script from the source tree: 
-  <a href="http://genome-source.soe.ucsc.edu/gitlist/kent.git/raw/master/src/utils/pslScore/pslScore.pl"><code>pslScore.pl</code></a><li>
+  <a href="http://genome-source.soe.ucsc.edu/gitlist/kent.git/raw/master/src/utils/pslScore/pslScore.pl">
+  <code>pslScore.pl</code></a></li>
   <li>
   View the corresponding C program:
-  <a href="http://genome-source.soe.ucsc.edu/gitlist/kent.git/raw/master/src/utils/pslScore/pslScore.c"><code>pslScore.c</code></a>
+  <a href="http://genome-source.soe.ucsc.edu/gitlist/kent.git/raw/master/src/utils/pslScore/pslScore.c">
+  <code>pslScore.c</code></a>
   and associated library functions <code>pslScore</code> and <code>pslCalcMilliBad</code> in   
-  <a href="http://genome-source.soe.ucsc.edu/gitlist/kent.git/raw/master/src/lib/psl.c"><code>psl.c</code></a></li>
+  <a href="http://genome-source.soe.ucsc.edu/gitlist/kent.git/raw/master/src/lib/psl.c">
+  <code>psl.c</code></a></li>
 </ul>
 <p>
 See our <a href="FAQlicense.html">FAQ</a> on source code licensing and downloads for information on 
 obtaining the source.</p>
 
 <a name="blat7"></a>
 <h2>Replicating web-based Blat &quot;I'm feeling lucky&quot; search results</h2>
 <h6>How do I generate the same search results as web-based Blat's &quot;I'm feeling lucky&quot; 
-option using command-line Blat?</strong> </h6>
+option using command-line Blat?</h6>
 <p>
 The code for the &quot;I'm feeling lucky&quot; Blat search orders the results based on the sort 
 output option that you selected on the query page. It then returns the highest-scoring alignment of 
 the first query sequence.</p>
 <p>
 If you are sorting results by &quot;query, start&quot; or &quot;chrom, start&quot;, generating the 
 &quot;I'm feeling lucky&quot; result is straightforward: sort the output file by these columns, then
 select the top result.</p>
 <p>
 To replicate any of the sort options involving score, you first must calculate the score for each 
 result in your PSL output file, then sort the results by score or other combination (<em>e.g.</em> 
 &quot;query, score&quot; and &quot;chrom, score&quot;). See the section on 
 <a href="#blat4">Replicating web-based Blat percent identity and score calculations</a> for 
 information on calculating the score.</p>
 <p>
 Alternatively, you can try filtering your Blat PSL output using either the
 <code>pslReps</code> or 
 <code>pslCDnaFilter</code> program available in the Genome Browser source code. For information on 
 obtaining the source code, see our <a href="FAQlicense.html">FAQ</a> on source code licensing and 
 downloads.</p>
 
 <a name="blat8"></a>
 <h2>Using Blat for short sequences with maximum sensitivity</h2>
 <h6>How do I configure Blat for short sequences with maximum sensitivity?</h6> 
 <p>
 Here are some guidelines for configuring standalone Blat and gfServer/gfClient for these 
 conditions:</p>
 <ul>
   <li>
   The formula to find the shortest query size that will guarantee a match (if matching tiles are not
   marked as overused) is: 2 * <em>stepSize</em> + <em>tileSize</em> - 1 <br>
   For example, with <em>stepSize</em> set to 5 and <em>tileSize</em> set to 11, matches of query 
   size 2 * 5 + 11 - 1 = 20 bp will be found if the query matches the target exactly.<br>
   The <em>stepSize</em> parameter can range from 1 to <em>tileSize</em>.<br>
   The <em>tileSize</em> parameter can range from 6 to 15. For protein, the range starts lower.<br>
   For <em>minMatch</em>=1 (e.g., protein), the minimum guaranteed match length is: 1 * 
   <em>stepSize</em> + <em>tileSize</em> - 1<br>
   Note: There is also a &quot;minimum lucky size&quot; for hits. This is the smallest possible hit 
   that Blat can find. This minimum lucky size can be calculated using the formula: <em>stepSize</em>
   + <em>tileSize</em>. For example, if we use a <em>tileSize</em> of 11 and <em>stepSize</em> of 5, 
   hits smaller than 16 bases won't be reported.</li>
   <li>
   Try using <em>-fine</em>.</li>
   <li>
   Use a large value for <em>repMatch</em> (e.g. <em>-repMatch</em> = 1000000) to reduce the chance 
   of a tile being marked as over-used.</li>
   <li>
   Do not use an <em>.ooc</em> file.</li>
   <li>
   Do not use <em>-fastMap</em>.</li>
   <li>
   Do not use masking command-line options.</li>
 </ul>
 <p>
 The above changes will make Blat more sensitive, but will also slow the speed and increase the 
 memory usage. It may be necessary to process one chromosome at a time to reduce the memory 
 requirements.</p>
 <p>
 A note on filtering output: increasing the <em>-minScore</em> parameter value beyond one-half of 
 the query size has no further effect.  Therefore, use either the <code>pslReps</code> or 
 <code>pslCDnaFilter</code> program available in the Genome Browser source code to filter for the size, 
 score, coverage, or quality desired. For information on obtaining the source code, see our 
 <a href="FAQlicense.html">FAQ</a> on source code licensing and downloads.</p>
 
 <a name="blat9"></a>
 <h2>Blat ALL genomes</h2>
 <h6>How do I Blat queries for the default genome assemblies of all organisms?</h6>
 <p>
 BLAT is designed to quickly find sequence similarity between query and target sequences. 
 Generally, Blat is used to find locations of sequence homology in a single target genome or determine
  the exon structure of an mRNA. Blat also allows users to compare the query sequence against all of 
 the default assemblies for organisms hosted on the UCSC Genome Browser. The <em>Search ALL</em> feature
 may be useful if you have an ambiguous query sequence and are trying to determine what organism it may
 belong to.
 </p>
 <p>
 Selecting the "Search ALL" 
 checkbox above the Genome drop-down list allows you to search the genomes
 of the default assemblies for all of our organisms. It also searches any attached hubs' 
 Blat servers, meaning you can search your user-generated assembly hubs. The results page displays an ordered list 
 of all our organisms and their homology with your query sequence. The results are ordered 
 so that the organism with the best alignment score is at the top, indicating which region(s) 
 of that organism has the greatest homology with your query sequence.
 The entire alignment, including mismatches and gaps, must <a href="../FAQ/FAQblat.html#blat4">score</a>  
 20 or higher in order to appear in the Blat output. By clicking into a link in the <em>Assembly list</em> 
 you will be taken to a new page displaying various locations and scores of sequence homology in the assembly of interest.
 </p>
 
 <a name="blat10"></a>
 <h2>Blat ALL genomes: No matches found</h2>
 <h6>My Blat ALL results display assemblies with hits, but clicking into them reports 
 no matches</h6>
 
 <p>
 In the Blat All results page, the "Hits" column does not represent alignments, instead it reports 
 tile hits. Tile hits are 11 base kmer matches found in the target, which do not necessarily 
 represent successful alignments. When one clicks the 'Assembly' link a full BLAT alignment for 
 that genome will occur and any alignment scores representing less than a 20 bp result will 
 come back as no matches found.</p>
 
 <p>
 When you BLAT a sequence, the server reads the target (genome) and builds an index in memory of 
 all the 11-mer locations. These 11-mers &quot;tile&quot; the sequence as such:
 
 <pre>
 ACTGACTGACT
  CTGACTGACTT
   TGACTGACTTA
 </pre></p>
 
 <p>
 After the index is built, the first step of alignment is to read the query (search) sequence, 
 extract all the 11-mers, and look those up in the genome 11-mer index currently in memory. 
 Matches found there represent the initial &quot;hits&quot; you see in the Blat All results page. 
 The next step is to look for hits that overlap or fall within a certain distance of each other, 
 and attempt to align the sequences between the hit locations in target and query.</p>
 
 <p>
 For example, if two 11-base tile hits align perfectly, it would result in a score of 22. This is 
 above the minimum required score of 20 (see <a href="#blat9">BLAT ALL genomes</a>), and would be 
 reported as an alignment. However, there are penalties for gaps and mismatches, as well as potential 
 overlap (see stepsize in <a href="../goldenPath/help/blatSpec.html">BLAT specifications</a>), all 
 of which could bring the score below 20. In that case, BLAT All would report 2 &quot;hits&quot;, 
 but clicking into the assembly would report no matches. This most often occurs when there are 
 only a few (1-3) hits reported by BLAT All.</p>
 
 <!--#include virtual="$ROOT/inc/gbPageEnd.html" -->
+</body>