src/hg/htdocs/FAQ/FAQgenes.html 06d7be056190c14b85e71bc12523f18ea6815b5e

06d7be056190c14b85e71bc12523f18ea6815b5e
markd
  Mon Dec 7 00:50:29 2020 -0800
BLAT mmap index support merge with master

diff --git src/hg/htdocs/FAQ/FAQgenes.html src/hg/htdocs/FAQ/FAQgenes.html
index dea3073..e924873 100755
--- src/hg/htdocs/FAQ/FAQgenes.html
+++ src/hg/htdocs/FAQ/FAQgenes.html
@@ -298,56 +298,64 @@
 A more comprehensive definition can also be found in the <a target=_blank 
 href="https://uswest.ensembl.org/info/genome/genebuild/transcript_quality_tags.html#basic">
 Ensembl FAQ</a>. By default, the track displays only the &quot;basic&quot; set. In order to 
 display the complete 
 &quot;comprehensive&quot; set, the box can be ticked at the top of the <a target=_blank 
 href="../cgi-bin/hgTrackUi?db=hg38&g=knownGene">GENCODE track description page</a>.</p>
 
 <p class='text-center'>
   <img class='text-center' src="../images/ComprehensiveSet.png" 
 alt="Turning on comprehensive gene set" width="750">
 
 <a name="ncbiRefseq"></a>
 <h6>What is the difference between "NCBI RefSeq" and "UCSC RefSeq"?</h6>
 <p>
 RefSeq gene transcripts, unlike GENCODE/Ensembl/UCSC Genes, are sequences that can differ from 
-the genome. They need to be aligned to the genome to create transcript models.
-Traditionally, UCSC has aligned RefSeq with BLAT (UCSC RefSeq sub-track) and NCBI has aligned with 
-splign. The advantages of the UCSC alignments are that
-they are updated more frequently and are available for older assemblies (like
-GRCh37/hg19), but they are not placed manually to a chromosome location and are not the official alignments.
-Therefore, we recommend working with the NCBI annotations.
-When an assembly has an &quot;NCBI RefSeq&quot; track, we show it by default and hide the
+the genome. They need to be aligned to the genome to create annotations and UCSC
+and NCBI create alignments with different software (BLAT and splign, respectively).
+The advantages of the UCSC alignments are that
+they are updated constantly even for older assemblies, like GRCh37/hg19.
+The advantage of NCBI alignments are that they are placed manually 
+to a chromosome location and are the official alignments, e.g. for databases and manuscripts.
+Therefore, we recommend working with the NCBI annotations and when an assembly has an &quot;NCBI RefSeq&quot; track, we show it by default and hide the
 &quot;UCSC RefSeq&quot; track.
 </p>
 
-<p>
-NCBI transcripts are manually tied to a chromosome band or location. The advantage is that
-when there are two almost-identical transcripts in RefSeq, each one will be
-placed at the official reference location in the NCBI annotations. For example, 
+<p>The UCSC alignments can differ from the NCBI alignments for two reasons:</p>
+
+<p><b>Very similar transcripts:</b> 
+Let's take the case of two almost-identical transcripts sequences in RefSeq,
+with two genes in the genome where they could be placed.
+NCBI has a rule to place every transcript only once, and transcripts
+are manually tied to a chromosome band or location by NCBI, so each gene will get one
+and only one transcript of two. NCBI RefSeq will have two genes with one transcript each.
+UCSC RefSeq though places all
+transcripts where they align at very high identity, so both genes will get
+annotated with both transcripts. For example, 
 the transcript NM_001012276 has three almost-identical possible
-placements to the genome in the UCSC RefSeq track (as it is entirely alignment-based),
+placements to the genome in the UCSC RefSeq track, as it is entirely alignment-based without any manual filtering,
 but NM_001012276.3 is shown at a single location in the NCBI RefSeq track, as the NCBI
-software will only retain the splign alignment at the manually annotated location. It
+software will only retain the alignment at the manually annotated location. It
 may be good to know about almost-identical alignments when doing genomic
 analysis or manual inspection of NGS read alignments, but for clinical
 reporting purposes or other automated analyses, we strongly recommend to use
 the NCBI RefSeq track.
 </p>
 
 <p>
-In some rare cases, the NCBI and UCSC exon boundaries differ.
+<b>Unclear exon boundaries:</b> In some rare cases, the NCBI and UCSC exon boundaries differ.
+This happens especially when sequence deletions in the genome make the placement very difficult.
 Activating both RefSeq and UCSC RefSeq tracks helps you investigate the differences.
 Activating the RefSeq Alignments track shows NCBI's splign alignments in more detail,
 including double lines where both transcript and genomic sequence are skipped in the alignment.
 When available, the RefSeq Diffs subtrack may be helpful too. The upcoming <a target=_blank 
 href=https://ncbiinsights.ncbi.nlm.nih.gov/2018/10/11/matched-annotation-by-ncbi-and-embl-ebi-mane-a-new-joint-venture-to-define-a-set-of-representative-transcripts-for-human-protein-coding-genes/>MANE gene set</a> 
 will contain a set of high-quality transcripts that are 100%
 alignable to the genome and are part of both RefSeq and Ensembl/GENCODE but
 at the time of writing this project is at an early stage.
 </p>
 
 <p>
 An anecdotal and rare example is SHANK2 and SHANK3 in hg19. It is impossible
 for either NCBI or BLAT to get the correct alignment and gene model because the genome sequence is
 missing for part of the gene.  NCBI and BLAT find slightly different exon
 boundaries at the edge of the problematic region. NCBI's aligner tries very hard
@@ -441,34 +449,33 @@
 <h2>How can I show a single transcript per gene?</h2>
 
 <p> 
 For the tracks &quot;<a target=_blank 
 href="../cgi-bin/hgTrackUi?db=hg19&g=knownGene">UCSC Genes</a>&quot; 
 (hg19) or &quot;<a target=_blank 
 href="../cgi-bin/hgTrackUi?db=hg38&g=knownGene">GENCODE Genes</a>&quot; 
 (hg38), click on their title and on the configuration page, uncheck the 
 box &quot;Show splice variants&quot;. Only a single transcript will be shown. The method for how this
 transcript is selected is described in the next section below and in the track documentation. </p>
 
 <p class='text-center'>
   <img class='text-center' src="../images/SpliceVariants.png" 
 alt="Changing splice variants" width="750">
 
-<p>For the track <a target=_blank 
-href="../cgi-bin/hgTrackUi?db=hg38&g=refSeqComposite">
-NCBI RefSeq</a> (hg38), you can activate the subtrack &quot;RefSeq HGMD&quot;.
-It contains only the transcripts that are part of the Human Gene Mutation Database.
+<p>
+For the various single-transcript options of &quot;NCBI RefSeq&quot;, please
+see the discussion of "single transcript" tracks in the next section. 
 </p>
 
 <a name="singledownload"></a>
 <h2>How can I download a file with a single transcript per gene?</h2>
 <p>
 This is a common request, but very often this is not necessary when designing
 an analysis.  You will have to make a choice of this single transcript using
 some mechanism, and this choice will affect your pipeline results. It may be
 easier to keep all transcripts. For example, instead of annotating enhancers
 with the closest &quot;best-transcript&quot;, you can annotate them with the closest exon
 of any transcript. When mapping variants to transcripts, you can map to all
 transcripts and and show the transcript with the worst impact first.  When
 segmenting the chromosomes into gene loci, you can use the union of all
 transcripts of a gene, adding some predefined distance, rather than selecting a
 single "best" transcript.</p>