src/hg/htdocs/FAQ/FAQgenes.html 8e20653bae737846c5025e987e37777f2a4f0f29

8e20653bae737846c5025e987e37777f2a4f0f29
max
  Tue Jul 2 10:52:21 2019 +0200
tweaking Lou's excellent transcript download section a little in genes FAQ, refs #23748

diff --git src/hg/htdocs/FAQ/FAQgenes.html src/hg/htdocs/FAQ/FAQgenes.html
index 418d046..92f6ed4 100755
--- src/hg/htdocs/FAQ/FAQgenes.html
+++ src/hg/htdocs/FAQ/FAQgenes.html
@@ -11,33 +11,32 @@
 
 <ul>
 <li><a href="#gene">What is a gene?</a></li>
 <li><a href="#genestrans">What is a transcript and how is it related to a gene?</a></li>
 <li><a href="#genename">What is a gene name?</a></li>
 <li><a href="#mostCommon">What are the most common gene transcript tracks?</a></li>
 <li><a href="#ens">What are Ensembl and GENCODE and is there a difference?</a></li>
 <li><a href="#ensRefseq">What are the differences among GENCODE, Ensembl and RefSeq?</a></li>
 <li><a href="#hg19">For the human assembly hg19/GRCh37: What is the difference between "UCSC 
                     Genes" track, the "GENCODE" track and the "Ensembl Genes" track?</a></li>
 <li><a href="#hg38">For the human assembly hg38/GRCh38: What are the differences between the 
 		    "GENCODE" and "All GENCODE" tracks?</a></li>
 <li><a href="#gencode">What is the difference between GENCODE comprehensive and basic?</a></li>
 <li><a href="#ncbiRefseq">What is the difference between "NCBI RefSeq" and "UCSC RefSeq"?</a></li>
 <li><a href="#ccds">What is CCDS?</a></li>
-<li><a href="#justsingle">How can I just show a single transcript per gene?</a></li>
-<li><a href="#singledownload">I just want to download a gene set with a single entry per gene.
-			      Where can I find this?</a></li>
+<li><a href="#justsingle">How can I show a single transcript per gene?</a></li>
+<li><a href="#singledownload">How can I download a file with a single transcript per gene?</a></li>
 <li><a href="#whatdo">This is rather complicated. Can you tell me which gene transcript track
                       I should use?</a></li>
 </ul>
 <hr>
 <p>
 <a href="index.html">Return to FAQ Table of Contents</a></p>
 
 <a name="gene"></a>
 <h2>The basics</h2>
 
 The genome browser contains many gene annotation tracks. Our users 
 often wonder what these contain and where the information that we present comes
 from.
 
 <h6>What is a gene?</h6>
@@ -342,90 +341,105 @@
 often change from one version to the next.
 </p>
 
 <a name="ccds"></a>
 <h6>What is CCDS?</h6>
 <p>
 The <a target=_blank href="https://www.ncbi.nlm.nih.gov/projects/CCDS/CcdsBrowse.cgi">
 Consensus Coding Sequence Project</a> is a list of transcript coding sequence (CDS) genomic regions
 that are identically annotated by RefSeq and Ensembl/GENCODE.   CCDS undergoes extensive manual
 review and you can consider these a subset of either gene track, filtered for high quality.
 The CCDS identifiers  are very stable and allow you to link easily between the different databases.
 As  the name implies, it does not cover UTR regions or non-coding transcripts.
 </p>
 
 <a name="justsingle"></a>
-<h6>How can I just show a single transcript per gene?</h6>
+<h6>How can I show a single transcript per gene?</h6>
 
 <p> 
 For the tracks &quot;<a target=_blank 
 href="../cgi-bin/hgTrackUi?db=hg19&g=knownGene">UCSC Genes</a>&quot; 
 (hg19) or &quot;<a target=_blank 
 href="../cgi-bin/hgTrackUi?db=hg38&g=knownGene">GENCODE Genes</a>&quot; 
 (hg38), click on their title and on the configuration page, uncheck the 
 box &quot;Show splice variants&quot;. Only a single transcript will be shown. The method for how this
 transcript is selected is described in the track documentation below the 
 configuration settings.</p>
 
 <p class='text-center'>
   <img class='text-center' src="../images/SpliceVariants.png" 
 alt="Changing splice variants" width="750">
 
 <p>For the track <a target=_blank 
 href="../cgi-bin/hgTrackUi?db=hg38&g=refSeqComposite">
 NCBI RefSeq</a> (hg38), you can activate the subtrack &quot;RefSeq HGMD&quot;.
 It contains only the transcripts that are part of the Human Gene Mutation Database.
 </p>
 
 <a name="singledownload"></a>
-<h6>I just want to download a gene set with a single entry per gene. Where can I find this?</h6>
+<h6>How can I download a file with a single transcript per gene?</h6>
 <p>
-We have data tables named knownCanonical available for different assemblies comprised of a single 
-transcript/isoform per gene.</p>
+This is a common request, but very often this is not necessary when designing
+an analysis.  You will have to make a choice of this single transcript using
+some mechanism, and this choice will affect your pipeline results. It may be
+easier to keep all transcripts. For example, instead of annotating enhancers
+with the closest "best-transcript", you can annotate them with the closest exon
+of any transcript. When mapping variants to transcripts, you can map to all
+transcripts and sort these by accession ID, showing mainly the first.  When
+segmenting the chromosomes into gene loci, you can use the union of all
+transcripts of a gene rather than select a single "best" transcript.
+</p>
+
+<p>
+That being said, data tables called "knownCanonical" are available for
+many assemblies. They try to select only a single transcript/isoform per gene, 
+if possible.</p>
 
 <p>
 For hg19, the knownCanonical table is a subset of the <a target="_blank" 
 href="../cgi-bin/hgTrackUi?db=hg19&g=knownGene">UCSC Genes</a> track. It was generated by 
 identifying a canonical isoform for each cluster ID, or gene. Generally, this is the longest
 isoform. It can be downloaded directly from the <a target="_blank" 
 href="http://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/">hg19 downloads database</a> 
 or by using the <a target="_blank" href="../cgi-bin/hgTables">Table Browser</a>.</p>
 
 <p>
 For hg38, the knownCanonical table is a subset of the <a target="_blank" 
 href="../cgi-bin/hgTrackUi?db=hg38&g=knownGene">GENCODE v29</a> track. As opposed to the hg19 
 equivalent which generally used the longest isoform for indentification, this table is defined 
 as follows:</p>
 <p>
 <i>knownCanonical identifies the canonical isoform of each cluster ID or gene using the ENSEMBL 
 gene IDs to define each cluster. The canonical transcript is chosen using the APPRIS principal 
 transcript when available. If no APPRIS tag exists for any transcript associated with the 
 cluster, then a transcript in the BASIC set is chosen. If no BASIC transcript exists, then 
 the longest isoform is used.</i></p>
 <p>
 It can be downloaded directly from the <a target="_blank" 
 href="http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/">hg38 downloads database</a>
 or by using the <a target="_blank" href="../cgi-bin/hgTables">Table Browser</a>.</p>
 
 <a name="whatdo"></a>
 <h6>This is rather complicated. Can you tell me which gene transcript track I should use?</h6>
 <p> 
 For automated analysis, if you are doing NGS analysis and you need to capture
-all possible transcripts, GENCODE provides a comprehensive gene set.  For human 
+all possible transcripts, GENCODE provides one of the most comprehensive gene sets.  For human 
 genetics or variant annotation, a more restricted transcript set is usually sufficient and &quot;NCBI
 RefSeq&quot; is the standard. If you are only interested in protein-coding
 annotations, CCDS or UniProt may be an option, but this is rather unusual.
+If you are interested in the best splice site coverage, AceView is worth a
+look.
 </p>
 
 <p>
 For manual inspection of exon boundaries of a single gene, and especially if it
 is a transcript that is repetitive or hard to align (e.g. very small exons),
 look at the UCSC RefSeq track and watch for differences between the NCBI
 and UCSC exon placement. You can also BLAT the transcript sequence. 
 Manually look at ESTs, mRNAs, TransMap and possibly Augustus, Genscan, SIB, SGP
 or GeneId in obscure cases where you are looking for hints on what an
 alternative splicing could look like.</p>
 <p>
 You may also find the <a target="_blank" 
 href="http://genome.ucsc.edu/s/view/GeneSupport">Gene Support</a> public session
 helpful. This session is a collection of tracks centered around supporting evidence
 for genes.</p>