src/hg/htdocs/FAQ/FAQgenes.html 772bf0ee794706c69818babdd436402a633c6090

772bf0ee794706c69818babdd436402a633c6090
max
  Fri Jan 17 11:49:37 2020 +0100
adding a section about NCBI RefSeq and knownCanonical alternatives there, refs #24780

diff --git src/hg/htdocs/FAQ/FAQgenes.html src/hg/htdocs/FAQ/FAQgenes.html
index 8ade084..82dfb94 100755
--- src/hg/htdocs/FAQ/FAQgenes.html
+++ src/hg/htdocs/FAQ/FAQgenes.html
@@ -388,89 +388,105 @@
 review and you can consider these a subset of either gene track, filtered for high quality.
 The CCDS identifiers  are very stable and allow you to link easily between the different databases.
 As  the name implies, it does not cover UTR regions or non-coding transcripts.
 </p>
 
 <a name="justsingle"></a>
 <h6>How can I show a single transcript per gene?</h6>
 
 <p> 
 For the tracks &quot;<a target=_blank 
 href="../cgi-bin/hgTrackUi?db=hg19&g=knownGene">UCSC Genes</a>&quot; 
 (hg19) or &quot;<a target=_blank 
 href="../cgi-bin/hgTrackUi?db=hg38&g=knownGene">GENCODE Genes</a>&quot; 
 (hg38), click on their title and on the configuration page, uncheck the 
 box &quot;Show splice variants&quot;. Only a single transcript will be shown. The method for how this
-transcript is selected is described in the track documentation below the 
-configuration settings.</p>
+transcript is selected is described in the next section below and in the track documentation. </p>
 
 <p class='text-center'>
   <img class='text-center' src="../images/SpliceVariants.png" 
 alt="Changing splice variants" width="750">
 
 <p>For the track <a target=_blank 
 href="../cgi-bin/hgTrackUi?db=hg38&g=refSeqComposite">
 NCBI RefSeq</a> (hg38), you can activate the subtrack &quot;RefSeq HGMD&quot;.
 It contains only the transcripts that are part of the Human Gene Mutation Database.
 </p>
 
 <a name="singledownload"></a>
 <h6>How can I download a file with a single transcript per gene?</h6>
 <p>
 This is a common request, but very often this is not necessary when designing
 an analysis.  You will have to make a choice of this single transcript using
 some mechanism, and this choice will affect your pipeline results. It may be
 easier to keep all transcripts. For example, instead of annotating enhancers
 with the closest &quot;best-transcript&quot;, you can annotate them with the closest exon
 of any transcript. When mapping variants to transcripts, you can map to all
-transcripts and sort these by accession ID, showing mainly the first.  When
+transcripts and and show the transcript with the worst impact first.  When
 segmenting the chromosomes into gene loci, you can use the union of all
-transcripts of a gene rather than select a single "best" transcript.
-</p>
+transcripts of a gene, adding some predefined distance, rather than selecting a
+single "best" transcript.</p>
 
 <p>
-That being said, data tables called &quot;knownCanonical&quot; are available for
+That being said, the main gene tracks have tables that try to take guess the most interesting 
+transcript per gene. For the knownGene tracks (UCSC genes on hg19, Gencode on hg38 and mm10), 
+data tables called &quot;knownCanonical&quot; are available for
 many assemblies. They try to select only a single transcript/isoform per gene, 
-if possible.</p>
+if possible and for RefSeq similar options exist.</p>
 
 <p>
-For hg19, the knownCanonical table is a subset of the <a target="_blank" 
+<b>UCSC Genes on hg19</b>: For hg19, the knownCanonical table is a subset of the <a target="_blank" 
 href="../cgi-bin/hgTrackUi?db=hg19&g=knownGene">UCSC Genes</a> track. It was generated by 
 identifying a canonical isoform for each cluster ID, or gene. Generally, this is the longest
 isoform. It can be downloaded directly from the <a target="_blank" 
 href="http://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/">hg19 downloads database</a> 
 or by using the <a target="_blank" href="../cgi-bin/hgTables">Table Browser</a>.</p>
 
 <p>
-For hg38, the knownCanonical table is a subset of the <a target="_blank" 
+<b>Gencode on hg38/mm10</b>: For hg38, the knownCanonical table is a subset of the <a target="_blank" 
 href="../cgi-bin/hgTrackUi?db=hg38&g=knownGene">GENCODE v29</a> track. As opposed to the hg19 
 knownCanonical table, which used computationally generated gene clusters and generally chose the 
 longest isoform as the canonical isoform, the hg38 table uses ENSEMBL gene IDs to define clusters 
 (that is to say, one canonical isoform per ENSEMBL gene ID), and the method of choosing the 
 isoform is described as such:</p>
 
-<p>
+<p style="margin-left: 10em">
 <i>knownCanonical identifies the canonical isoform of each cluster ID or gene using the ENSEMBL 
 gene IDs to define each cluster. The canonical transcript is chosen using the APPRIS principal 
 transcript when available. If no APPRIS tag exists for any transcript associated with the 
 cluster, then a transcript in the BASIC set is chosen. If no BASIC transcript exists, then 
 the longest isoform is used.</i></p>
 <p>
 It can be downloaded directly from the <a target="_blank" 
 href="http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/">hg38 downloads database</a>
 or by using the <a target="_blank" href="../cgi-bin/hgTables">Table Browser</a>.</p>
 
+<p>
+<b>NCBI RefSeq (hg19/hg38)</b>: NCBI manually selects few, usually one,
+transcript per gene called "RefSeq Select", based on <a target=_blank
+href="https://www.ncbi.nlm.nih.gov/refseq/refseq_select/">various criteria</a>.
+Example use cases are comparative genomics and variant reporting. This subset
+is available in the RefSeq Select track under NCBI RefSeq.  RefSeq and the EBI
+also select one transcript for every protein coding gene, a project called <a
+href="https://ncbiinsights.ncbi.nlm.nih.gov/2019/03/12/mane-select-v0-5/"
+target=_blank>"MANE select"</a>, which is another subtrack of NCBI RefSeq. 
+For the special case of clinical diagnostics
+where an even more reduced number of transcripts simplifies visual inspection,
+we also provide "RefSeq HGMD", another subtrack of NCBI RefSeq. It contains
+(usually) a single transcript only for genes known to cause human genetic diseases and
+the transcript is the one where reported clinical variants can be mapped to.
+
 <a name="whatdo"></a>
 <h6>This is rather complicated. Can you tell me which gene transcript track I should use?</h6>
 <p> 
 For automated analysis, if you are doing NGS analysis and you need to capture
 all possible transcripts, GENCODE provides one of the most comprehensive gene sets.  For human 
 genetics or variant annotation, a more restricted transcript set is usually sufficient and &quot;NCBI
 RefSeq&quot; is the standard. If you are only interested in protein-coding
 annotations, CCDS or UniProt may be an option, but this is rather unusual.
 If you are interested in the best splice site coverage, AceView is worth a
 look.
 </p>
 
 <p>
 For manual inspection of exon boundaries of a single gene, and especially if it
 is a transcript that is repetitive or hard to align (e.g. very small exons),