src/hg/htdocs/FAQ/FAQgenes.html 0d0a7789414e9f3571c66ffb0977eadebde86992

0d0a7789414e9f3571c66ffb0977eadebde86992
max
  Thu Sep 9 03:32:48 2021 -0700
updating faqgenes, refs #28132

diff --git src/hg/htdocs/FAQ/FAQgenes.html src/hg/htdocs/FAQ/FAQgenes.html
index ecfe592..bb0a1ab 100755
--- src/hg/htdocs/FAQ/FAQgenes.html
+++ src/hg/htdocs/FAQ/FAQgenes.html
@@ -481,82 +481,92 @@
 <a name="singledownload"></a>
 <h2>How can I download a file with a single transcript per gene?</h2>
 <p>
 This is a common request, but very often this is not necessary when designing
 an analysis.  You will have to make a choice of this single transcript using
 some mechanism, and this choice will affect your pipeline results. It may be
 easier to keep all transcripts. For example, instead of annotating enhancers
 with the closest &quot;best-transcript&quot;, you can annotate them with the closest exon
 of any transcript. When mapping variants to transcripts, you can map to all
 transcripts and and show the transcript with the worst impact first.  When
 segmenting the chromosomes into gene loci, you can use the union of all
 transcripts of a gene, adding some predefined distance, rather than selecting a
 single "best" transcript.</p>
 
 <p>
-That being said, the main gene tracks have tables that try to guess the most interesting 
-transcript per gene. For the knownGene tracks (UCSC genes on hg19, Gencode on hg38 and mm10), 
-data tables called &quot;knownCanonical&quot; are available for
-many assemblies. They try to select only a single transcript/isoform per gene, 
-if possible and for RefSeq similar options exist.</p>
+That being said, the main gene tracks have tables that try to show the "best"
+transcript per gene. There are many choices, depending on the assembly and the 
+gene track and every selection method has a different aim. For the
+knownGene tracks (UCSC genes on hg19, Gencode on hg38 and mm10), data tables
+called &quot;knownCanonical&quot; were built at UCSC. 
+For both Gencode/Ensembl and RefSeq, the NCBI/EBI project MANE selects
+for each gene the most relevant transcript, as long as these are identical between
+Gencode and RefSeq. For NCBI RefSeq, the track RefSeqSelect also selects the most relevant
+transcript(s) for each gene and is not limited to transcripts that are identical between 
+RefSeq and Ensembl. Therefore, the following gene tracks have "best-transcripts" tracks:
+</p>
 
 <p>
 <b>UCSC Genes on hg19</b>: For hg19, the knownCanonical table is a subset of the <a target="_blank" 
-href="../cgi-bin/hgTrackUi?db=hg19&g=knownGene">UCSC Genes</a> track. It was generated by 
+href="../cgi-bin/hgTrackUi?db=hg19&g=knownGene">UCSC Genes</a> track. It was generated at UCSC by 
 identifying a canonical isoform for each cluster ID, or gene. Generally, this is the longest
 isoform. It can be downloaded directly from the <a target="_blank" 
 href="http://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/">hg19 downloads database</a> 
 or by using the <a target="_blank" href="../cgi-bin/hgTables">Table Browser</a>.</p>
 
 <p>
-<b>Gencode on hg38/mm10</b>: For hg38, the knownCanonical table is a subset of the <a target="_blank" 
-href="../cgi-bin/hgTrackUi?db=hg38&g=knownGene">GENCODE v29</a> track. As opposed to the hg19 
+<b>Gencode on hg38/mm10 - knownCanonical</b>: For hg38, the knownCanonical table is a subset of the <a target="_blank" 
+href="../cgi-bin/hgTrackUi?db=hg38&g=knownGene">GENCODE v29</a> track. It was generated at UCSC. As opposed to the hg19 
 knownCanonical table, which used computationally generated gene clusters and generally chose the 
 longest isoform as the canonical isoform, the hg38 table uses ENSEMBL gene IDs to define clusters 
 (that is to say, one canonical isoform per ENSEMBL gene ID), and the method of choosing the 
 isoform is described as such:</p>
 
 <p style="margin-left: 10em">
 <i>knownCanonical identifies the canonical isoform of each cluster ID or gene using the ENSEMBL 
 gene IDs to define each cluster. The canonical transcript is chosen using the APPRIS principal 
 transcript when available. If no APPRIS tag exists for any transcript associated with the 
 cluster, then a transcript in the BASIC set is chosen. If no BASIC transcript exists, then 
 the longest isoform is used.</i></p>
 <p>
 It can be downloaded directly from the <a target="_blank" 
 href="http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/">hg38 downloads database</a>
 or by using the <a target="_blank" href="../cgi-bin/hgTables">Table Browser</a>.</p>
 
 <p>
-<b>NCBI RefSeq (hg19/hg38)</b>: On this track, there are three subtracks with slightly 
-different aims, all of which show only a single transcript (or less) for each gene.
+<b>NCBI RefSeq (hg19/hg38)</b>: This track collection contains three subtracks that select the 
+most relevant transcript for all or a subset of genes, with slightly different aims:
 <ul>
     <li> RefSeq Select: NCBI manually selects few, usually one,
 transcript per gene called "RefSeq Select", based on <a target=_blank
-href="https://www.ncbi.nlm.nih.gov/refseq/refseq_select/">various criteria</a>.
+href="https://www.ncbi.nlm.nih.gov/refseq/refseq_select/">a lot of criteria</a>.
+The criteria include manual curation, whether a transcript appears in LRG sequences,
+whether it is well conserved and many more. 
 Example use cases are comparative genomics and variant reporting. This subset
 is available in the RefSeq Select track under NCBI RefSeq.
 <li>MANE: RefSeq and the EBI
 also select one transcript for every protein coding gene that is annotated exactly 
 the same in both Gencode and RefSeq, a project called <a
 href="https://ncbiinsights.ncbi.nlm.nih.gov/2019/03/12/mane-select-v0-5/"
-target=_blank>"MANE select"</a>, which is another subtrack of NCBI RefSeq. 
+target=_blank>"MANE select"</a>, which is another subtrack of NCBI RefSeq. "MANE select"
+can be considered a subset of RefSeq Select.
 <li>HGMD: For the special case of clinical diagnostics
 where an even more reduced number of transcripts simplifies visual inspection,
 we provide another subtrack, "RefSeq HGMD". It contains
 (usually) a single transcript only for genes known to cause human genetic diseases and
 the transcript is the one to which all reported HGMD clinical variants can be mapped to.
+This transcript set is also a good choice for variant reporting.
 </ul>
 
 <a name="whatdo"></a>
 <h2>This is rather complicated. Can you tell me which gene transcript track I should use?</h2>
 <p> 
 For automated analysis, if you are doing NGS analysis and you need to capture
 all possible transcripts, GENCODE provides one of the most comprehensive gene sets.  For human 
 genetics or variant annotation, a more restricted transcript set is usually sufficient and &quot;NCBI
 RefSeq&quot; is the standard. If you are only interested in protein-coding
 annotations, CCDS or UniProt may be an option, but this is rather unusual.
 If you are interested in the best splice site coverage, AceView is worth a
 look.
 </p>
 
 <p>