src/hg/htdocs/FAQ/FAQgenes.html d328b0a2a09c271ddaee177d6405fac096371dc4

d328b0a2a09c271ddaee177d6405fac096371dc4
lrnassar
  Thu Jul 4 09:17:53 2019 -0700
Updating FAQ on how we generate knownCanonical for hg38 #23748

diff --git src/hg/htdocs/FAQ/FAQgenes.html src/hg/htdocs/FAQ/FAQgenes.html
index 92f6ed4..2c741e3 100755
--- src/hg/htdocs/FAQ/FAQgenes.html
+++ src/hg/htdocs/FAQ/FAQgenes.html
@@ -370,55 +370,58 @@
 alt="Changing splice variants" width="750">
 
 <p>For the track <a target=_blank 
 href="../cgi-bin/hgTrackUi?db=hg38&g=refSeqComposite">
 NCBI RefSeq</a> (hg38), you can activate the subtrack &quot;RefSeq HGMD&quot;.
 It contains only the transcripts that are part of the Human Gene Mutation Database.
 </p>
 
 <a name="singledownload"></a>
 <h6>How can I download a file with a single transcript per gene?</h6>
 <p>
 This is a common request, but very often this is not necessary when designing
 an analysis.  You will have to make a choice of this single transcript using
 some mechanism, and this choice will affect your pipeline results. It may be
 easier to keep all transcripts. For example, instead of annotating enhancers
-with the closest "best-transcript", you can annotate them with the closest exon
+with the closest &quot;best-transcript&quot;, you can annotate them with the closest exon
 of any transcript. When mapping variants to transcripts, you can map to all
 transcripts and sort these by accession ID, showing mainly the first.  When
 segmenting the chromosomes into gene loci, you can use the union of all
 transcripts of a gene rather than select a single "best" transcript.
 </p>
 
 <p>
-That being said, data tables called "knownCanonical" are available for
+That being said, data tables called &quot;knownCanonical&quot; are available for
 many assemblies. They try to select only a single transcript/isoform per gene, 
 if possible.</p>
 
 <p>
 For hg19, the knownCanonical table is a subset of the <a target="_blank" 
 href="../cgi-bin/hgTrackUi?db=hg19&g=knownGene">UCSC Genes</a> track. It was generated by 
 identifying a canonical isoform for each cluster ID, or gene. Generally, this is the longest
 isoform. It can be downloaded directly from the <a target="_blank" 
 href="http://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/">hg19 downloads database</a> 
 or by using the <a target="_blank" href="../cgi-bin/hgTables">Table Browser</a>.</p>
 
 <p>
 For hg38, the knownCanonical table is a subset of the <a target="_blank" 
 href="../cgi-bin/hgTrackUi?db=hg38&g=knownGene">GENCODE v29</a> track. As opposed to the hg19 
-equivalent which generally used the longest isoform for indentification, this table is defined 
-as follows:</p>
+knownCanonical table, which used computationally generated gene clusters and generally chose the 
+longest isoform as the canonical isoform, the hg38 table uses ENSEMBL gene IDs to define clusters 
+(that is to say, one canonical isoform per ENSEMBL gene ID), and the method of choosing the 
+isoform is described as such:</p>
+
 <p>
 <i>knownCanonical identifies the canonical isoform of each cluster ID or gene using the ENSEMBL 
 gene IDs to define each cluster. The canonical transcript is chosen using the APPRIS principal 
 transcript when available. If no APPRIS tag exists for any transcript associated with the 
 cluster, then a transcript in the BASIC set is chosen. If no BASIC transcript exists, then 
 the longest isoform is used.</i></p>
 <p>
 It can be downloaded directly from the <a target="_blank" 
 href="http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/">hg38 downloads database</a>
 or by using the <a target="_blank" href="../cgi-bin/hgTables">Table Browser</a>.</p>
 
 <a name="whatdo"></a>
 <h6>This is rather complicated. Can you tell me which gene transcript track I should use?</h6>
 <p> 
 For automated analysis, if you are doing NGS analysis and you need to capture