src/hg/htdocs/FAQ/FAQgenes.html 317dbfc227692ade3bc42d0de919155f67e139e7

317dbfc227692ade3bc42d0de919155f67e139e7
lrnassar
  Wed Mar 20 11:16:25 2019 -0700
Minor modifications and new section to unreleased FAQgenes page ref#22696

diff --git src/hg/htdocs/FAQ/FAQgenes.html src/hg/htdocs/FAQ/FAQgenes.html
index 8a7151d..0de5271 100755
--- src/hg/htdocs/FAQ/FAQgenes.html
+++ src/hg/htdocs/FAQ/FAQgenes.html
@@ -1,41 +1,43 @@
 <!DOCTYPE html>
 <!--#set var="TITLE" value="Genome Browser FAQ" -->
 <!--#set var="ROOT" value=".." -->
 
 <!-- Relative paths to support mirror sites with non-standard GB docs install -->
 <!--#include virtual="$ROOT/inc/gbPageStart.html" -->
 
 <h1>Frequently Asked Questions: Gene tracks</h1>
 
 <h2>Topics</h2>
 
 <ul>
 <li><a href="#gene">What is a gene?</a></li>
 <li><a href="#genestrans">What is a transcript and how is it related to a gene?</a></li>
-<li><a href="#mostCommon">What are the most common gene transcript tracks?</a></li>
 <li><a href="#genename">What is a gene name?</a></li>
+<li><a href="#mostCommon">What are the most common gene transcript tracks?</a></li>
 <li><a href="#ens">What are Ensembl and GENCODE and is there a difference?</a></li>
 <li><a href="#ensRefseq">What are the differences among GENCODE, Ensembl and RefSeq?</a></li>
 <li><a href="#hg19">For the human assembly hg19/GRCh37: What is the difference between "UCSC 
                     Genes" track, the "GENCODE" track and the "Ensembl Genes" track?</a></li>
 <li><a href="#hg38">For the human assembly hg38/GRCh38: What are the differences between the 
 		    "GENCODE" and "All GENCODE" tracks?</a></li>
 <li><a href="#gencode">What is the difference between GENCODE comprehensive and basic?</a></li>
 <li><a href="#ncbiRefseq">What is the difference between "NCBI RefSeq" and "UCSC RefSeq"?</a></li>
 <li><a href="#ccds">What is CCDS?</a></li>
 <li><a href="#justsingle">How can I just show a single transcript per gene?</a></li>
+<li><a href="#singledownload">I just want to download a gene set with a single entry per gene.
+			      Where can I find this?</a></li>
 <li><a href="#whatdo">This is rather complicated. Can you tell me which gene transcript track
                       I should use?</a></li>
 </ul>
 <hr>
 <p>
 <a href="index.html">Return to FAQ Table of Contents</a></p>
 
 <a name="gene"></a>
 <h2>The basics</h2>
 
 The genome browser contains many gene annotation tracks. Our users 
 often wonder what these contain and where the information that we present comes
 from.
 
 <h6>What is a gene?</h6>
@@ -66,82 +68,83 @@
 Genes&quot;, but they really represent transcripts on chromosomes of a genome assembly.</p>
 <p>
 For example, using the databases by NCBI, the gene
 with the gene symbol <a target=_blank
 href="https://www.ncbi.nlm.nih.gov/gene/672#">BRCA1</a> has 5 protein-coding
 transcripts or isoforms. The first transcript has the NCBI accession number <a
 target=_blank
 href="https://www.ncbi.nlm.nih.gov/nuccore/NM_007294.3">NM_007294.3</a> which
 produces the protein with the accession<a target=_blank
 href="https://www.ncbi.nlm.nih.gov/protein/NP_009225.1"> NP_009225.1</a>. In
 the human genome, it is located on chromosome 17, where it is comprised of <a
 target=_blank href="https://www.ncbi.nlm.nih.gov/nuccore/U14680">23 exons</a>.
 On the version GRCh38 of the human genome, these exons cover the DNA
 nucleotides 43044295 to 43125483.</p>
 
+<a name="genename"></a>
+<h6>What is a gene or transcript accession? </h6>
+
+<p>
+Gene symbols like BRCA1 are easy to remember but sometimes change and are not
+specific to an organism.  Therefore most databases internally use unique
+identifiers to refer to sequences and some journals require authors to use
+these in manuscripts.</p>
+
+<p>
+The most common accession numbers encountered by users are either from Ensembl,
+GENCODE or RefSeq.  Human Ensembl/GENCODE gene accession numbers start with
+ENSG, e.g. &quot;ENSG00000012048&quot for BRCA1.  Every ENSG-gene has at least
+one transcript assigned to it. The transcript identifiers start with with ENST
+followed by a number, e.g.  &quot;ENST00000619216.1&quot;. NCBI refers to genes
+with plain numbers, e.g.  672 for BRCA1. Manually curated RefSeq transcript
+identifiers start with NM_ (coding) or NR_ (non-coding), followed by a number and version
+number separated by a dot, e.g. &quot;NR_046018.2&quot;.  If the transcript was
+predicted by the NCBI Gnomon software, the prefix is XM_ but these are rare in human.
+A table of these and other RefSeq prefixes can be
+found on the <a target=_blank
+href="https://www.ncbi.nlm.nih.gov/books/NBK21091/table/ch18.T.refseq_accession_numbers_and_mole/?report=objectonly">
+NCBI website</a>.
+</p>
+
 <a name="mostCommon"></a>
 <h6>What are the most common gene transcript tracks?</h6>
 <p>
 Researchers sequence cDNA sequences and send these to NCBI Genbank. The
 Genome Browser shows these sequences in the Genbank or the <a target=_blank 
 href="../cgi-bin/hgTrackUi?db=hg38&g=est">EST track</a> (if the cDNA is just
 a single read from the 5' or 3' end). From the alignment of the cDNAs and ESTs, 
 the NCBI RefSeq group manually creates a smaller set of representative transcripts 
 which we display as the <a target=_blank 
 href=../cgi-bin/hgTrackUi?db=hg38&g=refSeqComposite>RefSeq Curated</a> track.
 Automated programs like UCSC's or Ensembl's gene build software do the same, just
 in software, which is more systematic but also more error-prone.
 With the arrival of GENCODE, Ensembl added a manual curation to their
 human and mouse transcripts. NCBI has added an automated prediction software (Gnomon)
 which we show in the &quot;<a target=_blank 
 href=../cgi-bin/hgTrackUi?db=hg38&g=refSeqComposite>RefSeq Predicted</a>&quot; track.</p>
 
 <p>There are many other tracks in the group &quot;Genes and Gene Predictions&quot;.
 <a target=_blank href="../cgi-bin/hgTrackUi?db=hg38&g=genscan">Genscan</a> and <a target=_blank 
 href="../cgi-bin/hgTrackUi?db=hg19&g=nscanGene">N-Scan</a> are older 
 transcript predictor algorithms that are based on the genome sequence alone. 
 <a target=_blank href="../cgi-bin/hgTrackUi?db=hg38&g=augustusGene">Augustus</a> and <a 
 target=_blank href="../cgi-bin/hgTrackUi?db=hg19&g=acembly">AceView</a> are automated 
 gene-predictors that use cDNA and EST data. These and similar gene
 tracks are only relevant when you are working on a particular locus where you
 think that the manually curated gene models (Ensembl and RefSeq) have
 errors.</p>
 
-<a name="genename"></a>
-<h6>What is a gene or transcript accession? </h6>
-
-<p>
-Gene symbols like BRCA1 are easy to remember but sometimes change and are not
-specific to an organism.  Therefore most databases internally use unique
-identifiers to refer to sequences and some journals require authors to use
-these in manuscripts.<br>
-
-The most common accession numbers encountered by users are either from Ensembl,
-GENCODE or RefSeq.  Human Ensembl/GENCODE gene accession numbers start with
-ENSG, e.g. &quot;ENSG00000012048&quot for BRCA1.  Every ENSG-gene has at least
-one transcript assigned to it. The transcript identifiers start with with ENST
-followed by a number, e.g.  &quot;ENST00000619216.1&quot;. NCBI refers to genes
-with plain numbers, e.g.  672 for BRCA1. Manually curated RefSeq transcript
-identifiers start with NM_ (coding) or NR_ (non-coding), followed by a number and version
-number separated by a dot, e.g. &quot;NR_046018.2&quot;.  If the transcript was
-predicted by the NCBI Gnomon software, the prefix is XM_ but these are rare in human. 
-A table of these and other RefSeq prefixes can be
-found on the <a target=_blank
-href="https://www.ncbi.nlm.nih.gov/books/NBK21091/table/ch18.T.refseq_accession_numbers_and_mole/?report=objectonly">
-NCBI website</a>.
-</p>
-
 <a name="ens"></a>
 <h2>The differences</h2>
 
 Some of our gene tracks look similar and contain very similar information which can be confusing.
 
 <h6>What are Ensembl and GENCODE and is there a difference?</h6>
 
 <p> 
 Officially, the Ensembl and GENCODE gene models are the same. On the latest human and mouse genome 
 assemblies (hg38 and mm10), the identifiers, transcript sequences, and exon coordinates are almost
 identical between equivalent Ensembl and GENCODE versions (excluding <a target=_blank 
 href="FAQdownloads.html#downloadAlt">alternative sequences</a> or <a target=_blank 
 href="FAQdownloads.html#downloadFix">fix sequences</a>).</p>
 
 <p>GENCODE uses the UCSC convention of prefixing chromosome names with &quot;chr&quot;, e.g. 
@@ -311,30 +314,60 @@
 (hg38), click on their title and on the configuration page, uncheck the 
 box &quot;Show splice variants&quot;. Only a single transcript will be shown. The method for how this
 transcript is selected is described in the track documentation below the 
 configuration settings.</p>
 
 <p class='text-center'>
   <img class='text-center' src="../images/SpliceVariants.png" 
 alt="Changing splice variants" width="750">
 
 <p>For the track <a target=_blank 
 href="../cgi-bin/hgTrackUi?db=hg38&g=refSeqComposite">
 NCBI RefSeq</a> (hg38), you can activate the subtrack &quot;RefSeq HGMD&quot;.
 It contains only the transcripts that are part of the Human Gene Mutation Database.
 </p>
 
+<a name="singledownload"></a>
+<h6>I just want to download a gene set with a single entry per gene. Where can I find this?</h6>
+<p>
+We have data tables named knownCanonical available for different assemblies comprised of a single 
+transcript/isoform per gene.</p>
+
+<p>
+For hg19, the knownCanonical table is a subset of the <a target="_blank" 
+href="../cgi-bin/hgTrackUi?db=hg19&g=knownGene">UCSC Genes</a> track. It was generated by 
+identifying a canonical isoform for each cluster ID, or gene. Generally, this is the longest
+isoform. It can be downloaded directly from the <a target="_blank" 
+href="http://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/">hg19 downloads database</a> 
+or by using the <a target="_blank" href="../cgi-bin/hgTables">Table Browser</a>.</p>
+
+<p>
+For hg38, the knownCanonical table is a subset of the <a target="_blank" 
+href="../cgi-bin/hgTrackUi?db=hg38&g=knownGene">GENCODE v29</a> track. As opposed to the hg19 
+equivalent which generally used the longest isoform for indentification, this table is defined 
+as follows:</p>
+<p>
+<i>knownCanonical identifies the canonical isoform of each cluster ID or gene using the ENSEMBL 
+gene IDs to define each cluster. The canonical transcript is chosen using the APPRIS principal 
+transcript when available. If no APPRIS tag exists for any transcript associated with the 
+cluster, then a transcript in the BASIC set is chosen. If no BASIC transcript exists, then 
+the longest isoform is used.</i></p>
+<p>
+It can be downloaded directly from the <a target="_blank" 
+href="http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/">hg38 downloads database</a>
+or by using the <a target="_blank" href="../cgi-bin/hgTables">Table Browser</a>.</p>
+
 <a name="whatdo"></a>
 <h6>This is rather complicated. Can you tell me which gene transcript track I should use?</h6>
 <p> 
 For automated analysis, if you are doing NGS analysis and you need to capture
 all possible transcripts, GENCODE provides a comprehensive gene set.  For human 
 genetics or variant annotation, a more restricted transcript set is usually sufficient and &quot;NCBI
 RefSeq&quot; is the standard. If you are only interested in protein-coding
 annotations, CCDS or UniProt may be an option, but this is rather unusual.
 </p>
 
 <p>
 For manual inspection of exon boundaries of a single gene, and especially if it
 is a transcript that is repetitive or hard to align (e.g. very small exons),
 look at the UCSC RefSeq track and watch for differences between the NCBI
 and UCSC exon placement. You can also BLAT the transcript sequence.