5980d946ead37a1db3066b34ce972ba78ca62d0f
brianlee
  Thu Mar 3 16:36:08 2022 -0800
Adding a gene FAQ entry about non-coding and coding gene selection refs #29030

diff --git src/hg/htdocs/FAQ/FAQgenes.html src/hg/htdocs/FAQ/FAQgenes.html
index c684d08..77ad926 100755
--- src/hg/htdocs/FAQ/FAQgenes.html
+++ src/hg/htdocs/FAQ/FAQgenes.html
@@ -19,30 +19,32 @@
 <li><a href="#ensRefseq">What are the differences among GENCODE, Ensembl and RefSeq?</a></li>
 <li><a href="#hg19">For the human assembly hg19/GRCh37: What is the difference between "UCSC 
                     Genes" track, the "GENCODE" track and the "Ensembl Genes" track?</a></li>
 <li><a href="#hg38">For the human assembly hg38/GRCh38: What are the differences between the 
 		    "GENCODE" and "All GENCODE" tracks?</a></li>
 <li><a href="#gencode">What is the difference between GENCODE comprehensive and basic?</a></li>
 <li><a href="#ncbiRefseq">What is the difference between "NCBI RefSeq" and "UCSC RefSeq"?</a></li>
 <li><a href="#mito">What is the best gene track for mitochondrial gene annotations?</a></li>
 <li><a href="#report">How shall I report a gene transcript in a manuscript?</a></li>
 <li><a href="#ccds">What is CCDS?</a></li>
 <li><a href="#justsingle">How can I show a single transcript per gene?</a></li>
 <li><a href="#singledownload">How can I download a file with a single transcript per gene?</a></li>
 <li><a href="#whatdo">This is rather complicated. Can you tell me which gene transcript track
                       I should use?</a></li>
 <li><a href="#gtfDownload">Does UCSC provide GTF/GFF files for gene models?</a></li>
+<li><a href="#coding">What is the best way to get only coding genes (or only non-coding genes)
+                      out of GENCODE (or other gene) tables?</a></li>
 </ul>
 <hr>
 <p>
 <a href="index.html">Return to FAQ Table of Contents</a></p>
 
 <a name="gene"></a>
 <h2>The basics</h2>
 
 The genome browser contains many gene annotation tracks. Our users 
 often wonder what these contain and where the information that we present comes
 from.
 
 <h6>What is a gene?</h6>
 <p>
 The exact definition of &quot;gene&quot; depends on the context. In the context of 
@@ -598,17 +600,73 @@
 information on GTF format can be found <a target="_blank" href="FAQformat.html#format4">
 in our FAQ</a>.</p>
 <p>
 These files are generated for four gene model tables: ncbiRefSeq, refGene, ensGene, knownGene. 
 Certain assemblies, such as hg19, will have all four files while smaller assemblies may only have
 one or two of these. Which file a user should use depends on their analysis, as they contain 
 different data and metadata.</p>
 <p>
 These files are generated using the <code>genePredToGtf</code> method described in our 
 <a target="_blank" href="https://genome.ucsc.edu/FAQ/FAQdownloads.html#download37">
 downloads FAQ</a> using the <code>-utr</code> flag. They can be found on the download server 
 address <i>http://hgdownload.soe.ucsc.edu/goldenPath/$db/bigZips/genes/</i> where
 <i>$db</i> is the assembly of interest. For example, the <a target="_blank" 
 href="http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/genes/">hg38 GTF files</a>.</p>
 
+<a name="coding"></a>
+<h2>What is the best way to get only coding genes (or only non-coding genes)
+out of GENCODE (or other gene) tables?</h2>
+<h3>Coding genes</h3>
+<p>
+One option for GENCODE is to use the Public MySQL server and the following query:
+<pre>
+ mysql --user=genome --host=genome-mysql.soe.ucsc.edu -Ne 'select * from wgEncodeGencodeBasicV39 where cdsStartStat = "cmpl" and cdsEndStat = "cmpl";' hg38 
+</pre></p>
+<p>
+What this query does is access the hg38 database, and then from the wgEncodeGencodeBasicV39 table,
+looks to the fields of cdsStartStat and cdsEndStat for only those entries with the value cmpl,
+showing &quot;CDS is complete&quot; at the start and end, so that these are genes that are
+protein-coding entries, thereby excluding non-coding RNA genes.</p>
+<p>
+For the manually curated RefSeq gene set transcript identifiers start with NM_ for coding
+or NR_ for non-coding, followed by a number and version number separated by a dot, e.g.
+&quot;NR_046018.2&quot; for a RNA pseudogene. For RefSeq one can select coding genes by
+filtering for NM identifiers.</p>
+<p>If using the UCSC knownGene table one can filter for where the coding start
+and coding end fields of the table are not equivalent, e.g.
+<code>knownGene.cdsStart != knownGene.cdsEnd</code>, which would ensure the selected
+entries are coding genes.</p>
+<p>
+You can also <a href="https://groups.google.com/u/1/a/soe.ucsc.edu/g/genome/search?q=only%20coding%20genes"
+target="_blank">search our mailing-list archives</a> to read further details about only
+obtaining coding genes from the UCSC Genome Browser.</p>
+<a name="nonCoding"></a>
+<h3>Non-coding genes</h3>
+<p>
+The steps for selecting non-coding genes are essentially the opposite of the steps to select
+only coding genes. One option for GENCODE is to use the Public MySQL server
+and the following query:</p>
+<p>
+<pre>
+ mysql --user=genome --host=genome-mysql.soe.ucsc.edu -Ne 'select * from wgEncodeGencodeBasicV39 where cdsStartStat != "cmpl" and cdsEndStat != "cmpl";' hg38
+</pre></p>
+<p>
+What this query does is access the hg38 database, and then from the wgEncodeGencodeBasicV39 table,
+looks to the fields of cdsStartStat and cdsEndStat for only those entries without the value cmpl,
+showing &quot;CDS is complete&quot; at the start and end, so that these are genes that are
+protein-coding entries, thereby including only non-coding RNA genes.</p>
+<p>
+For the manually curated RefSeq gene set transcript identifiers start with NM_ for coding
+or NR_ for non-coding, followed by a number and version number separated by a dot, e.g.
+&quot;NR_046018.2&quot; for an RNA pseudogene. For RefSeq, one can select coding genes by
+filtering for NR identifiers.</p>
+<p>If using the UCSC knownGene table one can filter for where the coding start
+and coding end fields of the table are equivalent, e.g.
+<code>knownGene.cdsStart = knownGene.cdsEnd</code>, which would ensure the selected
+entries are non-coding genes.</p>
+<p>
+You can also <a href="https://groups.google.com/u/1/a/soe.ucsc.edu/g/genome/search?q=only%20non-coding%20genes"
+target="_blank">search our mailing-list archives</a> to read further details about only
+obtaining non-coding genes from the UCSC Genome Browser.</p>
+
 <!--#include virtual="$ROOT/inc/gbPageEnd.html" -->