5980d946ead37a1db3066b34ce972ba78ca62d0f brianlee Thu Mar 3 16:36:08 2022 -0800 Adding a gene FAQ entry about non-coding and coding gene selection refs #29030 diff --git src/hg/htdocs/FAQ/FAQgenes.html src/hg/htdocs/FAQ/FAQgenes.html index c684d08..77ad926 100755 --- src/hg/htdocs/FAQ/FAQgenes.html +++ src/hg/htdocs/FAQ/FAQgenes.html @@ -19,30 +19,32 @@ <li><a href="#ensRefseq">What are the differences among GENCODE, Ensembl and RefSeq?</a></li> <li><a href="#hg19">For the human assembly hg19/GRCh37: What is the difference between "UCSC Genes" track, the "GENCODE" track and the "Ensembl Genes" track?</a></li> <li><a href="#hg38">For the human assembly hg38/GRCh38: What are the differences between the "GENCODE" and "All GENCODE" tracks?</a></li> <li><a href="#gencode">What is the difference between GENCODE comprehensive and basic?</a></li> <li><a href="#ncbiRefseq">What is the difference between "NCBI RefSeq" and "UCSC RefSeq"?</a></li> <li><a href="#mito">What is the best gene track for mitochondrial gene annotations?</a></li> <li><a href="#report">How shall I report a gene transcript in a manuscript?</a></li> <li><a href="#ccds">What is CCDS?</a></li> <li><a href="#justsingle">How can I show a single transcript per gene?</a></li> <li><a href="#singledownload">How can I download a file with a single transcript per gene?</a></li> <li><a href="#whatdo">This is rather complicated. Can you tell me which gene transcript track I should use?</a></li> <li><a href="#gtfDownload">Does UCSC provide GTF/GFF files for gene models?</a></li> +<li><a href="#coding">What is the best way to get only coding genes (or only non-coding genes) + out of GENCODE (or other gene) tables?</a></li> </ul> <hr> <p> <a href="index.html">Return to FAQ Table of Contents</a></p> <a name="gene"></a> <h2>The basics</h2> The genome browser contains many gene annotation tracks. Our users often wonder what these contain and where the information that we present comes from. <h6>What is a gene?</h6> <p> The exact definition of "gene" depends on the context. In the context of @@ -598,17 +600,73 @@ information on GTF format can be found <a target="_blank" href="FAQformat.html#format4"> in our FAQ</a>.</p> <p> These files are generated for four gene model tables: ncbiRefSeq, refGene, ensGene, knownGene. Certain assemblies, such as hg19, will have all four files while smaller assemblies may only have one or two of these. Which file a user should use depends on their analysis, as they contain different data and metadata.</p> <p> These files are generated using the <code>genePredToGtf</code> method described in our <a target="_blank" href="https://genome.ucsc.edu/FAQ/FAQdownloads.html#download37"> downloads FAQ</a> using the <code>-utr</code> flag. They can be found on the download server address <i>http://hgdownload.soe.ucsc.edu/goldenPath/$db/bigZips/genes/</i> where <i>$db</i> is the assembly of interest. For example, the <a target="_blank" href="http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/genes/">hg38 GTF files</a>.</p> +<a name="coding"></a> +<h2>What is the best way to get only coding genes (or only non-coding genes) +out of GENCODE (or other gene) tables?</h2> +<h3>Coding genes</h3> +<p> +One option for GENCODE is to use the Public MySQL server and the following query: +<pre> + mysql --user=genome --host=genome-mysql.soe.ucsc.edu -Ne 'select * from wgEncodeGencodeBasicV39 where cdsStartStat = "cmpl" and cdsEndStat = "cmpl";' hg38 +</pre></p> +<p> +What this query does is access the hg38 database, and then from the wgEncodeGencodeBasicV39 table, +looks to the fields of cdsStartStat and cdsEndStat for only those entries with the value cmpl, +showing "CDS is complete" at the start and end, so that these are genes that are +protein-coding entries, thereby excluding non-coding RNA genes.</p> +<p> +For the manually curated RefSeq gene set transcript identifiers start with NM_ for coding +or NR_ for non-coding, followed by a number and version number separated by a dot, e.g. +"NR_046018.2" for a RNA pseudogene. For RefSeq one can select coding genes by +filtering for NM identifiers.</p> +<p>If using the UCSC knownGene table one can filter for where the coding start +and coding end fields of the table are not equivalent, e.g. +<code>knownGene.cdsStart != knownGene.cdsEnd</code>, which would ensure the selected +entries are coding genes.</p> +<p> +You can also <a href="https://groups.google.com/u/1/a/soe.ucsc.edu/g/genome/search?q=only%20coding%20genes" +target="_blank">search our mailing-list archives</a> to read further details about only +obtaining coding genes from the UCSC Genome Browser.</p> +<a name="nonCoding"></a> +<h3>Non-coding genes</h3> +<p> +The steps for selecting non-coding genes are essentially the opposite of the steps to select +only coding genes. One option for GENCODE is to use the Public MySQL server +and the following query:</p> +<p> +<pre> + mysql --user=genome --host=genome-mysql.soe.ucsc.edu -Ne 'select * from wgEncodeGencodeBasicV39 where cdsStartStat != "cmpl" and cdsEndStat != "cmpl";' hg38 +</pre></p> +<p> +What this query does is access the hg38 database, and then from the wgEncodeGencodeBasicV39 table, +looks to the fields of cdsStartStat and cdsEndStat for only those entries without the value cmpl, +showing "CDS is complete" at the start and end, so that these are genes that are +protein-coding entries, thereby including only non-coding RNA genes.</p> +<p> +For the manually curated RefSeq gene set transcript identifiers start with NM_ for coding +or NR_ for non-coding, followed by a number and version number separated by a dot, e.g. +"NR_046018.2" for an RNA pseudogene. For RefSeq, one can select coding genes by +filtering for NR identifiers.</p> +<p>If using the UCSC knownGene table one can filter for where the coding start +and coding end fields of the table are equivalent, e.g. +<code>knownGene.cdsStart = knownGene.cdsEnd</code>, which would ensure the selected +entries are non-coding genes.</p> +<p> +You can also <a href="https://groups.google.com/u/1/a/soe.ucsc.edu/g/genome/search?q=only%20non-coding%20genes" +target="_blank">search our mailing-list archives</a> to read further details about only +obtaining non-coding genes from the UCSC Genome Browser.</p> + <!--#include virtual="$ROOT/inc/gbPageEnd.html" -->