fc450d4031fd9494b3a5bd8ee0fbeba83905a61a brianlee Thu Mar 3 16:48:59 2022 -0800 Edits to new coding/non-coding gene FAQ refs #29030 diff --git src/hg/htdocs/FAQ/FAQgenes.html src/hg/htdocs/FAQ/FAQgenes.html index 77ad926..a30ec1e 100755 --- src/hg/htdocs/FAQ/FAQgenes.html +++ src/hg/htdocs/FAQ/FAQgenes.html @@ -617,56 +617,56 @@ out of GENCODE (or other gene) tables?</h2> <h3>Coding genes</h3> <p> One option for GENCODE is to use the Public MySQL server and the following query: <pre> mysql --user=genome --host=genome-mysql.soe.ucsc.edu -Ne 'select * from wgEncodeGencodeBasicV39 where cdsStartStat = "cmpl" and cdsEndStat = "cmpl";' hg38 </pre></p> <p> What this query does is access the hg38 database, and then from the wgEncodeGencodeBasicV39 table, looks to the fields of cdsStartStat and cdsEndStat for only those entries with the value cmpl, showing "CDS is complete" at the start and end, so that these are genes that are protein-coding entries, thereby excluding non-coding RNA genes.</p> <p> For the manually curated RefSeq gene set transcript identifiers start with NM_ for coding or NR_ for non-coding, followed by a number and version number separated by a dot, e.g. -"NR_046018.2" for a RNA pseudogene. For RefSeq one can select coding genes by +"NR_046018.2" for an RNA pseudogene. For RefSeq one can select coding genes by filtering for NM identifiers.</p> <p>If using the UCSC knownGene table one can filter for where the coding start and coding end fields of the table are not equivalent, e.g. <code>knownGene.cdsStart != knownGene.cdsEnd</code>, which would ensure the selected entries are coding genes.</p> <p> You can also <a href="https://groups.google.com/u/1/a/soe.ucsc.edu/g/genome/search?q=only%20coding%20genes" target="_blank">search our mailing-list archives</a> to read further details about only obtaining coding genes from the UCSC Genome Browser.</p> <a name="nonCoding"></a> <h3>Non-coding genes</h3> <p> The steps for selecting non-coding genes are essentially the opposite of the steps to select only coding genes. One option for GENCODE is to use the Public MySQL server and the following query:</p> <p> <pre> mysql --user=genome --host=genome-mysql.soe.ucsc.edu -Ne 'select * from wgEncodeGencodeBasicV39 where cdsStartStat != "cmpl" and cdsEndStat != "cmpl";' hg38 </pre></p> <p> What this query does is access the hg38 database, and then from the wgEncodeGencodeBasicV39 table, looks to the fields of cdsStartStat and cdsEndStat for only those entries without the value cmpl, -showing "CDS is complete" at the start and end, so that these are genes that are -protein-coding entries, thereby including only non-coding RNA genes.</p> +showing "CDS is complete" at the start and end, so that this removes genes that are +protein-coding entries, thereby selecting only non-coding RNA genes.</p> <p> For the manually curated RefSeq gene set transcript identifiers start with NM_ for coding or NR_ for non-coding, followed by a number and version number separated by a dot, e.g. -"NR_046018.2" for an RNA pseudogene. For RefSeq, one can select coding genes by +"NR_046018.2" for an RNA pseudogene. For RefSeq, one can select non-coding genes by filtering for NR identifiers.</p> <p>If using the UCSC knownGene table one can filter for where the coding start and coding end fields of the table are equivalent, e.g. <code>knownGene.cdsStart = knownGene.cdsEnd</code>, which would ensure the selected entries are non-coding genes.</p> <p> You can also <a href="https://groups.google.com/u/1/a/soe.ucsc.edu/g/genome/search?q=only%20non-coding%20genes" target="_blank">search our mailing-list archives</a> to read further details about only obtaining non-coding genes from the UCSC Genome Browser.</p> <!--#include virtual="$ROOT/inc/gbPageEnd.html" -->