3d6c26a28372e86f6477032ec6210a4583dc55b1 brianlee Wed Mar 9 12:56:13 2022 -0800 Adding requested commas in code review refs #29059 diff --git src/hg/htdocs/FAQ/FAQgenes.html src/hg/htdocs/FAQ/FAQgenes.html index 8b12cd0..7ab2748 100755 --- src/hg/htdocs/FAQ/FAQgenes.html +++ src/hg/htdocs/FAQ/FAQgenes.html @@ -631,41 +631,41 @@ all types that have "protein_coding" in this transcriptType field:</p> <p> <pre> hgsql hg38 -e 'select g.name,a.transcriptType from wgEncodeGencodeBasicV39 g, wgEncodeGencodeAttrsV39 a where (g.name = a.transcriptId) and (a.transcriptType = "protein_coding");' </pre></p> <p> What this query does is access the hg38 database, and then from the wgEncodeGencodeBasicV39 table, it takes the name field (g.name) and looks in the related wgEncodeGencodeAttrsV39 table for a matching transcriptId field (g.name = a.transcriptId), and then screens for only entries in wgEncodeGencodeAttrsV39 that are equal to protein-coding (a.transcriptType = "protein_coding"). In this way selecting all the entries which are annotated as protein-coding. Please note this selection will return some of the unusual protein-coding cases that one would not consider, for instance, it will return genes one may not want (or want), such as Immunoglobulin and T-cell receptor components.</p> <p> -For the manually curated RefSeq gene set transcript identifiers start with NM_ for coding +For the manually curated RefSeq gene set, transcript identifiers start with NM_ for coding or NR_ for non-coding, followed by a number and version number separated by a dot, e.g. "NR_046018.2" for an RNA pseudogene. For RefSeq one can select coding genes by filtering for NM identifiers. On the concept of genes, it may be worth noting that the NR_046018.2 example is a transcribed pseudogene of an mRNA. So it is considered an RNA, and by many a lncRNA (long non-coding RNA), where the whole idea of transcribed pseudogenes is not an unambiguous concept to a lot of biologists. For some, another example, "NR_106918.1" represents a miRNA (microRNA), which are short (20-24 nt) non-coding RNAs, which may provide a more familiar idea of the kind of non-coding elements desired to be removed from a gene set. </p> -<p>If using the UCSC knownGene table one can filter for where the coding start +<p>If using the UCSC knownGene table, one can filter for where the coding start and coding end fields of the table are not equivalent, e.g. <code>knownGene.cdsStart != knownGene.cdsEnd</code>, which would ensure the selected entries are coding genes.</p> <p> You can also <a href="https://groups.google.com/u/1/a/soe.ucsc.edu/g/genome/search?q=only%20coding%20genes" target="_blank">search our mailing-list archives</a> to read further details about only obtaining coding genes from the UCSC Genome Browser.</p> <a name="nonCoding"></a> <h3>Non-coding genes</h3> <p> The steps for selecting non-coding genes are not exactly the opposite of the steps to select only coding genes. The above discussion introduced the idea of lncRNA (long non-coding RNA) and miRNA (microRNA), hinting at the abundant types of RNA molecules.</p> <p> Since there are many different kinds of non-coding elements in GENCODE, a better step for non-coding @@ -676,35 +676,35 @@ target="_blank">transcriptType</a> field. These terms are also more fully described on the GENCODE <a href=" https://www.gencodegenes.org/pages/biotypes.html" target="_blank">biotypes page</a>.</p> <p> Here is an introductory example using the Public MySQL server to access the wgEncodeGencodeBasicV39 table of all genes and the wgEncodeGencodeAttrsV39 related table to find the transcriptType for each entry and to select just lncRNA entries.</p> <p> <pre> hgsql hg38 -e 'select g.name,a.transcriptType from wgEncodeGencodeBasicV39 g, wgEncodeGencodeAttrsV39 a where (g.name = a.transcriptId) and (a.transcriptType = "lncRNA");' </pre></p> <p> What this query does is access the hg38 database, and then from the wgEncodeGencodeBasicV39 table, it takes the name field (g.name) and looks in the related wgEncodeGencodeAttrsV39 table for a matching transcriptId field (g.name = a.transcriptId), and then screens for only entries in wgEncodeGencodeAttrsV39 that are equal to lncRNA (a.transcriptType = "lncRNA"). In this way selecting all of these types, -which again, may not be the only subset desired. By modifying the above query it is possible to add +which again, may not be the only subset desired. By modifying the above query, it is possible to add further qualifiers and generate a subset of different non-coding elements meeting specific research needs.</p> <p> -For the manually curated RefSeq gene set transcript identifiers start with NM_ for coding +For the manually curated RefSeq gene set, transcript identifiers start with NM_ for coding or NR_ for non-coding, followed by a number and version number separated by a dot, e.g. "NR_046018.2" for an RNA pseudogene. For RefSeq, one can select non-coding genes by filtering for NR identifiers. Note that a pseudogene of mRNA is not an unambiguous concept, and there may be a desire to look further to select certain subset types as mentioned above.</p> <p> -If using the UCSC knownGene table one can filter for where the coding start +If using the UCSC knownGene table, one can filter for where the coding start and coding end fields of the table are equivalent, e.g. <code>knownGene.cdsStart = knownGene.cdsEnd</code>, which would ensure the selected entries are non-coding genes.</p> <p> You can also <a href="https://groups.google.com/u/1/a/soe.ucsc.edu/g/genome/search?q=only%20non-coding%20genes" target="_blank">search our mailing-list archives</a> to read further details about only obtaining non-coding genes from the UCSC Genome Browser.</p> <!--#include virtual="$ROOT/inc/gbPageEnd.html" -->