60ecb412e9bba9ed0a0665a45051fa63952c1035 brianlee Fri Mar 4 15:37:56 2022 -0800 More MarkD input on non-coding GENCODE FAQ (just show lncRNAs) refs #29030 diff --git src/hg/htdocs/FAQ/FAQgenes.html src/hg/htdocs/FAQ/FAQgenes.html index 886c96d..8b12cd0 100755 --- src/hg/htdocs/FAQ/FAQgenes.html +++ src/hg/htdocs/FAQ/FAQgenes.html @@ -605,87 +605,103 @@ one or two of these. Which file a user should use depends on their analysis, as they contain different data and metadata.</p> <p> These files are generated using the <code>genePredToGtf</code> method described in our <a target="_blank" href="https://genome.ucsc.edu/FAQ/FAQdownloads.html#download37"> downloads FAQ</a> using the <code>-utr</code> flag. They can be found on the download server address <i>http://hgdownload.soe.ucsc.edu/goldenPath/$db/bigZips/genes/</i> where <i>$db</i> is the assembly of interest. For example, the <a target="_blank" href="http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/genes/">hg38 GTF files</a>.</p> <a name="coding"></a> <h2>What is the best way to get only coding genes (or only non-coding genes) out of GENCODE (or other gene) tables?</h2> <h3>Coding genes</h3> <p> -One option for GENCODE is to use the Public MySQL server and the following query: +The best approach to get protein-coding genes out of GENCODE is to join data with a +related attributes table, and specifically name the desired biotype(s).</p> +<p> +Here is an introductory example using the Public MySQL server to access the wgEncodeGencodeBasicV39 +table of all genes and the wgEncodeGencodeAttrsV39 related table to find the transcriptType for each +entry and to select those that are annotated as protein-coding genes. There are a number of +biotypes that can be accessed by looking at the table scheme and clicking the values link for the +<a href="http://genome.ucsc.edu/cgi-bin/hgTables?hgta_database=hg38&hgta_histoTable=wgEncodeGencodeAttrsV39&hgta_doValueHistogram=transcriptType" +target="_blank">transcriptType</a> field. These terms are also more fully described on the GENCODE +<a href=" https://www.gencodegenes.org/pages/biotypes.html" target="_blank">biotypes page</a>. +The below example will attempt to make a simple example to select +all types that have "protein_coding" in this transcriptType field:</p> +<p> <pre> - mysql --user=genome --host=genome-mysql.soe.ucsc.edu -Ne 'select * from wgEncodeGencodeBasicV39 where cdsStartStat = "cmpl" and cdsEndStat = "cmpl";' hg38 +hgsql hg38 -e 'select g.name,a.transcriptType from wgEncodeGencodeBasicV39 g, wgEncodeGencodeAttrsV39 a where (g.name = a.transcriptId) and (a.transcriptType = "protein_coding");' </pre></p> <p> What this query does is access the hg38 database, and then from the wgEncodeGencodeBasicV39 table, -looks to the fields of cdsStartStat and cdsEndStat for only those entries with the value cmpl, -showing "CDS is complete" at the start and end, so that these are genes that are -protein-coding entries, thereby excluding non-coding RNA genes. Please note that this query -will return genes one may not want (or want), such as Immunoglobulin and T-cell receptor components.</p> +it takes the name field (g.name) and looks in the related wgEncodeGencodeAttrsV39 table for a matching +transcriptId field (g.name = a.transcriptId), and then screens for only entries in wgEncodeGencodeAttrsV39 +that are equal to protein-coding (a.transcriptType = "protein_coding"). +In this way selecting all the entries which are annotated as protein-coding. +Please note this selection will return some of the unusual protein-coding cases +that one would not consider, for instance, it will return genes one may not want +(or want), such as Immunoglobulin and T-cell receptor components.</p> <p> For the manually curated RefSeq gene set transcript identifiers start with NM_ for coding or NR_ for non-coding, followed by a number and version number separated by a dot, e.g. "NR_046018.2" for an RNA pseudogene. For RefSeq one can select coding genes by filtering for NM identifiers. On the concept of genes, it may be worth noting that the NR_046018.2 example is a transcribed pseudogene of an mRNA. So it is considered an RNA, and by many a lncRNA (long non-coding RNA), where the whole idea of transcribed pseudogenes is not an unambiguous concept to a lot of biologists. For some, another example, "NR_106918.1" -represents a miRNA (microRNA), which are short (20-24 nt) non-coding RNAs, may provide a -more familiar idea of the kind of non-coding elements desired to be removed. +represents a miRNA (microRNA), which are short (20-24 nt) non-coding RNAs, which may provide a +more familiar idea of the kind of non-coding elements desired to be removed from a gene set. </p> <p>If using the UCSC knownGene table one can filter for where the coding start and coding end fields of the table are not equivalent, e.g. <code>knownGene.cdsStart != knownGene.cdsEnd</code>, which would ensure the selected entries are coding genes.</p> <p> You can also <a href="https://groups.google.com/u/1/a/soe.ucsc.edu/g/genome/search?q=only%20coding%20genes" target="_blank">search our mailing-list archives</a> to read further details about only obtaining coding genes from the UCSC Genome Browser.</p> <a name="nonCoding"></a> <h3>Non-coding genes</h3> <p> The steps for selecting non-coding genes are not exactly the opposite of the steps to select only coding genes. The above discussion introduced the idea of lncRNA (long non-coding RNA) and miRNA (microRNA), hinting at the abundant types of RNA molecules.</p> <p> Since there are many different kinds of non-coding elements in GENCODE, a better step for non-coding -selection is to join data with a related attributes table, and specifically name a desired biotype or biotypes.</p> -<p> -Here is an introductory example using the Public MySQL server to access the wgEncodeGencodeBasicV39 table of all genes -and the wgEncodeGencodeAttrsV39 related table to find the transcriptType for each entry and simply to select not -protein-coding genes. There are a number of biotypes that can be accessed by looking at the table scheme -and clicking the values link for the +selection is to join data with a related attributes table, and specifically name a specific +desired biotype or biotypes, such as only lncRNAs. There are a number of biotypes that can be +accessed by looking at the table scheme and clicking the values link for the <a href="http://genome.ucsc.edu/cgi-bin/hgTables?hgta_database=hg38&hgta_histoTable=wgEncodeGencodeAttrsV39&hgta_doValueHistogram=transcriptType" -target="_blank">transcriptType</a> field. The below example will attempt to make a simple example to select -all types that do not have "protein_coding" in this field:</p> +target="_blank">transcriptType</a> field. These terms are also more fully described on the GENCODE +<a href=" https://www.gencodegenes.org/pages/biotypes.html" target="_blank">biotypes page</a>.</p> +<p> +Here is an introductory example using the Public MySQL server to access the wgEncodeGencodeBasicV39 table +of all genes and the wgEncodeGencodeAttrsV39 related table to find the transcriptType for each entry and +to select just lncRNA entries.</p> <p> <pre> -hgsql hg38 -e 'select g.name,a.transcriptType from wgEncodeGencodeBasicV39 g, wgEncodeGencodeAttrsV39 a where (g.name = a.transcriptId) and (a.transcriptType != "protein_coding") order by g.name;' +hgsql hg38 -e 'select g.name,a.transcriptType from wgEncodeGencodeBasicV39 g, wgEncodeGencodeAttrsV39 a where (g.name = a.transcriptId) and (a.transcriptType = "lncRNA");' </pre></p> <p> What this query does is access the hg38 database, and then from the wgEncodeGencodeBasicV39 table, it takes the name field (g.name) and looks in the related wgEncodeGencodeAttrsV39 table for a matching transcriptId field (g.name = a.transcriptId), and then screens for only entries in wgEncodeGencodeAttrsV39 -that are not equal to protein-coding (a.transcriptType != "protein_coding"). In this way selecting all the -many types, which again may not be the entire subset desired. By modifying the above query it is possible -to add further qualifiers and refine the subset of non-coding elements desired.</p> +that are equal to lncRNA (a.transcriptType = "lncRNA"). In this way selecting all of these types, +which again, may not be the only subset desired. By modifying the above query it is possible to add +further qualifiers and generate a subset of different non-coding elements meeting specific research needs.</p> <p> For the manually curated RefSeq gene set transcript identifiers start with NM_ for coding or NR_ for non-coding, followed by a number and version number separated by a dot, e.g. "NR_046018.2" for an RNA pseudogene. For RefSeq, one can select non-coding genes by filtering for NR identifiers. Note that a pseudogene of mRNA is not an unambiguous concept, and there may be a desire to look further to select certain subset types as mentioned above.</p> <p> If using the UCSC knownGene table one can filter for where the coding start and coding end fields of the table are equivalent, e.g. <code>knownGene.cdsStart = knownGene.cdsEnd</code>, which would ensure the selected entries are non-coding genes.</p> <p> You can also <a href="https://groups.google.com/u/1/a/soe.ucsc.edu/g/genome/search?q=only%20non-coding%20genes" target="_blank">search our mailing-list archives</a> to read further details about only obtaining non-coding genes from the UCSC Genome Browser.</p>