60ecb412e9bba9ed0a0665a45051fa63952c1035
brianlee
  Fri Mar 4 15:37:56 2022 -0800
More MarkD input on non-coding GENCODE FAQ (just show lncRNAs) refs #29030

diff --git src/hg/htdocs/FAQ/FAQgenes.html src/hg/htdocs/FAQ/FAQgenes.html
index 886c96d..8b12cd0 100755
--- src/hg/htdocs/FAQ/FAQgenes.html
+++ src/hg/htdocs/FAQ/FAQgenes.html
@@ -605,87 +605,103 @@
 one or two of these. Which file a user should use depends on their analysis, as they contain 
 different data and metadata.</p>
 <p>
 These files are generated using the <code>genePredToGtf</code> method described in our 
 <a target="_blank" href="https://genome.ucsc.edu/FAQ/FAQdownloads.html#download37">
 downloads FAQ</a> using the <code>-utr</code> flag. They can be found on the download server 
 address <i>http://hgdownload.soe.ucsc.edu/goldenPath/$db/bigZips/genes/</i> where
 <i>$db</i> is the assembly of interest. For example, the <a target="_blank" 
 href="http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/genes/">hg38 GTF files</a>.</p>
 
 <a name="coding"></a>
 <h2>What is the best way to get only coding genes (or only non-coding genes)
 out of GENCODE (or other gene) tables?</h2>
 <h3>Coding genes</h3>
 <p>
-One option for GENCODE is to use the Public MySQL server and the following query:
+The best approach to get protein-coding genes out of GENCODE is to join data with a
+related attributes table, and specifically name the desired biotype(s).</p>
+<p>
+Here is an introductory example using the Public MySQL server to access the wgEncodeGencodeBasicV39
+table of all genes and the wgEncodeGencodeAttrsV39 related table to find the transcriptType for each
+entry and to select those that are annotated as protein-coding genes. There are a number of
+biotypes that can be accessed by looking at the table scheme and clicking the values link for the
+<a href="http://genome.ucsc.edu/cgi-bin/hgTables?hgta_database=hg38&hgta_histoTable=wgEncodeGencodeAttrsV39&hgta_doValueHistogram=transcriptType"
+target="_blank">transcriptType</a> field. These terms are also more fully described on the GENCODE
+<a href=" https://www.gencodegenes.org/pages/biotypes.html" target="_blank">biotypes page</a>.
+The below example will attempt to make a simple example to select
+all types that have &quot;protein_coding&quot; in this transcriptType field:</p>
+<p>
 <pre>
- mysql --user=genome --host=genome-mysql.soe.ucsc.edu -Ne 'select * from wgEncodeGencodeBasicV39 where cdsStartStat = "cmpl" and cdsEndStat = "cmpl";' hg38 
+hgsql hg38 -e 'select g.name,a.transcriptType from wgEncodeGencodeBasicV39 g, wgEncodeGencodeAttrsV39 a where (g.name = a.transcriptId) and (a.transcriptType = "protein_coding");'
 </pre></p>
 <p>
 What this query does is access the hg38 database, and then from the wgEncodeGencodeBasicV39 table,
-looks to the fields of cdsStartStat and cdsEndStat for only those entries with the value cmpl,
-showing &quot;CDS is complete&quot; at the start and end, so that these are genes that are
-protein-coding entries, thereby excluding non-coding RNA genes. Please note that this query
-will return genes one may not want (or want), such as Immunoglobulin and T-cell receptor components.</p>
+it takes the name field (g.name) and looks in the related wgEncodeGencodeAttrsV39 table for a matching
+transcriptId field (g.name = a.transcriptId), and then screens for only entries in wgEncodeGencodeAttrsV39
+that are equal to protein-coding (a.transcriptType = &quot;protein_coding&quot;).
+In this way selecting all the entries which are annotated as protein-coding.
+Please note this selection will return some of the unusual protein-coding cases
+that one would not consider, for instance, it will return genes one may not want
+(or want), such as Immunoglobulin and T-cell receptor components.</p>
 <p>
 For the manually curated RefSeq gene set transcript identifiers start with NM_ for coding
 or NR_ for non-coding, followed by a number and version number separated by a dot, e.g.
 &quot;NR_046018.2&quot; for an RNA pseudogene. For RefSeq one can select coding genes by
 filtering for NM identifiers. On the concept of genes, it may be worth noting that the
 NR_046018.2 example is a transcribed pseudogene of an mRNA. So it is considered an RNA,
 and by many a lncRNA (long non-coding RNA), where the whole idea of transcribed pseudogenes
 is not an unambiguous concept to a lot of biologists. For some, another example, &quot;NR_106918.1&quot;
-represents a miRNA (microRNA), which are short (20-24 nt) non-coding RNAs, may provide a
-more familiar idea of the kind of non-coding elements desired to be removed.
+represents a miRNA (microRNA), which are short (20-24 nt) non-coding RNAs, which may provide a
+more familiar idea of the kind of non-coding elements desired to be removed from a gene set.
 </p>
 <p>If using the UCSC knownGene table one can filter for where the coding start
 and coding end fields of the table are not equivalent, e.g.
 <code>knownGene.cdsStart != knownGene.cdsEnd</code>, which would ensure the selected
 entries are coding genes.</p>
 <p>
 You can also <a href="https://groups.google.com/u/1/a/soe.ucsc.edu/g/genome/search?q=only%20coding%20genes"
 target="_blank">search our mailing-list archives</a> to read further details about only
 obtaining coding genes from the UCSC Genome Browser.</p>
 <a name="nonCoding"></a>
 <h3>Non-coding genes</h3>
 <p>
 The steps for selecting non-coding genes are not exactly the opposite of the steps to select
 only coding genes. The above discussion introduced the idea of lncRNA (long non-coding RNA)
 and miRNA (microRNA), hinting at the abundant types of RNA molecules.</p>
 <p>
 Since there are many different kinds of non-coding elements in GENCODE, a better step for non-coding
-selection is to join data with a related attributes table, and specifically name a desired biotype or biotypes.</p>
-<p>
-Here is an introductory example using the Public MySQL server to access the wgEncodeGencodeBasicV39 table of all genes
-and the wgEncodeGencodeAttrsV39 related table to find the transcriptType for each entry and simply to select not
-protein-coding genes. There are a number of biotypes that can be accessed by looking at the table scheme
-and clicking the values link for the
+selection is to join data with a related attributes table, and specifically name a specific
+desired biotype or biotypes, such as only lncRNAs. There are a number of biotypes that can be
+accessed by looking at the table scheme and clicking the values link for the
 <a href="http://genome.ucsc.edu/cgi-bin/hgTables?hgta_database=hg38&hgta_histoTable=wgEncodeGencodeAttrsV39&hgta_doValueHistogram=transcriptType"
-target="_blank">transcriptType</a> field. The below example will attempt to make a simple example to select
-all types that do not have &quot;protein_coding&quot; in this field:</p>
+target="_blank">transcriptType</a> field. These terms are also more fully described on the GENCODE
+<a href=" https://www.gencodegenes.org/pages/biotypes.html" target="_blank">biotypes page</a>.</p>
+<p>
+Here is an introductory example using the Public MySQL server to access the wgEncodeGencodeBasicV39 table
+of all genes and the wgEncodeGencodeAttrsV39 related table to find the transcriptType for each entry and
+to select just lncRNA entries.</p>
 <p>
 <pre>
-hgsql hg38 -e 'select g.name,a.transcriptType from wgEncodeGencodeBasicV39 g, wgEncodeGencodeAttrsV39 a where (g.name = a.transcriptId) and (a.transcriptType != "protein_coding") order by g.name;' 
+hgsql hg38 -e 'select g.name,a.transcriptType from wgEncodeGencodeBasicV39 g, wgEncodeGencodeAttrsV39 a where (g.name = a.transcriptId) and (a.transcriptType = "lncRNA");' 
 </pre></p>
 <p>
 What this query does is access the hg38 database, and then from the wgEncodeGencodeBasicV39 table,
 it takes the name field (g.name) and looks in the related wgEncodeGencodeAttrsV39 table for a matching
 transcriptId field (g.name = a.transcriptId), and then screens for only entries in wgEncodeGencodeAttrsV39
-that are not equal to protein-coding (a.transcriptType != "protein_coding").  In this way selecting all the
-many types, which again may not be the entire subset desired. By modifying the above query it is possible
-to add further qualifiers and refine the subset of non-coding elements desired.</p>
+that are equal to lncRNA (a.transcriptType = &quot;lncRNA&quot;).  In this way selecting all of these types,
+which again, may not be the only subset desired. By modifying the above query it is possible to add
+further qualifiers and generate a subset of different non-coding elements meeting specific research needs.</p>
 <p>
 For the manually curated RefSeq gene set transcript identifiers start with NM_ for coding
 or NR_ for non-coding, followed by a number and version number separated by a dot, e.g.
 &quot;NR_046018.2&quot; for an RNA pseudogene. For RefSeq, one can select non-coding genes by
 filtering for NR identifiers. Note that a pseudogene of mRNA is not an unambiguous concept,
 and there may be a desire to look further to select certain subset types as mentioned above.</p>
 <p>
 If using the UCSC knownGene table one can filter for where the coding start
 and coding end fields of the table are equivalent, e.g.
 <code>knownGene.cdsStart = knownGene.cdsEnd</code>, which would ensure the selected
 entries are non-coding genes.</p>
 <p>
 You can also <a href="https://groups.google.com/u/1/a/soe.ucsc.edu/g/genome/search?q=only%20non-coding%20genes"
 target="_blank">search our mailing-list archives</a> to read further details about only
 obtaining non-coding genes from the UCSC Genome Browser.</p>