90206b0ea2036dba8693fa886347d9368dce8efb
jnavarr5
  Tue Apr 22 14:38:39 2025 -0700
Documenting a frequently asked question, how to filter by biotype when using GENCODE, NCBI RefSeq, or Ensembl, refs #24243

diff --git src/hg/htdocs/FAQ/FAQgenes.html src/hg/htdocs/FAQ/FAQgenes.html
index 9be33d1c7bc..c9ba8302d5b 100755
--- src/hg/htdocs/FAQ/FAQgenes.html
+++ src/hg/htdocs/FAQ/FAQgenes.html
@@ -612,30 +612,132 @@
 is available in the RefSeq Select track under NCBI RefSeq.
 <li>MANE: RefSeq and the EBI
 also select one transcript for every protein coding gene that is annotated exactly 
 the same in both Gencode and RefSeq, a project called <a
 href="https://www.ncbi.nlm.nih.gov/refseq/MANE/"
 target=_blank>"MANE select"</a>, which is another subtrack of NCBI RefSeq. "MANE select"
 can be considered a subset of RefSeq Select.
 <li>HGMD: For the special case of clinical diagnostics
 where an even more reduced number of transcripts simplifies visual inspection,
 we provide another subtrack, "RefSeq HGMD". It contains
 (usually) a single transcript only for genes known to cause human genetic diseases and
 the transcript is the one to which all reported HGMD clinical variants can be mapped to.
 This transcript set is also a good choice for variant reporting.
 </ul>
 
+<a name="bioTypeFilter"></a>
+<h2>How can I filter by bioType from GENCODE/RefSeq/Ensembl?</h2>
+<p>
+A common request is to be able to filter by the bioType information from the GENCODE, NCBI
+RefSeq, or ENSEMBL gene tracks and download the data.
+The Table Browser is a powerful tool that can be used to filter for these bioTypes. If you are
+unfamiliar with creating filters using the Table Browser, please refer to the following
+<a href="/goldenPath/help/hgTablesHelp.html#Filter" target="_blank">help page</a>. 
+A few examples of bioType information available in the UCSC Genome Browser include:
+<em>protein-coding</em>, <em>non-coding</em>, <em>lncRNA</em>, <em>miRNA</em>, <em>piRNA</em>,
+<em>pseudogene</em>, etc.
+</p>
+<p>
+When using the <b>GENCODE</b> track, you can query the
+<a href="/cgi-bin/hgTables?hgta_doSchemaDb=hg38&hgta_doSchemaTable=knownAttrs"
+target="_blank">knownAttrs</a> table to get bioType
+information. Specifically, the fields <code>geneType</code> and <code>transcriptType</code> contain
+the biotype information for the GENCODE track. The following output is a
+query of the hg38.knownAttrs table, selecting the <code>kgID</code>, <code>geneType</code>, and
+<code>transcriptType</code> fields:
+</p>
+<pre>
+#hg38.knownAttrs.kgID	hg38.knownAttrs.geneType	hg38.knownAttrs.transcriptType
+ENST00000622482.1	miRNA				miRNA
+ENST00000612139.1	misc_RNA			misc_RNA
+ENST00000625598.1	lncRNA				lncRNA
+ENST00000613359.1	rRNA				rRNA
+ENST00000780342.1	lncRNA				lncRNA
+ENST00000650962.1	TEC				TEC
+ENST00000625020.1	processed_pseudogene		processed_pseudogene
+ENST00000559466.1	transcribed_unprocessed_pseudogene	transcribed_unprocessed_pseudogene
+ENST00000495055.1	protein_coding			protein_coding_CDS_not_defined
+ENST00000461088.1	protein_coding			retained_intron
+ENST00000467409.7	lncRNA				retained_intron
+ENST00000362728.1	snRNA				snRNA
+</pre>
+<p>
+The <b>NCBI RefSeq</b> track uses the
+<a href="/cgi-bin/hgTables?hgta_doSchemaDb=hg38&hgta_doSchemaTable=ncbiRefSeqLink"
+target="_blank">ncbiRefSeqLink</a> table to get the biotype for its entries. The field
+<code>gene_biotype</code> contains biotype information. The following output is from the
+hg38.ncbiRefSeqLink table, selecting the <code>mrnaAcc</code> and <code>gene_biotype</code> fields:
+</p>
+<pre>
+#hg38.ncbiRefSeqLink.mrnaAcc	hg38.ncbiRefSeqLink.gene_biotype
+NM_145005.7			protein_coding
+NR_030618.1			miRNA
+XR_929535.2			lncRNA
+XR_007061920.1			snoRNA
+NR_023917.1			transcribed_pseudogene
+NR_003051.4			RNase_MRP_RNA
+XR_007061908.1			snRNA
+XR_007061868.1			misc_RNA
+</pre>
+<p>
+For both the <b>GENCODE</b> and <b>NCBI RefSeq</b> tracks, you can query the
+<a href="/cgi-bin/hgTables?hgta_doSchemaDb=hg38&hgta_doSchemaTable=kgXref"
+target="_blank">kgXref</a>
+table to extract bioType information. The <code>description</code> field in the kgXref table can be
+used to filter entries when using the GENCODE and NCBI RefSeq tables. However, the biotype
+information may be in the form of a paragraph, which might require extra parsing or scripting to
+produce a tab-separated file.
+</p>
+<p>
+<em>Since the primary table is from GENCODE, there may not be a NCBI RefSeq identifier listed in
+the output.</em> The following output is from the hg38.kgXref table, selecting the
+<code>kgID</code>, <code>refseq</code>, and <code>description</code> fields:
+</p>
+<pre>
+#hg38.kgXref.kgID	hg38.kgXref.refseq	hg38.kgXref.description
+ENST00000622482.1	NR_128716		microRNA 6724-3 (from RefSeq NR_128716.1)
+ENST00000612139.1				ENSG00000274868 (from geneSymbol)
+ENST00000625598.1				ENSG00000280614 (from geneSymbol)
+ENST00000613359.1	NR_146153		RNA, 5.8S ribosomal RNA N3 (from RefSeq NR_146153.1)
+ENST00000780342.1				ENSG00000280441 (from geneSymbol)
+ENST00000650962.1				ENSG00000286267 (from geneSymbol)
+ENST00000625020.1				ribosomal protein SA pseudogene 68 (from HGNC RPSAP68)
+ENST00000559466.1				tektin 4 pseudogene 2 (from HGNC TEKT4P2)
+ENST00000495055.1	NR_135309		RNA binding motif protein 11, transcript variant 4 (from RefSeq NR_135309.2)
+ENST00000461088.1				RNA binding motif protein 11 (from HGNC RBM11)
+ENST00000467409.7	NR_003087		ATP binding cassette subfamily C member 13 (pseudogene), transcript variant A (from RefSeq NR_003087.1)
+ENST00000362728.1				RNA, U6 small nuclear 859, pseudogene (from HGNC RNU6-859P)</pre>
+<p>
+When using the <b>ENSEMBL track</b>, you can use the <code>source</code> field from the
+<a href="/cgi-bin/hgTables?hgta_doSchemaDb=hg19&hgta_doSchemaTable=ensemblSource"
+target="_blank">ensemblSource</a> table. The following is output from the hg19.ensemblSource table,
+selecting the <code>name</code> and <code>source</code> fields:
+</p>
+<pre>
+#hg19.ensemblSource.name	hg19.ensemblSource.source
+ENST00000596669			retained_intron
+ENST00000516163			snRNA
+ENST00000448850			protein_coding
+ENST00000463070			processed_transcript
+ENST00000455275			antisense
+ENST00000608591			lincRNA
+ENST00000384075			snRNA
+ENST00000460212			nonsense_mediated_decay
+ENST00000450472			processed_pseudogene
+ENST00000581654			miRNA
+</pre>
+
 <a name="whatdo"></a>
 <h2>This is rather complicated. Can you tell me which gene transcript track I should use?</h2>
 <p> 
 For automated analysis, if you are doing NGS analysis and you need to capture
 all possible transcripts, GENCODE provides one of the most comprehensive gene sets.  For human 
 genetics or variant annotation, a more restricted transcript set is usually sufficient and &quot;NCBI
 RefSeq&quot; is the standard. If you are only interested in protein-coding
 annotations, CCDS or UniProt may be an option, but this is rather unusual.
 If you are interested in the best splice site coverage, AceView is worth a
 look.
 </p>
 
 <p>
 For manual inspection of exon boundaries of a single gene, and especially if it
 is a transcript that is repetitive or hard to align (e.g. very small exons),