90206b0ea2036dba8693fa886347d9368dce8efb jnavarr5 Tue Apr 22 14:38:39 2025 -0700 Documenting a frequently asked question, how to filter by biotype when using GENCODE, NCBI RefSeq, or Ensembl, refs #24243 diff --git src/hg/htdocs/FAQ/FAQgenes.html src/hg/htdocs/FAQ/FAQgenes.html index 9be33d1c7bc..c9ba8302d5b 100755 --- src/hg/htdocs/FAQ/FAQgenes.html +++ src/hg/htdocs/FAQ/FAQgenes.html @@ -612,30 +612,132 @@ is available in the RefSeq Select track under NCBI RefSeq.
+A common request is to be able to filter by the bioType information from the GENCODE, NCBI +RefSeq, or ENSEMBL gene tracks and download the data. +The Table Browser is a powerful tool that can be used to filter for these bioTypes. If you are +unfamiliar with creating filters using the Table Browser, please refer to the following +help page. +A few examples of bioType information available in the UCSC Genome Browser include: +protein-coding, non-coding, lncRNA, miRNA, piRNA, +pseudogene, etc. +
+
+When using the GENCODE track, you can query the
+knownAttrs table to get bioType
+information. Specifically, the fields geneType
and transcriptType
contain
+the biotype information for the GENCODE track. The following output is a
+query of the hg38.knownAttrs table, selecting the kgID
, geneType
, and
+transcriptType
fields:
+
+#hg38.knownAttrs.kgID hg38.knownAttrs.geneType hg38.knownAttrs.transcriptType +ENST00000622482.1 miRNA miRNA +ENST00000612139.1 misc_RNA misc_RNA +ENST00000625598.1 lncRNA lncRNA +ENST00000613359.1 rRNA rRNA +ENST00000780342.1 lncRNA lncRNA +ENST00000650962.1 TEC TEC +ENST00000625020.1 processed_pseudogene processed_pseudogene +ENST00000559466.1 transcribed_unprocessed_pseudogene transcribed_unprocessed_pseudogene +ENST00000495055.1 protein_coding protein_coding_CDS_not_defined +ENST00000461088.1 protein_coding retained_intron +ENST00000467409.7 lncRNA retained_intron +ENST00000362728.1 snRNA snRNA ++
+The NCBI RefSeq track uses the
+ncbiRefSeqLink table to get the biotype for its entries. The field
+gene_biotype
contains biotype information. The following output is from the
+hg38.ncbiRefSeqLink table, selecting the mrnaAcc
and gene_biotype
fields:
+
+#hg38.ncbiRefSeqLink.mrnaAcc hg38.ncbiRefSeqLink.gene_biotype +NM_145005.7 protein_coding +NR_030618.1 miRNA +XR_929535.2 lncRNA +XR_007061920.1 snoRNA +NR_023917.1 transcribed_pseudogene +NR_003051.4 RNase_MRP_RNA +XR_007061908.1 snRNA +XR_007061868.1 misc_RNA ++
+For both the GENCODE and NCBI RefSeq tracks, you can query the
+kgXref
+table to extract bioType information. The description
field in the kgXref table can be
+used to filter entries when using the GENCODE and NCBI RefSeq tables. However, the biotype
+information may be in the form of a paragraph, which might require extra parsing or scripting to
+produce a tab-separated file.
+
+Since the primary table is from GENCODE, there may not be a NCBI RefSeq identifier listed in
+the output. The following output is from the hg38.kgXref table, selecting the
+kgID
, refseq
, and description
fields:
+
+#hg38.kgXref.kgID hg38.kgXref.refseq hg38.kgXref.description +ENST00000622482.1 NR_128716 microRNA 6724-3 (from RefSeq NR_128716.1) +ENST00000612139.1 ENSG00000274868 (from geneSymbol) +ENST00000625598.1 ENSG00000280614 (from geneSymbol) +ENST00000613359.1 NR_146153 RNA, 5.8S ribosomal RNA N3 (from RefSeq NR_146153.1) +ENST00000780342.1 ENSG00000280441 (from geneSymbol) +ENST00000650962.1 ENSG00000286267 (from geneSymbol) +ENST00000625020.1 ribosomal protein SA pseudogene 68 (from HGNC RPSAP68) +ENST00000559466.1 tektin 4 pseudogene 2 (from HGNC TEKT4P2) +ENST00000495055.1 NR_135309 RNA binding motif protein 11, transcript variant 4 (from RefSeq NR_135309.2) +ENST00000461088.1 RNA binding motif protein 11 (from HGNC RBM11) +ENST00000467409.7 NR_003087 ATP binding cassette subfamily C member 13 (pseudogene), transcript variant A (from RefSeq NR_003087.1) +ENST00000362728.1 RNA, U6 small nuclear 859, pseudogene (from HGNC RNU6-859P)+
+When using the ENSEMBL track, you can use the source
field from the
+ensemblSource table. The following is output from the hg19.ensemblSource table,
+selecting the name
and source
fields:
+
+#hg19.ensemblSource.name hg19.ensemblSource.source +ENST00000596669 retained_intron +ENST00000516163 snRNA +ENST00000448850 protein_coding +ENST00000463070 processed_transcript +ENST00000455275 antisense +ENST00000608591 lincRNA +ENST00000384075 snRNA +ENST00000460212 nonsense_mediated_decay +ENST00000450472 processed_pseudogene +ENST00000581654 miRNA ++
For automated analysis, if you are doing NGS analysis and you need to capture all possible transcripts, GENCODE provides one of the most comprehensive gene sets. For human genetics or variant annotation, a more restricted transcript set is usually sufficient and "NCBI RefSeq" is the standard. If you are only interested in protein-coding annotations, CCDS or UniProt may be an option, but this is rather unusual. If you are interested in the best splice site coverage, AceView is worth a look.
For manual inspection of exon boundaries of a single gene, and especially if it is a transcript that is repetitive or hard to align (e.g. very small exons),