317dbfc227692ade3bc42d0de919155f67e139e7 lrnassar Wed Mar 20 11:16:25 2019 -0700 Minor modifications and new section to unreleased FAQgenes page ref#22696 diff --git src/hg/htdocs/FAQ/FAQgenes.html src/hg/htdocs/FAQ/FAQgenes.html index 8a7151d..0de5271 100755 --- src/hg/htdocs/FAQ/FAQgenes.html +++ src/hg/htdocs/FAQ/FAQgenes.html @@ -1,41 +1,43 @@

Frequently Asked Questions: Gene tracks

Topics


Return to FAQ Table of Contents

The basics

The genome browser contains many gene annotation tracks. Our users often wonder what these contain and where the information that we present comes from.
What is a gene?
@@ -66,82 +68,83 @@ Genes", but they really represent transcripts on chromosomes of a genome assembly.

For example, using the databases by NCBI, the gene with the gene symbol BRCA1 has 5 protein-coding transcripts or isoforms. The first transcript has the NCBI accession number NM_007294.3 which produces the protein with the accession NP_009225.1. In the human genome, it is located on chromosome 17, where it is comprised of 23 exons. On the version GRCh38 of the human genome, these exons cover the DNA nucleotides 43044295 to 43125483.

+ +
What is a gene or transcript accession?
+ +

+Gene symbols like BRCA1 are easy to remember but sometimes change and are not +specific to an organism. Therefore most databases internally use unique +identifiers to refer to sequences and some journals require authors to use +these in manuscripts.

+ +

+The most common accession numbers encountered by users are either from Ensembl, +GENCODE or RefSeq. Human Ensembl/GENCODE gene accession numbers start with +ENSG, e.g. "ENSG00000012048" for BRCA1. Every ENSG-gene has at least +one transcript assigned to it. The transcript identifiers start with with ENST +followed by a number, e.g. "ENST00000619216.1". NCBI refers to genes +with plain numbers, e.g. 672 for BRCA1. Manually curated RefSeq transcript +identifiers start with NM_ (coding) or NR_ (non-coding), followed by a number and version +number separated by a dot, e.g. "NR_046018.2". If the transcript was +predicted by the NCBI Gnomon software, the prefix is XM_ but these are rare in human. +A table of these and other RefSeq prefixes can be +found on the +NCBI website. +

+
What are the most common gene transcript tracks?

Researchers sequence cDNA sequences and send these to NCBI Genbank. The Genome Browser shows these sequences in the Genbank or the EST track (if the cDNA is just a single read from the 5' or 3' end). From the alignment of the cDNAs and ESTs, the NCBI RefSeq group manually creates a smaller set of representative transcripts which we display as the RefSeq Curated track. Automated programs like UCSC's or Ensembl's gene build software do the same, just in software, which is more systematic but also more error-prone. With the arrival of GENCODE, Ensembl added a manual curation to their human and mouse transcripts. NCBI has added an automated prediction software (Gnomon) which we show in the "RefSeq Predicted" track.

There are many other tracks in the group "Genes and Gene Predictions". Genscan and N-Scan are older transcript predictor algorithms that are based on the genome sequence alone. Augustus and AceView are automated gene-predictors that use cDNA and EST data. These and similar gene tracks are only relevant when you are working on a particular locus where you think that the manually curated gene models (Ensembl and RefSeq) have errors.

- -
What is a gene or transcript accession?
- -

-Gene symbols like BRCA1 are easy to remember but sometimes change and are not -specific to an organism. Therefore most databases internally use unique -identifiers to refer to sequences and some journals require authors to use -these in manuscripts.
- -The most common accession numbers encountered by users are either from Ensembl, -GENCODE or RefSeq. Human Ensembl/GENCODE gene accession numbers start with -ENSG, e.g. "ENSG00000012048" for BRCA1. Every ENSG-gene has at least -one transcript assigned to it. The transcript identifiers start with with ENST -followed by a number, e.g. "ENST00000619216.1". NCBI refers to genes -with plain numbers, e.g. 672 for BRCA1. Manually curated RefSeq transcript -identifiers start with NM_ (coding) or NR_ (non-coding), followed by a number and version -number separated by a dot, e.g. "NR_046018.2". If the transcript was -predicted by the NCBI Gnomon software, the prefix is XM_ but these are rare in human. -A table of these and other RefSeq prefixes can be -found on the -NCBI website. -

-

The differences

Some of our gene tracks look similar and contain very similar information which can be confusing.
What are Ensembl and GENCODE and is there a difference?

Officially, the Ensembl and GENCODE gene models are the same. On the latest human and mouse genome assemblies (hg38 and mm10), the identifiers, transcript sequences, and exon coordinates are almost identical between equivalent Ensembl and GENCODE versions (excluding alternative sequences or fix sequences).

GENCODE uses the UCSC convention of prefixing chromosome names with "chr", e.g. @@ -311,30 +314,60 @@ (hg38), click on their title and on the configuration page, uncheck the box "Show splice variants". Only a single transcript will be shown. The method for how this transcript is selected is described in the track documentation below the configuration settings.

Changing splice variants

For the track NCBI RefSeq (hg38), you can activate the subtrack "RefSeq HGMD". It contains only the transcripts that are part of the Human Gene Mutation Database.

+ +
I just want to download a gene set with a single entry per gene. Where can I find this?
+

+We have data tables named knownCanonical available for different assemblies comprised of a single +transcript/isoform per gene.

+ +

+For hg19, the knownCanonical table is a subset of the UCSC Genes track. It was generated by +identifying a canonical isoform for each cluster ID, or gene. Generally, this is the longest +isoform. It can be downloaded directly from the hg19 downloads database +or by using the Table Browser.

+ +

+For hg38, the knownCanonical table is a subset of the GENCODE v29 track. As opposed to the hg19 +equivalent which generally used the longest isoform for indentification, this table is defined +as follows:

+

+knownCanonical identifies the canonical isoform of each cluster ID or gene using the ENSEMBL +gene IDs to define each cluster. The canonical transcript is chosen using the APPRIS principal +transcript when available. If no APPRIS tag exists for any transcript associated with the +cluster, then a transcript in the BASIC set is chosen. If no BASIC transcript exists, then +the longest isoform is used.

+

+It can be downloaded directly from the hg38 downloads database +or by using the Table Browser.

+
This is rather complicated. Can you tell me which gene transcript track I should use?

For automated analysis, if you are doing NGS analysis and you need to capture all possible transcripts, GENCODE provides a comprehensive gene set. For human genetics or variant annotation, a more restricted transcript set is usually sufficient and "NCBI RefSeq" is the standard. If you are only interested in protein-coding annotations, CCDS or UniProt may be an option, but this is rather unusual.

For manual inspection of exon boundaries of a single gene, and especially if it is a transcript that is repetitive or hard to align (e.g. very small exons), look at the UCSC RefSeq track and watch for differences between the NCBI and UCSC exon placement. You can also BLAT the transcript sequence.