8e20653bae737846c5025e987e37777f2a4f0f29 max Tue Jul 2 10:52:21 2019 +0200 tweaking Lou's excellent transcript download section a little in genes FAQ, refs #23748 diff --git src/hg/htdocs/FAQ/FAQgenes.html src/hg/htdocs/FAQ/FAQgenes.html index 418d046..92f6ed4 100755 --- src/hg/htdocs/FAQ/FAQgenes.html +++ src/hg/htdocs/FAQ/FAQgenes.html @@ -11,33 +11,32 @@ <ul> <li><a href="#gene">What is a gene?</a></li> <li><a href="#genestrans">What is a transcript and how is it related to a gene?</a></li> <li><a href="#genename">What is a gene name?</a></li> <li><a href="#mostCommon">What are the most common gene transcript tracks?</a></li> <li><a href="#ens">What are Ensembl and GENCODE and is there a difference?</a></li> <li><a href="#ensRefseq">What are the differences among GENCODE, Ensembl and RefSeq?</a></li> <li><a href="#hg19">For the human assembly hg19/GRCh37: What is the difference between "UCSC Genes" track, the "GENCODE" track and the "Ensembl Genes" track?</a></li> <li><a href="#hg38">For the human assembly hg38/GRCh38: What are the differences between the "GENCODE" and "All GENCODE" tracks?</a></li> <li><a href="#gencode">What is the difference between GENCODE comprehensive and basic?</a></li> <li><a href="#ncbiRefseq">What is the difference between "NCBI RefSeq" and "UCSC RefSeq"?</a></li> <li><a href="#ccds">What is CCDS?</a></li> -<li><a href="#justsingle">How can I just show a single transcript per gene?</a></li> -<li><a href="#singledownload">I just want to download a gene set with a single entry per gene. - Where can I find this?</a></li> +<li><a href="#justsingle">How can I show a single transcript per gene?</a></li> +<li><a href="#singledownload">How can I download a file with a single transcript per gene?</a></li> <li><a href="#whatdo">This is rather complicated. Can you tell me which gene transcript track I should use?</a></li> </ul> <hr> <p> <a href="index.html">Return to FAQ Table of Contents</a></p> <a name="gene"></a> <h2>The basics</h2> The genome browser contains many gene annotation tracks. Our users often wonder what these contain and where the information that we present comes from. <h6>What is a gene?</h6> @@ -342,90 +341,105 @@ often change from one version to the next. </p> <a name="ccds"></a> <h6>What is CCDS?</h6> <p> The <a target=_blank href="https://www.ncbi.nlm.nih.gov/projects/CCDS/CcdsBrowse.cgi"> Consensus Coding Sequence Project</a> is a list of transcript coding sequence (CDS) genomic regions that are identically annotated by RefSeq and Ensembl/GENCODE. CCDS undergoes extensive manual review and you can consider these a subset of either gene track, filtered for high quality. The CCDS identifiers are very stable and allow you to link easily between the different databases. As the name implies, it does not cover UTR regions or non-coding transcripts. </p> <a name="justsingle"></a> -<h6>How can I just show a single transcript per gene?</h6> +<h6>How can I show a single transcript per gene?</h6> <p> For the tracks "<a target=_blank href="../cgi-bin/hgTrackUi?db=hg19&g=knownGene">UCSC Genes</a>" (hg19) or "<a target=_blank href="../cgi-bin/hgTrackUi?db=hg38&g=knownGene">GENCODE Genes</a>" (hg38), click on their title and on the configuration page, uncheck the box "Show splice variants". Only a single transcript will be shown. The method for how this transcript is selected is described in the track documentation below the configuration settings.</p> <p class='text-center'> <img class='text-center' src="../images/SpliceVariants.png" alt="Changing splice variants" width="750"> <p>For the track <a target=_blank href="../cgi-bin/hgTrackUi?db=hg38&g=refSeqComposite"> NCBI RefSeq</a> (hg38), you can activate the subtrack "RefSeq HGMD". It contains only the transcripts that are part of the Human Gene Mutation Database. </p> <a name="singledownload"></a> -<h6>I just want to download a gene set with a single entry per gene. Where can I find this?</h6> +<h6>How can I download a file with a single transcript per gene?</h6> <p> -We have data tables named knownCanonical available for different assemblies comprised of a single -transcript/isoform per gene.</p> +This is a common request, but very often this is not necessary when designing +an analysis. You will have to make a choice of this single transcript using +some mechanism, and this choice will affect your pipeline results. It may be +easier to keep all transcripts. For example, instead of annotating enhancers +with the closest "best-transcript", you can annotate them with the closest exon +of any transcript. When mapping variants to transcripts, you can map to all +transcripts and sort these by accession ID, showing mainly the first. When +segmenting the chromosomes into gene loci, you can use the union of all +transcripts of a gene rather than select a single "best" transcript. +</p> + +<p> +That being said, data tables called "knownCanonical" are available for +many assemblies. They try to select only a single transcript/isoform per gene, +if possible.</p> <p> For hg19, the knownCanonical table is a subset of the <a target="_blank" href="../cgi-bin/hgTrackUi?db=hg19&g=knownGene">UCSC Genes</a> track. It was generated by identifying a canonical isoform for each cluster ID, or gene. Generally, this is the longest isoform. It can be downloaded directly from the <a target="_blank" href="http://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/">hg19 downloads database</a> or by using the <a target="_blank" href="../cgi-bin/hgTables">Table Browser</a>.</p> <p> For hg38, the knownCanonical table is a subset of the <a target="_blank" href="../cgi-bin/hgTrackUi?db=hg38&g=knownGene">GENCODE v29</a> track. As opposed to the hg19 equivalent which generally used the longest isoform for indentification, this table is defined as follows:</p> <p> <i>knownCanonical identifies the canonical isoform of each cluster ID or gene using the ENSEMBL gene IDs to define each cluster. The canonical transcript is chosen using the APPRIS principal transcript when available. If no APPRIS tag exists for any transcript associated with the cluster, then a transcript in the BASIC set is chosen. If no BASIC transcript exists, then the longest isoform is used.</i></p> <p> It can be downloaded directly from the <a target="_blank" href="http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/">hg38 downloads database</a> or by using the <a target="_blank" href="../cgi-bin/hgTables">Table Browser</a>.</p> <a name="whatdo"></a> <h6>This is rather complicated. Can you tell me which gene transcript track I should use?</h6> <p> For automated analysis, if you are doing NGS analysis and you need to capture -all possible transcripts, GENCODE provides a comprehensive gene set. For human +all possible transcripts, GENCODE provides one of the most comprehensive gene sets. For human genetics or variant annotation, a more restricted transcript set is usually sufficient and "NCBI RefSeq" is the standard. If you are only interested in protein-coding annotations, CCDS or UniProt may be an option, but this is rather unusual. +If you are interested in the best splice site coverage, AceView is worth a +look. </p> <p> For manual inspection of exon boundaries of a single gene, and especially if it is a transcript that is repetitive or hard to align (e.g. very small exons), look at the UCSC RefSeq track and watch for differences between the NCBI and UCSC exon placement. You can also BLAT the transcript sequence. Manually look at ESTs, mRNAs, TransMap and possibly Augustus, Genscan, SIB, SGP or GeneId in obscure cases where you are looking for hints on what an alternative splicing could look like.</p> <p> You may also find the <a target="_blank" href="http://genome.ucsc.edu/s/view/GeneSupport">Gene Support</a> public session helpful. This session is a collection of tracks centered around supporting evidence for genes.</p>