772bf0ee794706c69818babdd436402a633c6090 max Fri Jan 17 11:49:37 2020 +0100 adding a section about NCBI RefSeq and knownCanonical alternatives there, refs #24780 diff --git src/hg/htdocs/FAQ/FAQgenes.html src/hg/htdocs/FAQ/FAQgenes.html index 8ade084..82dfb94 100755 --- src/hg/htdocs/FAQ/FAQgenes.html +++ src/hg/htdocs/FAQ/FAQgenes.html @@ -388,89 +388,105 @@ review and you can consider these a subset of either gene track, filtered for high quality. The CCDS identifiers are very stable and allow you to link easily between the different databases. As the name implies, it does not cover UTR regions or non-coding transcripts. </p> <a name="justsingle"></a> <h6>How can I show a single transcript per gene?</h6> <p> For the tracks "<a target=_blank href="../cgi-bin/hgTrackUi?db=hg19&g=knownGene">UCSC Genes</a>" (hg19) or "<a target=_blank href="../cgi-bin/hgTrackUi?db=hg38&g=knownGene">GENCODE Genes</a>" (hg38), click on their title and on the configuration page, uncheck the box "Show splice variants". Only a single transcript will be shown. The method for how this -transcript is selected is described in the track documentation below the -configuration settings.</p> +transcript is selected is described in the next section below and in the track documentation. </p> <p class='text-center'> <img class='text-center' src="../images/SpliceVariants.png" alt="Changing splice variants" width="750"> <p>For the track <a target=_blank href="../cgi-bin/hgTrackUi?db=hg38&g=refSeqComposite"> NCBI RefSeq</a> (hg38), you can activate the subtrack "RefSeq HGMD". It contains only the transcripts that are part of the Human Gene Mutation Database. </p> <a name="singledownload"></a> <h6>How can I download a file with a single transcript per gene?</h6> <p> This is a common request, but very often this is not necessary when designing an analysis. You will have to make a choice of this single transcript using some mechanism, and this choice will affect your pipeline results. It may be easier to keep all transcripts. For example, instead of annotating enhancers with the closest "best-transcript", you can annotate them with the closest exon of any transcript. When mapping variants to transcripts, you can map to all -transcripts and sort these by accession ID, showing mainly the first. When +transcripts and and show the transcript with the worst impact first. When segmenting the chromosomes into gene loci, you can use the union of all -transcripts of a gene rather than select a single "best" transcript. -</p> +transcripts of a gene, adding some predefined distance, rather than selecting a +single "best" transcript.</p> <p> -That being said, data tables called "knownCanonical" are available for +That being said, the main gene tracks have tables that try to take guess the most interesting +transcript per gene. For the knownGene tracks (UCSC genes on hg19, Gencode on hg38 and mm10), +data tables called "knownCanonical" are available for many assemblies. They try to select only a single transcript/isoform per gene, -if possible.</p> +if possible and for RefSeq similar options exist.</p> <p> -For hg19, the knownCanonical table is a subset of the <a target="_blank" +<b>UCSC Genes on hg19</b>: For hg19, the knownCanonical table is a subset of the <a target="_blank" href="../cgi-bin/hgTrackUi?db=hg19&g=knownGene">UCSC Genes</a> track. It was generated by identifying a canonical isoform for each cluster ID, or gene. Generally, this is the longest isoform. It can be downloaded directly from the <a target="_blank" href="http://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/">hg19 downloads database</a> or by using the <a target="_blank" href="../cgi-bin/hgTables">Table Browser</a>.</p> <p> -For hg38, the knownCanonical table is a subset of the <a target="_blank" +<b>Gencode on hg38/mm10</b>: For hg38, the knownCanonical table is a subset of the <a target="_blank" href="../cgi-bin/hgTrackUi?db=hg38&g=knownGene">GENCODE v29</a> track. As opposed to the hg19 knownCanonical table, which used computationally generated gene clusters and generally chose the longest isoform as the canonical isoform, the hg38 table uses ENSEMBL gene IDs to define clusters (that is to say, one canonical isoform per ENSEMBL gene ID), and the method of choosing the isoform is described as such:</p> -<p> +<p style="margin-left: 10em"> <i>knownCanonical identifies the canonical isoform of each cluster ID or gene using the ENSEMBL gene IDs to define each cluster. The canonical transcript is chosen using the APPRIS principal transcript when available. If no APPRIS tag exists for any transcript associated with the cluster, then a transcript in the BASIC set is chosen. If no BASIC transcript exists, then the longest isoform is used.</i></p> <p> It can be downloaded directly from the <a target="_blank" href="http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/">hg38 downloads database</a> or by using the <a target="_blank" href="../cgi-bin/hgTables">Table Browser</a>.</p> +<p> +<b>NCBI RefSeq (hg19/hg38)</b>: NCBI manually selects few, usually one, +transcript per gene called "RefSeq Select", based on <a target=_blank +href="https://www.ncbi.nlm.nih.gov/refseq/refseq_select/">various criteria</a>. +Example use cases are comparative genomics and variant reporting. This subset +is available in the RefSeq Select track under NCBI RefSeq. RefSeq and the EBI +also select one transcript for every protein coding gene, a project called <a +href="https://ncbiinsights.ncbi.nlm.nih.gov/2019/03/12/mane-select-v0-5/" +target=_blank>"MANE select"</a>, which is another subtrack of NCBI RefSeq. +For the special case of clinical diagnostics +where an even more reduced number of transcripts simplifies visual inspection, +we also provide "RefSeq HGMD", another subtrack of NCBI RefSeq. It contains +(usually) a single transcript only for genes known to cause human genetic diseases and +the transcript is the one where reported clinical variants can be mapped to. + <a name="whatdo"></a> <h6>This is rather complicated. Can you tell me which gene transcript track I should use?</h6> <p> For automated analysis, if you are doing NGS analysis and you need to capture all possible transcripts, GENCODE provides one of the most comprehensive gene sets. For human genetics or variant annotation, a more restricted transcript set is usually sufficient and "NCBI RefSeq" is the standard. If you are only interested in protein-coding annotations, CCDS or UniProt may be an option, but this is rather unusual. If you are interested in the best splice site coverage, AceView is worth a look. </p> <p> For manual inspection of exon boundaries of a single gene, and especially if it is a transcript that is repetitive or hard to align (e.g. very small exons),