f67479340a8ea3cec099d85b085d6be8cbaa2715 lrnassar Wed Feb 5 10:13:36 2020 -0800 Added entry about new GTF files to genes FAQ refs #20867 diff --git src/hg/htdocs/FAQ/FAQgenes.html src/hg/htdocs/FAQ/FAQgenes.html index 47da356..0b11753 100755 --- src/hg/htdocs/FAQ/FAQgenes.html +++ src/hg/htdocs/FAQ/FAQgenes.html @@ -17,30 +17,31 @@ <li><a href="#wrong">I think this transcript looks strange, what shall I do?</a></li> <li><a href="#ens">What are Ensembl and GENCODE and is there a difference?</a></li> <li><a href="#ensRefseq">What are the differences among GENCODE, Ensembl and RefSeq?</a></li> <li><a href="#hg19">For the human assembly hg19/GRCh37: What is the difference between "UCSC Genes" track, the "GENCODE" track and the "Ensembl Genes" track?</a></li> <li><a href="#hg38">For the human assembly hg38/GRCh38: What are the differences between the "GENCODE" and "All GENCODE" tracks?</a></li> <li><a href="#gencode">What is the difference between GENCODE comprehensive and basic?</a></li> <li><a href="#ncbiRefseq">What is the difference between "NCBI RefSeq" and "UCSC RefSeq"?</a></li> <li><a href="#report">How shall I report a gene transcript in a manuscript?</a></li> <li><a href="#ccds">What is CCDS?</a></li> <li><a href="#justsingle">How can I show a single transcript per gene?</a></li> <li><a href="#singledownload">How can I download a file with a single transcript per gene?</a></li> <li><a href="#whatdo">This is rather complicated. Can you tell me which gene transcript track I should use?</a></li> +<li><a href="#gtfDownload">Does UCSC provide GTF/GFF files for gene models?</a></li> </ul> <hr> <p> <a href="index.html">Return to FAQ Table of Contents</a></p> <a name="gene"></a> <h2>The basics</h2> The genome browser contains many gene annotation tracks. Our users often wonder what these contain and where the information that we present comes from. <h6>What is a gene?</h6> <p> The exact definition of "gene" depends on the context. In the context of @@ -333,31 +334,31 @@ <p> An anecdotal and rare example is SHANK2 and SHANK3 in hg19. It is impossible for either NCBI or BLAT to get the correct alignment and gene model because the genome sequence is missing for part of the gene. NCBI and BLAT find slightly different exon boundaries at the edge of the problematic region. NCBI's aligner tries very hard to find exons that align to any transcript sequence, so it calls a few small dubious "exons" in the affected genomic region. GENCODE V19 also used an aligner that tried very hard to find exons, but it found small dubious "exons" in different places than NCBI. The <a target=_blank href="../cgi-bin/hgTrackUi?db=hg38&g=refSeqComposite">RefSeq Alignments</a> subtrack makes the problematic region very clear with double lines indicating unalignable transcript sequence. </p> <a name="report"></a> -<h6>How shall I report a gene transcript in a manuscript?</h6> +<h2>How shall I report a gene transcript in a manuscript?</h2> <p> When reporting on GENCODE/Ensembl transcripts, please specify the ENST identifier. It is often helpful to also specify the Ensembl release, which is shown on the details page, when you click onto a transcript. </p> <p> When reporting RefSeq transcripts, e.g. in HGVS, prefer the "NCBI RefSeq" track over the "UCSC RefSeq track". Please specify the RefSeq transcript ID and also the RefSeq annotation release. </p> <ul> <li>The RefSeq transcript ID is the sequence of the transcript, the NM_xxxxx.y @@ -368,64 +369,64 @@ is helpful for readers, e.g. report NM_012309.4, not NM_012309. <li>The RefSeq annotation release captures the mapping of all transcript sequences to the genome. It is shown on our transcript details page, when you click a transcript. It looks like "Annotation Release 105 (2017-04-01)". The most important part is the "Annotation Release" number, e.g. "105". The date is NCBI's release date. Shown below this line is the date when UCSC imported the data, which is not relevant for manuscripts. Note that an "Annotation release" is not a "RefSeq release" , a "RefSeq release" is only about sequences, not their mapping to the genome. NCBI provides a list of <a href="https://www.ncbi.nlm.nih.gov/genome/annotation_euk/all/" target=_blank>all current annotation releases</a>. The first annotation release for every genome is usually "100". </ul> <a name="ccds"></a> -<h6>What is CCDS?</h6> +<h2>What is CCDS?</h2> <p> The <a target=_blank href="https://www.ncbi.nlm.nih.gov/projects/CCDS/CcdsBrowse.cgi"> Consensus Coding Sequence Project</a> is a list of transcript coding sequence (CDS) genomic regions that are identically annotated by RefSeq and Ensembl/GENCODE. CCDS undergoes extensive manual review and you can consider these a subset of either gene track, filtered for high quality. The CCDS identifiers are very stable and allow you to link easily between the different databases. As the name implies, it does not cover UTR regions or non-coding transcripts. </p> <a name="justsingle"></a> -<h6>How can I show a single transcript per gene?</h6> +<h2>How can I show a single transcript per gene?</h2> <p> For the tracks "<a target=_blank href="../cgi-bin/hgTrackUi?db=hg19&g=knownGene">UCSC Genes</a>" (hg19) or "<a target=_blank href="../cgi-bin/hgTrackUi?db=hg38&g=knownGene">GENCODE Genes</a>" (hg38), click on their title and on the configuration page, uncheck the box "Show splice variants". Only a single transcript will be shown. The method for how this transcript is selected is described in the next section below and in the track documentation. </p> <p class='text-center'> <img class='text-center' src="../images/SpliceVariants.png" alt="Changing splice variants" width="750"> <p>For the track <a target=_blank href="../cgi-bin/hgTrackUi?db=hg38&g=refSeqComposite"> NCBI RefSeq</a> (hg38), you can activate the subtrack "RefSeq HGMD". It contains only the transcripts that are part of the Human Gene Mutation Database. </p> <a name="singledownload"></a> -<h6>How can I download a file with a single transcript per gene?</h6> +<h2>How can I download a file with a single transcript per gene?</h2> <p> This is a common request, but very often this is not necessary when designing an analysis. You will have to make a choice of this single transcript using some mechanism, and this choice will affect your pipeline results. It may be easier to keep all transcripts. For example, instead of annotating enhancers with the closest "best-transcript", you can annotate them with the closest exon of any transcript. When mapping variants to transcripts, you can map to all transcripts and and show the transcript with the worst impact first. When segmenting the chromosomes into gene loci, you can use the union of all transcripts of a gene, adding some predefined distance, rather than selecting a single "best" transcript.</p> <p> That being said, the main gene tracks have tables that try to take guess the most interesting transcript per gene. For the knownGene tracks (UCSC genes on hg19, Gencode on hg38 and mm10), @@ -465,43 +466,62 @@ transcript per gene called "RefSeq Select", based on <a target=_blank href="https://www.ncbi.nlm.nih.gov/refseq/refseq_select/">various criteria</a>. Example use cases are comparative genomics and variant reporting. This subset is available in the RefSeq Select track under NCBI RefSeq. RefSeq and the EBI also select one transcript for every protein coding gene that is annotated exactly the same in both Gencode and RefSeq, a project called <a href="https://ncbiinsights.ncbi.nlm.nih.gov/2019/03/12/mane-select-v0-5/" target=_blank>"MANE select"</a>, which is another subtrack of NCBI RefSeq. For the special case of clinical diagnostics where an even more reduced number of transcripts simplifies visual inspection, we also provide another subtrack, "RefSeq HGMD". It contains (usually) a single transcript only for genes known to cause human genetic diseases and the transcript is the one to which all reported clinical variants can be mapped to. <a name="whatdo"></a> -<h6>This is rather complicated. Can you tell me which gene transcript track I should use?</h6> +<h2>This is rather complicated. Can you tell me which gene transcript track I should use?</h2> <p> For automated analysis, if you are doing NGS analysis and you need to capture all possible transcripts, GENCODE provides one of the most comprehensive gene sets. For human genetics or variant annotation, a more restricted transcript set is usually sufficient and "NCBI RefSeq" is the standard. If you are only interested in protein-coding annotations, CCDS or UniProt may be an option, but this is rather unusual. If you are interested in the best splice site coverage, AceView is worth a look. </p> <p> For manual inspection of exon boundaries of a single gene, and especially if it is a transcript that is repetitive or hard to align (e.g. very small exons), look at the UCSC RefSeq track and watch for differences between the NCBI and UCSC exon placement. You can also BLAT the transcript sequence. Manually look at ESTs, mRNAs, TransMap and possibly Augustus, Genscan, SIB, SGP or GeneId in obscure cases where you are looking for hints on what an alternative splicing could look like.</p> <p> You may also find the <a target="_blank" href="http://genome.ucsc.edu/s/view/GeneSupport">Gene Support</a> public session helpful. This session is a collection of tracks centered around supporting evidence for genes.</p> </p> +<a name="gtfDownload"></a> +<h2>Does UCSC provide GTF/GFF files for gene models?</h2> +<p> +We provide files in GTF format, which is an extension to GFF2, for most assemblies. More +information on GTF format can be found <a target="_blank" href="FAQformat.html#format4"> +in our FAQ</a>.</p> +<p> +These files are generated for four gene model tables: ncbiRefSeq, refGene, ensGene, knownGene. +Certain assemblies, such as hg19, will have all four files while smaller assemblies may only have +one or two of these. Which file a user should use depends on their analysis, as they contain +different data and metadata.</p> +<p> +These files are generated using the <code>genePredToGtf</code> method described in our +<a target="_blank" href="https://genome.ucsc.edu/FAQ/FAQdownloads.html#download37"> +downloads FAQ</a> using the <code>-utr</code> flag. They can be found on the download server +address <i>http://hgdownload.soe.ucsc.edu/goldenPath/$db/bigZips/genes/</i> where +<i>$db</i> is the assembly of interest. For example, the <a target="_blank" +href="http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/genes/">hg38 GTF files</a>.</p> + <!--#include virtual="$ROOT/inc/gbPageEnd.html" -->