src/hg/htdocs/FAQ/FAQgenes.html f67479340a8ea3cec099d85b085d6be8cbaa2715

f67479340a8ea3cec099d85b085d6be8cbaa2715
lrnassar
  Wed Feb 5 10:13:36 2020 -0800
Added entry about new GTF files to genes FAQ refs #20867

diff --git src/hg/htdocs/FAQ/FAQgenes.html src/hg/htdocs/FAQ/FAQgenes.html
index 47da356..0b11753 100755
--- src/hg/htdocs/FAQ/FAQgenes.html
+++ src/hg/htdocs/FAQ/FAQgenes.html
@@ -17,30 +17,31 @@
 <li><a href="#wrong">I think this transcript looks strange, what shall I do?</a></li>
 <li><a href="#ens">What are Ensembl and GENCODE and is there a difference?</a></li>
 <li><a href="#ensRefseq">What are the differences among GENCODE, Ensembl and RefSeq?</a></li>
 <li><a href="#hg19">For the human assembly hg19/GRCh37: What is the difference between "UCSC 
                     Genes" track, the "GENCODE" track and the "Ensembl Genes" track?</a></li>
 <li><a href="#hg38">For the human assembly hg38/GRCh38: What are the differences between the 
 		    "GENCODE" and "All GENCODE" tracks?</a></li>
 <li><a href="#gencode">What is the difference between GENCODE comprehensive and basic?</a></li>
 <li><a href="#ncbiRefseq">What is the difference between "NCBI RefSeq" and "UCSC RefSeq"?</a></li>
 <li><a href="#report">How shall I report a gene transcript in a manuscript?</a></li>
 <li><a href="#ccds">What is CCDS?</a></li>
 <li><a href="#justsingle">How can I show a single transcript per gene?</a></li>
 <li><a href="#singledownload">How can I download a file with a single transcript per gene?</a></li>
 <li><a href="#whatdo">This is rather complicated. Can you tell me which gene transcript track
                       I should use?</a></li>
+<li><a href="#gtfDownload">Does UCSC provide GTF/GFF files for gene models?</a></li>
 </ul>
 <hr>
 <p>
 <a href="index.html">Return to FAQ Table of Contents</a></p>
 
 <a name="gene"></a>
 <h2>The basics</h2>
 
 The genome browser contains many gene annotation tracks. Our users 
 often wonder what these contain and where the information that we present comes
 from.
 
 <h6>What is a gene?</h6>
 <p>
 The exact definition of &quot;gene&quot; depends on the context. In the context of 
@@ -333,31 +334,31 @@
 <p>
 An anecdotal and rare example is SHANK2 and SHANK3 in hg19. It is impossible
 for either NCBI or BLAT to get the correct alignment and gene model because the genome sequence is
 missing for part of the gene.  NCBI and BLAT find slightly different exon
 boundaries at the edge of the problematic region. NCBI's aligner tries very hard
 to find exons that align to any transcript sequence,
 so it calls a few small dubious &quot;exons&quot; in the affected genomic region.
 GENCODE V19 also used an aligner that tried very hard to find exons, but it
 found small dubious &quot;exons&quot; in different places than NCBI.
 The <a target=_blank href="../cgi-bin/hgTrackUi?db=hg38&g=refSeqComposite">RefSeq Alignments</a> 
 subtrack makes the problematic region very clear with double lines
 indicating unalignable transcript sequence.
 </p>
 
 <a name="report"></a>
-<h6>How shall I report a gene transcript in a manuscript?</h6>
+<h2>How shall I report a gene transcript in a manuscript?</h2>
 
 <p>
 When reporting on GENCODE/Ensembl transcripts, please specify the ENST
 identifier. It is often helpful to also specify the Ensembl release, 
 which is shown on the details page, when you click onto a transcript.
 </p>
 
 <p>
 When reporting RefSeq transcripts, e.g. in HGVS, prefer the "NCBI RefSeq" track
 over the "UCSC RefSeq track".  Please specify the RefSeq transcript ID and
 also the RefSeq annotation release.
 </p>
 
 <ul>
 <li>The RefSeq transcript ID is the sequence of the transcript, the NM_xxxxx.y
@@ -368,64 +369,64 @@
 is helpful for readers, e.g. report NM_012309.4, not NM_012309.
 <li>The RefSeq annotation release captures the mapping of all transcript
 sequences to the genome.  It is shown on our transcript details page, when you
 click a transcript. It looks like "Annotation Release 105 (2017-04-01)".  The
 most important part is the "Annotation Release" number, e.g. "105". The date is
 NCBI's release date. Shown below this line is the date when UCSC imported the
 data, which is not relevant for manuscripts. Note that an "Annotation release"
 is not a "RefSeq release" , a "RefSeq release" is only about sequences, not
 their mapping to the genome. NCBI provides a list of 
 <a href="https://www.ncbi.nlm.nih.gov/genome/annotation_euk/all/"
     target=_blank>all current annotation releases</a>. The first annotation
     release for every genome is usually "100".
 </ul>
 
 <a name="ccds"></a>
-<h6>What is CCDS?</h6>
+<h2>What is CCDS?</h2>
 <p>
 The <a target=_blank href="https://www.ncbi.nlm.nih.gov/projects/CCDS/CcdsBrowse.cgi">
 Consensus Coding Sequence Project</a> is a list of transcript coding sequence (CDS) genomic regions
 that are identically annotated by RefSeq and Ensembl/GENCODE.   CCDS undergoes extensive manual
 review and you can consider these a subset of either gene track, filtered for high quality.
 The CCDS identifiers  are very stable and allow you to link easily between the different databases.
 As  the name implies, it does not cover UTR regions or non-coding transcripts.
 </p>
 
 <a name="justsingle"></a>
-<h6>How can I show a single transcript per gene?</h6>
+<h2>How can I show a single transcript per gene?</h2>
 
 <p> 
 For the tracks &quot;<a target=_blank 
 href="../cgi-bin/hgTrackUi?db=hg19&g=knownGene">UCSC Genes</a>&quot; 
 (hg19) or &quot;<a target=_blank 
 href="../cgi-bin/hgTrackUi?db=hg38&g=knownGene">GENCODE Genes</a>&quot; 
 (hg38), click on their title and on the configuration page, uncheck the 
 box &quot;Show splice variants&quot;. Only a single transcript will be shown. The method for how this
 transcript is selected is described in the next section below and in the track documentation. </p>
 
 <p class='text-center'>
   <img class='text-center' src="../images/SpliceVariants.png" 
 alt="Changing splice variants" width="750">
 
 <p>For the track <a target=_blank 
 href="../cgi-bin/hgTrackUi?db=hg38&g=refSeqComposite">
 NCBI RefSeq</a> (hg38), you can activate the subtrack &quot;RefSeq HGMD&quot;.
 It contains only the transcripts that are part of the Human Gene Mutation Database.
 </p>
 
 <a name="singledownload"></a>
-<h6>How can I download a file with a single transcript per gene?</h6>
+<h2>How can I download a file with a single transcript per gene?</h2>
 <p>
 This is a common request, but very often this is not necessary when designing
 an analysis.  You will have to make a choice of this single transcript using
 some mechanism, and this choice will affect your pipeline results. It may be
 easier to keep all transcripts. For example, instead of annotating enhancers
 with the closest &quot;best-transcript&quot;, you can annotate them with the closest exon
 of any transcript. When mapping variants to transcripts, you can map to all
 transcripts and and show the transcript with the worst impact first.  When
 segmenting the chromosomes into gene loci, you can use the union of all
 transcripts of a gene, adding some predefined distance, rather than selecting a
 single "best" transcript.</p>
 
 <p>
 That being said, the main gene tracks have tables that try to take guess the most interesting 
 transcript per gene. For the knownGene tracks (UCSC genes on hg19, Gencode on hg38 and mm10), 
@@ -465,43 +466,62 @@
 transcript per gene called "RefSeq Select", based on <a target=_blank
 href="https://www.ncbi.nlm.nih.gov/refseq/refseq_select/">various criteria</a>.
 Example use cases are comparative genomics and variant reporting. This subset
 is available in the RefSeq Select track under NCBI RefSeq.  RefSeq and the EBI
 also select one transcript for every protein coding gene that is annotated exactly 
 the same in both Gencode and RefSeq, a project called <a
 href="https://ncbiinsights.ncbi.nlm.nih.gov/2019/03/12/mane-select-v0-5/"
 target=_blank>"MANE select"</a>, which is another subtrack of NCBI RefSeq. 
 For the special case of clinical diagnostics
 where an even more reduced number of transcripts simplifies visual inspection,
 we also provide another subtrack, "RefSeq HGMD". It contains
 (usually) a single transcript only for genes known to cause human genetic diseases and
 the transcript is the one to which all reported clinical variants can be mapped to.
 
 <a name="whatdo"></a>
-<h6>This is rather complicated. Can you tell me which gene transcript track I should use?</h6>
+<h2>This is rather complicated. Can you tell me which gene transcript track I should use?</h2>
 <p> 
 For automated analysis, if you are doing NGS analysis and you need to capture
 all possible transcripts, GENCODE provides one of the most comprehensive gene sets.  For human 
 genetics or variant annotation, a more restricted transcript set is usually sufficient and &quot;NCBI
 RefSeq&quot; is the standard. If you are only interested in protein-coding
 annotations, CCDS or UniProt may be an option, but this is rather unusual.
 If you are interested in the best splice site coverage, AceView is worth a
 look.
 </p>
 
 <p>
 For manual inspection of exon boundaries of a single gene, and especially if it
 is a transcript that is repetitive or hard to align (e.g. very small exons),
 look at the UCSC RefSeq track and watch for differences between the NCBI
 and UCSC exon placement. You can also BLAT the transcript sequence. 
 Manually look at ESTs, mRNAs, TransMap and possibly Augustus, Genscan, SIB, SGP
 or GeneId in obscure cases where you are looking for hints on what an
 alternative splicing could look like.</p>
 <p>
 You may also find the <a target="_blank" 
 href="http://genome.ucsc.edu/s/view/GeneSupport">Gene Support</a> public session
 helpful. This session is a collection of tracks centered around supporting evidence
 for genes.</p>
 </p>
 
+<a name="gtfDownload"></a>
+<h2>Does UCSC provide GTF/GFF files for gene models?</h2>
+<p>
+We provide files in GTF format, which is an extension to GFF2, for most assemblies. More 
+information on GTF format can be found <a target="_blank" href="FAQformat.html#format4">
+in our FAQ</a>.</p>
+<p>
+These files are generated for four gene model tables: ncbiRefSeq, refGene, ensGene, knownGene. 
+Certain assemblies, such as hg19, will have all four files while smaller assemblies may only have
+one or two of these. Which file a user should use depends on their analysis, as they contain 
+different data and metadata.</p>
+<p>
+These files are generated using the <code>genePredToGtf</code> method described in our 
+<a target="_blank" href="https://genome.ucsc.edu/FAQ/FAQdownloads.html#download37">
+downloads FAQ</a> using the <code>-utr</code> flag. They can be found on the download server 
+address <i>http://hgdownload.soe.ucsc.edu/goldenPath/$db/bigZips/genes/</i> where
+<i>$db</i> is the assembly of interest. For example, the <a target="_blank" 
+href="http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/genes/">hg38 GTF files</a>.</p>
+
 <!--#include virtual="$ROOT/inc/gbPageEnd.html" -->