src/hg/makeDb/trackDb/wgEncodeGencodeDisplay1.shared.html cef070ed6c72e0247012855fa88e3e82144a7a9b

cef070ed6c72e0247012855fa88e3e82144a7a9b
markd
  Fri Nov 4 13:45:47 2022 -0700
added all gencode transcript ranks to the track descriptions

diff --git src/hg/makeDb/trackDb/wgEncodeGencodeDisplay1.shared.html src/hg/makeDb/trackDb/wgEncodeGencodeDisplay1.shared.html
index d1999be..7cf2a6a 100644
--- src/hg/makeDb/trackDb/wgEncodeGencodeDisplay1.shared.html
+++ src/hg/makeDb/trackDb/wgEncodeGencodeDisplay1.shared.html
@@ -1,240 +1,281 @@
 <h2>Display Conventions and Configuration</h2>
 <P>
 This track is a multi-view composite track that contains differing data sets
 (<EM>views</EM>).  Instructions for configuring multi-view tracks are
 <A HREF="/goldenPath/help/multiView.html" TARGET=_BLANK>here</A>.
 To show only selected subtracks, uncheck the boxes next to the tracks that
 you wish to hide.</P>
 <b>Views</b> available on this track are:
 <dl>
     <dt><i>Genes</i></dt>
     <dd> The gene annotations in this view are divided into three subtracks:</dd>
 </dl>
 <ul>
   <li><em>GENCODE Basic set</em> is a subset of the <em>Comprehensive set</em>. 
     The selection criteria are described in the <a href="#basicSetSelection">methods section</a>.</li>
   <li><em>GENCODE Comprehensive set</em> contains all GENCODE coding and non-coding transcript annotations,
     including polymorphic pseudogenes.  This includes both manual and
     automatic annotations.  This is a super-set of the <em>Basic set</em>.</li>
   <li><em>GENCODE Pseudogenes</em> include all annotations except polymorphic pseudogenes.</li>
 </ul>
     
 <dl>
     <dt><i>2-way</i></dt> 
 </dl>
 <ul>
     <li><em>GENCODE 2-way Pseudogenes</em> contains pseudogenes predicted by both the 
         <a href="https://academic.oup.com/bioinformatics/article-abstract/22/12/1437/207326">Yale
         PseudoPipe</a> and
         <a href="https://bmcgenomics.biomedcentral.com/articles/10.1186/1471-2164-9-466">
         UCSC RetroFinder</a> pipelines. The set was derived by looking for 50 base pairs
         of overlap between pseudogenes derived from both sets based on their 
         chromosomal coordinates.  When multiple PseudoPipe
         predictions map to a single RetroFinder prediction, only one match is kept
         for the 2-way consensus set.
     </li>
 </ul>
 
 <dl>
     <dt><i>PolyA</i></dt>
 </dl>
 <ul>
 <li><em>GENCODE PolyA</em> contains polyA signals and sites manually annotated on
     the genome based on transcribed evidence (ESTs and cDNAs) of 3' end of
     transcripts containing at least 3 A's not matching the genome.</li>
 </ul>
 
+<p>
+<b>Maximum number of transcripts to display</b>
+is available for the items in the GENCODE Basic, Comprehensive and Pseudogene tracks.
+Starting with the GENCODE human V42 and mouse VM31 releases, 
+transcripts are assigned rank within the gene. The ranks may be used to filter the number of transcripts
+displayed in a principled manner.  Transcript ranking is not available in the <em>lift37</em> releases.
+See <a href="#Methods">Methods</a> for details of rank assignment.
+</p>
 
 <p><b>Filtering</b> is available for the items in the GENCODE Basic, Comprehensive and Pseudogene tracks
 using the following criteria:</p>
 <ul>
   <li> Transcript class: filter by the basic biological function of a transcript
     annotation
     <ul>
      <li> All - don't filter by transcript class</li>
      <li> coding - display protein coding transcripts, including polymorphic pseudogenes</li>
      <li> nonCoding - display non-protein coding transcripts</li>
      <li> pseudo - display pseudogene transcript annotations</li>
      <li> problem - display problem transcripts (Biotypes of <em>retained_intron</em>, <em>TEC</em>, or <em>disrupted_domain</em>)
    </ul>
   </li>
 
   <li> Transcript Annotation Method: filter by the method used to create the annotation
    <ul>
      <li> All - don't filter by transcript class</li>
      <li> manual - display manually created annotations, including those that are 
        also created automatically</li>
      <li> automatic - display automatically created annotations, including those that are 
        also created manually</li>
      <li> manual_only - display manually created annotations that were
        not annotated by the automatic method</li>
      <li> automatic_only - display automatically created annotations that were
        not annotated by the manual method</li>
    </ul>
    </li>
   <li> Transcript Biotype: filter transcripts by
        <a href="https://www.gencodegenes.org/pages/biotypes.html" target="_blank">Biotype</a></li>
   <li> Support Level: filter transcripts by <a href="#tsl">transcription support level</a></li>
 </ul>
 
 <p><b>Coloring</b> for the gene annotations is based on the annotation type: </p>
 <ul>
   <li><font color="#0c0c78"><b>coding</b></font> 
   <li><font color="#006400"><b>non-coding</b></font> 
   <li><font color="#ff33ff"><b>pseudogene</b></font> 
   <li><font color="#fe0000"><b>problem</b></font>
   <li><font color="#ff33ff"><b>all 2-way pseudogenes</b></font>
   <li><font color="#000000"><b>all polyA annotations</b></font>
 </ul>
 
-<h2>Methods</h2>
+<h2 id="Methods">Methods</h2>
 
 <p>
 The GENCODE project aims to annotate all evidence-based gene features on the 
 human and mouse reference sequence with high accuracy by integrating 
 computational approaches (including comparative methods), manual
 annotation and targeted experimental verification. This goal includes identifying 
 all protein-coding loci with associated alternative variants, non-coding
 loci which have transcript evidence, and pseudogenes. 
 For a detailed description of the methods and references used, see
 Harrow <em>et al.</em> (2006).
 </p>
 
 <p>
 <b><a name="basicSetSelection">GENCODE <em>Basic Set</em> selection:</a></b>
 The GENCODE <em>Basic Set</em> is intended to provide a simplified subset of
 the GENCODE transcript annotations that will be useful to the majority of
 users. The goal was to have a high-quality basic set that also covered all loci.  
 Selection of GENCODE annotations for inclusion in the <em>basic set</em>
 was determined independently for the coding and non-coding transcripts at each
 gene locus.
 </p>
 <ul>
   <li> Criteria for selection of coding transcripts (including polymorphic pseudogenes) at a given
        locus:
     <ul>
       <li> All full-length coding transcripts (except problem transcripts or transcripts that are
            nonsense-mediated decay) were included in the basic set.</li>
       <li> If there were no transcripts meeting the above criteria, then the partial coding
            transcript with the largest CDS was included in the basic set (excluding problem transcripts).</li>
     </ul>
   </li>
   <li> Criteria for selection of non-coding transcripts at a given locus:
     <ul>
       <li> All full-length non-coding transcripts (except problem transcripts)
            with a well characterized Biotype (see below) were included in the
            basic set.</li>
       <li> If there were no transcripts meeting the above criteria, then the largest non-coding
            transcript was included in the basic set (excluding problem transcripts).</li>
     </ul>
   </li>
   <li> If no transcripts were included by either of the above criteria, the longest
     problem transcript is included.
   </li>
 </ul>
 
 <P>
 <b>Non-coding transcript categorization:</b> 
 Non-coding transcripts are categorized using
 their <a href="https://www.gencodegenes.org/gencode_biotypes.html" target="_blank">Biotype</a>
 and the following criteria:
 </p>
 <ul>
   <li> well characterized: <em>antisense, Mt_rRNA, Mt_tRNA, miRNA, rRNA, snRNA, snoRNA</em></li>
   <li> poorly characterized: <em>3prime_overlapping_ncrna, lincRNA, misc_RNA, non_coding, processed_transcript, sense_intronic, sense_overlapping</em></li>
 </ul>
 
+<p><b>Transcript ranking:</b>
+Within each gene, transcripts have been ranked according to the 
+following criteria.  The ranking approach is preliminary and will
+change is future releases.
+</p>
+
+<ul>
+  <li> Protein_coding genes
+    <ol>
+      <li> MANE or Ensembl canonical<br>
+        -1st: MANE Select / Ensembl canonical<br>
+        -2nd: MANE Plus Clinical<br>
+      <li>Coding biotypes<br>
+        -1st: protein_coding and protein_coding_LoF<br>
+        -2nd: NMDs and NSDs<br>
+        -3rd: retained intron and protein_coding_CDS_not_defined<br>
+      <li>Completeness<br>
+        -1st: full length<br>
+        -2nd: CDS start/end not found<br>
+      <li> CARS score (only for coding transcripts)<br>
+      <li> Transcript genomic span and length (only for non-coding transcripts)<br>
+    </ol>
+<li> Non-coding genes
+  <ol>
+    <li> Transcript biotype<br>
+      1st: transcript biotype identical to gene biotype
+    <li> Ensembl canonical
+    <li> GENCODE basic
+    <li> Transcript genomic span
+    <li> Transcript length
+  </ol>
+</ul>
+
 <p>
-<b><a name="tsl">Transcription Support Level (TSL):</a></b>
+<a name="tsl"><b>Transcription Support Level (TSL):</b></a>
 It is important that users understand how to assess transcript annotations
 that they see in GENCODE. While some transcript models have a high level of
 support through the full length of their exon structure, there are also
 transcripts that are poorly supported and that should be considered
 speculative. The Transcription Support Level (TSL) is a method to highlight the
 well-supported and poorly-supported transcript models for users. The method
 relies on the primary data that can support full-length transcript
 structure: mRNA and EST alignments supplied by UCSC and Ensembl.</p>
 
 <p>The mRNA and EST alignments are compared to the GENCODE transcripts and the
 transcripts are scored according to how well the alignment matches over its
 full length. 
 The GENCODE TSL provides a consistent method of evaluating the
 level of support that a GENCODE transcript annotation is
 actually expressed in mouse.  Mouse transcript sequences from the 
 <a href="https://www.insdc.org/"target="_blank">International Nucleotide
 Sequence Database Collaboration</a> (GenBank, ENA, and DDBJ) are used as
 the evidence for this analysis.
 <a href="https://www.ncbi.nlm.nih.gov/pubmed/15713233" target="_blank">
 Exonerate</a> RNA alignments from Ensembl,
 BLAT RNA and EST alignments from the UCSC Genome Browser Database are used in
 the analysis. Erroneous transcripts and libraries identified in lists
 maintained by the Ensembl, UCSC, HAVANA and RefSeq groups are flagged as
 suspect.  GENCODE annotations for protein-coding and non-protein-coding
 transcripts are compared with the evidence alignments.</p>
 
 <p>Annotations in the MHC region and other immunological genes are not
 evaluated, as automatic alignments tend to be very problematic. 
 Methods for evaluating single-exon genes are still being developed and 
 they are not included
 in the current analysis.  Multi-exon GENCODE annotations are evaluated using
 the criteria that all introns are supported by an evidence alignment and the
 evidence alignment does not indicate that there are unannotated exons. Small
 insertions and deletions in evidence alignments are assumed to be due to
 polymorphisms and not considered as differing from the annotations. All
 intron boundaries must match exactly. The transcript start and end locations
 are allowed to differ.</p>
 
 <p>The following categories are assigned to each of the evaluated annotations:</p>
 
 <ul>
   <li> <a name="tsl1"><b>tsl1</b></a> - all splice junctions of the transcript are supported by
     at least one non-suspect mRNA
   <li> <a name="tsl2"><b>tsl2</b></a> - the best supporting mRNA is flagged as suspect or the support is from multiple ESTs</li>
   <li> <a name="tsl3"><b>tsl3</b></a> - the only support is from a single EST</li>
   <li> <a name="tsl4"><b>tsl4</b></a> - the best supporting EST is flagged as suspect</li>
   <li> <a name="tsl5"><b>tsl5</b></a> - no single transcript supports the model structure</li>
   <li> <a name="tslNA"><b>tslNA</b></a> - the transcript was not analyzed for one of the following reasons:
     <ul>
       <li> pseudogene annotation, including transcribed pseudogenes
       <li> immunoglobin gene transcript
       <li> T-cell receptor transcript
       <li> single-exon transcript (will be included in a future version)
     </ul>
   </li>
 </ul>
 
 <p><b><a name="appris" href="https://appris.bioinfo.cnio.es/#/" target="_blank">APPRIS</a></b>
 is a system to annotate alternatively spliced transcripts based on a range of computational
 methods. It provides value to the annotations of the human, mouse, zebrafish, rat, and pig genomes.
 APPRIS has selected a single CDS variant for each gene as the 'PRINCIPAL' isoform. Principal
 isoforms are tagged with the numbers 1 to 5, with 1 being the most reliable.</p>
 <ul>
   <li>PRINCIPAL:1 - Transcript(s) expected to code for the main functional
     isoform based solely on the core modules in the APPRIS. 
   <li>PRINCIPAL:2 - Where the APPRIS core modules are unable to choose a clear
     principal variant (approximately 25% of human protein coding genes), the
     database chooses two or more of the CDS variants as "candidates" to be the
     principal variant.
   <li>PRINCIPAL:3 - Where the APPRIS core modules are unable to choose a clear
     principal variant and more than one of the variants have distinct
     CCDS identifiers, APPRIS selects the variant with lowest CCDS identifier
     as the principal variant. The lower the CCDS identifier, the earlier it
     was annotated.
   <li>PRINCIPAL:4 - Where the APPRIS core modules are unable to choose a clear
     principal CDS and there is more than one variant with distinct (but
     consecutive) CCDS identifiers, APPRIS selects the longest CCDS isoform as
     the principal variant.
   <li>PRINCIPAL:5 - Where the APPRIS core modules are unable to choose a clear
     principal variant and none of the candidate variants are annotated by CCDS,
     APPRIS selects the longest of the candidate isoforms as the principal variant.
     For genes in which the APPRIS core modules are unable to choose a clear
     principal variant (approximately 25% of human protein coding genes), the
     "candidate" variants not chosen as principal are labeled in the following way:
   <li>ALTERNATIVE:1 - Candidate transcript(s) models that are conserved in at
     least three tested species.
   <li>ALTERNATIVE:2 - Candidate transcript(s) models that appear to be
     conserved in fewer than three tested species.  Non-candidate transcripts are
     not tagged and are considered as "Minor" transcripts. Further information and
     additional web services can be found at the APPRIS website.
 </ul>