cef070ed6c72e0247012855fa88e3e82144a7a9b markd Fri Nov 4 13:45:47 2022 -0700 added all gencode transcript ranks to the track descriptions diff --git src/hg/makeDb/trackDb/wgEncodeGencodeDisplay1.shared.html src/hg/makeDb/trackDb/wgEncodeGencodeDisplay1.shared.html index d1999be..7cf2a6a 100644 --- src/hg/makeDb/trackDb/wgEncodeGencodeDisplay1.shared.html +++ src/hg/makeDb/trackDb/wgEncodeGencodeDisplay1.shared.html @@ -32,30 +32,38 @@ chromosomal coordinates. When multiple PseudoPipe predictions map to a single RetroFinder prediction, only one match is kept for the 2-way consensus set. </li> </ul> <dl> <dt><i>PolyA</i></dt> </dl> <ul> <li><em>GENCODE PolyA</em> contains polyA signals and sites manually annotated on the genome based on transcribed evidence (ESTs and cDNAs) of 3' end of transcripts containing at least 3 A's not matching the genome.</li> </ul> +<p> +<b>Maximum number of transcripts to display</b> +is available for the items in the GENCODE Basic, Comprehensive and Pseudogene tracks. +Starting with the GENCODE human V42 and mouse VM31 releases, +transcripts are assigned rank within the gene. The ranks may be used to filter the number of transcripts +displayed in a principled manner. Transcript ranking is not available in the <em>lift37</em> releases. +See <a href="#Methods">Methods</a> for details of rank assignment. +</p> <p><b>Filtering</b> is available for the items in the GENCODE Basic, Comprehensive and Pseudogene tracks using the following criteria:</p> <ul> <li> Transcript class: filter by the basic biological function of a transcript annotation <ul> <li> All - don't filter by transcript class</li> <li> coding - display protein coding transcripts, including polymorphic pseudogenes</li> <li> nonCoding - display non-protein coding transcripts</li> <li> pseudo - display pseudogene transcript annotations</li> <li> problem - display problem transcripts (Biotypes of <em>retained_intron</em>, <em>TEC</em>, or <em>disrupted_domain</em>) </ul> </li> @@ -75,31 +83,31 @@ <li> Transcript Biotype: filter transcripts by <a href="https://www.gencodegenes.org/pages/biotypes.html" target="_blank">Biotype</a></li> <li> Support Level: filter transcripts by <a href="#tsl">transcription support level</a></li> </ul> <p><b>Coloring</b> for the gene annotations is based on the annotation type: </p> <ul> <li><font color="#0c0c78"><b>coding</b></font> <li><font color="#006400"><b>non-coding</b></font> <li><font color="#ff33ff"><b>pseudogene</b></font> <li><font color="#fe0000"><b>problem</b></font> <li><font color="#ff33ff"><b>all 2-way pseudogenes</b></font> <li><font color="#000000"><b>all polyA annotations</b></font> </ul> -<h2>Methods</h2> +<h2 id="Methods">Methods</h2> <p> The GENCODE project aims to annotate all evidence-based gene features on the human and mouse reference sequence with high accuracy by integrating computational approaches (including comparative methods), manual annotation and targeted experimental verification. This goal includes identifying all protein-coding loci with associated alternative variants, non-coding loci which have transcript evidence, and pseudogenes. For a detailed description of the methods and references used, see Harrow <em>et al.</em> (2006). </p> <p> <b><a name="basicSetSelection">GENCODE <em>Basic Set</em> selection:</a></b> The GENCODE <em>Basic Set</em> is intended to provide a simplified subset of @@ -132,32 +140,65 @@ problem transcript is included. </li> </ul> <P> <b>Non-coding transcript categorization:</b> Non-coding transcripts are categorized using their <a href="https://www.gencodegenes.org/gencode_biotypes.html" target="_blank">Biotype</a> and the following criteria: </p> <ul> <li> well characterized: <em>antisense, Mt_rRNA, Mt_tRNA, miRNA, rRNA, snRNA, snoRNA</em></li> <li> poorly characterized: <em>3prime_overlapping_ncrna, lincRNA, misc_RNA, non_coding, processed_transcript, sense_intronic, sense_overlapping</em></li> </ul> +<p><b>Transcript ranking:</b> +Within each gene, transcripts have been ranked according to the +following criteria. The ranking approach is preliminary and will +change is future releases. +</p> + +<ul> + <li> Protein_coding genes + <ol> + <li> MANE or Ensembl canonical<br> + -1st: MANE Select / Ensembl canonical<br> + -2nd: MANE Plus Clinical<br> + <li>Coding biotypes<br> + -1st: protein_coding and protein_coding_LoF<br> + -2nd: NMDs and NSDs<br> + -3rd: retained intron and protein_coding_CDS_not_defined<br> + <li>Completeness<br> + -1st: full length<br> + -2nd: CDS start/end not found<br> + <li> CARS score (only for coding transcripts)<br> + <li> Transcript genomic span and length (only for non-coding transcripts)<br> + </ol> +<li> Non-coding genes + <ol> + <li> Transcript biotype<br> + 1st: transcript biotype identical to gene biotype + <li> Ensembl canonical + <li> GENCODE basic + <li> Transcript genomic span + <li> Transcript length + </ol> +</ul> + <p> -<b><a name="tsl">Transcription Support Level (TSL):</a></b> +<a name="tsl"><b>Transcription Support Level (TSL):</b></a> It is important that users understand how to assess transcript annotations that they see in GENCODE. While some transcript models have a high level of support through the full length of their exon structure, there are also transcripts that are poorly supported and that should be considered speculative. The Transcription Support Level (TSL) is a method to highlight the well-supported and poorly-supported transcript models for users. The method relies on the primary data that can support full-length transcript structure: mRNA and EST alignments supplied by UCSC and Ensembl.</p> <p>The mRNA and EST alignments are compared to the GENCODE transcripts and the transcripts are scored according to how well the alignment matches over its full length. The GENCODE TSL provides a consistent method of evaluating the level of support that a GENCODE transcript annotation is actually expressed in mouse. Mouse transcript sequences from the