src/hg/makeDb/trackDb/TOGAv2.html 6120d02a6aa7faf052299967d8bace2d2d706ad3

6120d02a6aa7faf052299967d8bace2d2d706ad3
hiram
  Mon Dec 15 13:06:28 2025 -0800
one additional color code definition for paralogous annotations refs #35776

diff --git src/hg/makeDb/trackDb/TOGAv2.html src/hg/makeDb/trackDb/TOGAv2.html
index 995cce59d39..40a83b03d08 100644
--- src/hg/makeDb/trackDb/TOGAv2.html
+++ src/hg/makeDb/trackDb/TOGAv2.html
@@ -1,134 +1,139 @@
 <h2>Description</h2>
 <p>
 <b>TOGA2</b>
 (<b>T</b>ool to infer <b>O</b>rthologs from <b>G</b>enome <b>A</b>lignments <b>2</b>) [1]
 is the next-generation version of the original TOGA method [2].<br>
 <b>TOGA2</b> is a homology-based method that integrates gene annotation, inferring
 orthologs and classifying genes as intact or lost.
 </p>
 
 <h2>Methods</h2>
 <p>
 As TOGA, <b>TOGA2</b> uses as input the gene annotation of a well-annotated reference species and
 a pairwise whole genome alignment (alignment chains) between the reference and query genome.
 Orthologous genomic loci are inferred primarily by alignments of intronic
 and intergenic regions using machine learning to accurately distinguish
 orthologous from paralogous or processed pseudogene loci.
 </p>
 <p>
 To annotate genes, CESAR 2.0 [3] is used to determine the positions and boundaries of coding exons of a
 reference transcript in the orthologous genomic locus in the query species.
 </p>
 
 <p>
 <b>TOGA2</b> differs from TOGA1 in the following major aspects.
 <ol>
 <li> It introduces an exon-wise annotation procedure that leverages exon-level orthology. This increases
 annotation accuracy, especially for very short exons, and reduces memory usage and runtime.</li>
 <li> It leverages pre-computed deep learning-based splice site predictions generated by SpliceAI [4]
 to achieve a higher precision in identifying the correct exon boundaries. These splice site predictions
 enable <b>TOGA2</b> also to handle evolutionary changes in exon–intron structure, including splice site
 shifts, intron deletions, and “exonization of introns”. </li>
 <li>A new gene tree–based reconciliation step refines orthology inference and identifies additional
 1:1 orthologs.</li>
 <li>It identifies not only coding exons but also predicts untranslated exons and exonic regions.</li>
 </ol>
 
 <h2>Reference species used by TOGA2</h2>
 <p>For placental mammals, <b>TOGA2</b> uses as references
 <ul>
 <li>human (hg38 assembly)</li>
 <li>mouse (mm10 assembly)</li>
 <li>cow (HLbosTau10=GCF_002263795.3 assembly)</li>
 <li>elephant (HLeleMaxInd3A=GCF_024166365.1 assembly)</li>
 </ul>
 For birds, <b>TOGA2</b> uses as references
 <ul>
 <li>chicken (HLgalGal7=GCF_016699485.2 assembly)</li>
 <li>crow (HLcorHaw3=GCF_020740725.1 assembly)</li>
 <li>zebrafinch (HLtaeGut5=GCF_003957565.2 assembly)</li>
 <li>kittiwake (HLrisTri2=GCF_028500815.1 assembly)</li>
 <li>emu (HLdroNov3=GCF_036370855.1 assembly)</li>
 </ul>
 </p>
 
 <h2>Display Conventions and Configuration</h2>
 <p>
 Each annotated transcript is named after the reference transcript, gene symbol and the chain identifier: transcriptID#geneID#chainID. <br>
 Transcripts ending with #retro are retrogene candidates (processed pseudogenes retaining an intact reading frame). <br>
 Transcripts ending with #paralog are classified as paralogous by TOGA2’s machine learning classifier; they only annotated if the respective query locus does not have an orthologous projection.<br>
 <br>
 Each annotated transcript is shown in a color-coded classification as
 <ul>
 <li><span style='display:inline-block; width:40px; height:15px; background-color:#000064;'>&nbsp;</span>
     <span style='color:#000064'>"fully intact"</span>: This status is new in TOGA2 and indicates that the
 projection has a completely intact reading frame, without any inactivating mutations. These transcripts likely
  encode functional proteins.</li>
 <li><span style='display:inline-block; width:40px; height:15px; background-color:#0000C8;'>&nbsp;</span>
     <span style='color:#0000C8'>"intact"</span>: middle 80% of the CDS
     (coding sequence) is present and exhibits no gene-inactivating mutation. However, mutations can be
 present in the N- or C-terminal 10% of the reading frame, and potentially indicate alterations in the
 protein's termini. These transcripts likely encode functional proteins.</li>
 <li><span style='display:inline-block; width:40px; height:15px; background-color:#00C8FF;'>&nbsp;</span>
     <span style='color:#00C8FF'>"partially intact"</span>: &gt;50% of the CDS
      is present in the query genome and the middle 80% of the CDS exhibits no
      inactivating mutation. These transcripts may also encode functional
      proteins, but the evidence is weaker as parts of the CDS are missing,
      often due to assembly gaps.</li>
 <li><span style='display:inline-block; width:40px; height:15px; background-color:#828282;'>&nbsp;</span>
     <span style='color:#828282'>"missing"</span>: &lt;50% of the CDS is present
      in the query and the middle 80% of the CDS exhibits no inactivating
      mutation. There is currently no evidence for transcript loss; however, the uncertainty is higher
 as more than half of the CDS is missing. Note that Missing transcripts can also arise if no genome alignment
 chain spans the transcript.</li>
 <li><span style='display:inline-block; width:40px; height:15px; background-color:#FFA078;'>&nbsp;</span>
     <span style='color:#FFA078'>"uncertain loss"</span>: there is at least one
      inactivating mutation in the middle 80% of the CDS, but evidence is not
      strong enough to classify the transcript as lost. These transcripts may
      or may not encode a functional protein.</li>
 <li><span style='display:inline-block; width:40px; height:15px; background-color:#FF3232;'>&nbsp;</span>
     <span style='color:#FF3232'>"lost"</span>: typically several inactivating
      mutations are present, thus there is strong evidence that the transcript
      is unlikely to encode a functional protein.</li>
+<li><span style='display:inline-block; width:40px; height:15px; background-color:#9F8170;'>&nbsp;</span>
+    <span style='color:#9F8170'>"paralogous"</span>: Special category. Transcript is classified as paralogous
+by TOGA2’s machine learning classifier and these are only retained if the respective query locus does not have
+an orthologous projection. Transcripts in this color have enough inactivating mutations or missing sequence
+    such that there loss status is "missing" or "deleted".</li>
 </ul>
 </p>
 <p>
 Clicking on a transcript provides additional information about the orthology
 classification, inactivating mutations, the query's nucleotide/protein sequence, and protein/exon
 alignments.
 </p>
 
 <h2>Credits</h2>
 <p>
 This data was prepared by the <a href="https://www.senckenberg.de/en/research/institutes-overview/sf/ffm-dept-comparative-genomics/"
 target="_blank">Michael Hiller's Lab</a>
 </p>
 
 <h2>References</h2>
 <p>
 The <b>TOGA2</b> software is available from
 <a href="https://github.com/hillerlab/TOGA2"
 target="_blank">github.com/hillerlab/TOGA2</a>
 </p>
 
 <p>
 [1] Malovichko Y, Bein B, Hilgers L, Stephens A, Yi X, Stadager T, Hoppach L, Koch L, Maschiner M, Hiller M. TOGA2 improves speed and accuracy of comparative gene annotation and orthology inference. In preparation
 </p>
 <p>
 [2] Kirilenko BM, Munegowda C, Osipova E, Jebb D, Sharma V, Blumer M, Morales AE, Ahmed AW, Kontopoulos DG, Hilgers L, Lindblad-Toh K, Karlsson EK, Zoonomia Consortium, Hiller M.
 <a href="https://www.science.org/doi/abs/10.1126/science.abn3107?url_ver=Z39.88-2003&amp;rfr_id=ori:
 rid:crossref.org&amp;rfr_dat=cr_pub%20%200pubmed" target="_blank">
 Integrating gene annotation with orthology inference at scale</a>.
 <em>Science</em>. 2023 Apr 28;380(6643):eabn3107.
 PMID: <a href="https://www.ncbi.nlm.nih.gov/pubmed/37104600" target="_blank">37104600</a>; PMC: <a
 href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10193443/" target="_blank">PMC10193443</a>
 </p>
 <p>
 [3]
 Sharma V, Schwede P, Hiller M. <a href="https://doi.org/10.1093/bioinformatics/btx527" target="_blank">CESAR 2.0 substantially improves speed and accuracy of comparative gene annotation</a>. <em>Bioinformatics</em>. 2017 Dec 15;33(24):3985-3987. PMID: <a href="https://pubmed.ncbi.nlm.nih.gov/28961744/" target="_blank">28961744</a> </p>
 <p>
 [4]
 Jaganathan K, Kyriazopoulou Panagiotopoulou S, McRae JF, Darbandi SF, Knowles D, Li YI, Kosmicki JA, Arbelaez J, Cui W, Schwartz GB, Chow ED, Kanterakis E, Gao H, Kia A, Batzoglou S, Sanders SJ, Farh KK-H. <a href="https://doi.org/10.1016/j.cell.2018.12.015" target="_blank">Predicting splicing from primary sequence with deep learning</a>. <em>Cell</em>. 2019 Jan 24;176(3):535-548.e24. PMID: <a href="https://pubmed.ncbi.nlm.nih.gov/30661751" target="_blank">30661751</a>
 </p>
 </div>
 <BR>