6120d02a6aa7faf052299967d8bace2d2d706ad3 hiram Mon Dec 15 13:06:28 2025 -0800 one additional color code definition for paralogous annotations refs #35776 diff --git src/hg/makeDb/trackDb/TOGAv2.html src/hg/makeDb/trackDb/TOGAv2.html index 995cce59d39..40a83b03d08 100644 --- src/hg/makeDb/trackDb/TOGAv2.html +++ src/hg/makeDb/trackDb/TOGAv2.html @@ -1,134 +1,139 @@ <h2>Description</h2> <p> <b>TOGA2</b> (<b>T</b>ool to infer <b>O</b>rthologs from <b>G</b>enome <b>A</b>lignments <b>2</b>) [1] is the next-generation version of the original TOGA method [2].<br> <b>TOGA2</b> is a homology-based method that integrates gene annotation, inferring orthologs and classifying genes as intact or lost. </p> <h2>Methods</h2> <p> As TOGA, <b>TOGA2</b> uses as input the gene annotation of a well-annotated reference species and a pairwise whole genome alignment (alignment chains) between the reference and query genome. Orthologous genomic loci are inferred primarily by alignments of intronic and intergenic regions using machine learning to accurately distinguish orthologous from paralogous or processed pseudogene loci. </p> <p> To annotate genes, CESAR 2.0 [3] is used to determine the positions and boundaries of coding exons of a reference transcript in the orthologous genomic locus in the query species. </p> <p> <b>TOGA2</b> differs from TOGA1 in the following major aspects. <ol> <li> It introduces an exon-wise annotation procedure that leverages exon-level orthology. This increases annotation accuracy, especially for very short exons, and reduces memory usage and runtime.</li> <li> It leverages pre-computed deep learning-based splice site predictions generated by SpliceAI [4] to achieve a higher precision in identifying the correct exon boundaries. These splice site predictions enable <b>TOGA2</b> also to handle evolutionary changes in exon–intron structure, including splice site shifts, intron deletions, and “exonization of introns”. </li> <li>A new gene tree–based reconciliation step refines orthology inference and identifies additional 1:1 orthologs.</li> <li>It identifies not only coding exons but also predicts untranslated exons and exonic regions.</li> </ol> <h2>Reference species used by TOGA2</h2> <p>For placental mammals, <b>TOGA2</b> uses as references <ul> <li>human (hg38 assembly)</li> <li>mouse (mm10 assembly)</li> <li>cow (HLbosTau10=GCF_002263795.3 assembly)</li> <li>elephant (HLeleMaxInd3A=GCF_024166365.1 assembly)</li> </ul> For birds, <b>TOGA2</b> uses as references <ul> <li>chicken (HLgalGal7=GCF_016699485.2 assembly)</li> <li>crow (HLcorHaw3=GCF_020740725.1 assembly)</li> <li>zebrafinch (HLtaeGut5=GCF_003957565.2 assembly)</li> <li>kittiwake (HLrisTri2=GCF_028500815.1 assembly)</li> <li>emu (HLdroNov3=GCF_036370855.1 assembly)</li> </ul> </p> <h2>Display Conventions and Configuration</h2> <p> Each annotated transcript is named after the reference transcript, gene symbol and the chain identifier: transcriptID#geneID#chainID. <br> Transcripts ending with #retro are retrogene candidates (processed pseudogenes retaining an intact reading frame). <br> Transcripts ending with #paralog are classified as paralogous by TOGA2’s machine learning classifier; they only annotated if the respective query locus does not have an orthologous projection.<br> <br> Each annotated transcript is shown in a color-coded classification as <ul> <li><span style='display:inline-block; width:40px; height:15px; background-color:#000064;'> </span> <span style='color:#000064'>"fully intact"</span>: This status is new in TOGA2 and indicates that the projection has a completely intact reading frame, without any inactivating mutations. These transcripts likely encode functional proteins.</li> <li><span style='display:inline-block; width:40px; height:15px; background-color:#0000C8;'> </span> <span style='color:#0000C8'>"intact"</span>: middle 80% of the CDS (coding sequence) is present and exhibits no gene-inactivating mutation. However, mutations can be present in the N- or C-terminal 10% of the reading frame, and potentially indicate alterations in the protein's termini. These transcripts likely encode functional proteins.</li> <li><span style='display:inline-block; width:40px; height:15px; background-color:#00C8FF;'> </span> <span style='color:#00C8FF'>"partially intact"</span>: >50% of the CDS is present in the query genome and the middle 80% of the CDS exhibits no inactivating mutation. These transcripts may also encode functional proteins, but the evidence is weaker as parts of the CDS are missing, often due to assembly gaps.</li> <li><span style='display:inline-block; width:40px; height:15px; background-color:#828282;'> </span> <span style='color:#828282'>"missing"</span>: <50% of the CDS is present in the query and the middle 80% of the CDS exhibits no inactivating mutation. There is currently no evidence for transcript loss; however, the uncertainty is higher as more than half of the CDS is missing. Note that Missing transcripts can also arise if no genome alignment chain spans the transcript.</li> <li><span style='display:inline-block; width:40px; height:15px; background-color:#FFA078;'> </span> <span style='color:#FFA078'>"uncertain loss"</span>: there is at least one inactivating mutation in the middle 80% of the CDS, but evidence is not strong enough to classify the transcript as lost. These transcripts may or may not encode a functional protein.</li> <li><span style='display:inline-block; width:40px; height:15px; background-color:#FF3232;'> </span> <span style='color:#FF3232'>"lost"</span>: typically several inactivating mutations are present, thus there is strong evidence that the transcript is unlikely to encode a functional protein.</li> +<li><span style='display:inline-block; width:40px; height:15px; background-color:#9F8170;'> </span> + <span style='color:#9F8170'>"paralogous"</span>: Special category. Transcript is classified as paralogous +by TOGA2’s machine learning classifier and these are only retained if the respective query locus does not have +an orthologous projection. Transcripts in this color have enough inactivating mutations or missing sequence + such that there loss status is "missing" or "deleted".</li> </ul> </p> <p> Clicking on a transcript provides additional information about the orthology classification, inactivating mutations, the query's nucleotide/protein sequence, and protein/exon alignments. </p> <h2>Credits</h2> <p> This data was prepared by the <a href="https://www.senckenberg.de/en/research/institutes-overview/sf/ffm-dept-comparative-genomics/" target="_blank">Michael Hiller's Lab</a> </p> <h2>References</h2> <p> The <b>TOGA2</b> software is available from <a href="https://github.com/hillerlab/TOGA2" target="_blank">github.com/hillerlab/TOGA2</a> </p> <p> [1] Malovichko Y, Bein B, Hilgers L, Stephens A, Yi X, Stadager T, Hoppach L, Koch L, Maschiner M, Hiller M. TOGA2 improves speed and accuracy of comparative gene annotation and orthology inference. In preparation </p> <p> [2] Kirilenko BM, Munegowda C, Osipova E, Jebb D, Sharma V, Blumer M, Morales AE, Ahmed AW, Kontopoulos DG, Hilgers L, Lindblad-Toh K, Karlsson EK, Zoonomia Consortium, Hiller M. <a href="https://www.science.org/doi/abs/10.1126/science.abn3107?url_ver=Z39.88-2003&rfr_id=ori: rid:crossref.org&rfr_dat=cr_pub%20%200pubmed" target="_blank"> Integrating gene annotation with orthology inference at scale</a>. <em>Science</em>. 2023 Apr 28;380(6643):eabn3107. PMID: <a href="https://www.ncbi.nlm.nih.gov/pubmed/37104600" target="_blank">37104600</a>; PMC: <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10193443/" target="_blank">PMC10193443</a> </p> <p> [3] Sharma V, Schwede P, Hiller M. <a href="https://doi.org/10.1093/bioinformatics/btx527" target="_blank">CESAR 2.0 substantially improves speed and accuracy of comparative gene annotation</a>. <em>Bioinformatics</em>. 2017 Dec 15;33(24):3985-3987. PMID: <a href="https://pubmed.ncbi.nlm.nih.gov/28961744/" target="_blank">28961744</a> </p> <p> [4] Jaganathan K, Kyriazopoulou Panagiotopoulou S, McRae JF, Darbandi SF, Knowles D, Li YI, Kosmicki JA, Arbelaez J, Cui W, Schwartz GB, Chow ED, Kanterakis E, Gao H, Kia A, Batzoglou S, Sanders SJ, Farh KK-H. <a href="https://doi.org/10.1016/j.cell.2018.12.015" target="_blank">Predicting splicing from primary sequence with deep learning</a>. <em>Cell</em>. 2019 Jan 24;176(3):535-548.e24. PMID: <a href="https://pubmed.ncbi.nlm.nih.gov/30661751" target="_blank">30661751</a> </p> </div> <BR>