ad4e4b66fe4aeec66d13d5ce01bbaf03f148c9f8 hiram Fri Dec 12 17:46:52 2025 -0800 updated text for version 2 refs #35776 diff --git src/hg/makeDb/trackDb/TOGAv2.html src/hg/makeDb/trackDb/TOGAv2.html index 88e73527fc1..995cce59d39 100644 --- src/hg/makeDb/trackDb/TOGAv2.html +++ src/hg/makeDb/trackDb/TOGAv2.html @@ -1,85 +1,134 @@ <h2>Description</h2> <p> -<b>TOGA version 2.0</b> -(<b>T</b>ool to infer <b>O</b>rthologs from <b>G</b>enome <b>A</b>lignments) -is a homology-based method that integrates gene annotation, inferring +<b>TOGA2</b> +(<b>T</b>ool to infer <b>O</b>rthologs from <b>G</b>enome <b>A</b>lignments <b>2</b>) [1] +is the next-generation version of the original TOGA method [2].<br> +<b>TOGA2</b> is a homology-based method that integrates gene annotation, inferring orthologs and classifying genes as intact or lost. </p> <h2>Methods</h2> <p> -As input, <b>TOGA</b> uses a gene annotation of a reference species -(human/hg38 for mammals, chicken/galGal6 for birds) and -a whole genome alignment between the reference and query genome. +As TOGA, <b>TOGA2</b> uses as input the gene annotation of a well-annotated reference species and +a pairwise whole genome alignment (alignment chains) between the reference and query genome. +Orthologous genomic loci are inferred primarily by alignments of intronic +and intergenic regions using machine learning to accurately distinguish +orthologous from paralogous or processed pseudogene loci. </p> <p> -<b>TOGA</b> implements a novel paradigm that relies on alignments of intronic -and intergenic regions and uses machine learning to accurately distinguish -orthologs from paralogs or processed pseudogenes. +To annotate genes, CESAR 2.0 [3] is used to determine the positions and boundaries of coding exons of a +reference transcript in the orthologous genomic locus in the query species. </p> + <p> -To annotate genes, -<a href="https://academic.oup.com/bioinformatics/article/33/24/3985/4095639" -target="blank">CESAR 2.0</a> -is used to determine the positions and boundaries of coding exons of a -reference transcript in the orthologous genomic locus in the query species. +<b>TOGA2</b> differs from TOGA1 in the following major aspects. +<ol> +<li> It introduces an exon-wise annotation procedure that leverages exon-level orthology. This increases +annotation accuracy, especially for very short exons, and reduces memory usage and runtime.</li> +<li> It leverages pre-computed deep learning-based splice site predictions generated by SpliceAI [4] +to achieve a higher precision in identifying the correct exon boundaries. These splice site predictions +enable <b>TOGA2</b> also to handle evolutionary changes in exon–intron structure, including splice site +shifts, intron deletions, and “exonization of introns”. </li> +<li>A new gene tree–based reconciliation step refines orthology inference and identifies additional +1:1 orthologs.</li> +<li>It identifies not only coding exons but also predicts untranslated exons and exonic regions.</li> +</ol> + +<h2>Reference species used by TOGA2</h2> +<p>For placental mammals, <b>TOGA2</b> uses as references +<ul> +<li>human (hg38 assembly)</li> +<li>mouse (mm10 assembly)</li> +<li>cow (HLbosTau10=GCF_002263795.3 assembly)</li> +<li>elephant (HLeleMaxInd3A=GCF_024166365.1 assembly)</li> +</ul> +For birds, <b>TOGA2</b> uses as references +<ul> +<li>chicken (HLgalGal7=GCF_016699485.2 assembly)</li> +<li>crow (HLcorHaw3=GCF_020740725.1 assembly)</li> +<li>zebrafinch (HLtaeGut5=GCF_003957565.2 assembly)</li> +<li>kittiwake (HLrisTri2=GCF_028500815.1 assembly)</li> +<li>emu (HLdroNov3=GCF_036370855.1 assembly)</li> +</ul> </p> <h2>Display Conventions and Configuration</h2> <p> +Each annotated transcript is named after the reference transcript, gene symbol and the chain identifier: transcriptID#geneID#chainID. <br> +Transcripts ending with #retro are retrogene candidates (processed pseudogenes retaining an intact reading frame). <br> +Transcripts ending with #paralog are classified as paralogous by TOGA2’s machine learning classifier; they only annotated if the respective query locus does not have an orthologous projection.<br> +<br> Each annotated transcript is shown in a color-coded classification as <ul> -<li><span style='display:inline-block; width:40px; height:15px; background-color:blue;'> </span> - <span style='color:blue'>"intact"</span>: middle 80% of the CDS - (coding sequence) is present and exhibits no gene-inactivating mutation. - These transcripts likely encode functional proteins.</li> -<li><span style='display:inline-block; width:40px; height:15px; background-color:lightblue;'> </span> - <span style='color:#7193a0'>"partially intact"</span>: 50% of the CDS - is present in the query and the middle 80% of the CDS exhibits no +<li><span style='display:inline-block; width:40px; height:15px; background-color:#000064;'> </span> + <span style='color:#000064'>"fully intact"</span>: This status is new in TOGA2 and indicates that the +projection has a completely intact reading frame, without any inactivating mutations. These transcripts likely + encode functional proteins.</li> +<li><span style='display:inline-block; width:40px; height:15px; background-color:#0000C8;'> </span> + <span style='color:#0000C8'>"intact"</span>: middle 80% of the CDS + (coding sequence) is present and exhibits no gene-inactivating mutation. However, mutations can be +present in the N- or C-terminal 10% of the reading frame, and potentially indicate alterations in the +protein's termini. These transcripts likely encode functional proteins.</li> +<li><span style='display:inline-block; width:40px; height:15px; background-color:#00C8FF;'> </span> + <span style='color:#00C8FF'>"partially intact"</span>: >50% of the CDS + is present in the query genome and the middle 80% of the CDS exhibits no inactivating mutation. These transcripts may also encode functional proteins, but the evidence is weaker as parts of the CDS are missing, often due to assembly gaps.</li> -<li><span style='display:inline-block; width:40px; height:15px; background-color:grey;'> </span> - <span style='color:grey'>"missing"</span>: <50% of the CDS is present +<li><span style='display:inline-block; width:40px; height:15px; background-color:#828282;'> </span> + <span style='color:#828282'>"missing"</span>: <50% of the CDS is present in the query and the middle 80% of the CDS exhibits no inactivating - mutation.</li> -<li><span style='display:inline-block; width:40px; height:15px; background-color:orange;'> </span> - <span style='color:orange'>"uncertain loss"</span>: there is 1 + mutation. There is currently no evidence for transcript loss; however, the uncertainty is higher +as more than half of the CDS is missing. Note that Missing transcripts can also arise if no genome alignment +chain spans the transcript.</li> +<li><span style='display:inline-block; width:40px; height:15px; background-color:#FFA078;'> </span> + <span style='color:#FFA078'>"uncertain loss"</span>: there is at least one inactivating mutation in the middle 80% of the CDS, but evidence is not strong enough to classify the transcript as lost. These transcripts may or may not encode a functional protein.</li> -<li><span style='display:inline-block; width:40px; height:15px; background-color:red;'> </span> - <span style='color:red'>"lost"</span>: typically several inactivating +<li><span style='display:inline-block; width:40px; height:15px; background-color:#FF3232;'> </span> + <span style='color:#FF3232'>"lost"</span>: typically several inactivating mutations are present, thus there is strong evidence that the transcript is unlikely to encode a functional protein.</li> </ul> </p> <p> Clicking on a transcript provides additional information about the orthology -classification, inactivating mutations, the protein sequence and protein/exon +classification, inactivating mutations, the query's nucleotide/protein sequence, and protein/exon alignments. </p> <h2>Credits</h2> <p> -This data was prepared by the <a href="https://tbg.senckenberg.de/hillerlab/" -target="_blank">Michael Hiller Lab</a> +This data was prepared by the <a href="https://www.senckenberg.de/en/research/institutes-overview/sf/ffm-dept-comparative-genomics/" +target="_blank">Michael Hiller's Lab</a> </p> <h2>References</h2> <p> -The <b>TOGA</b> software is available from -<a href="https://github.com/hillerlab/TOGA" -target="_blank">github.com/hillerlab/TOGA</a> +The <b>TOGA2</b> software is available from +<a href="https://github.com/hillerlab/TOGA2" +target="_blank">github.com/hillerlab/TOGA2</a> </p> <p> -Kirilenko BM, Munegowda C, Osipova E, Jebb D, Sharma V, Blumer M, Morales AE, Ahmed AW, Kontopoulos -DG, Hilgers L <em>et al</em>. +[1] Malovichko Y, Bein B, Hilgers L, Stephens A, Yi X, Stadager T, Hoppach L, Koch L, Maschiner M, Hiller M. TOGA2 improves speed and accuracy of comparative gene annotation and orthology inference. In preparation +</p> +<p> +[2] Kirilenko BM, Munegowda C, Osipova E, Jebb D, Sharma V, Blumer M, Morales AE, Ahmed AW, Kontopoulos DG, Hilgers L, Lindblad-Toh K, Karlsson EK, Zoonomia Consortium, Hiller M. <a href="https://www.science.org/doi/abs/10.1126/science.abn3107?url_ver=Z39.88-2003&rfr_id=ori: rid:crossref.org&rfr_dat=cr_pub%20%200pubmed" target="_blank"> Integrating gene annotation with orthology inference at scale</a>. <em>Science</em>. 2023 Apr 28;380(6643):eabn3107. PMID: <a href="https://www.ncbi.nlm.nih.gov/pubmed/37104600" target="_blank">37104600</a>; PMC: <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10193443/" target="_blank">PMC10193443</a> </p> +<p> +[3] +Sharma V, Schwede P, Hiller M. <a href="https://doi.org/10.1093/bioinformatics/btx527" target="_blank">CESAR 2.0 substantially improves speed and accuracy of comparative gene annotation</a>. <em>Bioinformatics</em>. 2017 Dec 15;33(24):3985-3987. PMID: <a href="https://pubmed.ncbi.nlm.nih.gov/28961744/" target="_blank">28961744</a> </p> +<p> +[4] +Jaganathan K, Kyriazopoulou Panagiotopoulou S, McRae JF, Darbandi SF, Knowles D, Li YI, Kosmicki JA, Arbelaez J, Cui W, Schwartz GB, Chow ED, Kanterakis E, Gao H, Kia A, Batzoglou S, Sanders SJ, Farh KK-H. <a href="https://doi.org/10.1016/j.cell.2018.12.015" target="_blank">Predicting splicing from primary sequence with deep learning</a>. <em>Cell</em>. 2019 Jan 24;176(3):535-548.e24. PMID: <a href="https://pubmed.ncbi.nlm.nih.gov/30661751" target="_blank">30661751</a> +</p> +</div> +<BR>