d2ac08231c388dcae68ce7031d5a7dddb91beb97 hiram Mon Sep 12 10:55:19 2022 -0700 expanded documentation from Michael refs #29982 diff --git src/hg/makeDb/trackDb/TOGAannotation.html src/hg/makeDb/trackDb/TOGAannotation.html index 839dafa..c201476 100644 --- src/hg/makeDb/trackDb/TOGAannotation.html +++ src/hg/makeDb/trackDb/TOGAannotation.html @@ -1,28 +1,76 @@ <h2>Description</h2> <p> -<b>T</b>ool to infer <b>O</b>rthologs from <b>G</b>enome <b>A</b>lignments +<b>TOGA</b> +(<b>T</b>ool to infer <b>O</b>rthologs from <b>G</b>enome <b>A</b>lignments) +is a homology-based method that integrates gene annotation, inferring +orthologs and classifying genes as intact or lost. </p> + +<h2>Methods</h2> <p> -<b>TOGA</b> is a new method that integrates gene annotation, inferring orthologs -and classifying genes as intact or lost. +As input, <b>TOGA</b> uses a gene annotation of a reference species +(human/hg38 for mammals, chicken/galGal6 for birds) and +a whole genome alignment between the reference and query genome. </p> <p> -<b>TOGA</b> implements a novel machine learning based paradigm to infer -orthologous genes between related species and to accurately distinguish +<b>TOGA</b> implements a novel paradigm that relies on alignments of intronic +and intergenic regions and uses machine learning to accurately distinguish orthologs from paralogs or processed pseudogenes. </p> +<p> +To annotate genes, +<a href='https://academic.oup.com/bioinformatics/article/33/24/3985/4095639' +target=blank>CESAR 2.0</a> +is used to determine the positions and boundaries of coding exons of a +reference transcript in the orthologous genomic locus in the query species. +</p> + +<h2>Display Conventions and Configuration</h2> +<p> +Each annotated transcript is shown in a color-coded classification as +<ul> +<li><span style='display:inline-block; width:40px; height:15px; background-color:blue;'> </span> + <span style='color:blue'>"intact"</span>: middle 80% of the CDS + (coding sequence) is present and exhibits no gene-inactivating mutation. + These transcripts likely encode functional proteins.</li> +<li><span style='display:inline-block; width:40px; height:15px; background-color:lightblue;'> </span> + <span style='color:#7193a0'>"partially intact"</span>: 50% of the CDS + is present in the query and the middle 80% of the CDS exhibits no + inactivating mutation. These transcripts may also encode functional + proteins, but the evidence is weaker as parts of the CDS are missing, + often due to assembly gaps.</li> +<li><span style='display:inline-block; width:40px; height:15px; background-color:grey;'> </span> + <span style='color:grey'>"missing"</span>: <50% of the CDS is present + in the query and the middle 80% of the CDS exhibits no inactivating + mutation.</li> +<li><span style='display:inline-block; width:40px; height:15px; background-color:orange;'> </span> + <span style='color:orange'>"uncertain loss"</span>: there is 1 + inactivating mutation in the middle 80% of the CDS, but evidence is not + strong enough to classify the transcript as lost. These transcripts may + or may not encode a functional protein.</li> +<li><span style='display:inline-block; width:40px; height:15px; background-color:red;'> </span> + <span style='color:red'>"lost"</span>: typically several inactivating + mutations are present, thus there is strong evidence that the transcript + is unlikely to encode a functional protein.</li> +</ul> +</p> +<p> +Clicking on a transcript provides additional information about the orthology +classification, inactivating mutations, the protein sequence and protein/exon +alignments. +</p> <h2>Credits</h2> <p> This data was prepared by the <a href='https://tbg.senckenberg.de/hillerlab/' target=_blank>Michael Hiller Lab</a> </p> <h2>References</h2> <p> The <b>TOGA</b> software is available from <a href='https://github.com/hillerlab/TOGA' target=_blank>github.com/hillerlab/TOGA</a> </p> <p>