--------------------------------------------------------------- ochPri3.trackDb.html : Differences exist between hgwbeta and hgw2 (RR fields taken from public MySql server, not individual machine) 972,983d971 < cpgIslandExt |

< cpgIslandExt | The calculation of the track data is performed by the following command sequence: < cpgIslandExt |

< cpgIslandExt | twoBitToFa assembly.2bit stdout | maskOutFa stdin hard stdout \\
< cpgIslandExt |   | cpg_lh /dev/stdin 2> cpg_lh.err \\
< cpgIslandExt |     |  awk '{$2 = $2 - 1; width = $3 - $2;  printf("%s\\t%d\\t%s\\t%s %s\\t%s\\t%s\\t%0.0f\\t%0.1f\\t%s\\t%s\
< cpgIslandExt | ", $1, $2, $3, $5, $6, width, $6, width*$7*0.01, 100.0*2*$6/width, $7, $9);}' \\
< cpgIslandExt |      | sort -k1,1 -k2,2n > cpgIsland.bed
< cpgIslandExt |

< cpgIslandExt | The unmasked track data is constructed from < cpgIslandExt | twoBitToFa -noMask output for the twoBitToFa command. < cpgIslandExt |

995,999d982 < cpgIslandExt |

< cpgIslandExt | The source for the cpg_lh program can be obtained from < cpgIslandExt | src/utils/cpgIslandExt/. < cpgIslandExt | The cpg_lh program binary can be obtained from: http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/cpg_lh (choose "save file") < cpgIslandExt |

1077,1088d1059 < cpgIslandExtUnmasked |

< cpgIslandExtUnmasked | The calculation of the track data is performed by the following command sequence: < cpgIslandExtUnmasked |

< cpgIslandExtUnmasked | twoBitToFa assembly.2bit stdout | maskOutFa stdin hard stdout \\
< cpgIslandExtUnmasked |   | cpg_lh /dev/stdin 2> cpg_lh.err \\
< cpgIslandExtUnmasked |     |  awk '{$2 = $2 - 1; width = $3 - $2;  printf("%s\\t%d\\t%s\\t%s %s\\t%s\\t%s\\t%0.0f\\t%0.1f\\t%s\\t%s\
< cpgIslandExtUnmasked | ", $1, $2, $3, $5, $6, width, $6, width*$7*0.01, 100.0*2*$6/width, $7, $9);}' \\
< cpgIslandExtUnmasked |      | sort -k1,1 -k2,2n > cpgIsland.bed
< cpgIslandExtUnmasked |

< cpgIslandExtUnmasked | The unmasked track data is constructed from < cpgIslandExtUnmasked | twoBitToFa -noMask output for the twoBitToFa command. < cpgIslandExtUnmasked |

1100,1104d1070 < cpgIslandExtUnmasked |

< cpgIslandExtUnmasked | The source for the cpg_lh program can be obtained from < cpgIslandExtUnmasked | src/utils/cpgIslandExt/. < cpgIslandExtUnmasked | The cpg_lh program binary can be obtained from: http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/cpg_lh (choose "save file") < cpgIslandExtUnmasked |

1182,1193d1147 < cpgIslandSuper |

< cpgIslandSuper | The calculation of the track data is performed by the following command sequence: < cpgIslandSuper |

< cpgIslandSuper | twoBitToFa assembly.2bit stdout | maskOutFa stdin hard stdout \\
< cpgIslandSuper |   | cpg_lh /dev/stdin 2> cpg_lh.err \\
< cpgIslandSuper |     |  awk '{$2 = $2 - 1; width = $3 - $2;  printf("%s\\t%d\\t%s\\t%s %s\\t%s\\t%s\\t%0.0f\\t%0.1f\\t%s\\t%s\
< cpgIslandSuper | ", $1, $2, $3, $5, $6, width, $6, width*$7*0.01, 100.0*2*$6/width, $7, $9);}' \\
< cpgIslandSuper |      | sort -k1,1 -k2,2n > cpgIsland.bed
< cpgIslandSuper |

< cpgIslandSuper | The unmasked track data is constructed from < cpgIslandSuper | twoBitToFa -noMask output for the twoBitToFa command. < cpgIslandSuper |

1205,1209d1158 < cpgIslandSuper |

< cpgIslandSuper | The source for the cpg_lh program can be obtained from < cpgIslandSuper | src/utils/cpgIslandExt/. < cpgIslandSuper | The cpg_lh program binary can be obtained from: http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/cpg_lh (choose "save file") < cpgIslandSuper |

1385,1468d1333 < HLTOGAannotvHg38v1 | html < HLTOGAannotvHg38v1 |

Description

< HLTOGAannotvHg38v1 |

< HLTOGAannotvHg38v1 | TOGA < HLTOGAannotvHg38v1 | (Tool to infer Orthologs from Genome Alignments) < HLTOGAannotvHg38v1 | is a homology-based method that integrates gene annotation, inferring < HLTOGAannotvHg38v1 | orthologs and classifying genes as intact or lost. < HLTOGAannotvHg38v1 |

< HLTOGAannotvHg38v1 | < HLTOGAannotvHg38v1 |

Methods

< HLTOGAannotvHg38v1 |

< HLTOGAannotvHg38v1 | As input, TOGA uses a gene annotation of a reference species < HLTOGAannotvHg38v1 | (human/hg38 for mammals, chicken/galGal6 for birds) and < HLTOGAannotvHg38v1 | a whole genome alignment between the reference and query genome. < HLTOGAannotvHg38v1 |

< HLTOGAannotvHg38v1 |

< HLTOGAannotvHg38v1 | TOGA implements a novel paradigm that relies on alignments of intronic < HLTOGAannotvHg38v1 | and intergenic regions and uses machine learning to accurately distinguish < HLTOGAannotvHg38v1 | orthologs from paralogs or processed pseudogenes. < HLTOGAannotvHg38v1 |

< HLTOGAannotvHg38v1 |

< HLTOGAannotvHg38v1 | To annotate genes, < HLTOGAannotvHg38v1 | CESAR 2.0 < HLTOGAannotvHg38v1 | is used to determine the positions and boundaries of coding exons of a < HLTOGAannotvHg38v1 | reference transcript in the orthologous genomic locus in the query species. < HLTOGAannotvHg38v1 |

< HLTOGAannotvHg38v1 | < HLTOGAannotvHg38v1 |

Display Conventions and Configuration

< HLTOGAannotvHg38v1 |

< HLTOGAannotvHg38v1 < HLTOGAannotvHg38v1 |

"intact": middle 80% of the CDS (coding sequence) is present and exhibits no gene-inactivating mutation. These transcripts likely encode functional proteins.

"partially intact": 50% of the CDS is present in the query and the middle 80% of the CDS exhibits no inactivating mutation. These transcripts may also encode functional proteins, but the evidence is weaker as parts of the CDS are missing, often due to assembly gaps.

"missing": <50% of the CDS is present in the query and the middle 80% of the CDS exhibits no inactivating mutation.

"uncertain loss": there is 1 inactivating mutation in the middle 80% of the CDS, but evidence is not strong enough to classify the transcript as lost. These transcripts may or may not encode a functional protein.

"lost": typically several inactivating mutations are present, thus there is strong evidence that the transcript is unlikely to encode a functional protein.

< HLTOGAannotvHg38v1 |

< HLTOGAannotvHg38v1 | Clicking on a transcript provides additional information about the orthology < HLTOGAannotvHg38v1 | classification, inactivating mutations, the protein sequence and protein/exon < HLTOGAannotvHg38v1 | alignments. < HLTOGAannotvHg38v1 |

< HLTOGAannotvHg38v1 | < HLTOGAannotvHg38v1 |

Credits

< HLTOGAannotvHg38v1 |

< HLTOGAannotvHg38v1 | This data was prepared by the Michael Hiller Lab < HLTOGAannotvHg38v1 |

< HLTOGAannotvHg38v1 | < HLTOGAannotvHg38v1 |

References

< HLTOGAannotvHg38v1 |

< HLTOGAannotvHg38v1 | The TOGA software is available from < HLTOGAannotvHg38v1 | github.com/hillerlab/TOGA < HLTOGAannotvHg38v1 |

< HLTOGAannotvHg38v1 | < HLTOGAannotvHg38v1 |

< HLTOGAannotvHg38v1 | Kirilenko BM, Munegowda C, Osipova E, Jebb D, Sharma V, Blumer M, Morales A, < HLTOGAannotvHg38v1 | Ahmed AW, Kontopoulos DG, Hilgers L, Zoonomia Consortium, Hiller M. < HLTOGAannotvHg38v1 | TOGA integrates gene annotation with orthology inference < HLTOGAannotvHg38v1 | at scale. bioRxiv preprint September 2022 < HLTOGAannotvHg38v1 |

< HLTOGAannotvHg38v1 | 2177a2043,2061 > refSeqComposite |

> refSeqComposite | RefSeq Select+MANE (subset) – Subset of RefSeq Curated, transcripts marked as > refSeqComposite | RefSeq Select or MANE Select. > refSeqComposite | A single Select transcript is chosen as representative for each protein-coding gene. > refSeqComposite | This track includes transcripts categorized as MANE, which are further agreed upon as > refSeqComposite | representative by both NCBI RefSeq and Ensembl/GENCODE, and have a 100% identical match > refSeqComposite | to a transcript in the Ensembl annotation. See refSeqComposite | href="https://www.ncbi.nlm.nih.gov/refseq/refseq_select/">NCBI RefSeq Select. > refSeqComposite | Note that we provide a separate track, refSeqComposite | target=_blank href="hgTrackUi?g=mane&db=hg38&c=chr22">MANE (hg38), > refSeqComposite | which contains only the MANE transcripts. > refSeqComposite |

> refSeqComposite |

> refSeqComposite | RefSeq HGMD (subset) – Subset of RefSeq Curated, transcripts annotated by the Human > refSeqComposite | Gene Mutation Database. This track is only available on the human genomes hg19 and hg38. > refSeqComposite | It is the most restricted RefSeq subset, targeting clinical diagnostics. > refSeqComposite |

> refSeqComposite | > refSeqComposite | 2180,2181c2064,2065 < refSeqComposite | The RefSeq All, RefSeq Curated, RefSeq Predicted, and < refSeqComposite | UCSC RefSeq tracks follow the display conventions for --- > refSeqComposite | The RefSeq All, RefSeq Curated, RefSeq Predicted, RefSeq HGMD, > refSeqComposite | RefSeq Select/MANE and UCSC RefSeq tracks follow the display conventions for 2291,2292c2175 < refSeqComposite | using the REST API, < refSeqComposite | Table Browser or --- > refSeqComposite | using the Table Browser or 2313a2197,2198 > refSeqComposite |

RefSeq HGMD - ncbiRefSeqHgmd

> refSeqComposite |

RefSeq Select+MANE - ncbiRefSeqSelect

2415,2419c2300 < rmsk | Repbase Update is described in Jurka (2000) in the References section below. < rmsk | Some newer assemblies have been made with Dfam, not Repbase. You can < rmsk | find the details for how we make our database data here in our "makeDb/doc/" < rmsk | directory.

--- > rmsk | Repbase Update is described in Jurka (2000) in the References section below.

3322c3203 < ucscToINSDC | names from the ucscToINSDC | names from the Protein sequences from SwissProt mapped to the genome. All other --- > uniprot | Protein sequences from SwissProt mapped onto the genome. All other 3393,3397c3274,3275 < uniprot | through this alignment. Even protein sequences without a single curated < uniprot | annotation (splice isoforms) are visible in this track. Each UniProt protein < uniprot | has one main isoform, which is colored in dark. Alternative isoforms are < uniprot | sequences that do not have annotations on them and are colored in light-blue. < uniprot | They can be hidden with the TrEMBL/Isoform filter (see below). --- > uniprot | using this track. Protein sequences without a single curated > uniprot | annotation were not added to this track. 3400c3278 < uniprot | Protein sequences from TrEMBL mapped to the genome. All other tracks --- > uniprot | Protein sequences from TrEMBL mapped onto the genome. All other tracks 3403c3281,3282 < uniprot | checkbox on the track configuration page. --- > uniprot | checkbox on the track configuration page. Protein sequences without a single > uniprot | predicted annotation on them were not added to this track. 3462,3463c3341 < uniprot | For consistency and convenience for users of mutation-related tracks, < uniprot | the subtrack "UniProt/SwissProt Variants" is a copy of the track --- > uniprot | For consistency, the subtrack "UniProt/SwissProt Variants" is a copy of the track 3500c3378 < uniprot | annotations mapped through different protein sequence alignments but with the same genome --- > uniprot | annotations mapped through different transcripts but with the same genome 3503,3508c3381 < uniprot |

On the configuration page of this track, you can choose to hide any TrEMBL annotations. < uniprot | This filter will also hide the UniProt alternative isoform protein sequences because < uniprot | both types of information are less relevant to most users. Please contact us if you < uniprot | want more detailed filtering features.

< uniprot | < uniprot |

Note that for the human hg38 assembly and SwissProt annotations, there --- > uniprot |

Note that only for the human hg38 assembly and SwissProt annotations, there 3510c3383 < uniprot | href="hgTracks?db=hg38&hubUrl=https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/genome_annotation_tracks/UP000005640_9606_hub/hub.txt" target=_blank>public --- > uniprot | href="hgTracks?db=hg38&hubUrl=ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/genome_annotation_tracks/UP000005640_9606_hub/hub.txt">public 3514,3516c3387 < uniprot | for a given protein. For proteins that differ from the genome, UniProt's mapping method < uniprot | will, in most cases, map a protein and its annotations to an unexpected location < uniprot | (see below for details on UCSC's mapping method).

--- > uniprot | for a given protein.

3521,3579c3392,3394 < uniprot | Briefly, UniProt protein sequences were aligned to the transcripts associated < uniprot | with the protein, the top-scoring alignments were retained, and the result was < uniprot | projected to the genome through a transcript-to-genome alignment. < uniprot | Depending on the genome, the transcript-genome alignments was either < uniprot | provided by the source database (NBCI RefSeq), created at UCSC (UCSC RefSeq) or < uniprot | derived from the transcripts (Ensembl/Augustus). The transcript set is NCBI < uniprot | RefSeq for hg38, UCSC RefSeq for hg19 (due to alt/fix haplotype misplacements < uniprot | in the NCBI RefSeq set on hg19). For other genomes, RefSeq, Ensembl and Augustus < uniprot | are tried, in this order. The resulting protein-genome alignments of this process < uniprot | are available in the file formats for liftOver or pslMap from our data archive < uniprot | (see "Data Access" section below). < uniprot |

< uniprot | < uniprot |

An important step of the mapping process is filtering the alignment from < uniprot | protein to transcript. Due to differences between the UniProt proteins and the < uniprot | transcripts and the genome, the best matching transcript is not always the < uniprot | correct transcript. Therefore, only for organisms that have a RefSeq transcript track, < uniprot | proteins are only aligned to the RefSeq transcripts that are annotated < uniprot | by UniProt for this protein. If no transcripts are annotated on the protein, or < uniprot | the annotated ones do not exist anymore, but a NCBI Gene ID is annotated, < uniprot | the RefSeq transcripts for the gene are used. If no NCBI Gene is annotated, < uniprot | then the best matching alignment is used. Only a handful of edge cases < uniprot | (pseudogenes, very recently added proteins) on hg38 remain where the < uniprot | global transcriptome-wide matches have to be used. The details page of the < uniprot | protein alignments shows the transcripts used for the mapping and how < uniprot | these transcripts were found. There can be multiple transcripts for one < uniprot | protein, as their coding sequences can be identical or several of them do < uniprot | not differ by more than 1% in alignment score. < uniprot |

< uniprot | < uniprot |

In other words, when an NCBI or UCSC RefSeq track is used for the mapping and to align a < uniprot | protein sequence to the correct transcript, we use a three stage process: < uniprot |

If UniProt has annotated a given RefSeq transcript for a given protein < uniprot | sequence, the protein is aligned to this transcript. Any difference in the < uniprot | version suffix is tolerated in this comparison. < uniprot |
If no transcript is annotated or the transcript cannot be found in the < uniprot | NCBI/UCSC RefSeq track, the UniProt-annotated NCBI Gene ID is resolved to a < uniprot | set of NCBI RefSeq transcript IDs via the most current version of NCBI < uniprot | genes tables. Only the top match of the resulting alignments and all < uniprot | others within 1% of its score are used for the mapping. < uniprot |
If no transcript can be found after step (2), the protein is aligned to all transcripts, < uniprot | the top match, and all others within 1% of its score are used. < uniprot |

< uniprot | < uniprot |

This system was designed to resolve the problem of incorrect mappings of < uniprot | proteins, mostly on hg38, due to differences between the SwissProt < uniprot | sequences and the genome reference sequence, which has changed since the < uniprot | proteins were defined. The problem is most pronounced for gene families < uniprot | composed of either very repetitive or very similar proteins. To make sure that < uniprot | the alignments always go to the best chromosome location, all _alt and _fix < uniprot | reference patch sequences are ignored for the alignment, so the patches are < uniprot | entirely free of UniProt annotations. Please contact us if you have feedback on < uniprot | this process or example edge cases. We are not aware of a way to evaluate the < uniprot | results completely and in an automated manner.

< uniprot |

< uniprot | Proteins were aligned to transcripts with TBLASTN, converted to PSL, filtered < uniprot | with pslReps (93% query coverage, keep alignments within top 1% score), lifted to genome < uniprot | positions with pslMap and filtered again with pslReps. UniProt annotations were --- > uniprot | UniProt sequences were aligned to one of UCSC, Gencode, Ensembl or Augustus transcript sequences, first with > uniprot | BLAT, filtered with pslReps (93% query coverage, within top 1% score), lifted > uniprot | to genome positions with pslMap and filtered again. UniProt annotations were 3581c3396 < uniprot | genome through the alignment described above using the pslMap program. This approach --- > uniprot | genome through the alignment using the pslMap program. This mapping approach 3583,3586c3398,3402 < uniprot | TARGET="_BLANK">LS-SNP pipeline by Mark Diekhans. < uniprot | Like all Genome Browser source code, the main script used to build this track < uniprot | can be found on Github. --- > uniprot | TARGET="_BLANK">LS-SNP pipeline by Mark Diekhans. For human and mouse, the > uniprot | alignments were filtered by retaining only proteins annotated with > uniprot | a given transcript in the Genome Browser table kgXref. Like all Genome Browser > uniprot | source code, the main script used to build this track can be found on > uniprot | github. 3589,3612d3404 < uniprot |

Automated data updates and release history

< uniprot |

< uniprot | This track is automatically updated on an ongoing basis, every 2-3 months. < uniprot | The current version is always shown on the track details page, it includes the < uniprot | release of UniProt, the version of the transcript set and a unique MD5 that is < uniprot | based on the protein sequences, the transcript sequences, the mapping file < uniprot | between both and the transcript-genome alignment. The exact transcript < uniprot | that was used for the alignment is shown when clicking a protein alignment < uniprot | in one of the two alignment tracks. < uniprot |

< uniprot | < uniprot |

< uniprot | For reproducibility of older analysis results, previous versions of this track < uniprot | are available for browsing in the form of the UCSC UniProt Archive Track Hub. The underlying data of < uniprot | all releases of this track (past and current) can be obtained from our downloads server, including the UniProt < uniprot | protein-to-genome alignment. The file formats available are in the < uniprot | command line programs liftOver or pslMap, which can be used to map < uniprot | coordinates on protein sequences to genome coordinates. The filenames are < uniprot | unipToGenome.over.chain.gz (liftOver) and unipToGenomeLift.psl.gz (pslMap).

< uniprot | 3616c3408 < uniprot | The raw data of the current track can be explored interactively with the --- > uniprot | The raw data can be explored interactively with the 3623c3415 < uniprot | track configuration file. --- > uniprot | track configuration file. 3629c3421 < uniprot |

--- > uniprot |
3631c3423,3428 < uniprot |

--- > uniprot |
> uniprot | This track is updated every month. The MySQL table hgFixed.trackVersion > uniprot | contains the name of the currently available data on the website. Older > uniprot | versions of the data files can be downloaded from the uniprot | href="http://hgdownload.soe.ucsc.edu/goldenPath/ochPri3/archive/">archive > uniprot | folder of our downloads server.
3639,3640d3435 < uniprot |

< uniprot | 3644,3646c3439,3441 < uniprot | This track was created by Maximilian Haeussler at UCSC, with a lot of input from Chris < uniprot | Lee, Mark Diekhans and Brian Raney, feedback from the UniProt staff, Alejo < uniprot | Mujica, Regeneron Pharmaceuticals and Pia Riestra, GeneDx. Thanks to UniProt for making all data --- > uniprot | This track was created by Maximilian Haeussler at UCSC, with help from Chris > uniprot | Lee, Mark Diekhans and Brian Raney, feedback from the UniProt staff and Alejo > uniprot | Mujica, Regeneron Pharmaceuticals. Thanks to UniProt for making all data