--------------------------------------------------------------- ochPri3.trackDb.html : Differences exist between hgwbeta and hgw2 (RR fields taken from public MySql server, not individual machine) 972,983d971 < cpgIslandExt |

< cpgIslandExt | The calculation of the track data is performed by the following command sequence: < cpgIslandExt |

< cpgIslandExt | twoBitToFa assembly.2bit stdout | maskOutFa stdin hard stdout \\
< cpgIslandExt |   | cpg_lh /dev/stdin 2> cpg_lh.err \\
< cpgIslandExt |     |  awk '{$2 = $2 - 1; width = $3 - $2;  printf("%s\\t%d\\t%s\\t%s %s\\t%s\\t%s\\t%0.0f\\t%0.1f\\t%s\\t%s\
< cpgIslandExt | ", $1, $2, $3, $5, $6, width, $6, width*$7*0.01, 100.0*2*$6/width, $7, $9);}' \\
< cpgIslandExt |      | sort -k1,1 -k2,2n > cpgIsland.bed
< cpgIslandExt | 
< cpgIslandExt | The unmasked track data is constructed from < cpgIslandExt | twoBitToFa -noMask output for the twoBitToFa command. < cpgIslandExt |

995,999d982 < cpgIslandExt |

< cpgIslandExt | The source for the cpg_lh program can be obtained from < cpgIslandExt | src/utils/cpgIslandExt/. < cpgIslandExt | The cpg_lh program binary can be obtained from: http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/cpg_lh (choose "save file") < cpgIslandExt |

1077,1088d1059 < cpgIslandExtUnmasked |

< cpgIslandExtUnmasked | The calculation of the track data is performed by the following command sequence: < cpgIslandExtUnmasked |

< cpgIslandExtUnmasked | twoBitToFa assembly.2bit stdout | maskOutFa stdin hard stdout \\
< cpgIslandExtUnmasked |   | cpg_lh /dev/stdin 2> cpg_lh.err \\
< cpgIslandExtUnmasked |     |  awk '{$2 = $2 - 1; width = $3 - $2;  printf("%s\\t%d\\t%s\\t%s %s\\t%s\\t%s\\t%0.0f\\t%0.1f\\t%s\\t%s\
< cpgIslandExtUnmasked | ", $1, $2, $3, $5, $6, width, $6, width*$7*0.01, 100.0*2*$6/width, $7, $9);}' \\
< cpgIslandExtUnmasked |      | sort -k1,1 -k2,2n > cpgIsland.bed
< cpgIslandExtUnmasked | 
< cpgIslandExtUnmasked | The unmasked track data is constructed from < cpgIslandExtUnmasked | twoBitToFa -noMask output for the twoBitToFa command. < cpgIslandExtUnmasked |

1100,1104d1070 < cpgIslandExtUnmasked |

< cpgIslandExtUnmasked | The source for the cpg_lh program can be obtained from < cpgIslandExtUnmasked | src/utils/cpgIslandExt/. < cpgIslandExtUnmasked | The cpg_lh program binary can be obtained from: http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/cpg_lh (choose "save file") < cpgIslandExtUnmasked |

1182,1193d1147 < cpgIslandSuper |

< cpgIslandSuper | The calculation of the track data is performed by the following command sequence: < cpgIslandSuper |

< cpgIslandSuper | twoBitToFa assembly.2bit stdout | maskOutFa stdin hard stdout \\
< cpgIslandSuper |   | cpg_lh /dev/stdin 2> cpg_lh.err \\
< cpgIslandSuper |     |  awk '{$2 = $2 - 1; width = $3 - $2;  printf("%s\\t%d\\t%s\\t%s %s\\t%s\\t%s\\t%0.0f\\t%0.1f\\t%s\\t%s\
< cpgIslandSuper | ", $1, $2, $3, $5, $6, width, $6, width*$7*0.01, 100.0*2*$6/width, $7, $9);}' \\
< cpgIslandSuper |      | sort -k1,1 -k2,2n > cpgIsland.bed
< cpgIslandSuper | 
< cpgIslandSuper | The unmasked track data is constructed from < cpgIslandSuper | twoBitToFa -noMask output for the twoBitToFa command. < cpgIslandSuper |

1205,1209d1158 < cpgIslandSuper |

< cpgIslandSuper | The source for the cpg_lh program can be obtained from < cpgIslandSuper | src/utils/cpgIslandExt/. < cpgIslandSuper | The cpg_lh program binary can be obtained from: http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/cpg_lh (choose "save file") < cpgIslandSuper |

1385,1468d1333 < HLTOGAannotvHg38v1 | html < HLTOGAannotvHg38v1 |

Description

< HLTOGAannotvHg38v1 |

< HLTOGAannotvHg38v1 | TOGA < HLTOGAannotvHg38v1 | (Tool to infer Orthologs from Genome Alignments) < HLTOGAannotvHg38v1 | is a homology-based method that integrates gene annotation, inferring < HLTOGAannotvHg38v1 | orthologs and classifying genes as intact or lost. < HLTOGAannotvHg38v1 |

< HLTOGAannotvHg38v1 | < HLTOGAannotvHg38v1 |

Methods

< HLTOGAannotvHg38v1 |

< HLTOGAannotvHg38v1 | As input, TOGA uses a gene annotation of a reference species < HLTOGAannotvHg38v1 | (human/hg38 for mammals, chicken/galGal6 for birds) and < HLTOGAannotvHg38v1 | a whole genome alignment between the reference and query genome. < HLTOGAannotvHg38v1 |

< HLTOGAannotvHg38v1 |

< HLTOGAannotvHg38v1 | TOGA implements a novel paradigm that relies on alignments of intronic < HLTOGAannotvHg38v1 | and intergenic regions and uses machine learning to accurately distinguish < HLTOGAannotvHg38v1 | orthologs from paralogs or processed pseudogenes. < HLTOGAannotvHg38v1 |

< HLTOGAannotvHg38v1 |

< HLTOGAannotvHg38v1 | To annotate genes, < HLTOGAannotvHg38v1 | CESAR 2.0 < HLTOGAannotvHg38v1 | is used to determine the positions and boundaries of coding exons of a < HLTOGAannotvHg38v1 | reference transcript in the orthologous genomic locus in the query species. < HLTOGAannotvHg38v1 |

< HLTOGAannotvHg38v1 | < HLTOGAannotvHg38v1 |

Display Conventions and Configuration

< HLTOGAannotvHg38v1 |

< HLTOGAannotvHg38v1 | Each annotated transcript is shown in a color-coded classification as < HLTOGAannotvHg38v1 |

< HLTOGAannotvHg38v1 |

< HLTOGAannotvHg38v1 |

< HLTOGAannotvHg38v1 | Clicking on a transcript provides additional information about the orthology < HLTOGAannotvHg38v1 | classification, inactivating mutations, the protein sequence and protein/exon < HLTOGAannotvHg38v1 | alignments. < HLTOGAannotvHg38v1 |

< HLTOGAannotvHg38v1 | < HLTOGAannotvHg38v1 |

Credits

< HLTOGAannotvHg38v1 |

< HLTOGAannotvHg38v1 | This data was prepared by the Michael Hiller Lab < HLTOGAannotvHg38v1 |

< HLTOGAannotvHg38v1 | < HLTOGAannotvHg38v1 |

References

< HLTOGAannotvHg38v1 |

< HLTOGAannotvHg38v1 | The TOGA software is available from < HLTOGAannotvHg38v1 | github.com/hillerlab/TOGA < HLTOGAannotvHg38v1 |

< HLTOGAannotvHg38v1 | < HLTOGAannotvHg38v1 |

< HLTOGAannotvHg38v1 | Kirilenko BM, Munegowda C, Osipova E, Jebb D, Sharma V, Blumer M, Morales A, < HLTOGAannotvHg38v1 | Ahmed AW, Kontopoulos DG, Hilgers L, Zoonomia Consortium, Hiller M. < HLTOGAannotvHg38v1 | TOGA integrates gene annotation with orthology inference < HLTOGAannotvHg38v1 | at scale. bioRxiv preprint September 2022 < HLTOGAannotvHg38v1 |

< HLTOGAannotvHg38v1 | 2177a2043,2061 > refSeqComposite |
  • > refSeqComposite | RefSeq Select+MANE (subset) – Subset of RefSeq Curated, transcripts marked as > refSeqComposite | RefSeq Select or MANE Select. > refSeqComposite | A single Select transcript is chosen as representative for each protein-coding gene. > refSeqComposite | This track includes transcripts categorized as MANE, which are further agreed upon as > refSeqComposite | representative by both NCBI RefSeq and Ensembl/GENCODE, and have a 100% identical match > refSeqComposite | to a transcript in the Ensembl annotation. See refSeqComposite | href="https://www.ncbi.nlm.nih.gov/refseq/refseq_select/">NCBI RefSeq Select. > refSeqComposite | Note that we provide a separate track, refSeqComposite | target=_blank href="hgTrackUi?g=mane&db=hg38&c=chr22">MANE (hg38), > refSeqComposite | which contains only the MANE transcripts. > refSeqComposite |
  • > refSeqComposite |
  • > refSeqComposite | RefSeq HGMD (subset) – Subset of RefSeq Curated, transcripts annotated by the Human > refSeqComposite | Gene Mutation Database. This track is only available on the human genomes hg19 and hg38. > refSeqComposite | It is the most restricted RefSeq subset, targeting clinical diagnostics. > refSeqComposite |
  • > refSeqComposite | > refSeqComposite | 2180,2181c2064,2065 < refSeqComposite | The RefSeq All, RefSeq Curated, RefSeq Predicted, and < refSeqComposite | UCSC RefSeq tracks follow the display conventions for --- > refSeqComposite | The RefSeq All, RefSeq Curated, RefSeq Predicted, RefSeq HGMD, > refSeqComposite | RefSeq Select/MANE and UCSC RefSeq tracks follow the display conventions for 2291,2292c2175 < refSeqComposite | using the REST API, < refSeqComposite | Table Browser or --- > refSeqComposite | using the Table Browser or 2313a2197,2198 > refSeqComposite |
  • RefSeq HGMD - ncbiRefSeqHgmd
  • > refSeqComposite |
  • RefSeq Select+MANE - ncbiRefSeqSelect
  • 2415,2419c2300 < rmsk | Repbase Update is described in Jurka (2000) in the References section below. < rmsk | Some newer assemblies have been made with Dfam, not Repbase. You can < rmsk | find the details for how we make our database data here in our "makeDb/doc/" < rmsk | directory.

    --- > rmsk | Repbase Update is described in Jurka (2000) in the References section below.

    3322c3203 < ucscToINSDC | names from the ucscToINSDC | names from the Protein sequences from SwissProt mapped to the genome. All other --- > uniprot | Protein sequences from SwissProt mapped onto the genome. All other 3393,3397c3274,3275 < uniprot | through this alignment. Even protein sequences without a single curated < uniprot | annotation (splice isoforms) are visible in this track. Each UniProt protein < uniprot | has one main isoform, which is colored in dark. Alternative isoforms are < uniprot | sequences that do not have annotations on them and are colored in light-blue. < uniprot | They can be hidden with the TrEMBL/Isoform filter (see below). --- > uniprot | using this track. Protein sequences without a single curated > uniprot | annotation were not added to this track. 3400c3278 < uniprot | Protein sequences from TrEMBL mapped to the genome. All other tracks --- > uniprot | Protein sequences from TrEMBL mapped onto the genome. All other tracks 3403c3281,3282 < uniprot | checkbox on the track configuration page. --- > uniprot | checkbox on the track configuration page. Protein sequences without a single > uniprot | predicted annotation on them were not added to this track. 3462,3463c3341 < uniprot | For consistency and convenience for users of mutation-related tracks, < uniprot | the subtrack "UniProt/SwissProt Variants" is a copy of the track --- > uniprot | For consistency, the subtrack "UniProt/SwissProt Variants" is a copy of the track 3500c3378 < uniprot | annotations mapped through different protein sequence alignments but with the same genome --- > uniprot | annotations mapped through different transcripts but with the same genome 3503,3508c3381 < uniprot |

    On the configuration page of this track, you can choose to hide any TrEMBL annotations. < uniprot | This filter will also hide the UniProt alternative isoform protein sequences because < uniprot | both types of information are less relevant to most users. Please contact us if you < uniprot | want more detailed filtering features.

    < uniprot | < uniprot |

    Note that for the human hg38 assembly and SwissProt annotations, there --- > uniprot |

    Note that only for the human hg38 assembly and SwissProt annotations, there 3510c3383 < uniprot | href="hgTracks?db=hg38&hubUrl=https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/genome_annotation_tracks/UP000005640_9606_hub/hub.txt" target=_blank>public --- > uniprot | href="hgTracks?db=hg38&hubUrl=ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/genome_annotation_tracks/UP000005640_9606_hub/hub.txt">public 3514,3516c3387 < uniprot | for a given protein. For proteins that differ from the genome, UniProt's mapping method < uniprot | will, in most cases, map a protein and its annotations to an unexpected location < uniprot | (see below for details on UCSC's mapping method).

    --- > uniprot | for a given protein.

    3521,3579c3392,3394 < uniprot | Briefly, UniProt protein sequences were aligned to the transcripts associated < uniprot | with the protein, the top-scoring alignments were retained, and the result was < uniprot | projected to the genome through a transcript-to-genome alignment. < uniprot | Depending on the genome, the transcript-genome alignments was either < uniprot | provided by the source database (NBCI RefSeq), created at UCSC (UCSC RefSeq) or < uniprot | derived from the transcripts (Ensembl/Augustus). The transcript set is NCBI < uniprot | RefSeq for hg38, UCSC RefSeq for hg19 (due to alt/fix haplotype misplacements < uniprot | in the NCBI RefSeq set on hg19). For other genomes, RefSeq, Ensembl and Augustus < uniprot | are tried, in this order. The resulting protein-genome alignments of this process < uniprot | are available in the file formats for liftOver or pslMap from our data archive < uniprot | (see "Data Access" section below). < uniprot |

    < uniprot | < uniprot |

    An important step of the mapping process is filtering the alignment from < uniprot | protein to transcript. Due to differences between the UniProt proteins and the < uniprot | transcripts and the genome, the best matching transcript is not always the < uniprot | correct transcript. Therefore, only for organisms that have a RefSeq transcript track, < uniprot | proteins are only aligned to the RefSeq transcripts that are annotated < uniprot | by UniProt for this protein. If no transcripts are annotated on the protein, or < uniprot | the annotated ones do not exist anymore, but a NCBI Gene ID is annotated, < uniprot | the RefSeq transcripts for the gene are used. If no NCBI Gene is annotated, < uniprot | then the best matching alignment is used. Only a handful of edge cases < uniprot | (pseudogenes, very recently added proteins) on hg38 remain where the < uniprot | global transcriptome-wide matches have to be used. The details page of the < uniprot | protein alignments shows the transcripts used for the mapping and how < uniprot | these transcripts were found. There can be multiple transcripts for one < uniprot | protein, as their coding sequences can be identical or several of them do < uniprot | not differ by more than 1% in alignment score. < uniprot |

    < uniprot | < uniprot |

    In other words, when an NCBI or UCSC RefSeq track is used for the mapping and to align a < uniprot | protein sequence to the correct transcript, we use a three stage process: < uniprot |

      < uniprot |
    1. If UniProt has annotated a given RefSeq transcript for a given protein < uniprot | sequence, the protein is aligned to this transcript. Any difference in the < uniprot | version suffix is tolerated in this comparison. < uniprot |
    2. If no transcript is annotated or the transcript cannot be found in the < uniprot | NCBI/UCSC RefSeq track, the UniProt-annotated NCBI Gene ID is resolved to a < uniprot | set of NCBI RefSeq transcript IDs via the most current version of NCBI < uniprot | genes tables. Only the top match of the resulting alignments and all < uniprot | others within 1% of its score are used for the mapping. < uniprot |
    3. If no transcript can be found after step (2), the protein is aligned to all transcripts, < uniprot | the top match, and all others within 1% of its score are used. < uniprot |
    < uniprot | < uniprot |

    This system was designed to resolve the problem of incorrect mappings of < uniprot | proteins, mostly on hg38, due to differences between the SwissProt < uniprot | sequences and the genome reference sequence, which has changed since the < uniprot | proteins were defined. The problem is most pronounced for gene families < uniprot | composed of either very repetitive or very similar proteins. To make sure that < uniprot | the alignments always go to the best chromosome location, all _alt and _fix < uniprot | reference patch sequences are ignored for the alignment, so the patches are < uniprot | entirely free of UniProt annotations. Please contact us if you have feedback on < uniprot | this process or example edge cases. We are not aware of a way to evaluate the < uniprot | results completely and in an automated manner.

    < uniprot |

    < uniprot | Proteins were aligned to transcripts with TBLASTN, converted to PSL, filtered < uniprot | with pslReps (93% query coverage, keep alignments within top 1% score), lifted to genome < uniprot | positions with pslMap and filtered again with pslReps. UniProt annotations were --- > uniprot | UniProt sequences were aligned to one of UCSC, Gencode, Ensembl or Augustus transcript sequences, first with > uniprot | BLAT, filtered with pslReps (93% query coverage, within top 1% score), lifted > uniprot | to genome positions with pslMap and filtered again. UniProt annotations were 3581c3396 < uniprot | genome through the alignment described above using the pslMap program. This approach --- > uniprot | genome through the alignment using the pslMap program. This mapping approach 3583,3586c3398,3402 < uniprot | TARGET="_BLANK">LS-SNP pipeline by Mark Diekhans. < uniprot | Like all Genome Browser source code, the main script used to build this track < uniprot | can be found on Github. --- > uniprot | TARGET="_BLANK">LS-SNP pipeline by Mark Diekhans. For human and mouse, the > uniprot | alignments were filtered by retaining only proteins annotated with > uniprot | a given transcript in the Genome Browser table kgXref. Like all Genome Browser > uniprot | source code, the main script used to build this track can be found on > uniprot | github. 3589,3612d3404 < uniprot |

    Automated data updates and release history

    < uniprot |

    < uniprot | This track is automatically updated on an ongoing basis, every 2-3 months. < uniprot | The current version is always shown on the track details page, it includes the < uniprot | release of UniProt, the version of the transcript set and a unique MD5 that is < uniprot | based on the protein sequences, the transcript sequences, the mapping file < uniprot | between both and the transcript-genome alignment. The exact transcript < uniprot | that was used for the alignment is shown when clicking a protein alignment < uniprot | in one of the two alignment tracks. < uniprot |

    < uniprot | < uniprot |

    < uniprot | For reproducibility of older analysis results, previous versions of this track < uniprot | are available for browsing in the form of the UCSC UniProt Archive Track Hub. The underlying data of < uniprot | all releases of this track (past and current) can be obtained from our downloads server, including the UniProt < uniprot | protein-to-genome alignment. The file formats available are in the < uniprot | command line programs liftOver or pslMap, which can be used to map < uniprot | coordinates on protein sequences to genome coordinates. The filenames are < uniprot | unipToGenome.over.chain.gz (liftOver) and unipToGenomeLift.psl.gz (pslMap).

    < uniprot | 3616c3408 < uniprot | The raw data of the current track can be explored interactively with the --- > uniprot | The raw data can be explored interactively with the 3623c3415 < uniprot | track configuration file. --- > uniprot | track configuration file. 3629c3421 < uniprot |

    --- > uniprot |
    3631c3423,3428 < uniprot |

    --- > uniprot |
    > uniprot | This track is updated every month. The MySQL table hgFixed.trackVersion > uniprot | contains the name of the currently available data on the website. Older > uniprot | versions of the data files can be downloaded from the uniprot | href="http://hgdownload.soe.ucsc.edu/goldenPath/ochPri3/archive/">archive > uniprot | folder of our downloads server.
    3639,3640d3435 < uniprot |

    < uniprot | 3644,3646c3439,3441 < uniprot | This track was created by Maximilian Haeussler at UCSC, with a lot of input from Chris < uniprot | Lee, Mark Diekhans and Brian Raney, feedback from the UniProt staff, Alejo < uniprot | Mujica, Regeneron Pharmaceuticals and Pia Riestra, GeneDx. Thanks to UniProt for making all data --- > uniprot | This track was created by Maximilian Haeussler at UCSC, with help from Chris > uniprot | Lee, Mark Diekhans and Brian Raney, feedback from the UniProt staff and Alejo > uniprot | Mujica, Regeneron Pharmaceuticals. Thanks to UniProt for making all data