--------------------------------------------------------------- cb3.trackDb.html : Differences exist between hgwbeta and hgw2 (RR fields taken from public MySql server, not individual machine) 1768c1768 < uniprot | Protein sequences from SwissProt mapped to the genome. All other --- > uniprot | Protein sequences from SwissProt mapped onto the genome. All other 1770,1774c1770,1771 < uniprot | through this alignment. Even protein sequences without a single curated < uniprot | annotation (splice isoforms) are visible in this track. Each UniProt protein < uniprot | has one main isoform, which is colored in dark. Alternative isoforms are < uniprot | sequences that do not have annotations on them and are colored in light-blue. < uniprot | They can be hidden with the TrEMBL/Isoform filter (see below). --- > uniprot | using this track. Protein sequences without a single curated > uniprot | annotation were not added to this track. 1777c1774 < uniprot | Protein sequences from TrEMBL mapped to the genome. All other tracks --- > uniprot | Protein sequences from TrEMBL mapped onto the genome. All other tracks 1780c1777,1778 < uniprot | checkbox on the track configuration page. --- > uniprot | checkbox on the track configuration page. Protein sequences without a single > uniprot | predicted annotation on them were not added to this track. 1839,1840c1837 < uniprot | For consistency and convenience for users of mutation-related tracks, < uniprot | the subtrack "UniProt/SwissProt Variants" is a copy of the track --- > uniprot | For consistency, the subtrack "UniProt/SwissProt Variants" is a copy of the track 1877c1874 < uniprot | annotations mapped through different protein sequence alignments but with the same genome --- > uniprot | annotations mapped through different transcripts but with the same genome 1880,1885c1877 < uniprot |

On the configuration page of this track, you can choose to hide any TrEMBL annotations. < uniprot | This filter will also hide the UniProt alternative isoform protein sequences because < uniprot | both types of information are less relevant to most users. Please contact us if you < uniprot | want more detailed filtering features.

< uniprot | < uniprot |

Note that for the human hg38 assembly and SwissProt annotations, there --- > uniprot |

Note that only for the human hg38 assembly and SwissProt annotations, there 1887c1879 < uniprot | href="hgTracks?db=hg38&hubUrl=https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/genome_annotation_tracks/UP000005640_9606_hub/hub.txt" target=_blank>public --- > uniprot | href="hgTracks?db=hg38&hubUrl=ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/genome_annotation_tracks/UP000005640_9606_hub/hub.txt">public 1891,1893c1883 < uniprot | for a given protein. For proteins that differ from the genome, UniProt's mapping method < uniprot | will, in most cases, map a protein and its annotations to an unexpected location < uniprot | (see below for details on UCSC's mapping method).

--- > uniprot | for a given protein.

1898,1956c1888,1890 < uniprot | Briefly, UniProt protein sequences were aligned to the transcripts associated < uniprot | with the protein, the top-scoring alignments were retained, and the result was < uniprot | projected to the genome through a transcript-to-genome alignment. < uniprot | Depending on the genome, the transcript-genome alignments was either < uniprot | provided by the source database (NBCI RefSeq), created at UCSC (UCSC RefSeq) or < uniprot | derived from the transcripts (Ensembl/Augustus). The transcript set is NCBI < uniprot | RefSeq for hg38, UCSC RefSeq for hg19 (due to alt/fix haplotype misplacements < uniprot | in the NCBI RefSeq set on hg19). For other genomes, RefSeq, Ensembl and Augustus < uniprot | are tried, in this order. The resulting protein-genome alignments of this process < uniprot | are available in the file formats for liftOver or pslMap from our data archive < uniprot | (see "Data Access" section below). < uniprot |

< uniprot | < uniprot |

An important step of the mapping process is filtering the alignment from < uniprot | protein to transcript. Due to differences between the UniProt proteins and the < uniprot | transcripts and the genome, the best matching transcript is not always the < uniprot | correct transcript. Therefore, only for organisms that have a RefSeq transcript track, < uniprot | proteins are only aligned to the RefSeq transcripts that are annotated < uniprot | by UniProt for this protein. If no transcripts are annotated on the protein, or < uniprot | the annotated ones do not exist anymore, but a NCBI Gene ID is annotated, < uniprot | the RefSeq transcripts for the gene are used. If no NCBI Gene is annotated, < uniprot | then the best matching alignment is used. Only a handful of edge cases < uniprot | (pseudogenes, very recently added proteins) on hg38 remain where the < uniprot | global transcriptome-wide matches have to be used. The details page of the < uniprot | protein alignments shows the transcripts used for the mapping and how < uniprot | these transcripts were found. There can be multiple transcripts for one < uniprot | protein, as their coding sequences can be identical or several of them do < uniprot | not differ by more than 1% in alignment score. < uniprot |

< uniprot | < uniprot |

In other words, when an NCBI or UCSC RefSeq track is used for the mapping and to align a < uniprot | protein sequence to the correct transcript, we use a three stage process: < uniprot |

    < uniprot |
  1. If UniProt has annotated a given RefSeq transcript for a given protein < uniprot | sequence, the protein is aligned to this transcript. Any difference in the < uniprot | version suffix is tolerated in this comparison. < uniprot |
  2. If no transcript is annotated or the transcript cannot be found in the < uniprot | NCBI/UCSC RefSeq track, the UniProt-annotated NCBI Gene ID is resolved to a < uniprot | set of NCBI RefSeq transcript IDs via the most current version of NCBI < uniprot | genes tables. Only the top match of the resulting alignments and all < uniprot | others within 1% of its score are used for the mapping. < uniprot |
  3. If no transcript can be found after step (2), the protein is aligned to all transcripts, < uniprot | the top match, and all others within 1% of its score are used. < uniprot |
< uniprot | < uniprot |

This system was designed to resolve the problem of incorrect mappings of < uniprot | proteins, mostly on hg38, due to differences between the SwissProt < uniprot | sequences and the genome reference sequence, which has changed since the < uniprot | proteins were defined. The problem is most pronounced for gene families < uniprot | composed of either very repetitive or very similar proteins. To make sure that < uniprot | the alignments always go to the best chromosome location, all _alt and _fix < uniprot | reference patch sequences are ignored for the alignment, so the patches are < uniprot | entirely free of UniProt annotations. Please contact us if you have feedback on < uniprot | this process or example edge cases. We are not aware of a way to evaluate the < uniprot | results completely and in an automated manner.

< uniprot |

< uniprot | Proteins were aligned to transcripts with TBLASTN, converted to PSL, filtered < uniprot | with pslReps (93% query coverage, keep alignments within top 1% score), lifted to genome < uniprot | positions with pslMap and filtered again with pslReps. UniProt annotations were --- > uniprot | UniProt sequences were aligned to one of UCSC, Gencode, Ensembl or Augustus transcript sequences, first with > uniprot | BLAT, filtered with pslReps (93% query coverage, within top 1% score), lifted > uniprot | to genome positions with pslMap and filtered again. UniProt annotations were 1958c1892 < uniprot | genome through the alignment described above using the pslMap program. This approach --- > uniprot | genome through the alignment using the pslMap program. This mapping approach 1960,1963c1894,1898 < uniprot | TARGET="_BLANK">LS-SNP pipeline by Mark Diekhans. < uniprot | Like all Genome Browser source code, the main script used to build this track < uniprot | can be found on Github. --- > uniprot | TARGET="_BLANK">LS-SNP pipeline by Mark Diekhans. For human and mouse, the > uniprot | alignments were filtered by retaining only proteins annotated with > uniprot | a given transcript in the Genome Browser table kgXref. Like all Genome Browser > uniprot | source code, the main script used to build this track can be found on > uniprot | github. 1966,1989d1900 < uniprot |

Automated data updates and release history

< uniprot |

< uniprot | This track is automatically updated on an ongoing basis, every 2-3 months. < uniprot | The current version is always shown on the track details page, it includes the < uniprot | release of UniProt, the version of the transcript set and a unique MD5 that is < uniprot | based on the protein sequences, the transcript sequences, the mapping file < uniprot | between both and the transcript-genome alignment. The exact transcript < uniprot | that was used for the alignment is shown when clicking a protein alignment < uniprot | in one of the two alignment tracks. < uniprot |

< uniprot | < uniprot |

< uniprot | For reproducibility of older analysis results, previous versions of this track < uniprot | are available for browsing in the form of the UCSC UniProt Archive Track Hub. The underlying data of < uniprot | all releases of this track (past and current) can be obtained from our downloads server, including the UniProt < uniprot | protein-to-genome alignment. The file formats available are in the < uniprot | command line programs liftOver or pslMap, which can be used to map < uniprot | coordinates on protein sequences to genome coordinates. The filenames are < uniprot | unipToGenome.over.chain.gz (liftOver) and unipToGenomeLift.psl.gz (pslMap).

< uniprot | 1993c1904 < uniprot | The raw data of the current track can be explored interactively with the --- > uniprot | The raw data can be explored interactively with the 2000c1911 < uniprot | track configuration file. --- > uniprot | track configuration file. 2006c1917 < uniprot |

--- > uniprot |
2008c1919,1924 < uniprot |

--- > uniprot |
> uniprot | This track is updated every month. The MySQL table hgFixed.trackVersion > uniprot | contains the name of the currently available data on the website. Older > uniprot | versions of the data files can be downloaded from the uniprot | href="http://hgdownload.soe.ucsc.edu/goldenPath/cb3/archive/">archive > uniprot | folder of our downloads server.
2016,2017d1931 < uniprot |

< uniprot | 2021,2023c1935,1937 < uniprot | This track was created by Maximilian Haeussler at UCSC, with a lot of input from Chris < uniprot | Lee, Mark Diekhans and Brian Raney, feedback from the UniProt staff, Alejo < uniprot | Mujica, Regeneron Pharmaceuticals and Pia Riestra, GeneDx. Thanks to UniProt for making all data --- > uniprot | This track was created by Maximilian Haeussler at UCSC, with help from Chris > uniprot | Lee, Mark Diekhans and Brian Raney, feedback from the UniProt staff and Alejo > uniprot | Mujica, Regeneron Pharmaceuticals. Thanks to UniProt for making all data