8c4eff5f7832be1b022735416a2bb1d4c7ad42be gperez2 Wed Sep 14 13:17:02 2022 -0700 Releasing the UniProt bugs update to the RR, refs #28560 diff --git src/hg/makeDb/trackDb/uniprotAlpha.html src/hg/makeDb/trackDb/uniprotAlpha.html deleted file mode 100644 index 7ca655c..0000000 --- src/hg/makeDb/trackDb/uniprotAlpha.html +++ /dev/null @@ -1,302 +0,0 @@ -
-This track shows protein sequences and annotations on them from the UniProt/SwissProt database, -mapped to genomic coordinates. -
--UniProt/SwissProt data has been curated from scientific publications by the UniProt staff, -UniProt/TrEMBL data has been predicted by various computational algorithms. -The annotations are divided into multiple subtracks, based on their "feature type" in UniProt. -The first two subtracks below - one for SwissProt, one for TrEMBL - show the -alignments of protein sequences to the genome, all other tracks below are the protein annotations -mapped through these alignments to the genome. -
- -Track Name | -Description | -
---|---|
UCSC Alignment, SwissProt = curated protein sequences | -Protein sequences from SwissProt mapped to the genome. All other - tracks are (start,end) SwissProt annotations on these sequences mapped - through this alignment. Even protein sequences without a single curated - annotation (splice isoforms) are visible in this track. Each UniProt protein - has one main isoform, which is colored in dark. Alternative isoforms are - sequences that do not have annotations on them and are colored in light-blue. - They can be hidden with the TrEMBL/Isoform filter (see below). |
UCSC Alignment, TrEMBL = predicted protein sequences | -Protein sequences from TrEMBL mapped to the genome. All other tracks - below are (start,end) TrEMBL annotations mapped to the genome using - this track. This track is hidden by default. To show it, click its - checkbox on the track configuration page. |
UniProt Signal Peptides | -Regions found in proteins destined to be secreted, generally cleaved from mature protein. | -
UniProt Extracellular Domains | -Protein domains with the comment "Extracellular". | -
UniProt Transmembrane Domains | -Protein domains of the type "Transmembrane". | -
UniProt Cytoplasmic Domains | -Protein domains with the comment "Cytoplasmic". | -
UniProt Polypeptide Chains | -Polypeptide chain in mature protein after post-processing. | -
UniProt Regions of Interest | -Regions that have been experimentally defined, such as the role of a region in mediating protein-protein interactions or some other biological process. | -
UniProt Domains | -Protein domains, zinc finger regions and topological domains. | -
UniProt Disulfide Bonds | -Disulfide bonds. | -
UniProt Amino Acid Modifications | -Glycosylation sites, modified residues and lipid moiety-binding regions. | -
UniProt Amino Acid Mutations | -Mutagenesis sites and sequence variants. | -
UniProt Protein Primary/Secondary Structure Annotations | -Beta strands, helices, coiled-coil regions and turns. | -
UniProt Sequence Conflicts | -Differences between Genbank sequences and the UniProt sequence. | -
UniProt Repeats | -Regions of repeated sequence motifs or repeated domains. | -
UniProt Other Annotations | -All other annotations, e.g. compositional bias | -
-For consistency and convenience for users of mutation-related tracks, -the subtrack "UniProt/SwissProt Variants" is a copy of the track -"UniProt Variants" in the track group "Phenotype and Literature", or -"Variation and Repeats", depending on the assembly. -
- --Genomic locations of UniProt/SwissProt annotations are labeled with a short name for -the type of annotation (e.g. "glyco", "disulf bond", "Signal peptide" -etc.). A click on them shows the full annotation and provides a link to the UniProt/SwissProt -record for more details. TrEMBL annotations are always shown in -light blue, except in the Signal Peptides, -Extracellular Domains, Transmembrane Domains, and Cytoplamsic domains subtracks.
- --Mouse over a feature to see the full UniProt annotation comment. For variants, the mouse over will -show the full name of the UniProt disease acronym. -
- --The subtracks for domains related to subcellular location are sorted from outside to inside of -the cell: Signal peptide, -extracellular, -transmembrane, and cytoplasmic. -
- --In the "UniProt Modifications" track, lipoification sites are highlighted in -dark blue, glycosylation sites in -dark green, and phosphorylation in -light green.
- --Duplicate annotations are removed as far as possible: if a TrEMBL annotation -has the same genome position and same feature type, comment, disease and -mutated amino acids as a SwissProt annotation, it is not shown again. Two -annotations mapped through different protein sequence alignments but with the same genome -coordinates are only shown once.
- -On the configuration page of this track, you can choose to hide any TrEMBL annotations. -This filter will also hide the UniProt alternative isoform protein sequences because -both types of information are less relevant to most users. Please contact us if you -want more detailed filtering features.
- -Note that for the human hg38 assembly and SwissProt annotations, there -also is a public -track hub prepared by UniProt itself, with -genome annotations maintained by UniProt using their own mapping -method based on those Gencode/Ensembl gene models that are annotated in UniProt -for a given protein. For proteins that differ from the genome, UniProt's mapping method -will, in most cases, map a protein and its annotations to an unexpected location -(see below for details on UCSC's mapping method).
- --Briefly, UniProt protein sequences were aligned to the transcripts associated -with the protein, the top-scoring alignments were retained, and the result was -projected to the genome through a transcript-to-genome alignment. -Depending on the genome, the transcript-genome alignments was either -provided by the source database (NBCI RefSeq), created at UCSC (UCSC RefSeq) or -derived from the transcripts (Ensembl/Augustus). The transcript set is NCBI -RefSeq for hg38, UCSC RefSeq for hg19 (due to alt/fix haplotype misplacements -in the NCBI RefSeq set on hg19). For other genomes, RefSeq, Ensembl and Augustus -are tried, in this order. The resulting protein-genome alignments of this process -are available in the file formats for liftOver or pslMap from our data archive -(see "Data Access" section below). -
- -An important step of the mapping process is filtering the alignment from -protein to transcript. Due to differences between the UniProt proteins and the -transcripts and the genome, the best matching transcript is not always the -correct transcript. Therefore, only for organisms that have a RefSeq transcript track, -proteins are only aligned to the RefSeq transcripts that are annotated -by UniProt for this protein. If no transcripts are annotated on the protein, or -the annotated ones do not exist anymore, but a NCBI Gene ID is annotated, -the RefSeq transcripts for the gene are used. If no NCBI Gene is annotated, -then the best matching alignment is used. Only a handful of edge cases -(pseudogenes, very recently added proteins) on hg38 remain where the -global transcriptome-wide matches have to be used. The details page of the -protein alignments shows the transcripts used for the mapping and how -these transcripts were found. There can be multiple transcripts for one -protein, as their coding sequences can be identical or several of them do -not differ by more than 1% in alignment score. -
- -In other words, when an NCBI or UCSC RefSeq track is used for the mapping and to align a -protein sequence to the correct transcript, we use a three stage process: -
This system was designed to resolve the problem of incorrect mappings of -proteins, mostly on hg38, due to differences between the SwissProt -sequences and the genome reference sequence, which has changed since the -proteins were defined. The problem is most pronounced for gene families -composed of either very repetitive or very similar proteins. To make sure that -the alignments always go to the best chromosome location, all _alt and _fix -reference patch sequences are ignored for the alignment, so the patches are -entirely free of UniProt annotations. Please contact us if you have feedback on -this process or example edge cases. We are not aware of a way to evaluate the -results completely and in an automated manner.
--Proteins were aligned to transcripts with TBLASTN, converted to PSL, filtered -with pslReps (93% query coverage, keep alignments within top 1% score), lifted to genome -positions with pslMap and filtered again with pslReps. UniProt annotations were -obtained from the UniProt XML file. The UniProt annotations were then mapped to the -genome through the alignment described above using the pslMap program. This approach -draws heavily on the LS-SNP pipeline by Mark Diekhans. -Like all Genome Browser source code, the main script used to build this track -can be found on Github. -
- --This track is automatically updated on an ongoing basis, every 2-3 months. -The current version is always shown on the track details page, it includes the -release of UniProt, the version of the transcript set and a unique MD5 that is -based on the protein sequences, the transcript sequences, the mapping file -between both and the transcript-genome alignment. The exact transcript -that was used for the alignment is shown when clicking a protein alignment -in one of the two alignment tracks. -
- --For reproducibility of older analysis results, previous versions of this track -are available for browsing in the form of the UCSC UniProt Archive Track Hub. The underlying data of - all releases of this track (past and current) can be obtained from our downloads server, including the UniProt -protein-to-genome alignment. The file formats available are in the -command line programs liftOver or pslMap, which can be used to map -coordinates on protein sequences to genome coordinates. The filenames are -unipToGenome.over.chain.gz (liftOver) and unipToGenomeLift.psl.gz (pslMap).
- --The raw data of the current track can be explored interactively with the -Table Browser, or the -Data Integrator. -For automated analysis, the genome annotation is stored in a bigBed file that -can be downloaded from the -download server. -The exact filenames can be found in the -track configuration file. -Annotations can be converted to ASCII text by our tool bigBedToBed -which can be compiled from the source code or downloaded as a precompiled -binary for your system. Instructions for downloading source code and binaries can be found -here. -The tool can also be used to obtain only features within a given range, for example: -
-bigBedToBed http://hgdownload.soe.ucsc.edu/gbdb/$db/uniprot/unipStruct.bb -chrom=chr6 -start=0 -end=1000000 stdout -
-Please refer to our -mailing list archives -for questions, or our -Data Access FAQ -for more information. - - -- -
-This track was created by Maximilian Haeussler at UCSC, with a lot of input from Chris -Lee, Mark Diekhans and Brian Raney, feedback from the UniProt staff, Alejo -Mujica, Regeneron Pharmaceuticals and Pia Riestra, GeneDx. Thanks to UniProt for making all data -available for download. -
- --UniProt Consortium. - -Reorganizing the protein space at the Universal Protein Resource (UniProt). -Nucleic Acids Res. 2012 Jan;40(Database issue):D71-5. -PMID: 22102590; PMC: PMC3245120 -
- --Yip YL, Scheib H, Diemand AV, Gattiker A, Famiglietti LM, Gasteiger E, Bairoch A. - -The Swiss-Prot variant page and the ModSNP database: a resource for sequence and structure -information on human protein variants. -Hum Mutat. 2004 May;23(5):464-70. -PMID: 15108278 -