1a36012f9f2eb9d087e503b1fd2c37f3a9adedfd max Fri Nov 26 08:11:01 2021 -0800 A major update for the UniProt otto job, refs #28560 diff --git src/hg/makeDb/trackDb/uniprotAlpha.html src/hg/makeDb/trackDb/uniprotAlpha.html new file mode 100644 index 0000000..6b77f0c --- /dev/null +++ src/hg/makeDb/trackDb/uniprotAlpha.html @@ -0,0 +1,264 @@ +
+This track shows protein sequences and annotations on them from the UniProt/SwissProt database, +mapped to genomic coordinates. +
++UniProt/SwissProt data has been curated from scientific publications by the UniProt staff, +UniProt/TrEMBL data has been predicted by various computational algorithms. +The annotations are divided into multiple subtracks, based on their "feature type" in UniProt. +The first two subtracks below - one for SwissProt, one for TrEMBL - show the +alignments of protein sequences to the genome, all other tracks below are the protein annotations +mapped through these alignments to the genome. +
+ +Track Name | +Description | +
---|---|
UCSC Alignment, SwissProt = curated protein sequences | +Protein sequences from SwissProt mapped onto the genome. All other + tracks are (start,end) SwissProt annotations on these sequences mapped + using this track. Protein sequences without a single curated + annotation were not added to this track. |
UCSC Alignment, TrEMBL = predicted protein sequences | +Protein sequences from TrEMBL mapped onto the genome. All other tracks + below are (start,end) TrEMBL annotations mapped to the genome using + this track. This track is hidden by default. To show it, click its + checkbox on the track configuration page. Protein sequences without a single + predicted annotation on them were not added to this track. |
UniProt Signal Peptides | +Regions found in proteins destined to be secreted, generally cleaved from mature protein. | +
UniProt Extracellular Domains | +Protein domains with the comment "Extracellular". | +
UniProt Transmembrane Domains | +Protein domains of the type "Transmembrane". | +
UniProt Cytoplasmic Domains | +Protein domains with the comment "Cytoplasmic". | +
UniProt Polypeptide Chains | +Polypeptide chain in mature protein after post-processing. | +
UniProt Regions of Interest | +Regions that have been experimentally defined, such as the role of a region in mediating protein-protein interactions or some other biological process. | +
UniProt Domains | +Protein domains, zinc finger regions and topological domains. | +
UniProt Disulfide Bonds | +Disulfide bonds. | +
UniProt Amino Acid Modifications | +Glycosylation sites, modified residues and lipid moiety-binding regions. | +
UniProt Amino Acid Mutations | +Mutagenesis sites and sequence variants. | +
UniProt Protein Primary/Secondary Structure Annotations | +Beta strands, helices, coiled-coil regions and turns. | +
UniProt Sequence Conflicts | +Differences between Genbank sequences and the UniProt sequence. | +
UniProt Repeats | +Regions of repeated sequence motifs or repeated domains. | +
UniProt Other Annotations | +All other annotations, e.g. compositional bias | +
+For consistency, the subtrack "UniProt/SwissProt Variants" is a copy of the track +"UniProt Variants" in the track group "Phenotype and Literature", or +"Variation and Repeats", depending on the assembly. +
+ ++Genomic locations of UniProt/SwissProt annotations are labeled with a short name for +the type of annotation (e.g. "glyco", "disulf bond", "Signal peptide" +etc.). A click on them shows the full annotation and provides a link to the UniProt/SwissProt +record for more details. TrEMBL annotations are always shown in +light blue, except in the Signal Peptides, +Extracellular Domains, Transmembrane Domains, and Cytoplamsic domains subtracks.
+ ++Mouse over a feature to see the full UniProt annotation comment. For variants, the mouse over will +show the full name of the UniProt disease acronym. +
+ ++The subtracks for domains related to subcellular location are sorted from outside to inside of +the cell: Signal peptide, +extracellular, +transmembrane, and cytoplasmic. +
+ ++In the "UniProt Modifications" track, lipoification sites are highlighted in +dark blue, glycosylation sites in +dark green, and phosphorylation in +light green.
+ ++Duplicate annotations are removed as far as possible: if a TrEMBL annotation +has the same genome position and same feature type, comment, disease and +mutated amino acids as a SwissProt annotation, it is not shown again. Two +annotations mapped through different transcripts but with the same genome +coordinates are only shown once.
+ +Note that only for the human hg38 assembly and SwissProt annotations, there +also is a public +track hub prepared by UniProt itself, with +genome annotations maintained by UniProt using their own mapping +method based on those Gencode/Ensembl gene models that are annotated in UniProt +for a given protein.
+ ++Briefly, UniProt protein sequences were aligned to the transcripts associated +with the protein, the top-scoring alignments were retained, and the result was +projected to the genome through a transcript-to-genome alignment. +Depending on the genome, the transcript-genome alignments was either +provided by the source database (NBCI RefSeq), created at UCSC (UCSC RefSeq) or +derived from the transcripts (Ensembl/Augustus). The transcript set is NCBI +RefSeq for hg38, UCSC RefSeq for hg19 (due to alt/fix haplotype misplacements +in the NCBI RefSeq set on hg19). For other genomes, RefSeq, Ensembl and Augustus +are tried, in this order. The resulting protein-genome alignments of this process +are available in the file formats for liftOver or pslMap from our data archive +(see "Data Access" section below). +
+ +An important step of the mapping process is filtering the alignment from +protein to transcript. Due to differences between the UniProt proteins and the +transcripts and the genome, the best matching transcript is not always the +correct transcript. Therefore, at least when the transcript model is RefSeq, +proteins are only aligned to the RefSeq transcripts that are annotated by +UniProt for this protein (RefSeq version suffixes are skipped). If no +transcripts are annotated on the protein, or the annotated ones are not current +anymore, but a NCBI Gene ID is annotated, all RefSeq transcripts annotated to +this NCBI Gene ID are used. If no NCBI Gene ID is annotated, then the best +matching alignment is used. On hg38, only a handful of edge cases (pseudogenes, +very recently added proteins) remain where the best matches have to be used. +The details page of the protein alignments shows which transcript were used +for the mapping and how these transcripts were found. There can be multiple +transcripts for one protein, as their coding sequences can be identical. +
+ ++Proteins were aligned to transcripts with TBLASTN, converted to PSL, filtered +with pslReps (93% query coverage, within top 1% score), lifted to genome +positions with pslMap and filtered again. UniProt annotations were +obtained from the UniProt XML file. The UniProt annotations were then mapped to the +genome through the alignment using the pslMap program. This mapping approach +draws heavily on the LS-SNP pipeline by Mark Diekhans. For human and mouse, the +alignments were filtered by retaining only proteins annotated with +a given transcript in the Genome Browser table kgXref. Like all Genome Browser +source code, the main script used to build this track can be found on +github. +
+ ++This track is automatically updated on an ongoing basis, every 3-6 months. +The current version is always shown on the track details page, it includes the +release of UniProt, the version of the transcript set and a unique MD5 that is +based on the protein sequences, the transcript sequences, the mapping file +between both and the transcript-genome alignment. +
+ ++Previous versions of this track are available for browsing in the form of the + +UCSC UniProt Archive Track Hub. The underlying data of all releases of this track (past +and current) can be obtained from our +Downloads Server, in the data archive directory. +The UniProt protein-to-genome alignment is also available from there, in file +formats for our command line programs liftOver or pslMap, which can be used to +map coordinates on protein sequences to genome coordinates. The filenames are +unipToGenome.over.chain.gz and unipToGenomeLift.psl.gz. +
+ +
+The raw data of the current track can be explored interactively with the
+Table Browser, or the
+Data Integrator.
+For automated analysis, the genome annotation is stored in a bigBed file that
+can be downloaded from the
+download server.
+The exact filenames can be found in the
+track configuration file.
+Annotations can be converted to ASCII text by our tool bigBedToBed
+which can be compiled from the source code or downloaded as a precompiled
+binary for your system. Instructions for downloading source code and binaries can be found
+here.
+The tool can also be used to obtain only features within a given range, for example:
+
+bigBedToBed http://hgdownload.soe.ucsc.edu/gbdb/$db/uniprot/unipStruct.bb -chrom=chr6 -start=0 -end=1000000 stdout
+
+Please refer to our
+mailing list archives
+for questions, or our
+Data Access FAQ
+for more information.
+
+ +
+This track was created by Maximilian Haeussler at UCSC, with help from Chris +Lee, Mark Diekhans and Brian Raney, feedback from the UniProt staff, Alejo +Mujica, Regeneron Pharmaceuticals and Pia Riestra, GeneDx. Thanks to UniProt for making all data +available for download. +
+ ++UniProt Consortium. + +Reorganizing the protein space at the Universal Protein Resource (UniProt). +Nucleic Acids Res. 2012 Jan;40(Database issue):D71-5. +PMID: 22102590; PMC: PMC3245120 +
+ ++Yip YL, Scheib H, Diemand AV, Gattiker A, Famiglietti LM, Gasteiger E, Bairoch A. + +The Swiss-Prot variant page and the ModSNP database: a resource for sequence and structure +information on human protein variants. +Hum Mutat. 2004 May;23(5):464-70. +PMID: 15108278 +