2593718ac23f10aabf08572b67c20c3db436ef03 max Thu Apr 1 02:09:59 2021 -0700 clarifying the uniprot docs a bit, refs #27308 diff --git src/hg/makeDb/trackDb/uniprot.html src/hg/makeDb/trackDb/uniprot.html index 69006ce..5d47dbf 100644 --- src/hg/makeDb/trackDb/uniprot.html +++ src/hg/makeDb/trackDb/uniprot.html @@ -1,41 +1,49 @@ <h2>Description</h2> <p> -This track shows protein sequence annotations from the <a +This track shows protein sequences and annotations on them from the <a href="https://www.uniprot.org/" target="_blank">UniProt/SwissProt</A> database, -mapped to genomic coordinates. It also shows how the protein sequences in this database -map to the genome. -The data has been curated from scientific publications by the UniProt/SwissProt staff. -The annotations are divided into multiple subtracks, based on their "feature type" in UniProt: +mapped to genomic coordinates. +</p> +<p> +UniProt/SwissProt data has been curated from scientific publications by the UniProt staff, +UniProt/TrEMBL data has been predicted by various computational algorithms. +The annotations are divided into multiple subtracks, based on their "feature type" in UniProt. +The first two subtracks below - one for SwissProt, one for TrEMBL - show the +alignments of protein sequences to the genome, all other tracks below are the protein annotations +mapped through these alignments to the genome. </p> <table class="stdTbl"> <tr> <th>Track Name</th> <th>Description</th> </tr> <tr> - <td>UCSC Alignment, SwissProt</td> + <td>UCSC Alignment, SwissProt = curated protein sequences</td> <td>Protein sequences from SwissProt mapped onto the genome. All other - tracks are (start,end) annotations mapped using this track.</td> </tr> + tracks are (start,end) SwissProt annotations on these sequences mapped + using this track. Protein sequences without a single curated + annotation were not added to this track.</td> </tr> <tr> - <td>UCSC Alignment, TrEMBL</td> + <td>UCSC Alignment, TrEMBL = predicted protein sequences</td> <td>Protein sequences from TrEMBL mapped onto the genome. All other tracks - are (start,end) annotations mapped using this track. This track is -hidden by default. To show it, click its checkbox on the track description -page.</td> </tr> + below are (start,end) TrEMBL annotations mapped to the genome using + this track. This track is hidden by default. To show it, click its + checkbox on the track configuration page. Protein sequences without a single + predicted annotation on them were not added to this track.</td></tr> <tr> <td>UniProt Signal Peptides</td> <td>Regions found in proteins destined to be secreted, generally cleaved from mature protein.</td> </tr> <tr> <td>UniProt Extracellular Domains</td> <td>Protein domains with the comment "Extracellular".</td> </tr> <tr> <td>UniProt Transmembrane Domains</td> <td>Protein domains of the type "Transmembrane".</td> </tr> <tr> <td>UniProt Cytoplasmic Domains</td> <td>Protein domains with the comment "Cytoplasmic".</td> @@ -121,34 +129,34 @@ mutated amino acids as a SwissProt annotation, it is not shown again. Two annotations mapped through different transcripts but with the same genome coordinates are only shown once. </p> <p>Note that only for the human hg38 assembly and SwissProt annotations, there also is a <a href="hgTracks?db=hg38&hubUrl=ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/genome_annotation_tracks/UP000005640_9606_hub/hub.txt">public track hub</a> prepared by UniProt itself, with genome annotations maintained by UniProt using their own mapping method based on those Gencode/Ensembl gene models that are annotated in UniProt for a given protein.</p> <h2>Methods</h2> <p> -UniProt sequences were aligned to UCSC/Gencode transcript sequences first with +UniProt sequences were aligned to one of UCSC, Gencode, Ensembl or Augustus transcript sequences, first with BLAT, filtered with pslReps (93% query coverage, within top 1% score), lifted to genome positions with pslMap and filtered again. UniProt annotations were -obtained from the UniProt XML file. The annotations were then mapped to the +obtained from the UniProt XML file. The UniProt annotations were then mapped to the genome through the alignment using the pslMap program. This mapping approach draws heavily on the <A HREF="http://modbase.compbio.ucsf.edu/LS-SNP/" TARGET="_BLANK">LS-SNP</A> pipeline by Mark Diekhans. For human and mouse, the alignments were filtered by retaining only proteins annotated with a given transcript in the Genome Browser table kgXref. Like all Genome Browser source code, the main script used to build this track can be found on <a href="https://github.com/ucscGenomeBrowser/kent/tree/master/src/hg/utils/otto/uniprot/doUniprot">github</a>. </p> <h2>Data Access</h2> <p> The raw data can be explored interactively with the <a href="../cgi-bin/hgTables">Table Browser</a>, or the <a href="../cgi-bin/hgIntegrator">Data Integrator</a>.