src/hg/makeDb/trackDb/uniprotAlpha.html 8c4eff5f7832be1b022735416a2bb1d4c7ad42be

8c4eff5f7832be1b022735416a2bb1d4c7ad42be
gperez2
  Wed Sep 14 13:17:02 2022 -0700
Releasing the UniProt bugs update to the RR, refs #28560

diff --git src/hg/makeDb/trackDb/uniprotAlpha.html src/hg/makeDb/trackDb/uniprotAlpha.html
deleted file mode 100644
index 7ca655c..0000000
--- src/hg/makeDb/trackDb/uniprotAlpha.html
+++ /dev/null
@@ -1,302 +0,0 @@
-<h2>Description</h2>
-
-<p>
-This track shows protein sequences and annotations on them from the <a
-href="https://www.uniprot.org/" target="_blank">UniProt/SwissProt</A> database,
-mapped to genomic coordinates. 
-</p>
-<p>
-UniProt/SwissProt data has been curated from scientific publications by the UniProt staff,
-UniProt/TrEMBL data has been predicted by various computational algorithms.
-The annotations are divided into multiple subtracks, based on their &quot;feature type&quot; in UniProt.
-The first two subtracks below - one for SwissProt, one for TrEMBL - show the
-alignments of protein sequences to the genome, all other tracks below are the protein annotations
-mapped through these alignments to the genome.
-</p> 
-
-<table class="stdTbl">
-  <tr>
-    <th>Track Name</th>
-    <th>Description</th>
-  </tr>
-  <tr>
-    <td>UCSC Alignment, SwissProt = curated protein sequences</td>
-    <td>Protein sequences from SwissProt mapped to the genome. All other
-        tracks are (start,end) SwissProt annotations on these sequences mapped
-        through this alignment. Even protein sequences without a single curated 
-    annotation (splice isoforms) are visible in this track. Each UniProt protein 
-    has one main isoform, which is colored in dark. Alternative isoforms are 
-    sequences that do not have annotations on them and are colored in light-blue. 
-    They can be hidden with the TrEMBL/Isoform filter (see below).</td> </tr>
-<tr>
-    <td>UCSC Alignment, TrEMBL = predicted protein sequences</td>
-    <td>Protein sequences from TrEMBL mapped to the genome. All other tracks
-        below are (start,end) TrEMBL annotations mapped to the genome using
-        this track. This track is hidden by default. To show it, click its
-        checkbox on the track configuration page. </td></tr>
-  <tr>
-    <td>UniProt Signal Peptides</td>
-    <td>Regions found in proteins destined to be secreted, generally cleaved from mature protein.</td>
-  </tr>
-  <tr>
-    <td>UniProt Extracellular Domains</td>
-    <td>Protein domains with the comment &quot;Extracellular&quot;.</td>
-  </tr>
-  <tr>
-    <td>UniProt Transmembrane Domains</td>
-    <td>Protein domains of the type &quot;Transmembrane&quot;.</td>
-  </tr>
-  <tr>
-    <td>UniProt Cytoplasmic Domains</td>
-    <td>Protein domains with the comment &quot;Cytoplasmic&quot;.</td>
-  </tr>
-  <tr>
-    <td>UniProt Polypeptide Chains</td>
-    <td>Polypeptide chain in mature protein after post-processing.</td>
-  </tr>
-  <tr>
-    <td>UniProt Regions of Interest</td>
-    <td>Regions that have been experimentally defined, such as the role of a region in mediating protein-protein interactions or some other biological process.</td>
-  </tr>
-  <tr>
-    <td>UniProt Domains</td>
-    <td>Protein domains, zinc finger regions and topological domains.</td>
-  </tr>
-  <tr>
-    <td>UniProt Disulfide Bonds</td>
-    <td>Disulfide bonds.</td>
-  </tr>
-  <tr>
-    <td>UniProt Amino Acid Modifications</td>
-    <td>Glycosylation sites, modified residues and lipid moiety-binding regions.</td>
-  </tr>
-  <tr>
-    <td>UniProt Amino Acid Mutations</td>
-    <td>Mutagenesis sites and sequence variants.</td>
-  </tr>
-  <tr>
-    <td>UniProt Protein Primary/Secondary Structure Annotations</td>
-    <td>Beta strands, helices, coiled-coil regions and turns.</td>
-  </tr>
-  <tr>
-    <td>UniProt Sequence Conflicts</td>
-    <td>Differences between Genbank sequences and the UniProt sequence.</td>
-  </tr>
-  <tr>
-    <td>UniProt Repeats</td>
-    <td>Regions of repeated sequence motifs or repeated domains.</td>
-  </tr>
-  <tr>
-    <td>UniProt Other Annotations</td>
-    <td>All other annotations, e.g. compositional bias</td>
-  </tr>
-</table>
-<p>
-For consistency and convenience for users of mutation-related tracks,
-the subtrack &quot;UniProt/SwissProt Variants&quot; is a copy of the track
-&quot;UniProt Variants&quot; in the track group &quot;Phenotype and Literature&quot;, or 
-&quot;Variation and Repeats&quot;, depending on the assembly.
-</p>
-
-<h2>Display Conventions and Configuration</h2>
-
-<p>
-Genomic locations of UniProt/SwissProt annotations are labeled with a short name for
-the type of annotation (e.g. &quot;glyco&quot;, &quot;disulf bond&quot;, &quot;Signal peptide&quot;
-etc.). A click on them shows the full annotation and provides a link to the UniProt/SwissProt
-record for more details. TrEMBL annotations are always shown in 
-<span style="color: rgb(0,150,250)"><b>light blue</b></span>, except in the Signal Peptides,
-Extracellular Domains, Transmembrane Domains, and Cytoplamsic domains subtracks.</p>
-
-<p>
-Mouse over a feature to see the full UniProt annotation comment. For variants, the mouse over will
-show the full name of the UniProt disease acronym.
-</p>
-
-<p>
-The subtracks for domains related to subcellular location are sorted from outside to inside of 
-the cell: <span style="color: rgb(255,0,150)"><b>Signal peptide</b></span>, 
-<span style="color: rgb(0,150,255)"><b>extracellular</b></span>, <span style="color: rgb(0,150,0)">
-<b>transmembrane</b></span>, and <span style="color: rgb(255,150,0)"><b>cytoplasmic</b></span>.
-</p>
-
-<p>
-In the &quot;UniProt Modifications&quot; track, lipoification sites are highlighted in 
-<span style="color: rgb(12,12,120)"><b>dark blue</b></span>, glycosylation sites in 
-<span style="color: rgb(0,100,100)"><b>dark green</b></span>, and phosphorylation in 
-<span style="color: rgb(200,200,0)"><b>light green</b></span>.</p>
-
-<p>
-Duplicate annotations are removed as far as possible: if a TrEMBL annotation
-has the same genome position and same feature type, comment, disease and
-mutated amino acids as a SwissProt annotation, it is not shown again. Two
-annotations mapped through different protein sequence alignments but with the same genome
-coordinates are only shown once.  </p>
-
-<p>On the configuration page of this track, you can choose to hide any TrEMBL annotations.
-This filter will also hide the UniProt alternative isoform protein sequences because
-both types of information are less relevant to most users. Please contact us if you
-want more detailed filtering features.</p>
-
-<p>Note that for the human hg38 assembly and SwissProt annotations, there
-also is a <a
-href="hgTracks?db=hg38&hubUrl=https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/genome_annotation_tracks/UP000005640_9606_hub/hub.txt" target=_blank>public
-track hub</a> prepared by UniProt itself, with 
-genome annotations maintained by UniProt using their own mapping
-method based on those Gencode/Ensembl gene models that are annotated in UniProt
-for a given protein. For proteins that differ from the genome, UniProt's mapping method
-will, in most cases, map a protein and its annotations to an unexpected location
-(see below for details on UCSC's mapping method).</p>
-
-<h2>Methods</h2>
-
-<p>
-Briefly, UniProt protein sequences were aligned to the transcripts associated
-with the protein, the top-scoring alignments were retained, and the result was
-projected to the genome through a transcript-to-genome alignment.
-Depending on the genome, the transcript-genome alignments was either
-provided by the source database (NBCI RefSeq), created at UCSC (UCSC RefSeq) or
-derived from the transcripts (Ensembl/Augustus).  The transcript set is NCBI
-RefSeq for hg38, UCSC RefSeq for hg19 (due to alt/fix haplotype misplacements 
-in the NCBI RefSeq set on hg19). For other genomes, RefSeq, Ensembl and Augustus 
-are tried, in this order. The resulting protein-genome alignments of this process 
-are available in the file formats for liftOver or pslMap from our data archive
-(see "Data Access" section below).
-</p>
-
-<p>An important step of the mapping process is filtering the alignment from
-protein to transcript. Due to differences between the UniProt proteins and the
-transcripts and the genome, the best matching transcript is not always the
-correct transcript.  Therefore, only for organisms that have a RefSeq transcript track,
-proteins are only aligned to the RefSeq transcripts that are annotated
-by UniProt for this protein. If no transcripts are annotated on the protein, or
-the annotated ones do not exist anymore, but a NCBI Gene ID is annotated,
-the RefSeq transcripts for the gene are used.  If no NCBI Gene is annotated,
-then the best matching alignment is used. Only a handful of edge cases
-(pseudogenes, very recently added proteins) on hg38 remain where the
-global transcriptome-wide matches have to be used.  The details page of the
-protein alignments shows the transcripts used for the mapping and how
-these transcripts were found. There can be multiple transcripts for one
-protein, as their coding sequences can be identical or several of them do
-not differ by more than 1% in alignment score.
-</p>
-
-<p>In other words, when an NCBI or UCSC RefSeq track is used for the mapping and to align a
-protein sequence to the correct transcript, we use a three stage process:
-<ol>
-    <li>If UniProt has annotated a given RefSeq transcript for a given protein
-    sequence, the protein is aligned to this transcript. Any difference in the
-    version suffix is tolerated in this comparison.  
-    <li>If no transcript is annotated or the transcript cannot be found in the
-    NCBI/UCSC RefSeq track, the UniProt-annotated NCBI Gene ID is resolved to a
-    set of NCBI RefSeq transcript IDs via the most current version of NCBI
-    genes tables. Only the top match of the resulting alignments and all
-    others within 1% of its score are used for the mapping.
-    <li>If no transcript can be found after step (2), the protein is aligned to all transcripts,
-    the top match, and all others within 1% of its score are used.
-</ol>
-
-<p>This system was designed to resolve the problem of incorrect mappings of
-proteins, mostly on hg38, due to differences between the SwissProt
-sequences and the genome reference sequence, which has changed since the
-proteins were defined. The problem is most pronounced for gene families
-composed of either very repetitive or very similar proteins. To make sure that
-the alignments always go to the best chromosome location, all _alt and _fix
-reference patch sequences are ignored for the alignment, so the patches are
-entirely free of UniProt annotations. Please contact us if you have feedback on
-this process or example edge cases. We are not aware of a way to evaluate the
-results completely and in an automated manner.</p>
-<p>
-Proteins were aligned to transcripts with TBLASTN, converted to PSL, filtered
-with pslReps (93% query coverage, keep alignments within top 1% score), lifted to genome
-positions with pslMap and filtered again with pslReps.  UniProt annotations were
-obtained from the UniProt XML file.  The UniProt annotations were then mapped to the
-genome through the alignment described above using the pslMap program.  This approach
-draws heavily on the <A HREF="https://modbase.compbio.ucsf.edu/LS-SNP/"
-TARGET="_BLANK">LS-SNP</A> pipeline by Mark Diekhans.
-Like all Genome Browser source code, the main script used to build this track
-can be found on <a
-href="https://github.com/ucscGenomeBrowser/kent/tree/master/src/hg/utils/otto/uniprot/doUniprot">Github</a>.
-</p>
-
-<h2>Automated data updates and release history</h2>
-<p>
-This track is automatically updated on an ongoing basis, every 2-3 months.
-The current version is always shown on the track details page, it includes the
-release of UniProt, the version of the transcript set and a unique MD5 that is
-based on the protein sequences, the transcript sequences, the mapping file
-between both and the transcript-genome alignment. The exact transcript
-that was used for the alignment is shown when clicking a protein alignment
-in one of the two alignment tracks.
-</p>
-
-<p>
-For reproducibility of older analysis results, previous versions of this track
-are available for browsing in the form of the <a
-    href="hgTracks?db=$db&hubUrl=https://hgdownload.soe.ucsc.edu/goldenPath/archive/$db/uniprot/hub.txt"
-    target=_blank> UCSC UniProt Archive Track Hub</a>. The underlying data of
-    all releases of this track (past and current) can be obtained from our <a
-    href="https://hgdownload.soe.ucsc.edu/goldenPath/archive/$db/uniprot"
-target=_blank>downloads server</a>, including the UniProt
-protein-to-genome alignment. The file formats available are in the
-command line programs liftOver or pslMap, which can be used to map
-coordinates on protein sequences to genome coordinates. The filenames are
-unipToGenome.over.chain.gz (liftOver) and unipToGenomeLift.psl.gz (pslMap).  </p>
-
-<h2>Data Access</h2>
-
-<p>
-The raw data of the current track can be explored interactively with the
-<a href="../cgi-bin/hgTables">Table Browser</a>, or the
-<a href="../cgi-bin/hgIntegrator">Data Integrator</a>.
-For automated analysis, the genome annotation is stored in a bigBed file that 
-can be downloaded from the
-<a href="http://hgdownload.soe.ucsc.edu/gbdb/$db/uniprot/" target="_blank">download server</a>.
-The exact filenames can be found in the 
-<a href="https://github.com/ucscGenomeBrowser/kent/blob/master/src/hg/makeDb/trackDb/uniprot.ra">track configuration file</a>. 
-Annotations can be converted to ASCII text by our tool <tt>bigBedToBed</tt>
-which can be compiled from the source code or downloaded as a precompiled
-binary for your system. Instructions for downloading source code and binaries can be found
-<a href="http://hgdownload.soe.ucsc.edu/downloads.html#utilities_downloads">here</a>.
-The tool can also be used to obtain only features within a given range, for example:
-<p>
-<tt>bigBedToBed http://hgdownload.soe.ucsc.edu/gbdb/$db/uniprot/unipStruct.bb -chrom=chr6 -start=0 -end=1000000 stdout</tt> 
-</p>
-Please refer to our
-<a href="https://groups.google.com/a/soe.ucsc.edu/forum/#!forum/genome">mailing list archives</a>
-for questions, or our
-<a href="../FAQ/FAQdownloads.html#download36">Data Access FAQ</a>
-for more information. 
-</p>
-
-<p>
-
-<h2>Credits</h2>
-
-<p>
-This track was created by Maximilian Haeussler at UCSC, with a lot of input from Chris
-Lee, Mark Diekhans and Brian Raney, feedback from the UniProt staff, Alejo
-Mujica, Regeneron Pharmaceuticals and Pia Riestra, GeneDx. Thanks to UniProt for making all data
-available for download.
-</p>
-
-<h2>References</h2>
-
-<p>
-UniProt Consortium.
-<a href="https://academic.oup.com/nar/article/40/D1/D71/2903687/Reorganizing-the-protein-space-at-
-the-Universal" target="_blank">
-Reorganizing the protein space at the Universal Protein Resource (UniProt)</a>.
-<em>Nucleic Acids Res</em>. 2012 Jan;40(Database issue):D71-5.
-PMID: <a href="https://www.ncbi.nlm.nih.gov/pubmed/22102590" target="_blank">22102590</a>; PMC: <a
-href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3245120/" target="_blank">PMC3245120</a>
-</p>
-
-<p>
-Yip YL, Scheib H, Diemand AV, Gattiker A, Famiglietti LM, Gasteiger E, Bairoch A.
-<a href="https://onlinelibrary.wiley.com/doi/abs/10.1002/humu.20021" target="_blank">
-The Swiss-Prot variant page and the ModSNP database: a resource for sequence and structure
-information on human protein variants</a>.
-<em>Hum Mutat</em>. 2004 May;23(5):464-70.
-PMID: <a href="https://www.ncbi.nlm.nih.gov/pubmed/15108278" target="_blank">15108278</a>
-</p>