src/hg/makeDb/trackDb/human/encodePseudogene.html 8c2f7318d8d821de9b2a25750586a94ab5e8c1bb

8c2f7318d8d821de9b2a25750586a94ab5e8c1bb
lrnassar
  Fri Nov 15 18:50:19 2024 -0800
Giving the UI link cronjob some love by fixing all the 301 redirects. These are the bulk of the items listed on the cron. No RM.

diff --git src/hg/makeDb/trackDb/human/encodePseudogene.html src/hg/makeDb/trackDb/human/encodePseudogene.html
index 4b4c811..4fbf941 100644
--- src/hg/makeDb/trackDb/human/encodePseudogene.html
+++ src/hg/makeDb/trackDb/human/encodePseudogene.html
@@ -1,412 +1,412 @@
 <H2>Description</H2>
 <P>
 This track shows the pseudogenes located in ENCODE regions generated by
 five different methods&mdash;Yale Pipeline, GenCode manual annotation, two 
 different UCSC methods, and Gene Identification Signature (GIS)&mdash;as well 
 as a consensus pseudogenes subtrack based on the 
 pseudogenes from all five methods. Datasets are displayed in separate
 subtracks within the annotation and are individually described below.</P>
 <P>
 The annotations are colored as follows: </P>
 <P>
 <TABLE BORDER=1 BORDERCOLOR="#aaaaaa" CELLPADDING=4>
 <TR>
   <TH align=left>Type</TH>
   <TH align=left>Color</TH>
   <TH align=left>Description</TH>
 </TR>
 <TR>
   <TD>Processed_pseudogene</TD>
   <TD><FONT COLOR=#C85BBF><B>pink</B></FONT></TD>
   <TD>Pseudogenes arising via retrotransposition (exon structure of parent gene lost) </TD>
 </TR>
 <TR>
   <TD>Unprocessed_pseudogene</TD>
   <TD><FONT COLOR=005BBF><B>blue</B></FONT></TD>
   <TD>Pseudogenes arising via gene duplication (exon structure of parent gene retained)</TD>
 </TR>
 <TR>
   <TD>Pseudogene_fragment</TD>
   <TD><FONT COLOR=#645BBF><B>light blue</B></FONT></TD>
   <TD>Pseudogenes sequences that are single-exon and cannot be confidently 
 assigned to either the processed or the duplicated category</TD>
 </TR>
 <TR>
   <TD>Undefined</TD>
   <TD>gray</TD>
   <TD>&nbsp;</TD>
 </TR>
 </TABLE></P>
 <P></P>
 <HR>
 
 <H2>Consensus Pseudogenes</H2>
 <H3>Description</H3>
 <P>
 This subtrack shows pseudogenes derived from a consensus of the five 
 methods listed above. In the pseudogene.org data freeze dated 6 Jan. 2006, 
 201 consensus pseudogenes were found.
 Here, pseudogenes are defined as genomic sequences that are similar to known 
 genes but exhibit various inactivating disablements (<em>e.g.</em> premature 
 stop codons or frameshifts) in their putative protein-coding regions and are 
 flagged as either recently-processed or non-processed.</P>
 
 <H3>Methods</H3>
 <P>
 The pseudogene sets were processed as follows:
 <UL>
 <LI><B>Step I:</B> The four data sets were filtered to remove pseudogenes 
 that overlap with current Gencode coding exons/loci. Pseudogenes overlapping 
 with introns or noncoding genes were kept. Subsequent filtering of pseudogene
 sets, excluding the Havana set, removed pseudogenes overlapping with exons of 
 UCSC Known Genes.
 
 <LI><B>Step II:</B> A union of the pseudogenes from each filtered set was 
 created. If a pseudogenic region was annotated by more than one group, 
 the lowest starting coordinate and highest ending coordinate were used as the
 boundaries.
 
 <LI><B>Step III:</B> A parent protein for each pseudogene in the union was
 assigned using a protein set from UniProt. Pseudogenes without a matching 
 protein were excluded.
 
 <LI><B>Step IV:</B> Each pseudogene was realigned to its parent protein.
 
 <LI><B>Step V:</B> The consensus list of pseudogenes was updated with 
 boundaries derived from the alignment in Step IV.
 
 <LI><B>Step VI:</B> The consensus list of pseudogenes was updated with the
 assigned parent proteins and new classifications (processed or non-processed).
 </UL></P>
      
 <H3>Verification of the Consensus Pseudogenes</H3>
 <P>
 All pseudogenes in the list have been extensively curated by Adam Frankish and
 Jennifer Harrow at the The Wellcome Trust Sanger Institute.</P>
 
 <H3>References</H3>
 <P>
 More information about this data set is available from <A
 HREF=http://www.pseudogene.org/ENCODE/ TARGET=_blank>pseudogene.org/ENCODE</A>.
 </P>
 <P></P>
 <HR>
 
 <H2>Havana-Gencode Annotated Pseudogenes and Immunglobulin Segments</H2>
 <H3>Description</H3>
 <P>
 This track shows pseudogenes annotated by the 
-<A HREF=http://www.sanger.ac.uk/HGP/havana/ TARGET=_blank>HAVANA</A> group 
+<A HREF=https://www.sanger.ac.uk/HGP/havana/ TARGET=_blank>HAVANA</A> group 
 at the Wellcome Trust Sanger Institute.  Pseudogenes have homology to protein
 sequences but generally have a disrupted CDS.  For all annotated
 pseudogenes, an active homologous gene (the parent) can be identified
 elsewhere in the genome.  Pseudogenes are classified as processed or
 unprocessed.
 
 <H3>Methods</H3>
 <P>
 Prior to manual annotation, finished sequence is submitted to an
 automated analysis pipeline for similarity searches and ab initio gene
 predictions.  The searches are run on a computer farm and stored in an
 Ensembl MySQL database using the Ensembl analysis pipeline system
 (Searle <em>et al.</em>, 2004, Harrow <em>et al.</em>, 2006).</P>
 <P>
 A pseudogene is annotated
 where the total length of the protein homology to the genomic sequence
 is &gt;20% of the length of the parent protein or &gt;100 aa in length,
 whichever is shortest.  If a gene structure has an ORF but has lost
 the structure of the parent gene, a pseudogene is annotated provided there
 is no evidence of transcription from the pseudogene locus.  When an
 open but truncated reading frame is present, other evidence is used
 (for example, 3' genomic polyA tract) to allow classification as a
 pseudogene.  When a parent gene has only a single coding exon (e.g.
 olfactory receptors), a small 5' or 3' truncation to the CDS at the
 pseudogene locus (compared to other family members) is sufficient to
 confirm pseudogene status where the truncation is predicted to
 significantly affect secondary structure by the literature and/or
 expert community.  </P>
 <P>
 Processed and unprocessed pseudogenes are
 distinguished on the basis of structure and genomic context. 
 Processed pseudogenes, which arise via retrotransposition, lose the
 intron-exon structure of the parent gene, often have an A-rich tract
 indicative of the insertion site at their 3' end, and are flanked by
 different genomic sequence to the parent gene.  Unprocessed
 pseudogenes, which arise via gene duplication, share both the
 intron-exon structure and flanking genomic sequence with the parent
 gene.  Transcribed pseudogenes are indicated by the annotation of a
 pseudogene and transcript variant alongside each other.</P>
 
 <H3>References</H3>
 <P>
 Harrow J, Denoeud F, Frankish A, Reymond A, Chen CK, Chrast J, Lagarde J, 
 Gilbert JG, Storey R, Swarbreck D, <em>et al</em>.
 <A HREF="https://genomebiology.biomedcentral.com/articles/10.1186/gb-2006-7-s1-s4"
 TARGET=_blank>GENCODE: Producing a reference annotation for ENCODE</A>. 
 <em>Genome Biol</em>. 2006;7 Suppl 1:S4.1-9. </P>
 <P>
 Searle SM, Gilbert J, Iyer V, Clamp M.
 <A HREF="https://genome.cshlp.org/content/14/5/963.full"
 TARGET=_blank>The otter annotation system</A>.
 <em>Genome Res</em>. 2004 May;14(5):963-70.</P>
 <P></P>
 <HR>
 
 <H2>Yale Pseudogenes</H2>
 <H3>Description</H3>
 <P>
 This subtrack shows pseudogenes in the ENCODE regions identified by the Yale 
 Pseudogene Pipeline. In this analysis, pseudogenes are defined as genomic 
 sequences that are similar to known genes with various inactivating 
 disablements (e.g. premature stop codons or frameshifts) in their 
 putative protein-coding regions. Pseudogenes are flagged as 
 recently processed, recently duplicated, or of uncertain origin (either 
 ancient fragments or resulting from a single-exon parent).</P>
 
 <H3>Methods</H3>
 <P>
 <UL>
 <LI><B>Step I:</B> Repeat-masked human genome sequence was used as the target 
 for a six-frame TBLASTN where the query was the nonredundant human proteome 
 set (European Bioinformatics Institute). Only high-quality human protein 
 sequences from SWISS-PROT and TrEMBL were used, because this set included 
 processed or duplicated pseudogenes.
 <LI><B>Step II:</B> BLAST hits that had a significant overlap with annotated
 multiple-exon Ensembl genes were removed from consideration.
 <LI><B>Step III:</B> The set of BLAST hits was reduced by selecting hits in 
 decreasing significance level and removing matches that overlapped by more
 than 10 amino acids or 30 bp with a picked match.
 <LI><B>Step IV:</B> Adjacent matches on a chromosome were merged together if 
 they were thought to belong to the same pseudogene locus. Merged matches were
 extended on both sides to include the length of the query protein to which they 
 matched along with an extra 30 bp buffer on either side. 
 <LI><B>Step V:</B> The FASTA program was used to re-align these extended hits to
 the genome. Redundant hits were removed and hits with gaps greater than 60 bp
 were split into two alignments.
 <LI><B>Step VI:</B> Alignments with possible artifactual frameshifts or stop 
 codons introduced by the alignment process were closely inspected.
 <LI><B>Step VII:</B> False positives (E-value less than 10<SUP>-10</SUP> or 
 amino acid sequence of less than 40% identity) and sequences matching 
 protein queries containing repeats or low-complexity regions were removed. 
 Potential functional genes were also removed. These were defined as having no 
 frameshift disruptions, less than 95% sequence identity to the query protein, 
 and translatable to a protein sequence longer than 95% of the length of 
 the query protein.
 <LI><B>Step VIII:</B> The remaining putative pseudogene sequences were 
 classified based on several criteria. The intron-exon structure of the 
 functional gene was further used to infer whether a pseudogene was recently 
 duplicated or processed. A duplicated pseudogene retains the intron-exon 
 structure of its parent functional gene, whereas a processed pseudogene shows 
 evidence that this structure has been spliced out. Those sequences
 where the insertions were 50% or more repeats (as detected by RepeatMasker)
 are "Disrupted" processed pseudogenes. Small pseudogene sequences that 
 cannot be confidently assigned to either the processed or duplicated 
 category may be ancient fragments. Further details can be found in the 
 references below.
 </UL></P>
 
 <H3>Verification of Yale Pseudogenes</H3>
 <P>
 All pseudogenes in the list have been manually checked.</P>
 
 <H3>References</H3>
 <P>
 Zhang Z, Harrison PM, Liu Y, Gerstein M. 
 <A HREF="http://papers.gersteinlab.org/papers/human-pgenes-gr2003/"
 TARGET=_blank>Millions of years of evolution preserved: a comprehensive catalog 
 of the processed pseudogenes in the human genome</A>.
 <em>Genome Res</em>. 2003 Dec;13(12):2541-58. </P>
 <P>
 Zheng D, Zhang Z, Harrison PM, Karro J, Carriero N, Gerstein M. 
 <A HREF="http://archive.gersteinlab.org/papers/e-print/chr22-pgene-exp/preprint.pdf"
 TARGET=_blank>Integrated pseudogene annotation for human chromosome 22: evidence
 for transcription</A>.
 <em>J Mol Biol</em>. 2005 May 27;349(1):27-45.</P>
 <P></P>
 <HR>
 
 <H2>UCSC Retrogene Predictions</H2>
 <H3>Description</H3>
 <P>
 The Retrogene subtrack shows processed mRNAs that have been inserted back 
 into the genome since the mouse/human split. Retrogenes can be 
 functional genes that have acquired a promoter from a neighboring gene, 
 non-functional pseudogenes, or transcribed pseudogenes.</P>
 
 <H3>Methods</H3>
 <P>
 <UL>
 <LI><B>Step I:</B> All GenBank mRNAs for a particular species were aligned to 
 the genome using blastz. 
 <LI><B>Step II:</B> mRNAs that aligned twice in the genome (once with introns 
 and once without introns) were initially screened. 
 <LI><B>Step III:</B> A series of features were scored to determine candidates 
 for retrotranspostion events. These features included position and length of the
 polyA tail, degree of synteny with mouse, coverage of repetitive elements, 
 number of exons that can still be aligned to the retroGene, and degree of 
 divergence from the parent gene. Retrogenes are classified using a threshold
 score function that is a linear combination of this set of features.
 Retrogenes in the final set have a score threshold greater than 425 based on a
 ROC plot against the Vega annotated pseudogenes.
 </UL></P>
 <P>
 The &quot;type&quot; field has four possible values: 
 <UL>
 <LI><B>singleExon: </B> the parent gene is a single exon gene
 <LI><B>mrna:</B> the parent gene is a spliced mrna that 
 has no annotation in NCBI refSeq, UCSC knownGene or Mammalian Gene Collection 
 (MGC)
 <LI><B>annotated:</B> the parent gene has been annotated
 by one of refSeq, knownGene or MGC
 <LI><B>expressed:</B> an mRNA overlaps
 the retrogene, indicating probable transcription
 </UL></P>
 <P>
 These features can be downloaded from the table pseudoGeneLink in many 
 formats using the Table Browser option on the menubar. </P>
 
 <H3>References</H3>
 <P>
 Kent WJ, Baertsch R, Hinrichs A, Miller W, Haussler D.
 <A HREF="https://www.pnas.org/content/100/20/11484" 
 TARGET=_blank>Evolution's cauldron: 
 Duplication, deletion, and rearrangement in the mouse and human genomes</A>. 
 <I>Proc Natl Acad Sci USA</I>. 2003 Sep 30;100(20):11484-9.</P>
 <P>
 Schwartz S, Kent WJ, Smit A, Zhang Z, Baertsch R, Hardison R, 
 Haussler D, Miller W.
 <A HREF="https://genome.cshlp.org/content/13/1/103.abstract" 
 TARGET=_blank>Human-mouse alignments with BLASTZ</A>. 
 <em>Genome Res</em>. 2003 Jan;13(1):103-7. </P>
 <P></P>
 <HR>
 
 <H2>UCSC Pseudogene Predictions</H2>
 <H3>Methods</H3>
 <P>
 <UL>
 <LI><B>Step I:</B> A set of pre-aligned human known genes was mapped across the 
 human genome through the human Blastz Self Alignment using HomoMap (homologous 
 mapping method). The fragments identified by HomoMap are homologs of genes 
 from the Known Genes set. 
 
 <LI><B>Step II:</B> Each homologous fragment was compared with its 
 known reference gene and a set of features was then collected. The features
 included sequence identity, Ka/Ks ratio (asynonymous substitution per codon vs. 
 synonymous substitution per codon), splicing sites, and the number of 
 premature stop codons. These homologous fragments are either genes or 
 pseudogenes. 
 
 <LI><B>Step III:</B> Homologous fragments that overlapped known reference 
 genes were labeled as positive samples; those overlapping known pseudogenes 
 were labeled as negative samples. 
 
 <LI><B>Step IV:</B> These positive and negative sets were used to train 
 support vector machines (SVMs) to separate coding fragments from pseudo
 fragments. The trained SVMs were used to classify all homologous fragments
 into potential coding elements or potential pseudo elements. 
 
 <LI><B>Step V:</B> Finally, a heuristic filter was used to correct some 
 misclassified fragments and to generate the final potential pseudogene set.
 </UL></P>
 <P>
 <HR>
 
 <H2>GIS-PET Pseudogene Predictions</H2>
 <H3>Description</H3>
 <P>
 This subtrack shows retrotransposed pseudogenes predicted by multiple mapped 
 GIS-PETs (gene identification signature-pair end ditags) collected from two 
 different cancer cell lines HCT116 and MCF7.  A total of 49 non-redundant 
 processed pseudogenes predicted in the ENCODE regions are presented in this 
 dataset. Each pseudogene is labeled with an ID of the format 
 <em>AAA-GISPgene-XX</em>, 
 where &quot;AAA&quot; indicates the parental gene name, &quot;GISPgene&quot; is the GIS pseudogene, and &quot;XX&quot; is the unique ID for each pseudogene.</P>
 
 <H3>Methods</H3>
 <P>
 PETs were generated from full-length transcripts and 
 computationally mapped onto the human genome to demarcate the transcript start 
 and end positions. The PETs that mapped to multiple genome locations were 
 grouped into PET-based gene families that include parent gene and 
 pseudogenes.  A representative member&mdash;the shortest PET as defined by 
 genomic coordinates&mdash;was selected from each family. This representative
 PET was aligned to the hg17 genome using in order to identify all the 
 putative pseudogenes at the whole genome level.  All hits with an 
 identity >=70% and coverage >=50% within ENCODE regions were 
 reported. In this context, &quot;coverage&quot; refers to alignment coverage of 
 the query sequence, i.e. a  measure of how complete the predicted pseudogene 
 is relative to the query sequence.</P>
 
 <H3>Verification of GIS-PET Pseudogene Predictions</H3>
 <P>
 Pseudogenes were verified by manual examination.</P>
 
 <H2>Credits</H2>
 <P>
 These data were generated by the ENCODE Pseudogene Annotation group:
 <A HREF="mailto:&#106;&#108;a&#49;&#64;&#115;a&#110;&#103;&#101;&#114;.
 &#97;&#99;.
 &#117;k">
 Jennifer Harrow</A>,
 <!-- above address is jla1 at sanger.ac.uk -->
 <A HREF="mailto:&#119;&#101;i&#99;&#108;&#64;&#103;&#105;&#115;.
 &#97;&#45;s&#116;&#97;&#114;.
 e&#100;&#117;.
 &#115;&#103;">
 Wei Chia-Lin</A>,
 <!-- above address is weicl at gis.a-star.edu.sg -->
 <A HREF="mailto:&#99;&#104;&#111;&#111;sw&#49;&#64;&#103;&#105;&#115;.
 a&#45;&#115;&#116;&#97;r.
 &#101;&#100;&#117;.
 &#115;g">
 Siew Woh Choo</A>
 <!-- above address is choosw1 at gis.a-star.edu.sg -->
 <A HREF="mailto:&#97;&#102;2&#64;&#115;&#97;&#110;&#103;&#101;&#114;.
 &#97;&#99;.
 &#117;k">
 Adam Frankish</A>,
 <!-- above address is af2 at sanger.ac.uk -->
 <A HREF="mailto:&#98;&#97;&#101;&#114;t&#115;&#99;&#104;&#64;&#115;o&#101;.
 &#117;&#99;&#115;c.
 e&#100;&#117;">
 Robert Baertsch</A>,
 <!-- above address is baertsch at soe.ucsc.edu -->
 <A HREF="mailto:&#102;&#100;&#101;&#110;oe&#117;&#100;&#64;&#105;&#109;&#105;&#109;.
 e&#115;">
 France Denoeud</A>,
 <!-- above address is fdenoeud at imim.es -->
 <A HREF="mailto:&#122;&#104;&#101;ng&#100;&#121;&#64;&#99;s&#98;.
 &#121;&#97;&#108;&#101;.
 &#101;&#100;u">
 Deyou Zheng</A>,
 <!-- above address is zhengdy at csb.yale.edu -->
 <A HREF="mailto:&#121;t&#108;&#117;&#64;&#115;&#111;e.
 &#117;c&#115;c.
 &#101;&#100;&#117;">
 Yontao Lu</A>,
 <!-- above address is ytlu at soe.ucsc.edu -->
 <A HREF="mailto:&#97;&#108;&#101;&#120;&#97;&#110;&#100;&#114;&#101;.
 &#114;e&#121;m&#111;&#110;&#100;&#64;&#109;&#101;&#100;ec&#105;&#110;&#101;.
 &#117;&#110;&#105;&#103;e.
 &#99;&#104;">
 Alexandre Reymond</A>,
 <!-- above address is alexandre.reymond at medecine.unige.ch -->
 <A HREF="mailto:&#114;g&#117;&#105;g&#111;&#64;&#105;&#109;i&#109;.
 &#101;&#115;">
 Roderic Guigo Serra</A>,
 <!-- above address is rguigo at imim.es -->
 <A HREF="mailto:&#116;&#111;&#109;_gi&#110;&#103;&#101;&#114;&#97;&#115;&#64;&#97;&#102;&#102;&#121;&#109;e&#116;&#114;i&#120;.
 c&#111;&#109;">
 Tom Gingeras</A>,
 <!-- above address is tom_gingeras at affymetrix.com -->
 <A HREF="mailto:&#115;&#117;g&#97;&#110;t&#104;&#105;&#64;&#99;&#115;b.
 y&#97;&#108;&#101;.
 &#101;&#100;u">
 Suganthi Balasubramanian</A> and
 <!-- above address is suganthi at csb.yale.edu -->
 <A HREF="mailto:&#109;&#97;rk.
 &#103;&#101;&#114;&#115;&#116;&#101;&#105;&#110;&#64;&#121;&#97;&#108;&#101;.
 &#101;&#100;&#117;">
 Mark Gerstein</A>.
 <!-- above address is mark.gerstein at yale.edu -->