src/hg/htdocs/FAQ/FAQgenes.html bbabbd5d2566d47d923d51dbe350634783455999

bbabbd5d2566d47d923d51dbe350634783455999
mspeir
  Sun Oct 26 12:14:52 2025 -0700
change soe to gi, refs #35031

diff --git src/hg/htdocs/FAQ/FAQgenes.html src/hg/htdocs/FAQ/FAQgenes.html
index e43effd5f95..b6acdff0764 100755
--- src/hg/htdocs/FAQ/FAQgenes.html
+++ src/hg/htdocs/FAQ/FAQgenes.html
@@ -454,31 +454,31 @@
 <a target=_blank href="https://www.mitomap.org/MITOMAP">Mitomap.org</a>, which contains detailed
 documentation about the 
 <a href="https://www.mitomap.org/foswiki/bin/view/MITOMAP/MitoSeqs" 
 target=_blank>the history of this sequence</a>. We also have a Mitomap track with
 gene annotations and variant information on both hg19 (chrMT) and hg38 (chrM).
 </p>
 
 <p>
 Why chrMT? The assembly hg19 has two mitochondrial genomes, chrM (old) and chrMT (current).
 The reason is that for hg19, no mitochondrial sequence was in the GRCh37 sequence file.
 The UCSC Genome Browser originally added a chrM sequence when making hg19 
 that was not the mitochondrial genome sequence later selected by NCBI for GRCh37. This
 is why <strong>the current hg19 version contains two mitochondrial sequences,
 the old one called &quot;chrM&quot; and the current GRCh37 reference, 
 called &quot;chrMT&quot;</strong>. The issue is described in detail in our 
-<a target=_blank href="https://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/README.txt">
+<a target=_blank href="https://hgdownload.gi.ucsc.edu/goldenPath/hg19/bigZips/README.txt">
 hg19 sequence download instructions</a>. If you use hg19 today, chrMT should be
 considered the current mitochondrial sequence, chrM is only supported for backwards
 compatibility and legacy annotation files. Our hg19.fa.gz in the "current" download directory
 contains both sequences, the old hg19.fa.gz in the top level download directory has only chrM,
 for backwards compatibility for old pipelines and our analysisSet fasta file for aligners contains only chrMT.
 For most purposes when using hg19, we recommend using the analysis set fasta file.
 </p>
 
 <p>
 For hg38, there is no issue, it has only chrM, and all mitochondrial annotations are present on chrM.
 </p>
 
 <a name="report"></a>
 <h2>How shall I report a gene transcript in a manuscript?</h2>
 
@@ -566,50 +566,50 @@
 gene track and every selection method has a different aim. For the
 knownGene tracks (UCSC genes on hg19, Gencode on hg38 and mm10), data tables
 called &quot;knownCanonical&quot; were built at UCSC. 
 For both Gencode/Ensembl and RefSeq, the NCBI/EBI project MANE selects
 for each gene the most relevant transcript, as long as these are identical between
 Gencode and RefSeq. For NCBI RefSeq, the track RefSeqSelect also selects the most relevant
 transcript(s) for each gene and is not limited to transcripts that are identical between 
 RefSeq and Ensembl. Therefore, the following gene tracks have "best-transcripts" tracks:
 </p>
 
 <p>
 <b>UCSC Genes on hg19</b>: For hg19, the knownCanonical table is a subset of the <a target="_blank" 
 href="../cgi-bin/hgTrackUi?db=hg19&g=knownGene">UCSC Genes</a> track. It was generated at UCSC by 
 identifying a canonical isoform for each cluster ID, or gene. Generally, this is the longest
 isoform. It can be downloaded directly from the <a target="_blank" 
-href="http://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/">hg19 downloads database</a> 
+href="http://hgdownload.gi.ucsc.edu/goldenPath/hg19/database/">hg19 downloads database</a> 
 or by using the <a target="_blank" href="../cgi-bin/hgTables">Table Browser</a>.</p>
 
 <p>
 <b>Gencode on hg38/mm10 - knownCanonical</b>: For hg38, the knownCanonical table is a subset of the <a target="_blank" 
 href="../cgi-bin/hgTrackUi?db=hg38&g=knownGene">GENCODE v29</a> track. It was generated at UCSC. As opposed to the hg19 
 knownCanonical table, which used computationally generated gene clusters and generally chose the 
 longest isoform as the canonical isoform, the hg38 table uses ENSEMBL gene IDs to define clusters 
 (that is to say, one canonical isoform per ENSEMBL gene ID), and the method of choosing the 
 isoform is described as such:</p>
 
 <p style="margin-left: 10em">
 <i>knownCanonical identifies the canonical isoform of each cluster ID or gene using the ENSEMBL 
 gene IDs to define each cluster. The canonical transcript is chosen using the APPRIS principal 
 transcript when available. If no APPRIS tag exists for any transcript associated with the 
 cluster, then a transcript in the BASIC set is chosen. If no BASIC transcript exists, then 
 the longest isoform is used.</i></p>
 <p>
 It can be downloaded directly from the <a target="_blank" 
-href="http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/">hg38 downloads database</a>
+href="http://hgdownload.gi.ucsc.edu/goldenPath/hg38/database/">hg38 downloads database</a>
 or by using the <a target="_blank" href="../cgi-bin/hgTables">Table Browser</a>.</p>
 
 <p>
 <b>NCBI RefSeq (hg19/hg38)</b>: This track collection contains three subtracks that select the 
 most relevant transcript for all or a subset of genes, with slightly different aims:
 <ul>
     <li> RefSeq Select: NCBI manually selects few, usually one,
 transcript per gene called "RefSeq Select", based on <a target=_blank
 href="https://www.ncbi.nlm.nih.gov/refseq/refseq_select/">a lot of criteria</a>.
 The criteria include manual curation, whether a transcript appears in LRG sequences,
 whether it is well conserved and many more. 
 Example use cases are comparative genomics and variant reporting. This subset
 is available in the RefSeq Select track under NCBI RefSeq.
 <li>MANE: RefSeq and the EBI
 also select one transcript for every protein coding gene that is annotated exactly 
@@ -757,54 +757,54 @@
 <a name="gtfDownload"></a>
 <h2>Does UCSC provide GTF/GFF files for gene models?</h2>
 <p>
 We provide files in GTF format, which is an extension to GFF2, for most assemblies. More 
 information on GTF format can be found <a target="_blank" href="FAQformat.html#format4">
 in our FAQ</a>.</p>
 <p>
 These files are generated for four gene model tables: ncbiRefSeq, refGene, ensGene, knownGene. 
 Certain assemblies, such as hg19, will have all four files while smaller assemblies may only have
 one or two of these. Which file a user should use depends on their analysis, as they contain 
 different data and metadata.</p>
 <p>
 These files are generated using the <code>genePredToGtf</code> method described in our 
 <a target="_blank" href="https://genome.ucsc.edu/FAQ/FAQdownloads.html#download37">
 downloads FAQ</a> using the <code>-utr</code> flag. They can be found on the download server 
-address <i>http://hgdownload.soe.ucsc.edu/goldenPath/$db/bigZips/genes/</i> where
+address <i>http://hgdownload.gi.ucsc.edu/goldenPath/$db/bigZips/genes/</i> where
 <i>$db</i> is the assembly of interest. For example, the <a target="_blank" 
-href="http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/genes/">hg38 GTF files</a>.</p>
+href="http://hgdownload.gi.ucsc.edu/goldenPath/hg38/bigZips/genes/">hg38 GTF files</a>.</p>
 
 <a name="coding"></a>
 <h2>What is the best way to get only coding genes (or only non-coding genes)
 out of GENCODE (or other gene) tables?</h2>
 <h3>Coding genes</h3>
 <p>
 The best approach to get protein-coding genes out of GENCODE is to join data with a
 related attributes table, and specifically name the desired biotype(s).</p>
 <p>
 Here is an introductory example using the Public MySQL server to access the wgEncodeGencodeBasicV39
 table of all genes and the wgEncodeGencodeAttrsV39 related table to find the transcriptType for each
 entry and to select those that are annotated as protein-coding genes. There are a number of
 biotypes that can be accessed by looking at the table scheme and clicking the values link for the
 <a href="http://genome.ucsc.edu/cgi-bin/hgTables?hgta_database=hg38&hgta_histoTable=wgEncodeGencodeAttrsV39&hgta_doValueHistogram=transcriptType"
 target="_blank">transcriptType</a> field. These terms are also more fully described on the GENCODE
 <a href=" https://www.gencodegenes.org/pages/biotypes.html" target="_blank">biotypes page</a>.
 The example below will attempt to make a simple example to select
 all types that have &quot;protein_coding&quot; in this transcriptType field:</p>
 <p>
 <pre>
-mysql -u genome -h genome-mysql.soe.ucsc.edu hg38 -e 'select g.name,a.transcriptType from wgEncodeGencodeBasicV39 g, wgEncodeGencodeAttrsV39 a where (g.name = a.transcriptId) and (a.transcriptType = "protein_coding");'
+mysql -u genome -h genome-mysql.gi.ucsc.edu hg38 -e 'select g.name,a.transcriptType from wgEncodeGencodeBasicV39 g, wgEncodeGencodeAttrsV39 a where (g.name = a.transcriptId) and (a.transcriptType = "protein_coding");'
 </pre></p>
 <p>
 What this query does is access the hg38 database, and then from the wgEncodeGencodeBasicV39 table,
 it takes the name field (g.name) and looks in the related wgEncodeGencodeAttrsV39 table for a matching
 transcriptId field (g.name = a.transcriptId), and then screens for only entries in wgEncodeGencodeAttrsV39
 that are equal to protein-coding (a.transcriptType = &quot;protein_coding&quot;).
 In this way selecting all the entries which are annotated as protein-coding.
 Please note this selection will return some of the unusual protein-coding cases
 that one would not consider, for instance, it will return genes one may not want
 (or want), such as Immunoglobulin and T-cell receptor components.</p>
 <p>
 For the manually curated RefSeq gene set, transcript identifiers start with NM_ for coding
 or NR_ for non-coding, followed by a number and version number separated by a dot, e.g.
 &quot;NR_046018.2&quot; for an RNA pseudogene. For RefSeq one can select coding genes by
 filtering for NM identifiers. On the concept of genes, it may be worth noting that the
@@ -830,31 +830,31 @@
 and miRNA (microRNA), hinting at the abundant types of RNA molecules.</p>
 <p>
 Since there are many different kinds of non-coding elements in GENCODE, a better step for non-coding
 selection is to join data with a related attributes table, and specifically name a specific
 desired biotype or biotypes, such as only lncRNAs. There are a number of biotypes that can be
 accessed by looking at the table scheme and clicking the values link for the
 <a href="http://genome.ucsc.edu/cgi-bin/hgTables?hgta_database=hg38&hgta_histoTable=wgEncodeGencodeAttrsV39&hgta_doValueHistogram=transcriptType"
 target="_blank">transcriptType</a> field. These terms are also more fully described on the GENCODE
 <a href=" https://www.gencodegenes.org/pages/biotypes.html" target="_blank">biotypes page</a>.</p>
 <p>
 Here is an introductory example using the Public MySQL server to access the wgEncodeGencodeBasicV39 table
 of all genes and the wgEncodeGencodeAttrsV39 related table to find the transcriptType for each entry and
 to select just lncRNA entries.</p>
 <p>
 <pre>
-mysql -u genome -h genome-mysql.soe.ucsc.edu hg38 -e 'select g.name,a.transcriptType from wgEncodeGencodeBasicV39 g, wgEncodeGencodeAttrsV39 a where (g.name = a.transcriptId) and (a.transcriptType = "lncRNA");'
+mysql -u genome -h genome-mysql.gi.ucsc.edu hg38 -e 'select g.name,a.transcriptType from wgEncodeGencodeBasicV39 g, wgEncodeGencodeAttrsV39 a where (g.name = a.transcriptId) and (a.transcriptType = "lncRNA");'
 </pre></p>
 <p>
 What this query does is access the hg38 database, and then from the wgEncodeGencodeBasicV39 table,
 it takes the name field (g.name) and looks in the related wgEncodeGencodeAttrsV39 table for a matching
 transcriptId field (g.name = a.transcriptId), and then screens for only entries in wgEncodeGencodeAttrsV39
 that are equal to lncRNA (a.transcriptType = &quot;lncRNA&quot;).  In this way selecting all of these types,
 which again, may not be the only subset desired. By modifying the above query, it is possible to add
 further qualifiers and generate a subset of different non-coding elements meeting specific research needs.</p>
 <p>
 For the manually curated RefSeq gene set, transcript identifiers start with NM_ for coding
 or NR_ for non-coding, followed by a number and version number separated by a dot, e.g.
 &quot;NR_046018.2&quot; for an RNA pseudogene. For RefSeq, one can select non-coding genes by
 filtering for NR identifiers. Note that a pseudogene of mRNA is not an unambiguous concept,
 and there may be a desire to look further to select certain subset types as mentioned above.</p>
 <p>