src/hg/htdocs/FAQ/FAQgenes.html 198c9b8daecc44fbda6a6494c566c723920f030a

198c9b8daecc44fbda6a6494c566c723920f030a
lrnassar
  Wed Mar 11 18:25:21 2026 -0700
Fixing a few hundred clear typos with the help of Claude. Some are less important in code comments, but majority of them are in user-facing places. I manually approved 60%+ of the changes and didn't see any that were an incorrect suggestion, at worst it was potentially uncessesary, like a code comment having cant instead of can't. No RM.

diff --git src/hg/htdocs/FAQ/FAQgenes.html src/hg/htdocs/FAQ/FAQgenes.html
index ed42cef9460..4c0f47dfda2 100755
--- src/hg/htdocs/FAQ/FAQgenes.html
+++ src/hg/htdocs/FAQ/FAQgenes.html
@@ -4,31 +4,31 @@
 
 <!-- Relative paths to support mirror sites with non-standard GB docs install -->
 <!--#include virtual="$ROOT/inc/gbPageStart.html" -->
 
 <h1>Frequently Asked Questions: Gene tracks</h1>
 
 <h2>Topics</h2>
 
 <ul>
 <li><a href="#gene">What is a gene?</a></li>
 <li><a href="#genestrans">What is a transcript and how is it related to a gene?</a></li>
 <li><a href="#genename">What is a gene name?</a></li>
 <li><a href="#mostCommon">What are the most common gene transcript tracks?</a></li>
 <li><a href="#wrong">I think this transcript looks strange, what shall I do?</a></li>
 <li><a href="#duplicates">Why does the UCSC RefSeq track ("refGene") include duplicates, and some transcripts map to two loci?</a></li>
-<li><a href="#duplicatesEns">Why does the Gencode/Ensembl tracks ("knownGene", "ensGene" or "wgEncodeGencodeVXX") include a few duplicates, and some transcripts map to two loci?</a></li>
+<li><a href="#duplicatesEns">Why do the Gencode/Ensembl tracks ("knownGene", "ensGene" or "wgEncodeGencodeVXX") include a few duplicates, and some transcripts map to two loci?</a></li>
 <li><a href="#ens">What are Ensembl and GENCODE and is there a difference?</a></li>
 <li><a href="#ensRefseq">What are the differences among GENCODE, Ensembl and RefSeq?</a></li>
 <li><a href="#hg19">For the human assembly hg19/GRCh37: What is the difference between "UCSC 
                     Genes" track, the "GENCODE" track and the "Ensembl Genes" track?</a></li>
 <li><a href="#hg38">For the human assembly hg38/GRCh38: What are the differences between the 
 		    "GENCODE" and "All GENCODE" tracks?</a></li>
 <li><a href="#gencode">What is the difference between GENCODE comprehensive and basic?</a></li>
 <li><a href="#ncbiRefseq">What is the difference between "NCBI RefSeq" and "UCSC RefSeq"?</a></li>
 <li><a href="#mito">What is the best gene track for mitochondrial gene annotations?</a></li>
 <li><a href="#report">How shall I report a gene transcript in a manuscript?</a></li>
 <li><a href="#ccds">What is CCDS?</a></li>
 <li><a href="#justsingle">How can I show a single transcript per gene?</a></li>
 <li><a href="#singledownload">How can I download a file with a single transcript per gene?</a></li>
 <li><a href="#bioTypeFilter">How can I filter by bioType from GENCODE/RefSeq/Ensembl?</a></li>
 <li><a href="#whatdo">This is rather complicated. Can you tell me which gene transcript track
@@ -43,31 +43,31 @@
 <a href="index.html">Return to FAQ Table of Contents</a></p>
 
 <a name="gene"></a>
 <h2>The basics</h2>
 
 The genome browser contains many gene annotation tracks. Our users 
 often wonder what these contain and where the information that we present comes
 from.
 
 <h6>What is a gene?</h6>
 <p>
 The exact definition of &quot;gene&quot; depends on the context. In the context of 
 genome annotation, a gene has at least a name and is defined by a collection of
 related RNA transcript sequences (&quot;isoforms&quot;). The naming of genes and the
 assignment of the most important transcript sequences is often done manually by
-a group of biological literature curators.  For human, genes names are created
+a group of biological literature curators.  For human, gene names are created
 by the <a target=_blank href="https://www.genenames.org/">Human Gene
 Nomenclature Committee (HGNC, formerly HUGO)</a>.  Non-human species have
 similar annotation groups, e.g. Mouse Genome Informatics, Wormbase, Flybase,
 etc.
 </p>
 
 <a name="genestrans"></a>
 <h6>What is a transcript and how is it related to a gene? </h6>
 <p>
 In the Genome Browser, transcript tracks often end with the word
 &quot;Genes&quot;, e.g. &quot;Ensembl Genes&quot;, &quot;NCBI RefSeq Genes&quot;, or &quot;UCSC 
 Genes&quot;. Despite the name, items in these tracks actually represent
 transcripts on chromosomes of a genome assembly</p>
 <p>
 Transcripts are defined as RNA molecules that are made from a DNA template.
@@ -102,32 +102,32 @@
 nucleotides 43044295 to 43125483.</p>
 
 <a name="genename"></a>
 <h6>What is a gene or transcript accession? </h6>
 
 <p>
 Gene symbols such as BRCA1 are easy to remember but sometimes change and are not
 specific to an organism.  Therefore most databases internally use unique
 identifiers to refer to sequences and some journals require authors to use
 these in manuscripts.</p>
 
 <p>
 The most common accession numbers encountered by users are either from Ensembl,
 GENCODE or RefSeq.  Human Ensembl/GENCODE gene accession numbers start with
 ENSG followed by a number and version number separated by a dot, e.g. 
-&quot;ENSG00000012048.21&quot for latest BRCA1.  Every ENSG-gene has at least
-one transcript assigned to it. The transcript identifiers start with with ENST
+&quot;ENSG00000012048.21&quot; for latest BRCA1.  Every ENSG-gene has at least
+one transcript assigned to it. The transcript identifiers start with ENST
 and are likewise followed by a version number, e.g. 
 &quot;ENST00000619216.1&quot;. Additional details on Ensembl IDs can be found
 on the <a target="_blank" 
 href="https://www.ensembl.org/Help/Faq?id=488">Ensembl FAQ page</a>.</p>
 
 <p>
 NCBI refers to genes
 with plain numbers, e.g.  672 for BRCA1. Manually curated RefSeq transcript
 identifiers start with NM_ (coding) or NR_ (non-coding), followed by a number and version
 number separated by a dot, e.g. &quot;NR_046018.2&quot;.  If the transcript was
 predicted by the NCBI Gnomon software, the prefix is XM_ but these are rare in human.
 A table of these and other RefSeq prefixes can be
 found on the <a target=_blank
 href="https://www.ncbi.nlm.nih.gov/books/NBK21091/table/ch18.T.refseq_accession_numbers_and_mole/?report=objectonly">
 NCBI website</a>.
@@ -196,62 +196,62 @@
     </tr> 
     <tr> 
       <td>CCDS</td>
       <td>32,506</td> 
     </tr> 
   </table>
 
 <a name="wrong"></a>
 <h6>I think this transcript looks strange, what shall I do?</h6>
 
 <p>The Genome Browser Group only displays transcripts provided by others. 
 But both RefSeq and Gencode have dedicated staff that look manually at each and every transcript and they 
 know everything there is to know about gene models.
 They are happy to answer your questions and they can change the transcript annotation. Submit your questions
 via the <a href="https://www.ncbi.nlm.nih.gov/projects/RefSeq/update.cgi" target=_blank>RefSeq contact form</a>
-or the <a href="https://www.gencodegenes.org/pages/contact.html" target=_blank>Gencode context form.</a>
+or the <a href="https://www.gencodegenes.org/pages/contact.html" target=_blank>Gencode contact form.</a>
 </p>
 
 <a name="duplicates"></a>
 <h6>Why does the UCSC RefSeq track ("refGene") include duplicates, and some transcripts map to two loci?</h6>
 
 <p>This is related to the question <a href="#ncbiRefSeq">What is the difference between "NCBI RefSeq" and "UCSC RefSeq"?</a>
 below. Briefly, the UCSC refGene track aligns the RefSeq transcripts to the genome with BLAT, with no special filtering but a
 95% identity, the NCBI RefSeq track is NCBI's mapping and the NCBI alignments were filtered using manual annotations
 to make sure that a transcript is mapped only once, even if it is perfectly
 aligning twice (there is one exception, genes in the PAR regions, see the
 paragraph below). NCBI uses manual curation to decide on the best placement,
 for example, if a gene is annotated on chr4, any alignments, even 100%
 identical, from other chromosomes are removed. As a result, the UCSC RefSeq
 track contains duplicates if the transcripts align very well to both loci and
 alerts the user to this fact, where as the NCBI alignments were filtered
 manually to make sure that every transcript maps only once.
 </p>
 <p>
 NCBI's transcript mapping, which we provide in our NCBI RefSeq track, does
 contain a few duplicates, but these have a biological explanation: they are
 transcripts in the <a target=_blank href='https://en.wikipedia.org/wiki/Pseudoautosomal_region'>pseudoautosomal regions</a>
 (PARs). Because they have identical sequences, NCBI rules assign them identical
 accessions. See the section below for how Ensembl/Gencode handle these cases.
 </p>
 <p>If you compare NCBI's RefSeq GFF files with the Genome Browser ones, note that the NCBI files
 contain non-gene annotations, without an accession, e.g. TCR or BCR locus names. We put these into the "NCBI Other" track,
 so "RefSeq curated" contains only transcripts.
 </p>
 
 <a name="duplicatesEns"></a>
-<h6>Why does the Gencode/Ensembl tracks ("knownGene", "ensGene" or "wgEncodeGencodeVXX") include a few duplicates, and some transcripts map to two loci?</h6>
+<h6>Why do the Gencode/Ensembl tracks ("knownGene", "ensGene" or "wgEncodeGencodeVXX") include a few duplicates, and some transcripts map to two loci?</h6>
 
 <p>The human genome has seven genes located in the <a target=_blank
 href='https://en.wikipedia.org/wiki/Pseudoautosomal_region'>pseudoautosomal regions</a> (PARs),
 which have identical sequences on both chrX and chrY. The Ensembl team assigned these genes
 identical accessions due to their identical sequences. Since Ensembl release 110 (identical to
 Gencode release 44), these genes now receive distinct accessions. If you encounter duplicates in
 Ensembl/Gencode files, they likely originate from file versions predating this update at the EBI.
 </p>
 
 
 <a name="ens"></a>
 <h2>The differences</h2>
 
 <p>
 Some of our gene tracks look similar and contain very similar information which can be confusing.
@@ -423,31 +423,31 @@
 The <a target=_blank href="../cgi-bin/hgTrackUi?db=hg38&g=refSeqComposite">RefSeq Alignments</a> 
 subtrack makes the problematic region very clear with double lines
 indicating unalignable transcript sequence.
 </p>
 <p>
 <b>Data format:</b>
 A small difference is the data format, which matters if you integrate our files into pipelines:
 The refGene table qName field stores the RefSeq accession but without the version number. The
 ncbiRefSeq tables show the full accession, with the version number. To add the version number 
 to the refGene table, use a MySQL command like this: <pre>
 SELECT matches,misMatches,repMatches,nCount,qNumInsert,qBaseInsert,tNumInsert,tBaseInsert,strand,concat(qName, '.', gbSeq.version),qSize,qStart,qEnd,tName,tSize,tStart,tEnd,blockCount,blockSizes,qStarts,tStarts from refSeqAli, hgFixed.gbSeq WHERE refSeqAli.qname=gbSeq.acc</pre>
 <p>To remove the transcripts on haplotypes, add this condition at the end:</p>
 <pre>and tName NOT LIKE '%_hap%' AND tName not like '%_alt%' AND tNAME NOT LIKE '%_fix%'</pre>
 
 <p>A word of caution on the NCBI RefSeq track on hg19: NCBI is not fully supporting hg19 anymore. As a result, 
-some genes are not located on the main chromosomes in anymore. An example is NM_001129826/CSAG3.
+some genes are not located on the main chromosomes anymore. An example is NM_001129826/CSAG3.
 For hg19, you may prefer UCSC RefSeq for now.</p>
 <a name="mito"></a>
 <h2>What is the best gene track for mitochondrial gene annotations</h2>
 <p>
 The mitochondrial sequence included in assembly sequence files is 
 a special case and most of what has been explained on this page does not apply
 to the mitochondrial gene annotations. For most assemblies in the Genome
 Browser, the sequence name of the mitochondrial genome is "chrM".</p>
 
 <p>Both GENCODE and RefSeq databases
 import their mitochondrial gene annotation directly from the rCRS 
 RefSeq record 
 <a target=_blank href="https://www.ncbi.nlm.nih.gov/nuccore/251831106">NC_012920.1</a>. 
 RefSeq does not assign NM_ transcript accessions for mitochondrial genes, only NP_
 protein accessions, as there is no splicing.
@@ -544,31 +544,31 @@
 
 <p>
 For the various single-transcript options of &quot;NCBI RefSeq&quot;, please
 see the discussion of "single transcript" tracks in the next section. 
 </p>
 
 <a name="singledownload"></a>
 <h2>How can I download a file with a single transcript per gene?</h2>
 <p>
 This is a common request, but very often this is not necessary when designing
 an analysis.  You will have to make a choice of this single transcript using
 some mechanism, and this choice will affect your pipeline results. It may be
 easier to keep all transcripts. For example, instead of annotating enhancers
 with the closest &quot;best-transcript&quot;, you can annotate them with the closest exon
 of any transcript. When mapping variants to transcripts, you can map to all
-transcripts and and show the transcript with the worst impact first.  When
+transcripts and show the transcript with the worst impact first.  When
 segmenting the chromosomes into gene loci, you can use the union of all
 transcripts of a gene, adding some predefined distance, rather than selecting a
 single "best" transcript.</p>
 
 <p>
 That being said, the main gene tracks have tables that try to show the "best"
 transcript per gene. There are many choices, depending on the assembly and the 
 gene track and every selection method has a different aim. For the
 knownGene tracks (UCSC genes on hg19, Gencode on hg38 and mm10), data tables
 called &quot;knownCanonical&quot; were built at UCSC. 
 For both Gencode/Ensembl and RefSeq, the NCBI/EBI project MANE selects
 for each gene the most relevant transcript, as long as these are identical between
 Gencode and RefSeq. For NCBI RefSeq, the track RefSeqSelect also selects the most relevant
 transcript(s) for each gene and is not limited to transcripts that are identical between 
 RefSeq and Ensembl. Therefore, the following gene tracks have "best-transcripts" tracks: