198c9b8daecc44fbda6a6494c566c723920f030a lrnassar Wed Mar 11 18:25:21 2026 -0700 Fixing a few hundred clear typos with the help of Claude. Some are less important in code comments, but majority of them are in user-facing places. I manually approved 60%+ of the changes and didn't see any that were an incorrect suggestion, at worst it was potentially uncessesary, like a code comment having cant instead of can't. No RM. diff --git src/hg/htdocs/FAQ/FAQgenes.html src/hg/htdocs/FAQ/FAQgenes.html index ed42cef9460..4c0f47dfda2 100755 --- src/hg/htdocs/FAQ/FAQgenes.html +++ src/hg/htdocs/FAQ/FAQgenes.html @@ -4,31 +4,31 @@ <!-- Relative paths to support mirror sites with non-standard GB docs install --> <!--#include virtual="$ROOT/inc/gbPageStart.html" --> <h1>Frequently Asked Questions: Gene tracks</h1> <h2>Topics</h2> <ul> <li><a href="#gene">What is a gene?</a></li> <li><a href="#genestrans">What is a transcript and how is it related to a gene?</a></li> <li><a href="#genename">What is a gene name?</a></li> <li><a href="#mostCommon">What are the most common gene transcript tracks?</a></li> <li><a href="#wrong">I think this transcript looks strange, what shall I do?</a></li> <li><a href="#duplicates">Why does the UCSC RefSeq track ("refGene") include duplicates, and some transcripts map to two loci?</a></li> -<li><a href="#duplicatesEns">Why does the Gencode/Ensembl tracks ("knownGene", "ensGene" or "wgEncodeGencodeVXX") include a few duplicates, and some transcripts map to two loci?</a></li> +<li><a href="#duplicatesEns">Why do the Gencode/Ensembl tracks ("knownGene", "ensGene" or "wgEncodeGencodeVXX") include a few duplicates, and some transcripts map to two loci?</a></li> <li><a href="#ens">What are Ensembl and GENCODE and is there a difference?</a></li> <li><a href="#ensRefseq">What are the differences among GENCODE, Ensembl and RefSeq?</a></li> <li><a href="#hg19">For the human assembly hg19/GRCh37: What is the difference between "UCSC Genes" track, the "GENCODE" track and the "Ensembl Genes" track?</a></li> <li><a href="#hg38">For the human assembly hg38/GRCh38: What are the differences between the "GENCODE" and "All GENCODE" tracks?</a></li> <li><a href="#gencode">What is the difference between GENCODE comprehensive and basic?</a></li> <li><a href="#ncbiRefseq">What is the difference between "NCBI RefSeq" and "UCSC RefSeq"?</a></li> <li><a href="#mito">What is the best gene track for mitochondrial gene annotations?</a></li> <li><a href="#report">How shall I report a gene transcript in a manuscript?</a></li> <li><a href="#ccds">What is CCDS?</a></li> <li><a href="#justsingle">How can I show a single transcript per gene?</a></li> <li><a href="#singledownload">How can I download a file with a single transcript per gene?</a></li> <li><a href="#bioTypeFilter">How can I filter by bioType from GENCODE/RefSeq/Ensembl?</a></li> <li><a href="#whatdo">This is rather complicated. Can you tell me which gene transcript track @@ -43,31 +43,31 @@ <a href="index.html">Return to FAQ Table of Contents</a></p> <a name="gene"></a> <h2>The basics</h2> The genome browser contains many gene annotation tracks. Our users often wonder what these contain and where the information that we present comes from. <h6>What is a gene?</h6> <p> The exact definition of "gene" depends on the context. In the context of genome annotation, a gene has at least a name and is defined by a collection of related RNA transcript sequences ("isoforms"). The naming of genes and the assignment of the most important transcript sequences is often done manually by -a group of biological literature curators. For human, genes names are created +a group of biological literature curators. For human, gene names are created by the <a target=_blank href="https://www.genenames.org/">Human Gene Nomenclature Committee (HGNC, formerly HUGO)</a>. Non-human species have similar annotation groups, e.g. Mouse Genome Informatics, Wormbase, Flybase, etc. </p> <a name="genestrans"></a> <h6>What is a transcript and how is it related to a gene? </h6> <p> In the Genome Browser, transcript tracks often end with the word "Genes", e.g. "Ensembl Genes", "NCBI RefSeq Genes", or "UCSC Genes". Despite the name, items in these tracks actually represent transcripts on chromosomes of a genome assembly</p> <p> Transcripts are defined as RNA molecules that are made from a DNA template. @@ -102,32 +102,32 @@ nucleotides 43044295 to 43125483.</p> <a name="genename"></a> <h6>What is a gene or transcript accession? </h6> <p> Gene symbols such as BRCA1 are easy to remember but sometimes change and are not specific to an organism. Therefore most databases internally use unique identifiers to refer to sequences and some journals require authors to use these in manuscripts.</p> <p> The most common accession numbers encountered by users are either from Ensembl, GENCODE or RefSeq. Human Ensembl/GENCODE gene accession numbers start with ENSG followed by a number and version number separated by a dot, e.g. -"ENSG00000012048.21" for latest BRCA1. Every ENSG-gene has at least -one transcript assigned to it. The transcript identifiers start with with ENST +"ENSG00000012048.21" for latest BRCA1. Every ENSG-gene has at least +one transcript assigned to it. The transcript identifiers start with ENST and are likewise followed by a version number, e.g. "ENST00000619216.1". Additional details on Ensembl IDs can be found on the <a target="_blank" href="https://www.ensembl.org/Help/Faq?id=488">Ensembl FAQ page</a>.</p> <p> NCBI refers to genes with plain numbers, e.g. 672 for BRCA1. Manually curated RefSeq transcript identifiers start with NM_ (coding) or NR_ (non-coding), followed by a number and version number separated by a dot, e.g. "NR_046018.2". If the transcript was predicted by the NCBI Gnomon software, the prefix is XM_ but these are rare in human. A table of these and other RefSeq prefixes can be found on the <a target=_blank href="https://www.ncbi.nlm.nih.gov/books/NBK21091/table/ch18.T.refseq_accession_numbers_and_mole/?report=objectonly"> NCBI website</a>. @@ -196,62 +196,62 @@ </tr> <tr> <td>CCDS</td> <td>32,506</td> </tr> </table> <a name="wrong"></a> <h6>I think this transcript looks strange, what shall I do?</h6> <p>The Genome Browser Group only displays transcripts provided by others. But both RefSeq and Gencode have dedicated staff that look manually at each and every transcript and they know everything there is to know about gene models. They are happy to answer your questions and they can change the transcript annotation. Submit your questions via the <a href="https://www.ncbi.nlm.nih.gov/projects/RefSeq/update.cgi" target=_blank>RefSeq contact form</a> -or the <a href="https://www.gencodegenes.org/pages/contact.html" target=_blank>Gencode context form.</a> +or the <a href="https://www.gencodegenes.org/pages/contact.html" target=_blank>Gencode contact form.</a> </p> <a name="duplicates"></a> <h6>Why does the UCSC RefSeq track ("refGene") include duplicates, and some transcripts map to two loci?</h6> <p>This is related to the question <a href="#ncbiRefSeq">What is the difference between "NCBI RefSeq" and "UCSC RefSeq"?</a> below. Briefly, the UCSC refGene track aligns the RefSeq transcripts to the genome with BLAT, with no special filtering but a 95% identity, the NCBI RefSeq track is NCBI's mapping and the NCBI alignments were filtered using manual annotations to make sure that a transcript is mapped only once, even if it is perfectly aligning twice (there is one exception, genes in the PAR regions, see the paragraph below). NCBI uses manual curation to decide on the best placement, for example, if a gene is annotated on chr4, any alignments, even 100% identical, from other chromosomes are removed. As a result, the UCSC RefSeq track contains duplicates if the transcripts align very well to both loci and alerts the user to this fact, where as the NCBI alignments were filtered manually to make sure that every transcript maps only once. </p> <p> NCBI's transcript mapping, which we provide in our NCBI RefSeq track, does contain a few duplicates, but these have a biological explanation: they are transcripts in the <a target=_blank href='https://en.wikipedia.org/wiki/Pseudoautosomal_region'>pseudoautosomal regions</a> (PARs). Because they have identical sequences, NCBI rules assign them identical accessions. See the section below for how Ensembl/Gencode handle these cases. </p> <p>If you compare NCBI's RefSeq GFF files with the Genome Browser ones, note that the NCBI files contain non-gene annotations, without an accession, e.g. TCR or BCR locus names. We put these into the "NCBI Other" track, so "RefSeq curated" contains only transcripts. </p> <a name="duplicatesEns"></a> -<h6>Why does the Gencode/Ensembl tracks ("knownGene", "ensGene" or "wgEncodeGencodeVXX") include a few duplicates, and some transcripts map to two loci?</h6> +<h6>Why do the Gencode/Ensembl tracks ("knownGene", "ensGene" or "wgEncodeGencodeVXX") include a few duplicates, and some transcripts map to two loci?</h6> <p>The human genome has seven genes located in the <a target=_blank href='https://en.wikipedia.org/wiki/Pseudoautosomal_region'>pseudoautosomal regions</a> (PARs), which have identical sequences on both chrX and chrY. The Ensembl team assigned these genes identical accessions due to their identical sequences. Since Ensembl release 110 (identical to Gencode release 44), these genes now receive distinct accessions. If you encounter duplicates in Ensembl/Gencode files, they likely originate from file versions predating this update at the EBI. </p> <a name="ens"></a> <h2>The differences</h2> <p> Some of our gene tracks look similar and contain very similar information which can be confusing. @@ -423,31 +423,31 @@ The <a target=_blank href="../cgi-bin/hgTrackUi?db=hg38&g=refSeqComposite">RefSeq Alignments</a> subtrack makes the problematic region very clear with double lines indicating unalignable transcript sequence. </p> <p> <b>Data format:</b> A small difference is the data format, which matters if you integrate our files into pipelines: The refGene table qName field stores the RefSeq accession but without the version number. The ncbiRefSeq tables show the full accession, with the version number. To add the version number to the refGene table, use a MySQL command like this: <pre> SELECT matches,misMatches,repMatches,nCount,qNumInsert,qBaseInsert,tNumInsert,tBaseInsert,strand,concat(qName, '.', gbSeq.version),qSize,qStart,qEnd,tName,tSize,tStart,tEnd,blockCount,blockSizes,qStarts,tStarts from refSeqAli, hgFixed.gbSeq WHERE refSeqAli.qname=gbSeq.acc</pre> <p>To remove the transcripts on haplotypes, add this condition at the end:</p> <pre>and tName NOT LIKE '%_hap%' AND tName not like '%_alt%' AND tNAME NOT LIKE '%_fix%'</pre> <p>A word of caution on the NCBI RefSeq track on hg19: NCBI is not fully supporting hg19 anymore. As a result, -some genes are not located on the main chromosomes in anymore. An example is NM_001129826/CSAG3. +some genes are not located on the main chromosomes anymore. An example is NM_001129826/CSAG3. For hg19, you may prefer UCSC RefSeq for now.</p> <a name="mito"></a> <h2>What is the best gene track for mitochondrial gene annotations</h2> <p> The mitochondrial sequence included in assembly sequence files is a special case and most of what has been explained on this page does not apply to the mitochondrial gene annotations. For most assemblies in the Genome Browser, the sequence name of the mitochondrial genome is "chrM".</p> <p>Both GENCODE and RefSeq databases import their mitochondrial gene annotation directly from the rCRS RefSeq record <a target=_blank href="https://www.ncbi.nlm.nih.gov/nuccore/251831106">NC_012920.1</a>. RefSeq does not assign NM_ transcript accessions for mitochondrial genes, only NP_ protein accessions, as there is no splicing. @@ -544,31 +544,31 @@ <p> For the various single-transcript options of "NCBI RefSeq", please see the discussion of "single transcript" tracks in the next section. </p> <a name="singledownload"></a> <h2>How can I download a file with a single transcript per gene?</h2> <p> This is a common request, but very often this is not necessary when designing an analysis. You will have to make a choice of this single transcript using some mechanism, and this choice will affect your pipeline results. It may be easier to keep all transcripts. For example, instead of annotating enhancers with the closest "best-transcript", you can annotate them with the closest exon of any transcript. When mapping variants to transcripts, you can map to all -transcripts and and show the transcript with the worst impact first. When +transcripts and show the transcript with the worst impact first. When segmenting the chromosomes into gene loci, you can use the union of all transcripts of a gene, adding some predefined distance, rather than selecting a single "best" transcript.</p> <p> That being said, the main gene tracks have tables that try to show the "best" transcript per gene. There are many choices, depending on the assembly and the gene track and every selection method has a different aim. For the knownGene tracks (UCSC genes on hg19, Gencode on hg38 and mm10), data tables called "knownCanonical" were built at UCSC. For both Gencode/Ensembl and RefSeq, the NCBI/EBI project MANE selects for each gene the most relevant transcript, as long as these are identical between Gencode and RefSeq. For NCBI RefSeq, the track RefSeqSelect also selects the most relevant transcript(s) for each gene and is not limited to transcripts that are identical between RefSeq and Ensembl. Therefore, the following gene tracks have "best-transcripts" tracks: