bba92d23fe7b8355f40307869ccc0cfe5895b3b8 max Sun Mar 2 11:32:15 2025 -0800 adding notes on chrMT and refseq dupes to faq, refs #35299 diff --git src/hg/htdocs/FAQ/FAQgenes.html src/hg/htdocs/FAQ/FAQgenes.html index 25b2597126d..f646cf0d008 100755 --- src/hg/htdocs/FAQ/FAQgenes.html +++ src/hg/htdocs/FAQ/FAQgenes.html @@ -219,30 +219,34 @@ aligning twice (there is one exception, genes in the PAR regions, see the paragraph below). NCBI uses manual curation to decide on the best placement, for example, if a gene is annotated on chr4, any alignments, even 100% identical, from other chromosomes are removed. As a result, the UCSC RefSeq track contains duplicates if the transcripts align very well to both loci and alerts the user to this fact, where as the NCBI alignments were filtered manually to make sure that every transcript maps only once.

NCBI's transcript mapping, which we provide in our NCBI RefSeq track, does contain a few duplicates, but these have a biological explanation: they are transcripts in the pseudoautosomal regions (PARs). Because they have identical sequences, NCBI rules assign them identical accessions. See the section below for how Ensembl/Gencode handle these cases.

If you compare NCBI's RefSeq GFF files with the Genome Browser ones, note that the NCBI files +contain non-gene annotations, without an accession, e.g. TCR or BCR locus names. We put these into the "NCBI Other" track, +so "RefSeq curated" contains only transcripts. +

Why does the Gencode/Ensembl tracks ("knownGene", "ensGene" or "wgEncodeGencodeVXX") include a few duplicates, and some transcripts map to two loci?

The human genome has seven genes located in the pseudoautosomal regions (PARs), which have identical sequences on both chrX and chrY. The Ensembl team assigned these genes identical accessions due to their identical sequences. Since Ensembl release 110 (identical to Gencode release 44), these genes now receive distinct accessions. If you encounter duplicates in Ensembl/Gencode files, they likely originate from file versions predating this update at the EBI.

The differences

@@ -416,59 +420,69 @@ subtrack makes the problematic region very clear with double lines indicating unalignable transcript sequence.

Data format:

A small difference is the data format, which matters if you integrate our files into pipelines: The refGene table qName field stores the RefSeq accession but without the version number. The ncbiRefSeq tables show the full accession, with the version number. To add the version number to the refGene table, use a MySQL command like this:

 SELECT matches,misMatches,repMatches,nCount,qNumInsert,qBaseInsert,tNumInsert,tBaseInsert,strand,concat(qName, '.', gbSeq.version),qSize,qStart,qEnd,tName,tSize,tStart,tEnd,blockCount,blockSizes,qStarts,tStarts from refSeqAli, hgFixed.gbSeq WHERE refSeqAli.qname=gbSeq.acc

. To remove the transcripts on haplotypes, add this condition at the end:

and tName NOT LIKE '%_hap%' AND tName not like '%_alt%' AND tNAME NOT LIKE '%_fix%'

A word of caution on the NCBI RefSeq track on hg19: NCBI is not fully supporting hg19 anymore. As a result, some genes are not located on the main chromosomes in anymore. An example is NM_001129826/CSAG3. For hg19, you may prefer UCSC RefSeq for now.

What is the best gene track for mitochondrial gene annotations

-The mitochondrial sequence included in assembly sequence files is very -special and most of what has been explained on this page does not apply +The mitochondrial sequence included in assembly sequence files is +a special case and most of what has been explained on this page does not apply to the mitochondrial gene annotations. For most assemblies in the Genome -Browser, the sequence name of the mitochondrial genome is "chrM".

+Browser, the sequence name of the mitochondrial genome is "chrM".

+ +

Both GENCODE and RefSeq databases +import their mitochondrial gene annotation directly from the rCRS +RefSeq record +NC_012920.1. +RefSeq does not assign NM_ transcript accessions for mitochondrial genes, only NP_ +protein accessions, as there is no splicing. +The mitochondrial annotation for both databases was provided by +Mitomap.org, which contains detailed +documentation about the +the history of this sequence. We also have a Mitomap track with +gene annotations and variant information on both hg19 (chrMT) and hg38 (chrM). +

-For hg19, no mitochondrial genome was originally published. -The UCSC Genome Browser added a chrM sequence +Why chrMT? The assembly hg19 has two mitochondrial genomes, chrM (old) and chrMT (current). +The reason is that for hg19, no mitochondrial sequence was in the GRCh37 sequence file. +The UCSC Genome Browser originally added a chrM sequence when making hg19 that was not the mitochondrial genome sequence later selected by NCBI for GRCh37. This is why the current hg19 version contains two mitochondrial sequences, the old one called "chrM" and the current GRCh37 reference, called "chrMT". The issue is described in detail in our hg19 sequence download instructions. If you use hg19 today, chrMT should be considered the current mitochondrial sequence, chrM is only supported for backwards -compatibility and legacy annotation files. -

-For hg38, the GENCODE and ClinVar variants annotations are all present on chrM. The RefSeq Other -Annotations data is not made by the RefSeq group and annotations are on chrM. For hg19, legacy -annotations such as the UCSC Genes are on chrM. All more recent annotations, the RefSeq Other -Annotations and NCBI ClinVar variants are on chrMT. Both GENCODE and RefSeq databases -import their mitochondrial gene annotation directly from the rCRS -RefSeq record -NC_012920.1. -The annotation was provided by -Mitomap.org, which provides detailed -documentation about the -the history of this sequence. +compatibility and legacy annotation files. Our hg19.fa.gz in the "current" download directory +contains both sequences, the old hg19.fa.gz in the top level download directory has only chrM, +for backwards compatibility for old pipelines and our analysisSet fasta file for aligners contains only chrMT. +For most purposes when using hg19, we recommend using the analysis set fasta file. +

+For hg38, there is no issue, it has only chrM, and all mitochondrial annotations are present on chrM. +

How shall I report a gene transcript in a manuscript?

When reporting on GENCODE/Ensembl transcripts, please specify the ENST identifier. It is often helpful to also specify the Ensembl release, which is shown on the details page, when you click onto a transcript.

When reporting RefSeq transcripts, e.g. in HGVS, prefer the "NCBI RefSeq" track over the "UCSC RefSeq track". Please specify the RefSeq transcript ID and also the RefSeq annotation release.