a1f4839d112890e7e176e5be7535f247c0ae01ca max Fri Nov 26 08:54:45 2021 -0800 documenting mysql query to get refGene with the version suffix, no redmine, email from/with Mark diff --git src/hg/htdocs/FAQ/FAQgenes.html src/hg/htdocs/FAQ/FAQgenes.html index b270231..c684d08 100755 --- src/hg/htdocs/FAQ/FAQgenes.html +++ src/hg/htdocs/FAQ/FAQgenes.html @@ -317,33 +317,32 @@ <p class='text-center'> <img class='text-center' src="../images/ComprehensiveSet.png" alt="Turning on comprehensive gene set" width="750"> <a name="ncbiRefseq"></a> <h6>What is the difference between "NCBI RefSeq" and "UCSC RefSeq"?</h6> <p> RefSeq gene transcripts, unlike GENCODE/Ensembl/UCSC Genes, are sequences that can differ from the genome. They need to be aligned to the genome to create annotations and UCSC and NCBI create alignments with different software (BLAT and splign, respectively). The advantages of the UCSC alignments are that they are updated constantly even for older assemblies, such as GRCh37/hg19. The advantage of NCBI alignments are that they are placed manually to a chromosome location and are the official alignments, e.g. for databases and manuscripts. Therefore, we recommend working with the NCBI annotations and when an assembly has an "NCBI RefSeq" track, we show it by default and hide the -"UCSC RefSeq" track. +"UCSC RefSeq" track. The only exception may be hg19 (see the note at the end of this section). </p> - <p>The UCSC alignments can differ from the NCBI alignments for two reasons:</p> <p><b>Very similar transcripts:</b> Let's take the case of two almost-identical transcripts sequences in RefSeq, with two genes in the genome where they could be placed. NCBI has a rule to place every transcript only once, and transcripts are manually tied to a chromosome band or location by NCBI, so each gene will get one and only one transcript of two. NCBI RefSeq will have two genes with one transcript each. UCSC RefSeq though places all transcripts where they align at very high identity, so both genes will get annotated with both transcripts. For example, the transcript NM_001012276 has three almost-identical possible placements to the genome in the UCSC RefSeq track, as it is entirely alignment-based without any manual filtering, but NM_001012276.3 is shown at a single location in the NCBI RefSeq track, as the NCBI software will only retain the alignment at the manually annotated location. It @@ -367,31 +366,39 @@ </p> <p> An anecdotal and rare example is SHANK2 and SHANK3 in hg19. It is impossible for either NCBI or BLAT to get the correct alignment and gene model because the genome sequence is missing for part of the gene. NCBI and BLAT find slightly different exon boundaries at the edge of the problematic region. NCBI's aligner tries very hard to find exons that align to any transcript sequence, so it calls a few small dubious "exons" in the affected genomic region. GENCODE V19 also used an aligner that tried very hard to find exons, but it found small dubious "exons" in different places than NCBI. The <a target=_blank href="../cgi-bin/hgTrackUi?db=hg38&g=refSeqComposite">RefSeq Alignments</a> subtrack makes the problematic region very clear with double lines indicating unalignable transcript sequence. </p> - +<b>Data format:</b> <p>A small difference is the data format, which matters if you integrate our files into pipelines: +The refGene table qName field stores the RefSeq accession but without the version number. The +ncbiRefSeq tables show the full accession, with the version number. To add the version number +to the refGene table, use a MySQL command like this: <pre> +SELECT matches,misMatches,repMatches,nCount,qNumInsert,qBaseInsert,tNumInsert,tBaseInsert,strand,concat(qName, '.', gbSeq.version),qSize,qStart,qEnd,tName,tSize,tStart,tEnd,blockCount,blockSizes,qStarts,tStarts from refSeqAli, hgFixed.gbSeq WHERE refSeqAli.qname=gbSeq.acc</pre>. To remove the transcripts on haplotypes, add this condition at the end: <pre>and tName NOT LIKE '%_hap%' AND tName not like '%_alt%' AND tNAME NOT LIKE '%_fix%'</pre>. + +<p>A word of caution on the NCBI RefSeq track on hg19: NCBI is not fully supporting hg19 anymore. As a result, +some genes are not located on the main chromosomes in anymore. An example is NM_001129826/CSAG3. +For hg19, you may prefer UCSC RefSeq for now.</p> <a name="mito"></a> <h2>What is the best gene track for mitochondrial gene annotations</h2> <p> The mitochondrial sequence included in assembly sequence files is very special and most of what has been explained on this page does not apply to the mitochondrial gene annotations. For most assemblies in the Genome Browser, the sequence name of the mitochondrial genome is "chrM".<p> <p> For hg19, no mitochondrial genome was originally published. The UCSC Genome Browser added a chrM sequence that was not the mitochondrial genome sequence later selected by NCBI for GRCh37. This is why <strong>the current hg19 version contains two mitochondrial sequences, the old one called "chrM" and the current GRCh37 reference, called "chrMT"</strong>. The issue is described in detail in our <a target=_blank href="https://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/README.txt">