a1f4839d112890e7e176e5be7535f247c0ae01ca
max
  Fri Nov 26 08:54:45 2021 -0800
documenting mysql query to get refGene with the version suffix, no redmine, email from/with Mark

diff --git src/hg/htdocs/FAQ/FAQgenes.html src/hg/htdocs/FAQ/FAQgenes.html
index b270231..c684d08 100755
--- src/hg/htdocs/FAQ/FAQgenes.html
+++ src/hg/htdocs/FAQ/FAQgenes.html
@@ -317,33 +317,32 @@
 <p class='text-center'>
   <img class='text-center' src="../images/ComprehensiveSet.png" 
 alt="Turning on comprehensive gene set" width="750">
 
 <a name="ncbiRefseq"></a>
 <h6>What is the difference between "NCBI RefSeq" and "UCSC RefSeq"?</h6>
 <p>
 RefSeq gene transcripts, unlike GENCODE/Ensembl/UCSC Genes, are sequences that can differ from 
 the genome. They need to be aligned to the genome to create annotations and UCSC
 and NCBI create alignments with different software (BLAT and splign, respectively).
 The advantages of the UCSC alignments are that
 they are updated constantly even for older assemblies, such as GRCh37/hg19.
 The advantage of NCBI alignments are that they are placed manually 
 to a chromosome location and are the official alignments, e.g. for databases and manuscripts.
 Therefore, we recommend working with the NCBI annotations and when an assembly has an &quot;NCBI RefSeq&quot; track, we show it by default and hide the
-&quot;UCSC RefSeq&quot; track.
+&quot;UCSC RefSeq&quot; track. The only exception may be hg19 (see the note at the end of this section).
 </p>
-
 <p>The UCSC alignments can differ from the NCBI alignments for two reasons:</p>
 
 <p><b>Very similar transcripts:</b> 
 Let's take the case of two almost-identical transcripts sequences in RefSeq,
 with two genes in the genome where they could be placed.
 NCBI has a rule to place every transcript only once, and transcripts
 are manually tied to a chromosome band or location by NCBI, so each gene will get one
 and only one transcript of two. NCBI RefSeq will have two genes with one transcript each.
 UCSC RefSeq though places all
 transcripts where they align at very high identity, so both genes will get
 annotated with both transcripts. For example, 
 the transcript NM_001012276 has three almost-identical possible
 placements to the genome in the UCSC RefSeq track, as it is entirely alignment-based without any manual filtering,
 but NM_001012276.3 is shown at a single location in the NCBI RefSeq track, as the NCBI
 software will only retain the alignment at the manually annotated location. It
@@ -367,31 +366,39 @@
 </p>
 
 <p>
 An anecdotal and rare example is SHANK2 and SHANK3 in hg19. It is impossible
 for either NCBI or BLAT to get the correct alignment and gene model because the genome sequence is
 missing for part of the gene.  NCBI and BLAT find slightly different exon
 boundaries at the edge of the problematic region. NCBI's aligner tries very hard
 to find exons that align to any transcript sequence,
 so it calls a few small dubious &quot;exons&quot; in the affected genomic region.
 GENCODE V19 also used an aligner that tried very hard to find exons, but it
 found small dubious &quot;exons&quot; in different places than NCBI.
 The <a target=_blank href="../cgi-bin/hgTrackUi?db=hg38&g=refSeqComposite">RefSeq Alignments</a> 
 subtrack makes the problematic region very clear with double lines
 indicating unalignable transcript sequence.
 </p>
-
+<b>Data format:</b> <p>A small difference is the data format, which matters if you integrate our files into pipelines:
+The refGene table qName field stores the RefSeq accession but without the version number. The
+ncbiRefSeq tables show the full accession, with the version number. To add the version number 
+to the refGene table, use a MySQL command like this: <pre>
+SELECT matches,misMatches,repMatches,nCount,qNumInsert,qBaseInsert,tNumInsert,tBaseInsert,strand,concat(qName, '.', gbSeq.version),qSize,qStart,qEnd,tName,tSize,tStart,tEnd,blockCount,blockSizes,qStarts,tStarts from refSeqAli, hgFixed.gbSeq WHERE refSeqAli.qname=gbSeq.acc</pre>. To remove the transcripts on haplotypes, add this condition at the end: <pre>and tName NOT LIKE '%_hap%' AND tName not like '%_alt%' AND tNAME NOT LIKE '%_fix%'</pre>.
+
+<p>A word of caution on the NCBI RefSeq track on hg19: NCBI is not fully supporting hg19 anymore. As a result, 
+some genes are not located on the main chromosomes in anymore. An example is NM_001129826/CSAG3.
+For hg19, you may prefer UCSC RefSeq for now.</p>
 <a name="mito"></a>
 <h2>What is the best gene track for mitochondrial gene annotations</h2>
 <p>
 The mitochondrial sequence included in assembly sequence files is very
 special and most of what has been explained on this page does not apply
 to the mitochondrial gene annotations. For most assemblies in the Genome
 Browser, the sequence name of the mitochondrial genome is "chrM".<p>
 <p>
 For hg19, no mitochondrial genome was originally published.
 The UCSC Genome Browser added a chrM sequence
 that was not the mitochondrial genome sequence later selected by NCBI for GRCh37. This
 is why <strong>the current hg19 version contains two mitochondrial sequences,
 the old one called &quot;chrM&quot; and the current GRCh37 reference, 
 called &quot;chrMT&quot;</strong>. The issue is described in detail in our 
 <a target=_blank href="https://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/README.txt">