1586330213411f03154ec815ec9addb9d4656329 max Mon Nov 9 07:43:10 2020 -0800 improving readability of NCBI versus refGene section, refs #26496 diff --git src/hg/htdocs/FAQ/FAQgenes.html src/hg/htdocs/FAQ/FAQgenes.html index dea3073..772bd17 100755 --- src/hg/htdocs/FAQ/FAQgenes.html +++ src/hg/htdocs/FAQ/FAQgenes.html @@ -298,56 +298,64 @@ A more comprehensive definition can also be found in the Ensembl FAQ. By default, the track displays only the "basic" set. In order to display the complete "comprehensive" set, the box can be ticked at the top of the GENCODE track description page.

Turning on comprehensive gene set

What is the difference between "NCBI RefSeq" and "UCSC RefSeq"?

RefSeq gene transcripts, unlike GENCODE/Ensembl/UCSC Genes, are sequences that can differ from -the genome. They need to be aligned to the genome to create transcript models. -Traditionally, UCSC has aligned RefSeq with BLAT (UCSC RefSeq sub-track) and NCBI has aligned with -splign. The advantages of the UCSC alignments are that -they are updated more frequently and are available for older assemblies (like -GRCh37/hg19), but they are not placed manually to a chromosome location and are not the official alignments. -Therefore, we recommend working with the NCBI annotations. -When an assembly has an "NCBI RefSeq" track, we show it by default and hide the +the genome. They need to be aligned to the genome to create annotations and UCSC +and NCBI create alignments with different software (BLAT and splign, respectively). +The advantages of the UCSC alignments are that +they are updated constantly even for older assemblies, like GRCh37/hg19. +The advantage of NCBI alignments are that they are placed manually +to a chromosome location and are the official alignments, e.g. for databases and manuscripts. +Therefore, we recommend working with the NCBI annotations and when an assembly has an "NCBI RefSeq" track, we show it by default and hide the "UCSC RefSeq" track.

-NCBI transcripts are manually tied to a chromosome band or location. The advantage is that -when there are two almost-identical transcripts in RefSeq, each one will be -placed at the official reference location in the NCBI annotations. For example, +

The UCSC alignments can differ from the NCBI alignments for two reasons:

+ +

Very similar transcripts: +Let's take the case of two almost-identical transcripts sequences in RefSeq, +with two genes in the genome where they could be placed. +NCBI has a rule to place every transcript only once, and transcripts +are manually tied to a chromosome band or location by NCBI, so each gene will get one +and only one transcript of two. NCBI RefSeq will have two genes with one transcript each. +UCSC RefSeq though places all +transcripts where they align at very high identity, so both genes will get +annotated with both transcripts. For example, the transcript NM_001012276 has three almost-identical possible -placements to the genome in the UCSC RefSeq track (as it is entirely alignment-based), +placements to the genome in the UCSC RefSeq track, as it is entirely alignment-based without any manual filtering, but NM_001012276.3 is shown at a single location in the NCBI RefSeq track, as the NCBI -software will only retain the splign alignment at the manually annotated location. It +software will only retain the alignment at the manually annotated location. It may be good to know about almost-identical alignments when doing genomic analysis or manual inspection of NGS read alignments, but for clinical reporting purposes or other automated analyses, we strongly recommend to use the NCBI RefSeq track.

-In some rare cases, the NCBI and UCSC exon boundaries differ. +Unclear exon boundaries: In some rare cases, the NCBI and UCSC exon boundaries differ. +This happens especially when sequence deletions in the genome make the placement very difficult. Activating both RefSeq and UCSC RefSeq tracks helps you investigate the differences. Activating the RefSeq Alignments track shows NCBI's splign alignments in more detail, including double lines where both transcript and genomic sequence are skipped in the alignment. When available, the RefSeq Diffs subtrack may be helpful too. The upcoming MANE gene set will contain a set of high-quality transcripts that are 100% alignable to the genome and are part of both RefSeq and Ensembl/GENCODE but at the time of writing this project is at an early stage.

An anecdotal and rare example is SHANK2 and SHANK3 in hg19. It is impossible for either NCBI or BLAT to get the correct alignment and gene model because the genome sequence is missing for part of the gene. NCBI and BLAT find slightly different exon boundaries at the edge of the problematic region. NCBI's aligner tries very hard