06d7be056190c14b85e71bc12523f18ea6815b5e markd Mon Dec 7 00:50:29 2020 -0800 BLAT mmap index support merge with master diff --git src/hg/htdocs/FAQ/FAQgenes.html src/hg/htdocs/FAQ/FAQgenes.html index dea3073..e924873 100755 --- src/hg/htdocs/FAQ/FAQgenes.html +++ src/hg/htdocs/FAQ/FAQgenes.html @@ -298,56 +298,64 @@ A more comprehensive definition can also be found in the <a target=_blank href="https://uswest.ensembl.org/info/genome/genebuild/transcript_quality_tags.html#basic"> Ensembl FAQ</a>. By default, the track displays only the "basic" set. In order to display the complete "comprehensive" set, the box can be ticked at the top of the <a target=_blank href="../cgi-bin/hgTrackUi?db=hg38&g=knownGene">GENCODE track description page</a>.</p> <p class='text-center'> <img class='text-center' src="../images/ComprehensiveSet.png" alt="Turning on comprehensive gene set" width="750"> <a name="ncbiRefseq"></a> <h6>What is the difference between "NCBI RefSeq" and "UCSC RefSeq"?</h6> <p> RefSeq gene transcripts, unlike GENCODE/Ensembl/UCSC Genes, are sequences that can differ from -the genome. They need to be aligned to the genome to create transcript models. -Traditionally, UCSC has aligned RefSeq with BLAT (UCSC RefSeq sub-track) and NCBI has aligned with -splign. The advantages of the UCSC alignments are that -they are updated more frequently and are available for older assemblies (like -GRCh37/hg19), but they are not placed manually to a chromosome location and are not the official alignments. -Therefore, we recommend working with the NCBI annotations. -When an assembly has an "NCBI RefSeq" track, we show it by default and hide the +the genome. They need to be aligned to the genome to create annotations and UCSC +and NCBI create alignments with different software (BLAT and splign, respectively). +The advantages of the UCSC alignments are that +they are updated constantly even for older assemblies, like GRCh37/hg19. +The advantage of NCBI alignments are that they are placed manually +to a chromosome location and are the official alignments, e.g. for databases and manuscripts. +Therefore, we recommend working with the NCBI annotations and when an assembly has an "NCBI RefSeq" track, we show it by default and hide the "UCSC RefSeq" track. </p> -<p> -NCBI transcripts are manually tied to a chromosome band or location. The advantage is that -when there are two almost-identical transcripts in RefSeq, each one will be -placed at the official reference location in the NCBI annotations. For example, +<p>The UCSC alignments can differ from the NCBI alignments for two reasons:</p> + +<p><b>Very similar transcripts:</b> +Let's take the case of two almost-identical transcripts sequences in RefSeq, +with two genes in the genome where they could be placed. +NCBI has a rule to place every transcript only once, and transcripts +are manually tied to a chromosome band or location by NCBI, so each gene will get one +and only one transcript of two. NCBI RefSeq will have two genes with one transcript each. +UCSC RefSeq though places all +transcripts where they align at very high identity, so both genes will get +annotated with both transcripts. For example, the transcript NM_001012276 has three almost-identical possible -placements to the genome in the UCSC RefSeq track (as it is entirely alignment-based), +placements to the genome in the UCSC RefSeq track, as it is entirely alignment-based without any manual filtering, but NM_001012276.3 is shown at a single location in the NCBI RefSeq track, as the NCBI -software will only retain the splign alignment at the manually annotated location. It +software will only retain the alignment at the manually annotated location. It may be good to know about almost-identical alignments when doing genomic analysis or manual inspection of NGS read alignments, but for clinical reporting purposes or other automated analyses, we strongly recommend to use the NCBI RefSeq track. </p> <p> -In some rare cases, the NCBI and UCSC exon boundaries differ. +<b>Unclear exon boundaries:</b> In some rare cases, the NCBI and UCSC exon boundaries differ. +This happens especially when sequence deletions in the genome make the placement very difficult. Activating both RefSeq and UCSC RefSeq tracks helps you investigate the differences. Activating the RefSeq Alignments track shows NCBI's splign alignments in more detail, including double lines where both transcript and genomic sequence are skipped in the alignment. When available, the RefSeq Diffs subtrack may be helpful too. The upcoming <a target=_blank href=https://ncbiinsights.ncbi.nlm.nih.gov/2018/10/11/matched-annotation-by-ncbi-and-embl-ebi-mane-a-new-joint-venture-to-define-a-set-of-representative-transcripts-for-human-protein-coding-genes/>MANE gene set</a> will contain a set of high-quality transcripts that are 100% alignable to the genome and are part of both RefSeq and Ensembl/GENCODE but at the time of writing this project is at an early stage. </p> <p> An anecdotal and rare example is SHANK2 and SHANK3 in hg19. It is impossible for either NCBI or BLAT to get the correct alignment and gene model because the genome sequence is missing for part of the gene. NCBI and BLAT find slightly different exon boundaries at the edge of the problematic region. NCBI's aligner tries very hard @@ -441,34 +449,33 @@ <h2>How can I show a single transcript per gene?</h2> <p> For the tracks "<a target=_blank href="../cgi-bin/hgTrackUi?db=hg19&g=knownGene">UCSC Genes</a>" (hg19) or "<a target=_blank href="../cgi-bin/hgTrackUi?db=hg38&g=knownGene">GENCODE Genes</a>" (hg38), click on their title and on the configuration page, uncheck the box "Show splice variants". Only a single transcript will be shown. The method for how this transcript is selected is described in the next section below and in the track documentation. </p> <p class='text-center'> <img class='text-center' src="../images/SpliceVariants.png" alt="Changing splice variants" width="750"> -<p>For the track <a target=_blank -href="../cgi-bin/hgTrackUi?db=hg38&g=refSeqComposite"> -NCBI RefSeq</a> (hg38), you can activate the subtrack "RefSeq HGMD". -It contains only the transcripts that are part of the Human Gene Mutation Database. +<p> +For the various single-transcript options of "NCBI RefSeq", please +see the discussion of "single transcript" tracks in the next section. </p> <a name="singledownload"></a> <h2>How can I download a file with a single transcript per gene?</h2> <p> This is a common request, but very often this is not necessary when designing an analysis. You will have to make a choice of this single transcript using some mechanism, and this choice will affect your pipeline results. It may be easier to keep all transcripts. For example, instead of annotating enhancers with the closest "best-transcript", you can annotate them with the closest exon of any transcript. When mapping variants to transcripts, you can map to all transcripts and and show the transcript with the worst impact first. When segmenting the chromosomes into gene loci, you can use the union of all transcripts of a gene, adding some predefined distance, rather than selecting a single "best" transcript.</p>