c71281301ba035188bb402d91d6940aad6b0c12a gperez2 Mon May 2 15:21:44 2022 -0700 Updating the html for the non-human ncbiRefSeq tracks, refs #29127 diff --git src/hg/makeDb/trackDb/human/refSeqComposite.html src/hg/makeDb/trackDb/human/refSeqComposite.html new file mode 100644 index 0000000..bd71253 --- /dev/null +++ src/hg/makeDb/trackDb/human/refSeqComposite.html @@ -0,0 +1,295 @@ +<h2>Description</h2> +<p> +The NCBI RefSeq Genes composite track shows $organism protein-coding and non-protein-coding +genes taken from the NCBI RNA reference sequences collection (RefSeq). All subtracks use +coordinates provided by RefSeq, except for the <em>UCSC RefSeq</em> track, which UCSC produces by +realigning the RefSeq RNAs to the genome. This realignment may result in occasional differences +between the annotation coordinates provided by UCSC and NCBI. For RNA-seq analysis, we advise +using NCBI aligned tables like RefSeq All or RefSeq Curated. See the +<a href="#methods">Methods</a> section for more details about how the different tracks were +created. </p> +<p> +Please visit NCBI's <a href="https://www.ncbi.nlm.nih.gov/projects/RefSeq/update.cgi" +target="_blank">Feedback for Gene and Reference Sequences (RefSeq)</a> page to make suggestions, +submit additions and corrections, or ask for help concerning RefSeq records. </p> + +<p> +For more information on the different gene tracks, see our <a target=_blank +href="/FAQ/FAQgenes.html">Genes FAQ</a>.</p> + +<h2>Display Conventions and Configuration</h2> +<p> +This track is a composite track that contains differing data sets. +To show only a selected set of subtracks, uncheck the boxes next to the tracks that you wish to +hide. <b>Note:</b> Not all subtracts are available on all assemblies. </p> + +The possible subtracks include: +<dl> + <dt><em><strong>RefSeq aligned annotations and UCSC alignment of RefSeq annotations + </strong></em></dt> + <ul> + <li> + <em>RefSeq All</em> – all curated and predicted annotations provided by + RefSeq.</li> + <li> + <em>RefSeq Curated</em> – subset of <em>RefSeq All</em> that includes only those + annotations whose accessions begin with NM, NR, NP or YP. <small>(NP and YP are used only for + protein-coding genes on the mitochondrion; YP is used for human only.)</small></li> + <li> + <em>RefSeq Predicted</em> – subset of RefSeq All that includes those annotations whose + accessions begin with XM or XR.</li> + <li> + <em>RefSeq Other</em> – all other annotations produced by the RefSeq group that + do not fit the requirements for inclusion in the <em>RefSeq Curated</em> or the + <em>RefSeq Predicted</em> tracks.</li> + <li> + <em>RefSeq Alignments</em> – alignments of RefSeq RNAs to the $organism genome provided + by the RefSeq group, following the display conventions for +<a href="../goldenPath/help/hgTracksHelp.html#PSLDisplay" target="_blank">PSL tracks</a>.</li> + <li> + <em>RefSeq Diffs</em> – alignment differences between the $organism reference genome(s) + and RefSeq transcripts. <small>(Track not currently available for every assembly.)</small> + </li> + <li> + <em>UCSC RefSeq</em> – annotations generated from UCSC's realignment of RNAs with NM + and NR accessions to the $organism genome. This track was previously known as the "RefSeq + Genes" track.</li> + <li> + <em>RefSeq Select+MANE (subset)</em> – Subset of RefSeq Curated, transcripts marked as + RefSeq Select or MANE Select. + A single <em>Select</em> transcript is chosen as representative for each protein-coding gene. + This track includes transcripts categorized as MANE, which are further agreed upon as + representative by both NCBI RefSeq and Ensembl/GENCODE, and have a 100% identical match + to a transcript in the Ensembl annotation. See <a target="_blank" + href="https://www.ncbi.nlm.nih.gov/refseq/refseq_select/">NCBI RefSeq Select</a>. + Note that we provide a separate track, <a + target=_blank href="hgTrackUi?g=mane&db=hg38&c=chr22">MANE (hg38)</a>, + which contains only the MANE transcripts. + </li> + <li> + <em>RefSeq HGMD (subset)</em> – Subset of RefSeq Curated, transcripts annotated by the Human + Gene Mutation Database. This track is only available on the human genomes hg19 and hg38. + It is the most restricted RefSeq subset, targeting clinical diagnostics. + </li> + </ul> +</dl> + +<p> +The <em>RefSeq All</em>, <em>RefSeq Curated</em>, <em>RefSeq Predicted</em>, <em>RefSeq HGMD</em>, +<em>RefSeq Select/MANE</em> and <em>UCSC RefSeq</em> tracks follow the display conventions for +<a href="../goldenPath/help/hgTracksHelp.html#GeneDisplay" +target="_blank">gene prediction tracks</a>. +The color shading indicates the level of review the RefSeq record has undergone: +predicted (light), provisional (medium), or reviewed (dark), as defined by <a target=_blank href="https://www.ncbi.nlm.nih.gov/books/NBK21091/table/ch18.T.refseq_status_codes/?report=objectonly">RefSeq</a>. </p> + +<p> +<table> + <thead> + <tr> + <th style="border-bottom: 2px solid #6678B1;">Color</th> + <th style="border-bottom: 2px solid #6678B1;">Level of review</th> + </tr> + </thead> + <tr> + <th bgcolor="#0C0C78"></th> + <th align="left">Reviewed: the RefSeq record has been reviewed by NCBI staff or by a collaborator. The NCBI review process includes assessing available sequence data and the literature. Some RefSeq records may incorporate expanded sequence and annotation information.</th> + </tr> + <tr> + <th bgcolor="#5050A0"></th> + <th align="left">Provisional: the RefSeq record has not yet been subject to individual review. The initial sequence-to-gene association has been established by outside collaborators or NCBI staff.</th> + </tr> + <tr> + <th bgcolor="#8282D2"></th> + <th align="left">Predicted: the RefSeq record has not yet been subject to individual review, and some aspect of the RefSeq record is predicted.</th> + </tr> +</table> +</p> + +<p> +The item labels and codon display properties for features within this track can be configured +through the check-box controls at the top of the track description page. To adjust the settings +for an individual subtrack, click the wrench icon next to the track name in the subtrack list .</p> +<ul> + <li> + <strong>Label:</strong> By default, items are labeled by gene name. Click the appropriate Label + option to display the accession name or OMIM identifier instead of the gene name, show all or a + subset of these labels including the gene name, OMIM identifier and accession names, or turn off + the label completely.</li> + <li> + <strong>Codon coloring:</strong> This track has an optional codon coloring feature that + allows users to quickly validate and compare gene predictions. To display codon colors, select the + <em>genomic codons</em> option from the <em>Color track by codons</em> pull-down menu. For more + information about this feature, go to the <a href="../goldenPath/help/hgCodonColoring.html" + target="_blank">Coloring Gene Predictions and Annotations by Codon</a> page.</li> +</ul> + +<p>The <em>RefSeq Diffs</em> track contains five different types of inconsistency between the +reference genome sequence and the RefSeq transcript sequences. The five types of differences are +as follows: +<ul> + <li> + <em>mismatch</em> – aligned but mismatching bases, plus HGVS g. + to show the genomic change required to match the transcript and HGVS c./n. + to show the transcript change required to match the genome.</li> + <li> + <em>short gap</em> – genomic gaps that are too small to be introns (arbitrary cutoff of + < 45 bp), most likely insertions/deletion variants or errors, with HGVS g. and c./n. + showing differences.</li> + <li> + <em>shift gap</em> – shortGap items whose placement could be shifted left and/or right on + the genome due to repetitive sequence, with HGVS c./n. position range of ambiguous region + in transcript. Here, thin and thick lines are used -- the thin line shows the span of the + repetitive sequence, and the thick line shows the rightmost shifted gap. + </li> + <li> + <em>double gap</em> – genomic gaps that are long enough to be introns but that skip over + transcript sequence (invisible in default setting), with HGVS c./n. deletion.</li> + <li> + <em>skipped</em> – sequence at the beginning or end of a transcript that is not aligned to + the genome + (invisible in default setting), with HGVS c./n. deletion</li> + +</ul> + +<small><b>HGVS Terminology </b>(Human Genome Variation Society): + +g. = genomic sequence ; c. = coding DNA sequence ; n. = non-coding RNA reference sequence.</small> +</p> + +<p> +When reporting HGVS with RefSeq sequences, to make sure that results from +research articles can be mapped to the genome unambiguously, +please specify the RefSeq annotation release displayed on the transcript's +Genome Browser details page and also the RefSeq transcript ID with version +(e.g. NM_012309.4 not NM_012309). +</p> + + +<a name="methods"></a> +<h2>Methods</h2> +<p> +Tracks contained in the RefSeq annotation and RefSeq RNA alignment tracks were created at UCSC using +data from the NCBI RefSeq project. Data files were downloaded from RefSeq in GFF file format and +converted to the genePred and PSL table formats for display in the Genome Browser. Information about +the NCBI annotation pipeline can be found +<a href="https://www.ncbi.nlm.nih.gov/genome/annotation_euk/process/" target="_blank">here</a>.</p> + +<p>The RefSeq Diffs track is generated by UCSC using NCBI's RefSeq RNA alignments.</p> +<p> +The UCSC RefSeq Genes track is constructed using the same methods as previous RefSeq Genes tracks. +RefSeq RNAs were aligned against the $organism genome using BLAT. Those with an alignment of +less than 15% were discarded. When a single RNA aligned in multiple places, the alignment +having the highest base identity was identified. Only alignments having a base identity +level within 0.1% of the best and at least 96% base identity with the genomic sequence were +kept.</p> + +<h2>Data Access</h2> +<p> +The raw data for these tracks can be accessed in multiple ways. It can be explored interactively +using the <a href="/goldenPath/help/api.html" target="_blank">REST API</a>, +<a href="../cgi-bin/hgTables" target="_blank">Table Browser</a> or +<a href="../cgi-bin/hgIntegrator" +target="_blank">Data Integrator</a>. The tables can also be accessed programmatically through our +<a href="../../goldenPath/help/mysql.html" +target="_blank">public MySQL server</a> or downloaded from our +<a href="http://hgdownload.soe.ucsc.edu/goldenPath/$db/database/" +target="_blank">downloads server</a> for local processing. The previous track versions are available +in the <a href="https://hgdownload.soe.ucsc.edu/goldenPath/archive/$db/ncbiRefSeq/" +target="_blank">archives</a> of our downloads server. You can also access any RefSeq table +entries in JSON format through our <a href="http://genome.ucsc.edu/goldenPath/help/api.html"> +JSON API</a>.</p> +<p> +The data in the <em>RefSeq Other</em> and <em>RefSeq Diffs</em> tracks are organized in +<a href="../../FAQ/FAQformat.html#format1.5" target="_blank">bigBed</a> file format; more +information about accessing the information in this bigBed file can be found +below. The other subtracks are associated with database tables as follows:</p> +<dl> + <dt><a href="../../FAQ/FAQformat.html#format9" target="_blank">genePred</a> format:</dt> + <ul> + <li>RefSeq All - <tt>ncbiRefSeq</tt></li> + <li>RefSeq Curated - <tt>ncbiRefSeqCurated</tt></li> + <li>RefSeq Predicted - <tt>ncbiRefSeqPredicted</tt></li> + <li>RefSeq HGMD - <tt>ncbiRefSeqHgmd</tt></li> + <li>RefSeq Select+MANE - <tt>ncbiRefSeqSelect</tt></li> + <li>UCSC RefSeq - <tt>refGene</tt></li> + </ul> + <dt><a href="../../FAQ/FAQformat.html#format2" target="_blank">PSL</a> format:</dt> + <ul> + <li>RefSeq Alignments - <tt>ncbiRefSeqPsl</tt></li> + </ul> +</dl> +<p> +The first column of each of these tables is "bin". This column is designed +to speed up access for display in the Genome Browser, but can be safely ignored in downstream +analysis. You can read more about the bin indexing system +<a href="http://genomewiki.ucsc.edu/index.php/Bin_indexing_system" target="_blank">here</a>.</p> +<p> +The annotations in the <em>RefSeqOther</em> and <em>RefSeqDiffs</em> tracks are stored in bigBed +files, which can be obtained from our downloads server here, +<a href="http://hgdownload.soe.ucsc.edu/gbdb/$db/ncbiRefSeq/ncbiRefSeqOther.bb" +target="_blank"><tt>ncbiRefSeqOther.bb</tt></a> and +<a href="http://hgdownload.soe.ucsc.edu/gbdb/$db/ncbiRefSeq/ncbiRefSeqGenomicDiff.bb" +target="_blank"><tt>ncbiRefSeqDiffs.bb</tt></a>. +Individual regions or the whole set of genome-wide annotations can be obtained using our tool +<tt>bigBedToBed</tt> which can be compiled from the source code or downloaded as a precompiled +binary for your system from the utilities directory linked below. For example, to extract only +annotations in a given region, you could use the following command:</p> +<p> +<tt>bigBedToBed http://hgdownload.soe.ucsc.edu/gbdb/$db/ncbiRefSeq/ncbiRefSeqOther.bb +-chrom=chr16 -start=34990190 -end=36727467 stdout</tt></p> +<p> +You can download a GTF format version of the RefSeq All table from the +<a href="http://hgdownload.soe.ucsc.edu/goldenPath/$db/bigZips/genes/">GTF downloads directory</a>. +The genePred format tracks can also be converted to GTF format using the +<tt>genePredToGtf</tt> utility, available from the +<a href="http://hgdownload.soe.ucsc.edu/admin/exe/" +target="_blank">utilities directory</a> on the UCSC downloads +server. The utility can be run from the command line like so:</p> +<tt>genePredToGtf $db ncbiRefSeqPredicted ncbiRefSeqPredicted.gtf</tt> +<p> +Note that using genePredToGtf in this manner accesses our public MySQL server, and you therefore +must set up your hg.conf as described on the MySQL page linked near the beginning of the Data Access +section.</p> +<p> +A file containing the RNA sequences in <a href="http://genetics.bwh.harvard.edu/pph/FASTA.html" +target="_blank">FASTA</a> format for all items in the <em>RefSeq All</em>, <em>RefSeq Curated</em>, +and <em>RefSeq Predicted</em> tracks can be found on our downloads server +<a href="http://hgdownload.soe.ucsc.edu/gbdb/$db/ncbiRefSeq/seqNcbiRefSeq.rna.fa" +target="_blank">here</a>.</p> +<p> +Please refer to our <a href="https://groups.google.com/a/soe.ucsc.edu/forum/#!forum/genome" +target="_blank">mailing list archives</a> for questions.</p> + +<p> +Previous versions of the ncbiRefSeq set of tracks can be found on our <a href="http://hgdownload.soe.ucsc.edu/goldenPath/archive/$db/ncbiRefSeq">archive download server</a>. +</p> + +<h2>Credits</h2> +<p> +This track was produced at UCSC from data generated by scientists worldwide and curated by the +NCBI RefSeq project. </p> + +<h2>References</h2> +<p> +Kent WJ. +<a href="https://genome.cshlp.org/content/12/4/656.full" target="_blank">BLAT - the BLAST-like +alignment tool</a>. <em>Genome Res.</em> 2002 Apr;12(4):656-64. +PMID: <a href="https://www.ncbi.nlm.nih.gov/pubmed/11932250" target="_blank">11932250</a>; PMC: <a +href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC187518/" target="_blank">PMC187518</a></p> +<p> +Pruitt KD, Brown GR, Hiatt SM, Thibaud-Nissen F, Astashyn A, Ermolaeva O, Farrell CM, Hart J, +Landrum MJ, McGarvey KM <em>et al</em>. +<a href="https://academic.oup.com/nar/article/42/D1/D756/1051112/RefSeq-an-update-on-mammalian- +reference-sequences" target="_blank">RefSeq: an update on mammalian reference sequences</a>. +<em>Nucleic Acids Res</em>. 2014 Jan;42(Database issue):D756-63. +PMID: <a href="https://www.ncbi.nlm.nih.gov/pubmed/24259432" target="_blank">24259432</a>; PMC: +<a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3965018/" target="_blank">PMC3965018</a></p> +<p> +Pruitt KD, Tatusova T, Maglott DR. +<a href="https://academic.oup.com/nar/article/33/suppl_1/D501/2505241/NCBI-Reference-Sequence- +RefSeq-a-curated-non" target="_blank"> +NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts +and proteins</a>. +<em>Nucleic Acids Res.</em> 2005 Jan 1;33(Database issue):D501-4. +PMID: <a href="https://www.ncbi.nlm.nih.gov/pubmed/15608248" target="_blank">15608248</a>; PMC: <a +href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC539979/" target="_blank">PMC539979</a></p>