@@ -172,78 +187,92 @@
Tracks contained in the RefSeq annotation and RefSeq RNA alignment tracks were created at UCSC using
data from the NCBI RefSeq project. Data files were downloaded from RefSeq in GFF file format and
converted to the genePred and PSL table formats for display in the Genome Browser. Information about
the NCBI annotation pipeline can be found
here.
The RefSeq Diffs track is generated by UCSC using NCBI's RefSeq RNA alignments.
The UCSC RefSeq Genes track is constructed using the same methods as previous RefSeq Genes tracks.
RefSeq RNAs were aligned against the $organism genome using BLAT. Those with an alignment of
less than 15% were discarded. When a single RNA aligned in multiple places, the alignment
having the highest base identity was identified. Only alignments having a base identity
level within 0.1% of the best and at least 96% base identity with the genomic sequence were
kept.
+
+The NCBI Orthologs track was generated using the latest
+NCBI files (gene2accession and
+gene_orthologs). NCBI chromosome identifiers were mapped to UCSC-compatible IDs using
+species-specific chromosome alias files, and genes were filtered to include only those located on
+valid NCBI chromosomes. A custom Python script processed the ortholog relationships and created bed files for
+each species. The bed files were then converted to BigBed format, with indexing for search
+functionality. The procedure is documented in the makeDoc from our GitHub repository.
Data Access
The raw data for these tracks can be accessed in multiple ways. It can be explored interactively
using the REST API,
Table Browser or
Data Integrator. The tables can also be accessed programmatically through our
public MySQL server or downloaded from our
downloads server for local processing. The previous track versions are available
in the archives of our downloads server. You can also access any RefSeq table
entries in JSON format through our
JSON API.
-The data in the RefSeq Other and RefSeq Diffs tracks are organized in
+The data in the RefSeq Other, RefSeq Diffs, and NCBI Orthologs tracks are organized in
bigBed file format; more
information about accessing the information in this bigBed file can be found
below. The other subtracks are associated with database tables as follows:
- genePred format:
- RefSeq All - ncbiRefSeq
- RefSeq Curated - ncbiRefSeqCurated
- RefSeq Predicted - ncbiRefSeqPredicted
- RefSeq HGMD - ncbiRefSeqHgmd
- RefSeq Select+MANE - ncbiRefSeqSelect
- UCSC RefSeq - refGene
- PSL format:
- RefSeq Alignments - ncbiRefSeqPsl
The first column of each of these tables is "bin". This column is designed
to speed up access for display in the Genome Browser, but can be safely ignored in downstream
analysis. You can read more about the bin indexing system
here.
-The annotations in the RefSeqOther and RefSeqDiffs tracks are stored in bigBed
+The annotations in the RefSeqOther, RefSeqDiffs, and NCBI Orthologs tracks are stored in bigBed
files, which can be obtained from our downloads server here,
ncbiRefSeqOther.bb and
+target="_blank">ncbiRefSeqOther.bb,
ncbiRefSeqDiffs.bb.
+target="_blank">ncbiRefSeqDiffs.bb, and
+ncbiOrtho.bb.
Individual regions or the whole set of genome-wide annotations can be obtained using our tool
bigBedToBed which can be compiled from the source code or downloaded as a precompiled
binary for your system from the utilities directory linked below. For example, to extract only
annotations in a given region, you could use the following command:
bigBedToBed http://hgdownload.soe.ucsc.edu/gbdb/$db/ncbiRefSeq/ncbiRefSeqOther.bb
-chrom=chr16 -start=34990190 -end=36727467 stdout
You can download a GTF format version of the RefSeq All table from the
GTF downloads directory.
The genePred format tracks can also be converted to GTF format using the
genePredToGtf utility, available from the
utilities directory on the UCSC downloads
server. The utility can be run from the command line like so: