a919a3dbe5fd770f03b4c879612b123e04743a9c
gperez2
  Thu Dec 12 23:47:12 2024 -0800
Adding the Gene Orthologs track to the NCBI RefSeq composite track, refs #30262

diff --git src/hg/makeDb/trackDb/human/refSeqComposite.html src/hg/makeDb/trackDb/human/refSeqComposite.html
index 437ba58..95d6387 100644
--- src/hg/makeDb/trackDb/human/refSeqComposite.html
+++ src/hg/makeDb/trackDb/human/refSeqComposite.html
@@ -9,31 +9,31 @@
 <a href="#methods">Methods</a> section for more details about how the different tracks were 
 created. </p>
 <p>
 Please visit NCBI's <a href="https://www.ncbi.nlm.nih.gov/projects/RefSeq/update.cgi"
 target="_blank">Feedback for Gene and Reference Sequences (RefSeq)</a> page to make suggestions, 
 submit additions and corrections, or ask for help concerning RefSeq records. </p>
 
 <p>
 For more information on the different gene tracks, see our <a target=_blank 
 href="/FAQ/FAQgenes.html">Genes FAQ</a>.</p>
 
 <h2>Display Conventions and Configuration</h2>
 <p>
 This track is a composite track that contains differing data sets.
 To show only a selected set of subtracks, uncheck the boxes next to the tracks that you wish to 
-hide. <b>Note:</b> Not all subtracts are available on all assemblies. </p>
+hide. <b>Note:</b> Not all subtracks are available on all assemblies. </p>
 
 The possible subtracks include:
 <dl>
   <dt><em><strong>RefSeq aligned annotations and UCSC alignment of RefSeq annotations
           </strong></em></dt>
   <ul>
     <li>
     <em>RefSeq All</em> &ndash; all curated and predicted annotations provided by 
     RefSeq.</li>
     <li>
     <em>RefSeq Curated</em> &ndash; subset of <em>RefSeq All</em> that includes only those 
     annotations whose accessions begin with NM, NR,  NP or YP. <small>(NP and YP are used only for
     protein-coding genes on the mitochondrion; YP is used for human only.)</small></li>
     <li>
     <em>RefSeq Predicted</em> &ndash; subset of RefSeq All that includes those annotations whose 
@@ -61,30 +61,45 @@
    RefSeq Select or MANE Select. 
    A single <em>Select</em> transcript is chosen as representative for each protein-coding gene. 
    This track includes transcripts categorized as MANE, which are further agreed upon as 
    representative by both NCBI RefSeq and Ensembl/GENCODE, and have a 100% identical match 
    to a transcript in the Ensembl annotation. See <a target="_blank" 
    href="https://www.ncbi.nlm.nih.gov/refseq/refseq_select/">NCBI RefSeq Select</a>. 
    Note that we provide a separate track, <a 
    target=_blank href="hgTrackUi?g=mane&db=hg38&c=chr22">MANE (hg38)</a>, 
    which contains only the MANE transcripts.
    </li>
    <li>
    <em>RefSeq HGMD (subset)</em> &ndash; Subset of RefSeq Curated, transcripts annotated by the Human
    Gene Mutation Database. This track is only available on the human genomes hg19 and hg38.
    It is the most restricted RefSeq subset, targeting clinical diagnostics.
    </li>
+   <li>
+   <em>RefSeq Historical</em> &ndash; previous RefSeq transcript versions, including NM_ accessions
+   and HGVS searches. This track is only available on hg38.
+   </li>
+   <li>
+   <em>NCBI Orthologs</em> &ndash; Orthologous genes were identified by 
+   <a target="_blank" href="https://www.ncbi.nlm.nih.gov/refseq/annotation_euk/process/">
+    NCBI's Eukaryotic Genome Annotation Pipeline</a>
+    for the NCBI Gene dataset using a combination of protein sequence similarity
+    and local synteny analysis. Orthology is determined between the genome being annotated and a
+    reference genome, such as human or zebrafish, and pairs of orthologs are grouped together.
+    Transitive relationships are inferred within each group, for example, zebrafish &lt;-&gt;
+    human &lt;-&gt; mouse. This track is available for the following assemblies: hg38, mm39,
+    danRer11, canFam6, and bosTau9.
+   </li>
   </ul>
 </dl>
 
 <p>
 The <em>RefSeq All</em>, <em>RefSeq Curated</em>, <em>RefSeq Predicted</em>, <em>RefSeq HGMD</em>,
 <em>RefSeq Select/MANE</em> and <em>UCSC RefSeq</em> tracks follow the display conventions for
 <a href="../goldenPath/help/hgTracksHelp.html#GeneDisplay"
 target="_blank">gene prediction tracks</a>.
 The color shading indicates the level of review the RefSeq record has undergone:
 predicted (light), provisional (medium), or reviewed (dark), as defined by <a target=_blank href="https://www.ncbi.nlm.nih.gov/books/NBK21091/table/ch18.T.refseq_status_codes/?report=objectonly">RefSeq</a>. </p>
 
 <p>
 <table>
   <thead>
   <tr>
@@ -172,78 +187,92 @@
 <p>
 Tracks contained in the RefSeq annotation and RefSeq RNA alignment tracks were created at UCSC using 
 data from the NCBI RefSeq project. Data files were downloaded from RefSeq in GFF file format and 
 converted to the genePred and PSL table formats for display in the Genome Browser. Information about
 the NCBI annotation pipeline can be found 
 <a href="https://www.ncbi.nlm.nih.gov/genome/annotation_euk/process/" target="_blank">here</a>.</p>
 
 <p>The RefSeq Diffs track is generated by UCSC using NCBI's RefSeq RNA alignments.</p>
 <p>
 The UCSC RefSeq Genes track is constructed using the same methods as previous RefSeq Genes tracks.
 RefSeq RNAs were aligned against the $organism genome using BLAT. Those with an alignment of
 less than 15% were discarded. When a single RNA aligned in multiple places, the alignment
 having the highest base identity was identified. Only alignments having a base identity
 level within 0.1% of the best and at least 96% base identity with the genomic sequence were
 kept.</p>
+<p>
+The NCBI Orthologs track was generated using the latest
+<a href="https://ftp.ncbi.nih.gov/gene/DATA/" target="_blank">NCBI files</a> (gene2accession and
+gene_orthologs). NCBI chromosome identifiers were mapped to UCSC-compatible IDs using
+species-specific chromosome alias files, and genes were filtered to include only those located on
+valid NCBI chromosomes. A custom <a
+href="https://github.com/ucscGenomeBrowser/kent/tree/master/src/hg/makeDb/doc/ncbiRefSeq/ortho.py"
+target="_blank">Python script</a> processed the ortholog relationships and created bed files for
+each species. The bed files were then converted to BigBed format, with indexing for search
+functionality. The procedure is documented in the <a href=
+"https://github.com/ucscGenomeBrowser/kent/tree/master/src/hg/makeDb/doc/ncbiRefSeq/ncbiOrtho.makedoc"
+target="_blank">makeDoc</a> from our GitHub repository.</p>
 
 <h2>Data Access</h2>
 <p>
 The raw data for these tracks can be accessed in multiple ways. It can be explored interactively 
 using the <a href="/goldenPath/help/api.html" target="_blank">REST API</a>,
 <a href="../cgi-bin/hgTables" target="_blank">Table Browser</a> or
 <a href="../cgi-bin/hgIntegrator"
 target="_blank">Data Integrator</a>. The tables can also be accessed programmatically through our
 <a href="../../goldenPath/help/mysql.html"
 target="_blank">public MySQL server</a> or downloaded from our
 <a href="http://hgdownload.soe.ucsc.edu/goldenPath/$db/database/"
 target="_blank">downloads server</a> for local processing. The previous track versions are available
 in the <a href="https://hgdownload.soe.ucsc.edu/goldenPath/archive/$db/ncbiRefSeq/"
 target="_blank">archives</a> of our downloads server. You can also access any RefSeq table
 entries in JSON format through our <a href="http://genome.ucsc.edu/goldenPath/help/api.html">
 JSON API</a>.</p>
 <p>
-The data in the <em>RefSeq Other</em> and <em>RefSeq Diffs</em> tracks are organized in 
+The data in the <em>RefSeq Other</em>, <em>RefSeq Diffs</em>, and <em>NCBI Orthologs</em> tracks are organized in
 <a href="../../FAQ/FAQformat.html#format1.5" target="_blank">bigBed</a> file format; more
 information about accessing the information in this bigBed file can be found
 below. The other subtracks are associated with database tables as follows:</p>
 <dl>
   <dt><a href="../../FAQ/FAQformat.html#format9" target="_blank">genePred</a> format:</dt>
   <ul>
     <li>RefSeq All - <tt>ncbiRefSeq</tt></li>
     <li>RefSeq Curated - <tt>ncbiRefSeqCurated</tt></li>
     <li>RefSeq Predicted - <tt>ncbiRefSeqPredicted</tt></li>
     <li>RefSeq HGMD - <tt>ncbiRefSeqHgmd</tt></li>
     <li>RefSeq Select+MANE - <tt>ncbiRefSeqSelect</tt></li>
     <li>UCSC RefSeq - <tt>refGene</tt></li>
   </ul>
   <dt><a href="../../FAQ/FAQformat.html#format2" target="_blank">PSL</a> format:</dt>
   <ul>	
     <li>RefSeq Alignments - <tt>ncbiRefSeqPsl</tt></li>
   </ul>
 </dl>
 <p>
 The first column of each of these tables is &quot;bin&quot;. This column is designed
 to speed up access for display in the Genome Browser, but can be safely ignored in downstream
 analysis. You can read more about the bin indexing system
 <a href="http://genomewiki.ucsc.edu/index.php/Bin_indexing_system" target="_blank">here</a>.</p>
 <p>
-The annotations in the <em>RefSeqOther</em> and <em>RefSeqDiffs</em> tracks are stored in bigBed 
+The annotations in the <em>RefSeqOther</em>, <em>RefSeqDiffs</em>, and <em>NCBI Orthologs</em> tracks are stored in bigBed
 files, which can be obtained from our downloads server here,
 <a href="http://hgdownload.soe.ucsc.edu/gbdb/$db/ncbiRefSeq/ncbiRefSeqOther.bb"
-target="_blank"><tt>ncbiRefSeqOther.bb</tt></a> and 
+target="_blank"><tt>ncbiRefSeqOther.bb</tt></a>,
 <a href="http://hgdownload.soe.ucsc.edu/gbdb/$db/ncbiRefSeq/ncbiRefSeqGenomicDiff.bb"
-target="_blank"><tt>ncbiRefSeqDiffs.bb</tt></a>.
+target="_blank"><tt>ncbiRefSeqDiffs.bb</tt></a>, and
+<a href="http://hgdownload.soe.ucsc.edu/gbdb/$db/ncbiOrtho/ncbiOrtho.bb"
+target="_blank"><tt>ncbiOrtho.bb</tt></a>.
 Individual regions or the whole set of genome-wide annotations can be obtained using our tool
 <tt>bigBedToBed</tt> which can be compiled from the source code or downloaded as a precompiled
 binary for your system from the utilities directory linked below. For example, to extract only
 annotations in a given region, you could use the following command:</p>
 <p>
 <tt>bigBedToBed http://hgdownload.soe.ucsc.edu/gbdb/$db/ncbiRefSeq/ncbiRefSeqOther.bb
 -chrom=chr16 -start=34990190 -end=36727467 stdout</tt></p>
 <p>
 You can download a GTF format version of the RefSeq All table from the 
 <a href="http://hgdownload.soe.ucsc.edu/goldenPath/$db/bigZips/genes/">GTF downloads directory</a>.
 The genePred format tracks can also be converted to GTF format using the
 <tt>genePredToGtf</tt> utility, available from the
 <a href="http://hgdownload.soe.ucsc.edu/admin/exe/"
 target="_blank">utilities directory</a> on the UCSC downloads 
 server. The utility can be run from the command line like so:</p>