3491d223c8596113721f630745bd9ec0c55e1833
max
  Mon Feb 24 08:03:19 2025 -0800
adding note about duplicate IDs in ensemb/gencode, refs #35222

diff --git src/hg/htdocs/FAQ/FAQgenes.html src/hg/htdocs/FAQ/FAQgenes.html
index 0b0cbb1125f..fab6a0fefe0 100755
--- src/hg/htdocs/FAQ/FAQgenes.html
+++ src/hg/htdocs/FAQ/FAQgenes.html
@@ -4,30 +4,31 @@
 
 <!-- Relative paths to support mirror sites with non-standard GB docs install -->
 <!--#include virtual="$ROOT/inc/gbPageStart.html" -->
 
 <h1>Frequently Asked Questions: Gene tracks</h1>
 
 <h2>Topics</h2>
 
 <ul>
 <li><a href="#gene">What is a gene?</a></li>
 <li><a href="#genestrans">What is a transcript and how is it related to a gene?</a></li>
 <li><a href="#genename">What is a gene name?</a></li>
 <li><a href="#mostCommon">What are the most common gene transcript tracks?</a></li>
 <li><a href="#wrong">I think this transcript looks strange, what shall I do?</a></li>
 <li><a href="#duplicates">Why does the UCSC RefSeq track ("refGene") include duplicates, and some transcripts map to two loci?</a></li>
+<li><a href="#duplicatesEns">Why does the Gencode/Ensembl tracks ("knownGene", "ensGene" or "wgEncodeGencodeVXX") include a few duplicates, and some transcripts map to two loci?</a></li>
 <li><a href="#ens">What are Ensembl and GENCODE and is there a difference?</a></li>
 <li><a href="#ensRefseq">What are the differences among GENCODE, Ensembl and RefSeq?</a></li>
 <li><a href="#hg19">For the human assembly hg19/GRCh37: What is the difference between "UCSC 
                     Genes" track, the "GENCODE" track and the "Ensembl Genes" track?</a></li>
 <li><a href="#hg38">For the human assembly hg38/GRCh38: What are the differences between the 
 		    "GENCODE" and "All GENCODE" tracks?</a></li>
 <li><a href="#gencode">What is the difference between GENCODE comprehensive and basic?</a></li>
 <li><a href="#ncbiRefseq">What is the difference between "NCBI RefSeq" and "UCSC RefSeq"?</a></li>
 <li><a href="#mito">What is the best gene track for mitochondrial gene annotations?</a></li>
 <li><a href="#report">How shall I report a gene transcript in a manuscript?</a></li>
 <li><a href="#ccds">What is CCDS?</a></li>
 <li><a href="#justsingle">How can I show a single transcript per gene?</a></li>
 <li><a href="#singledownload">How can I download a file with a single transcript per gene?</a></li>
 <li><a href="#whatdo">This is rather complicated. Can you tell me which gene transcript track
                       I should use?</a></li>
@@ -209,30 +210,39 @@
 </p>
 
 <a name="duplicates"></a>
 <h6>Why does the UCSC RefSeq track ("refGene") include duplicates, and some transcripts map to two loci?</h6>
 
 <p>This is related to the question <a href="#ncbiRefSeq">What is the difference between "NCBI RefSeq" and "UCSC RefSeq"?</a>
 below. Briefly, the UCSC refGene track aligns the RefSeq transcripts to the genome with BLAT, with no special filtering but a
 95% identity, the NCBI RefSeq track is NCBI's mapping and the NCBI alignments were filtered using manual annotations
 to make sure that a transcript is mapped only once, even if it is perfectly aligning twice. NCBI uses manual curation
 to decide on the best placement, for example, if a gene is annotated on chr4, any alignments, even 100% identical,
 from other chromosomes are removed. As a result, the UCSC RefSeq track contains duplicates if the transcripts align
 very well to both loci and alerts the user to this fact, where as the NCBI alignments were filtered manually
 to make sure that every transcript maps only once.
 </p>
 
+<a name="duplicatesEns"></a>
+<h6>Why does the Gencode/Ensembl tracks ("knownGene", "ensGene" or "wgEncodeGencodeVXX") include a few duplicates, and some transcripts map to two loci?</h6>
+
+<p>There are seven genes in the <a target=_blank href='https://en.wikipedia.org/wiki/Pseudoautosomal_region'>PAR regions</a>
+of the human genome. These genes have identical sequences on chrX and chrY. Because of
+the identical sequences, they used to be given identical accessions by the Ensembl team.
+Since Ensembl release 110 (identical to Gencode release 44), these genes get different
+accessions. If you see duplicates in Ensembl/Gencode files, these probably predate the changes at the EBI.</p>
+
 
 <a name="ens"></a>
 <h2>The differences</h2>
 
 Some of our gene tracks look similar and contain very similar information which can be confusing.
 
 <h6>What are Ensembl and GENCODE and is there a difference?</h6>
 
 <p> 
 Officially, the Ensembl and GENCODE gene models are the same. On the latest human and mouse genome 
 assemblies (hg38 and mm10), the identifiers, transcript sequences, and exon coordinates are almost
 identical between equivalent Ensembl and GENCODE versions (excluding <a target=_blank 
 href="FAQdownloads.html#downloadAlt">alternative sequences</a> or <a target=_blank 
 href="FAQdownloads.html#downloadFix">fix sequences</a>).</p>