cac0d55cab41854d0d152a5dfbd1ced98228618d gperez2 Tue Feb 25 09:57:55 2025 -0800 Updating Max entry of the 'Why does the UCSC RefSeq track (refGene) include duplicates' FAQ, refs #35222 diff --git src/hg/htdocs/FAQ/FAQgenes.html src/hg/htdocs/FAQ/FAQgenes.html index 4a7c7f65c1c..25b2597126d 100755 --- src/hg/htdocs/FAQ/FAQgenes.html +++ src/hg/htdocs/FAQ/FAQgenes.html @@ -213,34 +213,34 @@
This is related to the question What is the difference between "NCBI RefSeq" and "UCSC RefSeq"? below. Briefly, the UCSC refGene track aligns the RefSeq transcripts to the genome with BLAT, with no special filtering but a 95% identity, the NCBI RefSeq track is NCBI's mapping and the NCBI alignments were filtered using manual annotations to make sure that a transcript is mapped only once, even if it is perfectly aligning twice (there is one exception, genes in the PAR regions, see the paragraph below). NCBI uses manual curation to decide on the best placement, for example, if a gene is annotated on chr4, any alignments, even 100% identical, from other chromosomes are removed. As a result, the UCSC RefSeq track contains duplicates if the transcripts align very well to both loci and alerts the user to this fact, where as the NCBI alignments were filtered manually to make sure that every transcript maps only once.
-NCBI's transcript mapping which we provide in our NCBI RefSeq track, does -contain a few duplicates, but these have a biological explanation: They are +NCBI's transcript mapping, which we provide in our NCBI RefSeq track, does +contain a few duplicates, but these have a biological explanation: they are transcripts in the pseudoautosomal regions -(PARs), so they have identical sequences and by NCBI rules this means identical +(PARs). Because they have identical sequences, NCBI rules assign them identical accessions. See the section below for how Ensembl/Gencode handle these cases.
The human genome has seven genes located in the pseudoautosomal regions (PARs), which have identical sequences on both chrX and chrY. The Ensembl team assigned these genes identical accessions due to their identical sequences. Since Ensembl release 110 (identical to Gencode release 44), these genes now receive distinct accessions. If you encounter duplicates in Ensembl/Gencode files, they likely originate from file versions predating this update at the EBI.