1cb08c59e884b67fbfcbf79f9dc416574e5c6239 galt Fri Jul 2 11:55:28 2021 -0700 Added a note about extra mouse strains appearing in the mm10 patch6 diff --git src/hg/makeDb/doc/mm10.patchUpdate.6.txt src/hg/makeDb/doc/mm10.patchUpdate.6.txt index 3e3f35e..7a9beb9 100644 --- src/hg/makeDb/doc/mm10.patchUpdate.6.txt +++ src/hg/makeDb/doc/mm10.patchUpdate.6.txt @@ -1,19 +1,42 @@ # for emacs: -*- mode: sh; -*- # This file describes how mm10 was extended with patch sequences and annotations from grcM38P6 +ALTS FROM OTHER MOUSE STRAINS IN NCBI RELEASE CONSIDERATIONS + +The original NBCI release grcM38 (which we used for the initial mm10 release) +had dozens alt-scaffolds on 14 mouse strains. Whoever did that assembly manually removed +those sequences from other strains. When we ran the patch6, we went back to the NCBI source +and ran our standard build tools. We did not realize that 99 out of 108 scaffolds were +from the other 14 strains. It did have 9 alt-scaffolds for the native strain C57BL/6J too. +We did not catch the issue until too late when QA had pushed already and we received message from a researcher. + +Since it would be a lot of work to go back and re-do all the patch6 without the extra mouse strain alts, +we have decided to proceed. We have updated README and the mm10 main html page +to reflect these changes and note the additional non-native strain sequences that appear in patch 6 release. +This can be justified since we are having to deal with alt scaffolds anyway +in our increasingly complex world, and this makes our release more similar to NCBIs. +Those alt-scaffolds were chosen because they can be useful, e.g. genes from other strains +that have important medical research. + +Currently we have no table to map the alts to their respective strains, +but it is easy to tell the native from non-native alts since +all the IDS for the native C57BL/6J alts have the letter K in them. +Our convention is that the ID follows the chrom they are located on, +so the native alts look like chrN_KK* or chrN_KZ*. + ############################################################################## # Extend main database 2bit, chrom.sizes, chromInfo (DONE - 2021-04-08 - Galt) cd /hive/data/genomes/mm10 # main 2bit time faToTwoBit <(twoBitToFa mm10.2bit stdout) \ <(twoBitToFa /hive/data/genomes/grcM38P6/grcM38P6.2bit stdout) \ mm10.p6.2bit #real 1m52.859s # unmasked 2bit time twoBitMask -type=.bed mm10.p6.2bit /dev/null mm10.p6.unmasked.2bit #real 0m3.104s