edc007b0a1062502c0a898866561a6e9fd870b99 hiram Wed Feb 22 11:25:12 2023 -0800 updated instructions for a build procedure summary no redmine diff --git src/hg/makeDb/doc/asmHubs/README.txt src/hg/makeDb/doc/asmHubs/README.txt index db8026f..5b2b399 100644 --- src/hg/makeDb/doc/asmHubs/README.txt +++ src/hg/makeDb/doc/asmHubs/README.txt @@ -1,17 +1,99 @@ ############################################################################# -### Building the assembly hubs ### +### Building the GenArk assembly hubs ### +############################################################################# +### +### To build a single hub: +############################################################################# + +0: Given an accession identifier, e.g. GCF_002776525.5 +a. find build command and designated clade +b. run the build of the hub +c. add lines to source tree files master.run.list and clade.orderList.tsv +d. in the source tree run; time (make) > dbg 2>&1 # check for errors +e. time (make verifyTestDownload) >> test.down.log 2>&1 # check for errors +f. time (make sendDownload) >> send.down.log 2>&1 # check for errors +g. time (make verifyDownload) >> verify.down.log 2>&1 # check for errors +h. verify the browser functions: https://genome.ucsc.edu/h/GCF_002776525.5 + +### Details of those steps: + +1. Given an accession identifier, e.g. GCF_002776525.5 + Find the build command and designated clade: + + grep GCF_002776525.5 \ + /hive/data/outside/ncbi/genomes/reports/newAsm/{rs,gb}.todo.*.txt + +Answer: 'primates' clade and command: +rs.todo.primates.txt:./runBuild GCF_002776525.5_ASM277652v5 primates Piliocolobus_tephrosceles 2019_12_12 + + If that grep finds nothing, that browser may already be built. + Can grep source tree file: ~/kent/src/hg/makeDb/doc/asmHubs/master.run.list + for your accession to see if it already done + +2. Run the build of the browser in the directory: + cd /hive/data/genomes/asmHubs/allBuild +time (./runBuild GCF_002776525.5_ASM277652v5 primates Piliocolobus_tephrosceles) > GCF_002776525.5.log 2>&1 & + That could take several days for a large genome, a few hours for a small one + When it is done, there will be a asmId.trackDb.txt file in the build + directory: +/hive/data/genomes/asmHubs/refseqBuild/GCF/002/776/525/GCF_002776525.5_ASM277652v5/ + +3. When the build is done, add that runBuild command to the source tree: + ~/kent/src/hg/makeDb/doc/asmHubs/master.run.list + maintain the sorted order of that file + +4. Add the full assembly ID and common name to the primates.orderList.txt + cd ~/kent/src/hg/makeDb/doc/primatesAsmHub + echo GCF_002776525.5_ASM277652v5 | ../asmHubs/commonNames.pl /dev/stdin + GCF_002776525.5_ASM277652v5 Ugandan red Colobus (RC106 2019) + Keep the list in order by the second column case insensitive. + These common names will be the pull-down menu list in the browser + to select a genome from this group. Make the common name unique so + there is something the user can see that they can identify as the + assembly they want to use. + Extra credit: if your new build is an updated version of + that genome assembly, move the old one out of this + orderList.txt into ../legacyAsmHub/legacy.orderList.txt + Same procedures there to push out that group. + +5. Prepare the build for the push. In this primatesAsmHub directory: + time (make) > dbg 2>&1 + This could stop prematurely if errors are encountered, to verify + when done, check for errors: grep -i err dbg + should be nothing significant + +6. Verify the browser is correct on hgwdev: + time (makeVerifyTestDownload) >> test.down.log 2>&1 + should finish with an all clear line, no failures: +# checked 58 hubs, 58 success, 0 fail, total tracks: 1188, 2023-02-15 13:48:07 + +7. Push the hub to hgdownload (and dynamic blat server): + time (make sendDownload) >> send.down.log 2>&1 + should stop if there are errors. Can verify: grep -i error send.down.log + +8. Verify the hub is correctly on hgdownload: + time (make verifyDownload) >> verify.down.log 2>&1 + should finish with an all clear line, no failures: +# checked 58 hubs, 58 success, 0 fail, total tracks: 1188, 2023-02-15 13:58:02 + +9. Verify the hub appears in the browser: + https://genome.ucsc.edu/h/GCF_002776525.5 + +Extra historical discussion included below. + +############################################################################# ############################################################################# ### see below for adding custom/local developed tracks to an existing GenArk hub ############################################################################# The build of each assembly takes place in, for example: /hive/data/genomes/asmHubs/refseqBuild/GCF/000/001/405/GCF_000001405.39_GRCh38.p13/ (There is a corresponding hierarchy for 'genbank' GCA assemblies, i.e.: /hive/data/genomes/asmHubs/genbankBuild/GCA/902/686/455/GCA_902686455.1_mSciVul1.1 ) I have a 'goto' function in my shell, you can view at: @@ -95,31 +177,31 @@ ############################################################################# ### To run up a build of an assembly ### ############################################################################# The actual build is taking place with the help of the 'runBuild' script (copy here in ~/kent/src/hg/makeDb/doc/asmHubs/runBuild) The builds are operated from the directory: /hive/data/genomes/asmHubs/allBuild/ (a location to accumulate log files, and run lists, thus work history) The 'runBuild' is operated, for example, a single assembly: - time (./runBuild GCF_000001405.39 GCF_000001405.39_GRCh38.p13 vertebrate_mammalian Homo_sapiens) >> GCF_000001405.39.log 2>&1 & + time (./runBuild GCF_000001405.39_GRCh38.p13 primates Homo_sapiens) >> GCF_000001405.39.log 2>&1 & Or, typically, there may be a whole list of such commands ( such as in the master.run.list here: ~/kent/src/hg/makeDb/doc/asmHubs/master.run.list ) These are run, for example 5 at a time: time (kent/src/hg/utils/automation/perlPara.pl 5 master.run.list) \ >> bigRun.log 2>&1 The 'runBuild' script is usually set up to run all steps from 'download' to 'trackDb', and it is OK to use it like this even on a build that has already taken place (currently it is disabled to avoid trying to rebuild an assembly). There are cases, for example, where I want to update all the trackDb files since something has