src/hg/makeDb/doc/asmHubs/README.txt 87586a71c52ad59bc437786485e7e900cae367ee

87586a71c52ad59bc437786485e7e900cae367ee
hiram
  Mon Feb 24 07:19:52 2025 -0800
up to date error checking refs #35201

diff --git src/hg/makeDb/doc/asmHubs/README.txt src/hg/makeDb/doc/asmHubs/README.txt
index 19c4476d9e7..6a1984e5cec 100644
--- src/hg/makeDb/doc/asmHubs/README.txt
+++ src/hg/makeDb/doc/asmHubs/README.txt
@@ -1,328 +1,329 @@
 #############################################################################
 ### Building the GenArk assembly hubs ###
 #############################################################################
 ### Requests from the request system:
 
 When a user sends in a request with an accession ID, e.g.: GCF_002776525.5
 the assembly may already exist in some version.  To check if something
 already exists, use just the number part of the ID: 002776525 and
 check the existing listings in the source tree:
 
     grep 002776525 ~/kent/src/hg/makeDb/doc/*AsmHub/*.tsv
 
 They may be asking for a newer version, or they may be asking for
 a GenBank version when a RefSeq version already exists.  Can decide
 if what we have is better than what they ask for.  True, sometimes they
 may want a specific older version, negotiate that with the user in
 email to see if they would accept a newer version.
 
 When not there, check to see if has been recognized as something
 to build.  Again, just the number, for RefSeq assemblies:
 
     grep 002776525 /hive/data/outside/ncbi/genomes/reports/newAsm/rs.todo.*
 for Genbank:
     grep 002776525 /hive/data/outside/ncbi/genomes/reports/newAsm/gb.todo.*
 decide which one is best or most up to date, RefSeq is always first choice.
 Those are the pre-ready to go build commands to be run in the allBuild
 directory:
 
     cd /hive/data/genomes/asmHubs/allBuild
     time (./runBuild GCA_002776525.5_ASM277652v5 primates Piliocolobus_tephrosceles) >> GCA_002776525.5.log 2>&1
 
 If it doesn't show up there, check the 'master' listings from NCBI:
 
    /hive/data/outside/ncbi/genomes/reports/assembly_summary*.txt
 assembly_summary_genbank.txt             assembly_summary_refseq.txt
 assembly_summary_genbank_historical.txt  assembly_summary_refseq_historical.txt
 
 These asssembly_summary*.txt files can also be scanned for scientific names
 if that is all the user supplied.
 
 #############################################################################
 ###
 ###  To build a single hub:
 #############################################################################
 
 0: Given an accession identifier, e.g. GCF_002776525.5
 a. find build command and designated clade (see also above discussion)
 b. run the build of the hub (see also above discussion)
 c. add lines to source tree files master.run.list and
                       doc/<clade>AsmHug/<clade>.orderList.tsv
 d. in the source tree doc/<clade>AsmHub/ directory, running commands:
 e. make symLinks      # prepares staging directory with symlinks to the build
 f. then: time (make) > dbg 2>&1 # check for errors: egrep "miss|err" dbg
 g. time (make verifyTestDownload) >> test.down.log 2>&1 # check for errors
                                                    # grep check test.down.log
 h. time (make sendDownload) >> send.down.log 2>&1 # check for errors
                                                   # grep error send.down.log
 i. time (make verifyDownload) >> verify.down.log 2>&1 # check for errors
                                                   # grep check verify.down.log
 j. verify the browser functions: https://genome.ucsc.edu/h/GCF_002776525.5
 
 ### Details of those steps:
 
 1.  Given an accession identifier, e.g. GCF_002776525.5
     Find the build command and designated clade:
 
   grep GCF_002776525.5 \
      /hive/data/outside/ncbi/genomes/reports/newAsm/{rs,gb}.todo.*.txt
 
 Answer: 'primates' clade and command:
 rs.todo.primates.txt:./runBuild GCF_002776525.5_ASM277652v5 primates Piliocolobus_tephrosceles   2019_12_12
 
     If that grep finds nothing, that browser may already be built.
     Can grep source tree file: ~/kent/src/hg/makeDb/doc/asmHubs/master.run.list
     for your accession to see if it already done
 
 2.  Run the build of the browser in the directory:
     cd /hive/data/genomes/asmHubs/allBuild
 time (./runBuild GCF_002776525.5_ASM277652v5 primates Piliocolobus_tephrosceles) > GCF_002776525.5.log 2>&1 &
     That could take several days for a large genome, a few hours for a small one
     When it is done, there will be a asmId.trackDb.txt file in the build
     directory:
 /hive/data/genomes/asmHubs/refseqBuild/GCF/002/776/525/GCF_002776525.5_ASM277652v5/
 
 3.  When the build is done, add that runBuild command to the source tree:
              ~/kent/src/hg/makeDb/doc/asmHubs/master.run.list
     maintain the sorted order of that file
 
 4.  Add the full assembly ID and common name to the primates.orderList.txt
     cd ~/kent/src/hg/makeDb/doc/primatesAsmHub
     echo GCF_002776525.5_ASM277652v5 | ../asmHubs/commonNames.pl /dev/stdin
  GCF_002776525.5_ASM277652v5     Ugandan red Colobus (RC106 2019)
     Keep the list in order by the second column case insensitive.
     These common names will be the pull-down menu list in the browser
     to select a genome from this group.  Make the common name unique so
     there is something the user can see that they can identify as the
     assembly they want to use.
     Extra credit:  if your new build is an updated version of
                    that genome assembly, move the old one out of this
                    orderList.txt into ../legacyAsmHub/legacy.orderList.txt
                    Same procedures there to push out that group.
 
 5. Prepare the build for the push.  In this primatesAsmHub directory:
      time (make) > dbg 2>&1
      This could stop prematurely if errors are encountered, to verify
-     when done, check for errors: grep -i err dbg
+     when done, check for errors: egrep -i "error|fail|missing|cannot" dbg
      should be nothing significant
 
 6. Verify the browser is correct on hgwdev:
      time (makeVerifyTestDownload) >> test.down.log 2>&1
      should finish with an all clear line, no failures:
 # checked  58 hubs, 58 success, 0 fail, total tracks: 1188, 2023-02-15 13:48:07
 
 7. Push the hub to hgdownload (and dynamic blat server):
      time (make sendDownload) >> send.down.log 2>&1
-     should stop if there are errors.  Can verify: grep -i error send.down.log
+     should stop if there are errors.  Can verify:
+         egrep -i "error|fail|missing|cannot" send.down.log
 
 8. Verify the hub is correctly on hgdownload:
      time (make verifyDownload) >> verify.down.log 2>&1
      should finish with an all clear line, no failures:
 # checked 58 hubs, 58 success, 0 fail, total tracks: 1188, 2023-02-15 13:58:02
 
 9. Verify the hub appears in the browser:
         https://genome.ucsc.edu/h/GCF_002776525.5
 
 Extra historical discussion included below.
 
 #############################################################################
 #############################################################################
 ### see below for adding custom/local developed tracks to an existing GenArk hub
 #############################################################################
 
 The build of each assembly takes place in, for example:
 
   /hive/data/genomes/asmHubs/refseqBuild/GCF/000/001/405/GCF_000001405.39_GRCh38.p13/
 
 (There is a corresponding hierarchy for 'genbank' GCA assemblies, i.e.:
 
 /hive/data/genomes/asmHubs/genbankBuild/GCA/902/686/455/GCA_902686455.1_mSciVul1.1
 
 )
 
 I have a 'goto' function in my shell, you can view at:
   ~hiram/.bashrc.hiram
 which I use to move around in this spread out hierarchy.  For example:
 
   $ goto GCF_000001405
 
 will get you to that build directory (when there is only one GCF_000001405)
 
 You should construct any new files in this directory hierarchy.  Maybe
 a subdirectory here if you have a whole category of files,  Note
 subdirectories already here: bbi html ixIxx for example.
 
 (download, sequence, idKeys, trackData are directories for data construction
   during the build)
 
 To deliver files from this build to hgdownload, scripts in:
    ~/kent/src/hg/makeDb/doc/asmHubs/
 construct symlinks from the build directory into the delivery staging directory hierarchy, for example:
    /hive/data/genomes/asmHubs/GCF/000/001/405/GCF_000001405.39/
 
 Nothing but symlinks here and just the deliver files for hgdownload:
   ls -ogLd *
 -rw-rw-r-- 1  945974069 Sep 10  2019 GCF_000001405.39.2bit
 -rw-rw-r-- 1     888417 Sep 10  2019 GCF_000001405.39.agp.gz
 -rw-rw-r-- 1      18097 Sep 10  2019 GCF_000001405.39.chrom.sizes.txt
 -rw-rw-r-- 1      29424 Sep 25 11:58 GCF_000001405.39.chromAlias.txt
 -rw-rw-r-- 1 2915673072 Jul 16 15:16 GCF_000001405.39.trans.gfidx
 -rw-rw-r-- 1 2262217236 Jul 16 15:19 GCF_000001405.39.untrans.gfidx
 drwxrwxr-x 2       4096 Sep 23 15:33 bbi
 -rw-rw-r-- 1        354 Dec  1 12:16 genomes.txt
 -rw-rw-r-- 1        508 Dec  1 12:16 groups.txt
 drwxrwxr-x 2       4096 Dec  1 12:16 html
 -rw-rw-r-- 1        240 Dec  1 12:16 hub.txt
 drwxrwxr-x 2       4096 Sep 23 15:33 ixIxx
 -rw-rw-r-- 1       9910 Sep 23 15:33 trackDb.txt
 
 Note how the names become shorter here, losing the full assembly identifier.
 Don't need that.  There should be only one 'GCF_000001405.39' assembly.
 NCBI has made a couple of mistakes and these names became duplicated for
 a couple of assemblies.  Don't care about that.  Eliminated the garbage.
 
 So, to add the construction of the deliver symlinks for your new files,
 you would add something to: ~/kent/src/hg/makeDb/doc/asmHubs/mkSymLinks.pl
 This is assuming you do want to deliver these files to hgdownload.  I would
 guess you would since external users that want to copy this assembly can copy
 this directory from hgdownload to get everything they need to operate it
 independently from us.
 
 You don't operate the scripts in .../makeDb/doc/asmHubs/ by themselves.
 The are used from makefile rules in each assembly hub definition directory.
 For example, for the primates, in the directory:
    kent/src/hg/makeDb/doc/primatesAsmHub/
 you just type 'make' and it does everything to get these items ready for
 delivery.  This is what makes the symLinks and all other files to make
 the assembly hub function.
 
 (This is *not* the build of the files in the build hierarchy, see below)
 
 Other hub dirctories here in makeDb/doc/
 
 primatesAsmHub
 mammalsAsmHub
 birdsAsmHub
 fishAsmHub
 vertebrateAsmHub
 legacyAsmHub
 plantsAsmHub
 bacteriaAsmHub
 vgpAsmHub
 
 Future work will create:
 
 fungiAsmHub
 invertebrateAsmHub
 viralAsmHub
 protozoaAsmHub
 bacteriaAsmHub
 archaeaAsmHub
 
 #############################################################################
 ### To run up a build of an assembly ###
 #############################################################################
 
 The actual build is taking place with the help of the 'runBuild'
 script (copy here in ~/kent/src/hg/makeDb/doc/asmHubs/runBuild)
 
 The builds are operated from the directory:
 
    /hive/data/genomes/asmHubs/allBuild/
    (a location to accumulate log files, and run lists, thus work history)
 
 The 'runBuild' is operated, for example, a single assembly:
 
   time (./runBuild GCF_000001405.39_GRCh38.p13 primates Homo_sapiens) >> GCF_000001405.39.log 2>&1 &
 
 Or, typically, there may be a whole list of such commands
 ( such as in the master.run.list here:
     ~/kent/src/hg/makeDb/doc/asmHubs/master.run.list
 )
 
 These are run, for example 5 at a time:
   time (kent/src/hg/utils/automation/perlPara.pl 5 master.run.list) \
      >> bigRun.log 2>&1
 
 The 'runBuild' script is usually set up to run all steps from
 'download' to 'trackDb', and it is OK to use it like this even on
 a build that has already taken place (currently it is disabled to
 avoid trying to rebuild an assembly).  There are cases, for example,
 where I want to update all the trackDb files since something has
 been improved for trackDb, in which case I adjust the
 stepStart and stepEnd to run just the trackDb step.  (would have
 to disable the rebuild prevention)
 
 #############################################################################
 ### adding custom/local developed tracks to a GenArk hub
 #############################################################################
 
 Work in the trackData/ directory of the assembly hub in a directory
 name of the track, think of this as your /hive/data/genomes/<db>/bed/myTrack/
 usual work directory as if it were a database assembly.
 
 For example, the extra pcrAmplicon track on the Monkeypox browser
 GCF_000857045.1_ViralProj15142
 
 Is developed in:
 /hive/data/genomes/asmHubs/refseqBuild/GCF/014/621/545/GCF_014621545.1_ASM1462154v1/trackData/pcrAmplicon/
 
 When your data is ready, add your big* files, ixIxx and html page description
 files to the browser with symLinks in the bbi, ixIxx and html directories:
 
 /hive/data/genomes/asmHubs/refseqBuild/GCF/014/621/545/GCF_014621545.1_ASM1462154v1/bbi/
 and
 /hive/data/genomes/asmHubs/refseqBuild/GCF/014/621/545/GCF_014621545.1_ASM1462154v1/ixIxx/
 /hive/data/genomes/asmHubs/refseqBuild/GCF/014/621/545/GCF_014621545.1_ASM1462154v1/html/
 
 To get your track added to the GenArk hub, place your trackDb.txt
 definitions in the special named file: <asmId>.userTrackDb.txt
 in the top-level build directory:
 /hive/data/genomes/asmHubs/refseqBuild/GCF/014/621/545/GCF_014621545.1_ASM1462154v1/
 for example: GCF_014621545.1_ASM1462154v1.userTrackDb.txt
 
 Your track will push out to hgdownload with this GenArk hub the next time
 the build is run for the clade this organism is packaged in.
 
 Typical 'build' sequence to do the release of a clade set:
 
   cd ~/kent/src/hg/makeDb/doc/viralAsmHub
   # builds symLinks for delivery staging directory, constructs index pages
   # for this clade set, makes everything available on genome-test
   time (make) >> dbg 2>&1
   # when finished, examine the dbg file to see if there are any errors reported
   # by the scripts.  Then, verify it is looking good in the staging
   # directory on genome-test:
   time (make verifyTestDownload) >> test.down.log 2>&1
   # this testing is performed by the API on hgwdev.
   # this test.down.log file accumulates each time a build is run, to make sure
   # it is sane and there are no errors, grep for 'checked' to see lines such as:
 
   grep checked test.down.log
 # checked 221 hubs, 221 success, 0 fail, total tracks: 4720, 2022-09-25 14:58:55
 # checked 222 hubs, 222 success, 0 fail, total tracks: 4740, 2022-10-04 11:55:28
 
   # if you wanted to view this clade set on genome-test to see what it
   # looks like, the URL is:  https://genome-test.gi.ucsc.edu/hubs/viral/
   # each clade has a different directory here:
   #  primates mammals birds fish vertebrate plants fungi viral bacteria archaea
 
   # if it looks good on genome-test and verifyTestDownload runs without errors,
   # the hub can push to hgdownload:
 
   time (make sendDownload) >> send.down.log 2>&1
 
   # there isn't much to see in this send.down.log, it is just for the record
   # then to verify it is correct on hgdownload:
 
   time (make verifyDownload) >> verify.down.log 2>&1 &
   # this testing runs via the API on hgwbeta so that the access
   # activity logs on the RR won't be disturbed by such testing.
 
   # to see if it is sane, grep for 'checked' in this log file:
   grep checked verify.down.log
 
 # checked 221 hubs, 221 success, 0 fail, total tracks: 4720, 2022-09-25 19:42:48
 # checked 222 hubs, 222 success, 0 fail, total tracks: 4740, 2022-10-04 12:23:35
 
 #############################################################################