src/parasol/version.doc 1.44

1.44 2009/08/05 23:19:38 galt
fix problem addJob with sickBatch leaving batch on curBatches instead of oldBatches
Index: src/parasol/version.doc
===================================================================
RCS file: /projects/compbio/cvsroot/kent/src/parasol/version.doc,v
retrieving revision 1.43
retrieving revision 1.44
diff -b -B -U 1000000 -r1.43 -r1.44
--- src/parasol/version.doc	11 Sep 2008 23:17:51 -0000	1.43
+++ src/parasol/version.doc	5 Aug 2009 23:19:38 -0000	1.44
@@ -1,291 +1,294 @@
  1 - Lost in the mists of time
  
  2 - Beyond the reach of our records
  
  3 - A version that spanned many recompiles.  Sometimes was stable.
  
  4 - Fixed problem where nodes would be marked as dead if they
      took longer than 10 minutes to run a job.
      
  5 - Added logic to cope more gracefully with nodes that are 
      running jobs but momentarily can't be reached by hub.
      
  6 - Changing paraHub from processes to threads, and from TCP/IP
      to UDP.  Separating out node port and hub port so that same
      machine can run node and hub daemons.  Fixing bug where
      it would crash when you did a 'parasol remove jobs user'.
      
  7 - Some changes on paraNode to hopefully make it harder for
      a node to lose track of the jobs it is running.  Using
      waitpid() rather than wait, and eliminating the clearZombies().
      
  8 - Made paraHub just warn rather than abort on failure to close
      a results file cleanly.  Also changed auto-close time to
      1 minute rather than 5.
      
  9 - Added good/bad count to machine struct. 
      Changed logging to use syslog.
      Corrected setting of user's groups for better security.
      
 10 - Added priorities to batches. Added para status command.
      Changed paraNode to only do SU when running as root,
      so that stand-alone single user testing works.
      Also changed happy dots so it only outputs them if going
      to a terminal, not a file or emacs shell buffer.
      Added host as penultimate column in paraJobStatus display.
 
      Fixed bug in paraNode err msg when fetching a file not found.
      Fixed bug in paraNode during hub startup which corrupted para.results.
      Fixed bug in paraHub causing crash when using priorities.
      Made paraHub into a proper daemon which is better for normal usage.
      Added -log=path/logfile to paraHub and paraNode so that users
      debugging without admin privileges can see output.
      Added -debug option to paraHub so it won't demonize for debugging.
 
 11 - Added estimated completion time to para. -eta for time function.
      Added -maxNode to limit number of nodes a batch can have.
      Added longest running job to para time output and renamed
      longest job to longest finished job. Fixed paraFormatIp to 
      return host instead of network order. Added reminder to 
      documentation that the machine names in .ms files should 
      match the $HOST variables on the nodes.  Fixed parasol remove machine 
      so you don't have to do it twice, and so it logs user and requires
      a comment for the log so admin knows why it was removed.
 
 12 - Added sick node and sick batch detection.  Each batch keeps a hash 
      of sick nodes and starts avoiding the ones that seem sick.  
      It currently counts a node as sick if it crashes 3 times in a row.
      A batch considers itself sick and automatically chills itself
      after 25 failures in a row.  These sickness thresholds could be made 
      configurable in the future.  There are two new para commands,
      one for showing sick node stats for the batch, and another for
      clearing those stats.  There is a new parasol command for showing
      machines among currently running users that are in universal
      agreement about them being sick.  This list could inform the use
      of the parasol remove node command.  A batch will quickly learn
      which nodes are probably bad and simply avoid them.  This is aimed
      at preventing the loss of a night's work on the cluster due to one
      or two bad nodes that pump through most of a large queue and 
      erroneously failing all the jobs. A batch that is itself sick 
      will not affect the stats on other batches.  There is an additional 
      sick nodes hash maintained by the user structure that simply reflects 
      the condition of all its batches for a node being sick, this is done 
      just for efficiency in the algorithm that doles out the machines 
      to waiting users' batches. Added a sick nodes summary line to para check.
 
      Added a minor improvement to pstat as pstat2 so that it optionally 
      passes the user/batch and the hub only returns the relevant
      queued jobs that are in the batch, and not just sending all batches' queued jobs.
      So if other batches have huge job queues, you won't suffer.
      pstat2 now has to return the number of total jobs in the system as an extra 
      output line because para needs to know when the hub's internal maximum 
      queue size has been reached.  This is also where we get a chance
      to add extra line saying that the batch is sick, so that para gets the message
      and stops trying to shove.
 
      Added an option -extended to parasol list jobs to include the batch dir in the output.
 
      Finished the implementation of -killTime and -warnTime, which had
      never been completed.  This creates the new status "hung" which 
      is like the other two statuses "ranOk" and "crash".  If a running
      job exceeds killTime, then the job is removed and hung is set to true.
      if job exceeds warnTime, then it appears in various reports as a
      slow running job.  The defaults are 3 days for warnTime and off 
      for killTime.  Originally killTime was to default to 2 weeks, but since
      it has not been working for several years, turning it on by default now
      was deemed unnecessary.  
 
      Added para command resetCounts which resets the done and crashed counts on the hub
      for the batch, visible in parasol list batches.
 
      Added para command freeBatch which if there is nothing running or queued, 
      will free the batch from all structures and then free itself, recovering memory.
 
      Added para.bookmark to mark how much of para.results has already been
      successfully processed, which means that the system can just read and
      process any new results without having to re-do the whole results file.
 
      Added -verbose=2 elapsed time information about the major steps such as
      reading the batch, fetching pstat info from hub about queued and running jobs,
      reading para.results and hashing, and writing the batch.
 
      Changed pstat2 to return packed data with multiple lines per packet.
      And for queue data, passing just the one field needed (jobid).
      This has a tremendous speedup on batch pstat for large batches, and moves pstat
      from being a big speed problem to being fast.  This also required a minor
      improvement to the multiple-line query processing in para and parasol
      to handle multiple lines per packet.
 
      Added batch->queuedCount to replace dlCount(batch->jobQueue) which was slow.
 
 12.01 - Added versioning in all para* executables at Jorge's request.
 
 12.02 - Changed machine->job to machine->jobs list, removed hack duplication of machines
         in the machine list and now nodes are like real nodes.  They know the machSpec
         now too.  This will pave the way to future enhancements enabling algorithms
         that can match user-requirements against available node-resources.
 
 12.03 - Remove machine now sends feedback if machine name not found.
         Fixed minor bug in new parasol add machine command.
 
 12.04 - Added the ability to specify job usage with para create and para make 
         commandline options such as -cpu=2 -ram=2000000000 which would generate 
         cpu and ram usage values in every row of the batch.
 
         Clauses like {use cpu 2} {use ram 2000000000} can also be put
         directly in the spec file.  If both are used, the commandline options take
         precedence over the spec file. RAM usage clauses containing t,g,m,k 
         are expanded to TeraByte, GB, MB, and KB bigint values respectively.
 
         The addJob2 is used by para to tell hub to expect extra parameters.
         The hub passes the extra 2 params for cpus and ram to the paraNode,
         and paraNode returns them back again during listJobs requests.
         Fields were added to runJobMessage and job structs.  
 
         The batch file structure is now INCOMPATIBLE with previous versions 
         of para because of the new fields. 
 
         The hub itself does not yet act on the job usage info.
         A version of hub that does use the job usage info is planned, perhaps just 
         basic support for hog jobs and supporting only uniform clusters
         with machines all of the same type.  
 
         Also fixed a bug in freeBatch related to resultQueues.
 
 12.05   Fixed bug in para times, para status, and para problems due to the 
         para.bookmark optimization.  These commands still used fields in
         the jr struct.  Now these have been shifted to use equivalent fields
         in the sub struct.  We had to add sub->errFile to the submission struct.
 
         Also removed the unused para time option -eta.
 
         Also there no need to return the resultsHash now.  It previously had a bug 
         in para shove causing a leak because resultsHash was not being freed.
         Now it is freed at the end of the routine, and it is not returned,
         so the bug is gone.
 
 	Also fixed results over-run bug.  pstat2 in hub now returns resultsSize
         immediately after the results flush, and para uses this to stop reading
         results that are newer than the current cycle.  This prevents subtle
         tracking errors otherwise introduced by bookmark.
 
         Also changed spec usage to override commandline usage parameters as requested.
 
 12.06   Fixed minor bug in hung endtime for para status.
 
         Fixed hash size to use digitsBaseTwo().
 
         Fixed listSickNodes to include only active users, as was originally intended.
 
 12.07   Made a new scheduler that can deal with different sized cpu and ram usage requests.
         It should be friendly within reason to mixed clusters with machines of varying 
         capability. There is an array with various numbers of cpus free, and each 
         has an array with various numbers of GB of ram free.  The system allocates
         the nodes much like it allocates them today, with finding lucky user/batch,
         finding the most modest machine which still meets the batch requirements,
         and then allocating that node, subtracting the resources used, re-categorizing
         the node.  There will be minimum usage units, e.g. 1 cpu and 0.5 GB ram for kk.
         There will be a default 1 cpu and perhaps node ram/cpus for jobs that don't
         specify.  This could be controlled by commandline parameter in future,
         currently sets defaults based on the machines spec list. 
  
         It can re-balance the entire cluster in a fraction of a second.
         It will create a plan for the batches that are supposed to be running.
         That will not actually kill any running jobs, but new jobs will 
         be started from the plan until eventually all batches in the planned mix
         will be running their jobs on the node.  No recalculation will be needed
         until a batch either runs out of jobs, or a new batch starts, or some
         other event disturbs the equilibrium. 
 
         To help prevent slow jobs from hogging the cluster during the shift from 
         one equilibrium plan to another, a choke prevents jobs from running until 
         the runningCount is less than the planCount.
         Also, for really slow jobs run with maxJob (aka maxNode) setting, 
         the system does a prescan allocating them to the exact same nodes
         so that they are handled more smoothly.
 
 	For the parasol list batches command, new columns showing cpu, ram, and 
         min (average job run length minutes) have been added.  A new plan command
         exists for parasol -- not only does it trigger an immediate planning event,
         but it gives useful output for users interested in the details.
 
 12.08   Added parasol multi-messaging protocol and other protections against 
         receiving duplicate messages.  This was proven a problem for paraHub restart,
         and is extremely likely to also have occurred with para problems, and 
         the hub pstat and list jobs, and other similar situations with 
         multi-packet responses.  
 
         This version depends on a newer version of rudp in kent/src/lib,
         and we also updated the paraMessage to support pmm or parasol multi-messages.
         Much of this work had been done earlier and was just pulled in from
         an old sandbox.  
 
         Also added to rudp checking that the ack for rudpSend is coming
         back from the right ip and port.  Dozens of bad ack cases like this were 
         occurring per test run of 100,000 jobs.
 
         Fixed old bug in nodeAlive when jobid found in queue still.  This caused
         occasional strangeness like negative user runningCount, and that user 
         would hog the cluster.  Simply added code to change the counts in the
         opposite direction from requeueJob.
 
         Added to rudp duplicate packet filtering.  Duplicate packets have 
         always been possible and happening at a low rate.  There was no
         protection from receiving duplicate packets/messages.  It was 
         particularly a problem during hub restart when scanning all the nodes
         for running jobs.  The duplicate packet filtering was realized by 
         sending the process-id and the rudp-connection id (auto-incremented
         in a thread-safe way) so that the receiver can maintain a hash-indexed
         list of packets seen.  Duplicates are easily found and eliminated.
         New info is added to end of list and inserted into the hash.
         Old info is removed from the start of the list and removed from the hash.
         It remembers packets for 8 seconds.  The code for removing duplicate 
         packets in the pmm layer is now probably redundant but harmless.
 
         This necessitated adding an option to the hash library to use regular 
         heap memory instead of local stack-based memory.
         Since we needed to automatically re-size the hash quickly,
         we added resizing as a new hash capability and added a boolean flag for
         automatic hash expansion.  Also added hashVal to hash element struct
         so that the hashString call is avoided on hash resizing.
 
 
         New command - added for checking dead nodes for resurrection ASAP.
         This was requested by the admins, after they have fixed a problem
         on a bunch of machines, and don't want to wait 20 minutes for 
         gravedigger to get around to checking.
           parasol check dead - Check machines marked dead ASAP, some have been fixed.
   
         New command - added a command for people NOT using the para client to clear
         the sick statistics (e.g. sick batch and sick nodes) on their batch.
         Kevin Karplus may start using parasol client to directly run jobs 
         without using para client.  These users should be able to clear sick stats.
           parasol [options] clear sick  - Clear sick stats on a batch.
             options: -results=file
 
         Added option to para command, -batch, to allow specifying
         a directory other then the current directory for storing
         the parasol batch data files.  This allows running multiple
         batches from the same directory.
 
         Added option to para command, -jobCwd, to specify the current
         working directory for jobs in a batch.
 
         Added actual cost per job instead of just assuming 1. 
         In allocating node resources to batches, it now charges more 
         for jobs that use more ram and/or cpu.
 
 
 12.09   Fixed typo in para usage message "in defaults" should be "it defaults"
         for MarkD's new feature.
 
         Added more time info to err message in para.c for "Strange time in ..."
 
         Added retry with sleep for pmSendString and sickBatch based on feedback
         from MarkD with 2 genbank para make failures, and Brian with flakey SAN.
 
 12.10   Made cpus busy/free counts more according to expectations. Previously it had
         counted all cpus on a machine as busy if even one job was running on it.
 
         Fixed minor param-checking bug in "parasol check dead"
 
+        Fixed minor problem with addJob fails due to sick batch, but findBatch side-effect
+        triggers batch back onto curBatches, which makes planner think it is still active.
+