src/parasol/version.doc 1.44
1.44 2009/08/05 23:19:38 galt
fix problem addJob with sickBatch leaving batch on curBatches instead of oldBatches
Index: src/parasol/version.doc
===================================================================
RCS file: /projects/compbio/cvsroot/kent/src/parasol/version.doc,v
retrieving revision 1.43
retrieving revision 1.44
diff -b -B -U 1000000 -r1.43 -r1.44
--- src/parasol/version.doc 11 Sep 2008 23:17:51 -0000 1.43
+++ src/parasol/version.doc 5 Aug 2009 23:19:38 -0000 1.44
@@ -1,291 +1,294 @@
1 - Lost in the mists of time
2 - Beyond the reach of our records
3 - A version that spanned many recompiles. Sometimes was stable.
4 - Fixed problem where nodes would be marked as dead if they
took longer than 10 minutes to run a job.
5 - Added logic to cope more gracefully with nodes that are
running jobs but momentarily can't be reached by hub.
6 - Changing paraHub from processes to threads, and from TCP/IP
to UDP. Separating out node port and hub port so that same
machine can run node and hub daemons. Fixing bug where
it would crash when you did a 'parasol remove jobs user'.
7 - Some changes on paraNode to hopefully make it harder for
a node to lose track of the jobs it is running. Using
waitpid() rather than wait, and eliminating the clearZombies().
8 - Made paraHub just warn rather than abort on failure to close
a results file cleanly. Also changed auto-close time to
1 minute rather than 5.
9 - Added good/bad count to machine struct.
Changed logging to use syslog.
Corrected setting of user's groups for better security.
10 - Added priorities to batches. Added para status command.
Changed paraNode to only do SU when running as root,
so that stand-alone single user testing works.
Also changed happy dots so it only outputs them if going
to a terminal, not a file or emacs shell buffer.
Added host as penultimate column in paraJobStatus display.
Fixed bug in paraNode err msg when fetching a file not found.
Fixed bug in paraNode during hub startup which corrupted para.results.
Fixed bug in paraHub causing crash when using priorities.
Made paraHub into a proper daemon which is better for normal usage.
Added -log=path/logfile to paraHub and paraNode so that users
debugging without admin privileges can see output.
Added -debug option to paraHub so it won't demonize for debugging.
11 - Added estimated completion time to para. -eta for time function.
Added -maxNode to limit number of nodes a batch can have.
Added longest running job to para time output and renamed
longest job to longest finished job. Fixed paraFormatIp to
return host instead of network order. Added reminder to
documentation that the machine names in .ms files should
match the $HOST variables on the nodes. Fixed parasol remove machine
so you don't have to do it twice, and so it logs user and requires
a comment for the log so admin knows why it was removed.
12 - Added sick node and sick batch detection. Each batch keeps a hash
of sick nodes and starts avoiding the ones that seem sick.
It currently counts a node as sick if it crashes 3 times in a row.
A batch considers itself sick and automatically chills itself
after 25 failures in a row. These sickness thresholds could be made
configurable in the future. There are two new para commands,
one for showing sick node stats for the batch, and another for
clearing those stats. There is a new parasol command for showing
machines among currently running users that are in universal
agreement about them being sick. This list could inform the use
of the parasol remove node command. A batch will quickly learn
which nodes are probably bad and simply avoid them. This is aimed
at preventing the loss of a night's work on the cluster due to one
or two bad nodes that pump through most of a large queue and
erroneously failing all the jobs. A batch that is itself sick
will not affect the stats on other batches. There is an additional
sick nodes hash maintained by the user structure that simply reflects
the condition of all its batches for a node being sick, this is done
just for efficiency in the algorithm that doles out the machines
to waiting users' batches. Added a sick nodes summary line to para check.
Added a minor improvement to pstat as pstat2 so that it optionally
passes the user/batch and the hub only returns the relevant
queued jobs that are in the batch, and not just sending all batches' queued jobs.
So if other batches have huge job queues, you won't suffer.
pstat2 now has to return the number of total jobs in the system as an extra
output line because para needs to know when the hub's internal maximum
queue size has been reached. This is also where we get a chance
to add extra line saying that the batch is sick, so that para gets the message
and stops trying to shove.
Added an option -extended to parasol list jobs to include the batch dir in the output.
Finished the implementation of -killTime and -warnTime, which had
never been completed. This creates the new status "hung" which
is like the other two statuses "ranOk" and "crash". If a running
job exceeds killTime, then the job is removed and hung is set to true.
if job exceeds warnTime, then it appears in various reports as a
slow running job. The defaults are 3 days for warnTime and off
for killTime. Originally killTime was to default to 2 weeks, but since
it has not been working for several years, turning it on by default now
was deemed unnecessary.
Added para command resetCounts which resets the done and crashed counts on the hub
for the batch, visible in parasol list batches.
Added para command freeBatch which if there is nothing running or queued,
will free the batch from all structures and then free itself, recovering memory.
Added para.bookmark to mark how much of para.results has already been
successfully processed, which means that the system can just read and
process any new results without having to re-do the whole results file.
Added -verbose=2 elapsed time information about the major steps such as
reading the batch, fetching pstat info from hub about queued and running jobs,
reading para.results and hashing, and writing the batch.
Changed pstat2 to return packed data with multiple lines per packet.
And for queue data, passing just the one field needed (jobid).
This has a tremendous speedup on batch pstat for large batches, and moves pstat
from being a big speed problem to being fast. This also required a minor
improvement to the multiple-line query processing in para and parasol
to handle multiple lines per packet.
Added batch->queuedCount to replace dlCount(batch->jobQueue) which was slow.
12.01 - Added versioning in all para* executables at Jorge's request.
12.02 - Changed machine->job to machine->jobs list, removed hack duplication of machines
in the machine list and now nodes are like real nodes. They know the machSpec
now too. This will pave the way to future enhancements enabling algorithms
that can match user-requirements against available node-resources.
12.03 - Remove machine now sends feedback if machine name not found.
Fixed minor bug in new parasol add machine command.
12.04 - Added the ability to specify job usage with para create and para make
commandline options such as -cpu=2 -ram=2000000000 which would generate
cpu and ram usage values in every row of the batch.
Clauses like {use cpu 2} {use ram 2000000000} can also be put
directly in the spec file. If both are used, the commandline options take
precedence over the spec file. RAM usage clauses containing t,g,m,k
are expanded to TeraByte, GB, MB, and KB bigint values respectively.
The addJob2 is used by para to tell hub to expect extra parameters.
The hub passes the extra 2 params for cpus and ram to the paraNode,
and paraNode returns them back again during listJobs requests.
Fields were added to runJobMessage and job structs.
The batch file structure is now INCOMPATIBLE with previous versions
of para because of the new fields.
The hub itself does not yet act on the job usage info.
A version of hub that does use the job usage info is planned, perhaps just
basic support for hog jobs and supporting only uniform clusters
with machines all of the same type.
Also fixed a bug in freeBatch related to resultQueues.
12.05 Fixed bug in para times, para status, and para problems due to the
para.bookmark optimization. These commands still used fields in
the jr struct. Now these have been shifted to use equivalent fields
in the sub struct. We had to add sub->errFile to the submission struct.
Also removed the unused para time option -eta.
Also there no need to return the resultsHash now. It previously had a bug
in para shove causing a leak because resultsHash was not being freed.
Now it is freed at the end of the routine, and it is not returned,
so the bug is gone.
Also fixed results over-run bug. pstat2 in hub now returns resultsSize
immediately after the results flush, and para uses this to stop reading
results that are newer than the current cycle. This prevents subtle
tracking errors otherwise introduced by bookmark.
Also changed spec usage to override commandline usage parameters as requested.
12.06 Fixed minor bug in hung endtime for para status.
Fixed hash size to use digitsBaseTwo().
Fixed listSickNodes to include only active users, as was originally intended.
12.07 Made a new scheduler that can deal with different sized cpu and ram usage requests.
It should be friendly within reason to mixed clusters with machines of varying
capability. There is an array with various numbers of cpus free, and each
has an array with various numbers of GB of ram free. The system allocates
the nodes much like it allocates them today, with finding lucky user/batch,
finding the most modest machine which still meets the batch requirements,
and then allocating that node, subtracting the resources used, re-categorizing
the node. There will be minimum usage units, e.g. 1 cpu and 0.5 GB ram for kk.
There will be a default 1 cpu and perhaps node ram/cpus for jobs that don't
specify. This could be controlled by commandline parameter in future,
currently sets defaults based on the machines spec list.
It can re-balance the entire cluster in a fraction of a second.
It will create a plan for the batches that are supposed to be running.
That will not actually kill any running jobs, but new jobs will
be started from the plan until eventually all batches in the planned mix
will be running their jobs on the node. No recalculation will be needed
until a batch either runs out of jobs, or a new batch starts, or some
other event disturbs the equilibrium.
To help prevent slow jobs from hogging the cluster during the shift from
one equilibrium plan to another, a choke prevents jobs from running until
the runningCount is less than the planCount.
Also, for really slow jobs run with maxJob (aka maxNode) setting,
the system does a prescan allocating them to the exact same nodes
so that they are handled more smoothly.
For the parasol list batches command, new columns showing cpu, ram, and
min (average job run length minutes) have been added. A new plan command
exists for parasol -- not only does it trigger an immediate planning event,
but it gives useful output for users interested in the details.
12.08 Added parasol multi-messaging protocol and other protections against
receiving duplicate messages. This was proven a problem for paraHub restart,
and is extremely likely to also have occurred with para problems, and
the hub pstat and list jobs, and other similar situations with
multi-packet responses.
This version depends on a newer version of rudp in kent/src/lib,
and we also updated the paraMessage to support pmm or parasol multi-messages.
Much of this work had been done earlier and was just pulled in from
an old sandbox.
Also added to rudp checking that the ack for rudpSend is coming
back from the right ip and port. Dozens of bad ack cases like this were
occurring per test run of 100,000 jobs.
Fixed old bug in nodeAlive when jobid found in queue still. This caused
occasional strangeness like negative user runningCount, and that user
would hog the cluster. Simply added code to change the counts in the
opposite direction from requeueJob.
Added to rudp duplicate packet filtering. Duplicate packets have
always been possible and happening at a low rate. There was no
protection from receiving duplicate packets/messages. It was
particularly a problem during hub restart when scanning all the nodes
for running jobs. The duplicate packet filtering was realized by
sending the process-id and the rudp-connection id (auto-incremented
in a thread-safe way) so that the receiver can maintain a hash-indexed
list of packets seen. Duplicates are easily found and eliminated.
New info is added to end of list and inserted into the hash.
Old info is removed from the start of the list and removed from the hash.
It remembers packets for 8 seconds. The code for removing duplicate
packets in the pmm layer is now probably redundant but harmless.
This necessitated adding an option to the hash library to use regular
heap memory instead of local stack-based memory.
Since we needed to automatically re-size the hash quickly,
we added resizing as a new hash capability and added a boolean flag for
automatic hash expansion. Also added hashVal to hash element struct
so that the hashString call is avoided on hash resizing.
New command - added for checking dead nodes for resurrection ASAP.
This was requested by the admins, after they have fixed a problem
on a bunch of machines, and don't want to wait 20 minutes for
gravedigger to get around to checking.
parasol check dead - Check machines marked dead ASAP, some have been fixed.
New command - added a command for people NOT using the para client to clear
the sick statistics (e.g. sick batch and sick nodes) on their batch.
Kevin Karplus may start using parasol client to directly run jobs
without using para client. These users should be able to clear sick stats.
parasol [options] clear sick - Clear sick stats on a batch.
options: -results=file
Added option to para command, -batch, to allow specifying
a directory other then the current directory for storing
the parasol batch data files. This allows running multiple
batches from the same directory.
Added option to para command, -jobCwd, to specify the current
working directory for jobs in a batch.
Added actual cost per job instead of just assuming 1.
In allocating node resources to batches, it now charges more
for jobs that use more ram and/or cpu.
12.09 Fixed typo in para usage message "in defaults" should be "it defaults"
for MarkD's new feature.
Added more time info to err message in para.c for "Strange time in ..."
Added retry with sleep for pmSendString and sickBatch based on feedback
from MarkD with 2 genbank para make failures, and Brian with flakey SAN.
12.10 Made cpus busy/free counts more according to expectations. Previously it had
counted all cpus on a machine as busy if even one job was running on it.
Fixed minor param-checking bug in "parasol check dead"
+ Fixed minor problem with addJob fails due to sick batch, but findBatch side-effect
+ triggers batch back onto curBatches, which makes planner think it is still active.
+