ce163cf46702cd8fea306a972ce85ceaff19e41d
kent
  Mon Sep 30 19:09:53 2024 -0700
Updating CDW main README and adding install instructions and a script.

diff --git src/hg/cirm/cdw/README src/hg/cirm/cdw/README
index c209dca..7bb8373 100644
--- src/hg/cirm/cdw/README
+++ src/hg/cirm/cdw/README
@@ -1,24 +1,29 @@
 Here lies code to implement the CIRM Data Warehouse (CDW).  The CDW is designed to track 
 a moderate number (1,000,000's) of big (>1GB) files.  It consists of three parts:
+
 1) A Linux file system with a fairly large block size.
 2) A MySQL database with tables tracking
     a) Users (by their email name)
     b) Submits (by date/submitter/hub)
     c) Files (by hub file name and CDW license plate).
     d) Groups and file permissions
     e) Results of automatic quality assurance results on files.
+3) A bunch of command line programs, some of which are run by a daemon, that load
+   the database and run quality assurance on it's contents.
+
+Please seed the install/README file for how to set up the CDW
 
 The schema for the database is in lib/*.as, with most of the information in cdw.as. 
 There is also a script lib/resetCdw that will delete the existing database and create 
 a new one on the test site (hgwdev).  This should be viewed as documentation rather than
 as a program to run at this point, since the test database has useful stuff. The programs
 that interact with the database directly are in C, and all start with the "cdw" prefix.  
 Arguably the most important program is cdwSubmit.  This program takes a tab separated manifest
 file that contains a line for each file in the submission, and a tag-storm format metadata
 file that describes the experiments the files came from, copies the files into the warehouse
 directory, and puts the file into the cdwFile table.  It also adds jobs to a table
 for the validation/QA script.
 
 The validation is done asynchronously.  It is driven by the cdwRunDaemon program, which looks
 for new rows in the cdwJob table,  and runs them, keeping a configurable number of jobs (currently
 14) running in parallel.  The validation is done by a simple linear shell script, cdwQaAgent, 
@@ -300,36 +305,35 @@
 usage:
    cdwVcfStats in.vcf out.ra
 options:
    -bed=out.bed - make simple bed3 here
 
 
 MANAGE AUTOMATICALLY RUN JOBS
 --------------------------------------------------------------------------------
 $ cdwRunDaemon
 
 cdwRunDaemon v3 - Run jobs on multiple processers in background.  This is done with
 a combination of infrequent polling of the database, and a unix fifo which can be
 sent a signal (anything ending with a newline actually) that tells it to go look
 at database now.
 usage:
-   cdwRunDaemon database table count fifo
+   cdwRunDaemon database table count 
 where:
    database - mySQL database where cdwRun table lives
    table - table with same six fields as cdwRun table
    count - number of simultanious jobs to run
-   fifo - named pipe used to notify daemon of new data
 options:
    -debug - don't fork, close stdout, and daemonize self
    -log=logFile - send error messages and warnings of daemon itself to logFile
         There are not many of these.  Error mesages from jobs daemon runs end up
         in errorMessage fields of database.
    -logFacility - sends error messages and such to system log facility instead.
    -delay=N - delay this many seconds before starting a job, default 1
 --------------------------------------------------------------------------------
 $ cdwRetryJob
 
 cdwRetryJob - Add jobs that failed back to a cdwJob format queue.
 usage:
    cdwRetryJob database jobTable
 options:
    -dry - dry run, just print jobs that would rerun