ce163cf46702cd8fea306a972ce85ceaff19e41d kent Mon Sep 30 19:09:53 2024 -0700 Updating CDW main README and adding install instructions and a script. diff --git src/hg/cirm/cdw/README src/hg/cirm/cdw/README index c209dca..7bb8373 100644 --- src/hg/cirm/cdw/README +++ src/hg/cirm/cdw/README @@ -1,24 +1,29 @@ Here lies code to implement the CIRM Data Warehouse (CDW). The CDW is designed to track a moderate number (1,000,000's) of big (>1GB) files. It consists of three parts: + 1) A Linux file system with a fairly large block size. 2) A MySQL database with tables tracking a) Users (by their email name) b) Submits (by date/submitter/hub) c) Files (by hub file name and CDW license plate). d) Groups and file permissions e) Results of automatic quality assurance results on files. +3) A bunch of command line programs, some of which are run by a daemon, that load + the database and run quality assurance on it's contents. + +Please seed the install/README file for how to set up the CDW The schema for the database is in lib/*.as, with most of the information in cdw.as. There is also a script lib/resetCdw that will delete the existing database and create a new one on the test site (hgwdev). This should be viewed as documentation rather than as a program to run at this point, since the test database has useful stuff. The programs that interact with the database directly are in C, and all start with the "cdw" prefix. Arguably the most important program is cdwSubmit. This program takes a tab separated manifest file that contains a line for each file in the submission, and a tag-storm format metadata file that describes the experiments the files came from, copies the files into the warehouse directory, and puts the file into the cdwFile table. It also adds jobs to a table for the validation/QA script. The validation is done asynchronously. It is driven by the cdwRunDaemon program, which looks for new rows in the cdwJob table, and runs them, keeping a configurable number of jobs (currently 14) running in parallel. The validation is done by a simple linear shell script, cdwQaAgent, @@ -300,36 +305,35 @@ usage: cdwVcfStats in.vcf out.ra options: -bed=out.bed - make simple bed3 here MANAGE AUTOMATICALLY RUN JOBS -------------------------------------------------------------------------------- $ cdwRunDaemon cdwRunDaemon v3 - Run jobs on multiple processers in background. This is done with a combination of infrequent polling of the database, and a unix fifo which can be sent a signal (anything ending with a newline actually) that tells it to go look at database now. usage: - cdwRunDaemon database table count fifo + cdwRunDaemon database table count where: database - mySQL database where cdwRun table lives table - table with same six fields as cdwRun table count - number of simultanious jobs to run - fifo - named pipe used to notify daemon of new data options: -debug - don't fork, close stdout, and daemonize self -log=logFile - send error messages and warnings of daemon itself to logFile There are not many of these. Error mesages from jobs daemon runs end up in errorMessage fields of database. -logFacility - sends error messages and such to system log facility instead. -delay=N - delay this many seconds before starting a job, default 1 -------------------------------------------------------------------------------- $ cdwRetryJob cdwRetryJob - Add jobs that failed back to a cdwJob format queue. usage: cdwRetryJob database jobTable options: -dry - dry run, just print jobs that would rerun