db0adbfc752f5593bf2b2e33b52c0c3551fb65e1 brianlee Thu Nov 7 10:15:18 2024 -0800 Add a CDW file download dependency on Apache mod_xsendfile, requested by Max 11/7/24 diff --git src/hg/cirm/cdw/README src/hg/cirm/cdw/README index cdd6a7e..00d5703 100644 --- src/hg/cirm/cdw/README +++ src/hg/cirm/cdw/README @@ -1,393 +1,398 @@ Here lies code to implement the CIRM Data Warehouse (CDW). The CDW is designed to track a moderate number (1,000,000's) of big (>1GB) files. It consists of three parts: 1) A Linux file system with a fairly large block size. 2) A MySQL database with tables tracking a) Users (by their email name) b) Submits (by date/submitter/hub) c) Files (by hub file name and CDW license plate). d) Groups and file permissions e) Results of automatic quality assurance results on files. 3) A bunch of command line programs, some of which are run by a daemon, that load the database and run quality assurance on it's contents. Please seed the install/README file for how to set up the CDW The schema for the database is in lib/*.as, with most of the information in cdw.as. There is also a script lib/resetCdw that will delete the existing database and create a new one on the test site (hgwdev). This should be viewed as documentation rather than as a program to run at this point, since the test database has useful stuff. The programs that interact with the database directly are in C, and all start with the "cdw" prefix. Arguably the most important program is cdwSubmit. This program takes a tab separated manifest file that contains a line for each file in the submission, and a tag-storm format metadata file that describes the experiments the files came from, copies the files into the warehouse directory, and puts the file into the cdwFile table. It also adds jobs to a table for the validation/QA script. The validation is done asynchronously. It is driven by the cdwRunDaemon program, which looks for new rows in the cdwJob table, and runs them, keeping a configurable number of jobs (currently 14) running in parallel. The validation is done by a simple linear shell script, cdwQaAgent, which calls a number of C programs one after the other. The first and perhaps most important of these is cdwMakeValidFile, which insures that a file is of the format it claims to be in the manifest, gathers some preliminary statistics on the file, and adds the file to the cdwValidFile table. At this point the file will have a name that looks like a license plate, something like CDW123ABC. Once the asynchronous validation is complete, a wrangle will check for errors, deal with them if necessary, and then run a few commands to index and otherwise prepare the data for web display. The web display program is cdwWebBrowse. PROCESS DESCRIPTION - WHAT PROGRAMS RUN WHEN ------------------------------- 1) cdwSubmit parses manifest.txt and meta.txt inputs and makes sure that the meta tags in the manifest exist in the meta.txt. It also checks meta.txt to make sure there are only legal tags present. It then md5-sums each file. It creates a cdwSubmit table entry. Then it processes each file in the manifest sequentially, copying it to the warehouse, and puts a row for 'cdwQaAgent' in the cdwJob table. 2) cdwDaemon (another instance) notices cdwQaAgent job, and starts it, capturing start/end times and stderr output. This means the server side validation/automatic QA process typically starts as cdwSubmit is working on copying the next file. 3) cdwQaAgent is just a very small shell script that invokes the next QA steps sequentially. 4) cdwMakeValidFile checks the file format really is what it is supposed to be, and for many file formats will gather statistics such as how many items covering how many bases are in the file. This information goes into the cdwValidFile table, along with the CIRM license plate. For fastq files additional information gets stored in the cdwFastqFile table, and a sample fastq containing just 250,000 reads is made. These reads get aligned against the target genome to compute the mapRatio. For fastq files, and files such as BAM and bigBed, that include genomic coordinates, a sample of 250,000 items is turned into a simple (non-blocked) bed file for later use. If there is an error during the cdwMakeValidFile phase the cdwValidFile table entry will not be made, and there will be a message posted on the cdwFile errorMessage column. 5) cdwMakePairedEndQa checks for read concordance for faired end fastq files 6) cdwMakeEnrichments will take the sample bed files produced in step 10, and see where it lands relative to all the regions specified (as .bigBed files) in the cdwQaEnrichTarget table for the relevant assembly. It puts the results in the cdwQaEnrich table, which will have one row for each file/target combination. 7) cdwMakeReplicateQa looks for any files that are in the database already with the same experiment and output type, but a different replicate. For these it uses the sample bed files from step 10 to make an entry in the cdwQaPairSampleOverlap table (aka cross-splatter or cross-enrichment analysis). For bigWig files instead of the cross-enrichment, it does correlations, putting the results in the cdwQaPairCorrelation table. These correlations are done genome wide, and also restricted to the regions specified as the target in the manifest. 8) cdwMakeContaminationQa runs only on fastq files. It subsamples the 250,000 read sample down to 100,000 reads, and aligns these against all organisms specified in the cdwQaContamTarget table. The mapRatio is that results is stored in the cdwQaContam table, with one entry for each fastq file/target pair. 9) cdwMakeRepeatQa also runs only on fastq files. (Potentially this could be extended to other files though.) It aligns the 250,000 read sample against a RepeatMasker library for the organism, and stores the results in the cdwQaRepeat table. This will have one entry for each major repeat class (LINE, SINE, tRNA, rRNA, etc) that gets hit by the sample, and includes what proportion of reads align to that major repeat class. 10)cdwMakeTrackViz will prepare some file types for visualization in the genome browser, adding them to the appropriate table, and if necessary creating indexes. TROUBLE_SHOOTING TIPS (8/17 JK) -------------------------------- when troubleshooting the CDW, the first thing to do is use 'ps' to determine if cdwDaemons are running. There should be at least 1: kent 5729 1 0 Aug14 ? 00:00:00 cdwRunDaemon cdw cdwJob 12 /data/www/userdata/cdwQaAgent.fifo -log=cdwQaAgent. If not, go to: cd kent/src/hg/cirm/cdw/restartDaemons production: cirm01-restartDaemons test: hgwdev-restartDaemons (A copy of these scripts is also in this repo, in kent/src/hg/cirm/cdw/restartDaemons) and run ./restartDaemons At this point they'll be restarted. If the daemons are working, the next thing to do is look at the recent entries in the cdwSubmit, cdwSubmitJob and cdwJob tables. Look in the errorMessage and stderr fields. SETTING UP THINGS FOR WEB BROWSER APP --------------------------------------- There's some things that need to happen after the majority of the database is made by the validation daemon and cdwSubmit. o - Run cdwMakeFileTags now to create cdwFileTags table o - Run cdwTextForIndex in /gbdb/cdw to create full text for free index search o - Run ixIxx to make the full text index in /gbdb o - You *might* need to run cdwChangeAccess to update the accessibility to make things group readable, (or if the access all tag is intended but left off, world readable) TROUBLE_SHOOTING SETTING UP LOGIN --------------------------------- The site requires HTTP login to access, and the first step is to put the same type of basicAuth apache config statements for a new server as the last server. In the apache config on the previous server there is a user/password file. An easy approach is to copy the existing username/password file to the new server. A script called cdwWebsiteAcess, a wrapper around htpasswd, can be used with the user/password file. Or you can use these commands to add a user to the file /etc/httpd/passwords on the new machine. htpasswd passwordfile username htpasswd /etc/httpd/passwords -n bob New password: Re-type new password: A line like this will be added to /etc/httpd/passwords where that password is encrypted. bob:$apr1$Xd2b1SuY$T17QLQnFyaH1fQrcVjDjg0 Once the user exists the Apache settings also have to be enabled feeding the authorization back into the site. Request the following edit to httpd.conf: SetEnvIf Authorization .+ HTTP_AUTHORIZATION=$0 +There will be a File Download problem if the mod_xsendfile module is missing and not +activated in Apache. Once installed mod_xsendfile will enable logged-in users +to download files. + + COMMAND LINE USAGE SUBMITTING NEW DATA FILES -------------------------------------------------------------------------------- $ cdwSubmit cdwSubmit - Submit URL with validated.txt to warehouse. usage: cdwSubmit email /path/to/manifest.tab meta.tag options: -update If set, will update metadata on file it already has. The default behavior is to report an error if metadata doesn't match. -noRevalidate - if set don't run revalidator on update -md5=md5sum.txt Take list of file MD5s from output of md5sum command on list of files -------------------------------------------------------------------------------- $ cdwMakeFileTags cdwMakeFileTags - Create cdwFileTags table from tagStorm on same database. usage: cdwMakeFileTags now options: -table=tableName What table to store results in, default cdwFileTags -database=databaseName What database to store results in, default cdw -------------------------------------------------------------------------------- $ cdwSwapInSymLink cdwSwapInSymLink - Replace submission file with a symlink to file in warehouse. usage: cdwSwapInSymLink startFileId endFileId options: -xxx=XXX -------------------------------------------------------------------------------- $ cdwChangeAccess cdwChangeAccess - Change access to files. usage: cdwChangeAccess code ' boolean query in rql' where code is three letters encode who, direction, and access in the same fashion more modern versions of the unix chmod function does. The who can be either a - for all g - for group u - for user and direction is either + to add a permission or - to take it away, and access is either r - for read w - for write for instance a+r gives everyone read access options: -dry - if set just print what we _would_ change access to. -------------------------------------------------------------------------------- $ cdwTextForIndex cdwTextForIndex - Make text file used for building ixIxx indexes usage: cdwTextForIndex out.txt options: -xxx=XXX -------------------------------------------------------------------------------- $ ixIxx ixIxx - Create indices for simple line-oriented file of format usage: ixIxx in.text out.ix out.ixx Where out.ix is a word index, and out.ixx is an index into the index. options: -prefixSize=N Size of prefix to index on in ixx. Default is 5. -binSize=N Size of bins in ixx. Default is 64k. UPDATING DATABASE FOR NEW USERS, AND CHANGING USER GROUPS AND PERMISSIONS -------------------------------------------------------------------------------- $ cdwCreateUser cdwCreateUser - Create a new user from email combo. usage: cdwCreateUser email -------------------------------------------------------------------------------- $ cdwCreateGroup cdwCreateGroup - Create a new group for access control. usage: cdwCreateGroup name "Descriptive sentence or two." -------------------------------------------------------------------------------- $ cdwGroupFile cdwGroupFile - Associate a file with a group. usage: cdwGroupFile group 'boolean expression like after a SQL where' options: -remove - instead of adding this group permission, subtract it -dry - if set just print what we _would_ add group to. -------------------------------------------------------------------------------- $ cdwGroupUser cdwGroupUser - Change user group settings. usage: cdwGroupUser group user1@email.add ... userN@email.add options: -primary Make this the new primary group (where user's file's initially live). -remove Remove users from group rather than add ADDING NEW GENOMES, NEW ANALYSIS STEPS, NEW TYPES OF QA -------------------------------------------------------------------------------- $ cdwAddAssembly cdwAddAssembly - Add an assembly to database. usage: cdwAddAssembly taxon name ucscDb twoBitFile options: -symLink=MD5SUM - if set then make symlink rather than copy and use MD5SUM rather than calculating it. Just to speed up testing -------------------------------------------------------------------------------- $ cdwAddStep cdwAddStep - Add a step to pipeline usage: cdwAddStep name 'description in quotes' options: -xxx=XXX -------------------------------------------------------------------------------- $ cdwAddQaContamTarget cdwAddQaContamTarget - Add a new contamination target to warehouse. usage: cdwAddQaContamTarget assemblyName options: -xxx=XXX -------------------------------------------------------------------------------- $ cdwAddQaEnrichTarget cdwAddQaEnrichTarget - Add a new enrichment target to warehouse. usage: cdwAddQaEnrichTarget name db path where name is target name, db is a UCSC db name like 'hg19' or 'mm9' and path is absolute path to a simple non-blocked bed file with non-overlapping items. QUERY DATABASE FROM COMMAND LINE -------------------------------------------------------------------------------- $ cdwQuery cdwQuery - Get list of tagged files. usage: cdwQuery 'sql-like query' options: -out=output format where format is ra, tab, or tags VALIDATION AND QA (RUN AUTOMATICALLY) -------------------------------------------------------------------------------- $ cdwQaAgent Run all agents on a file echo usage: cdwQaAgent fileId usage: cdwQaAgent fileId # Note: this is in the ../ directory as it is a script. -------------------------------------------------------------------------------- $ cdwMakeContaminationQa cdwMakeContaminationQa - Screen for contaminants by aligning against contaminant genomes. usage: cdwMakeContaminationQa startId endId where startId and endId are id's in the cdwFile table options: -keepTemp -------------------------------------------------------------------------------- $ cdwMakeEnrichments cdwMakeEnrichments - Scan through database and make a makefile to calc. enrichments and store in database. Enrichments similar to 'featureBits -countGaps' enrichments. usage: cdwMakeEnrichments startFileId endFileId -------------------------------------------------------------------------------- $ cdwMakePairedEndQa cdwMakePairedEndQa - Do alignments of paired-end fastq files and calculate distrubution of insert size. usage: cdwMakePairedEndQa startId endId options: -maxInsert=N - maximum allowed insert size, default 1000 -------------------------------------------------------------------------------- $ cdwMakeRepeatQa cdwMakeRepeatQa - Figure out what proportion of things align to repeats. usage: cdwMakeRepeatQa startFileId endFileId -------------------------------------------------------------------------------- $ cdwMakeReplicateQa cdwMakeReplicateQa - Do qa level comparisons of replicates. usage: cdwMakeReplicateQa startId endId -------------------------------------------------------------------------------- $ cdwMakeTrackViz cdwMakeTrackViz - If possible make cdwTrackViz table entry for a file. usage: cdwMakeTrackViz startFileId endFileId options: -xxx=XXX -------------------------------------------------------------------------------- $ cdwMakeValidFile cdwMakeValidFile - Add range of ids to valid file table. usage: cdwMakeValidFile startId endId options: maxErrCount=N - maximum errors allowed before it stops, default 1 -redo - redo validation even if have it already -------------------------------------------------------------------------------- $ cdwVcfStats cdwVcfStats - Make a pass through vcf file gatherings some stats. usage: cdwVcfStats in.vcf out.ra options: -bed=out.bed - make simple bed3 here MANAGE AUTOMATICALLY RUN JOBS -------------------------------------------------------------------------------- $ cdwRunDaemon cdwRunDaemon v3 - Run jobs on multiple processers in background. This is done with a combination of infrequent polling of the database, and a unix fifo which can be sent a signal (anything ending with a newline actually) that tells it to go look at database now. usage: cdwRunDaemon database table count where: database - mySQL database where cdwRun table lives table - table with same six fields as cdwRun table count - number of simultanious jobs to run options: -debug - don't fork, close stdout, and daemonize self -log=logFile - send error messages and warnings of daemon itself to logFile There are not many of these. Error mesages from jobs daemon runs end up in errorMessage fields of database. -logFacility - sends error messages and such to system log facility instead. -delay=N - delay this many seconds before starting a job, default 1 -------------------------------------------------------------------------------- $ cdwRetryJob cdwRetryJob - Add jobs that failed back to a cdwJob format queue. usage: cdwRetryJob database jobTable options: -dry - dry run, just print jobs that would rerun -minId=minimum ID of job to retry, default 0 -maxTime=N.N - Maximum time (in days) for a job to succeed. Default 1 -------------------------------------------------------------------------------- $ cdwRunOnIds cdwRunOnIds - Run a cdw command line program (one that takes startId endId as it's two parameters) for a range of ids, putting it on cdwJob queue. usage: cdwRunOnIds program 'queryString' Where queryString is a SQL command that should return a list of fileIds. Example cdwRunOnIds cdwFixQualScore 'select fileId from cdwFastqFile where qualMean < 0'options: -runTable=cdwTempJob -job table to use -------------------------------------------------------------------------------- $ cdwJob cdwJob - Look at a cdwJob format table and report what's up, optionally do something about it. usage: cdwJob command Where commands are: status - report overall status count - just count table size failed - list failed jobs running - list running jobs fihished - list finished (but not failed) jobs options: -database= default cdw -table= default cdwJob --------------------------------------------------------------------------------