src/hg/makeDb/doc/monDom5.txt 1.14

1.14 2009/09/10 02:02:58 aamp
Added chain/net make quasi-instructions.
Index: src/hg/makeDb/doc/monDom5.txt
===================================================================
RCS file: /projects/compbio/cvsroot/kent/src/hg/makeDb/doc/monDom5.txt,v
retrieving revision 1.13
retrieving revision 1.14
diff -b -B -U 4 -r1.13 -r1.14
--- src/hg/makeDb/doc/monDom5.txt	21 Jul 2009 21:01:44 -0000	1.13
+++ src/hg/makeDb/doc/monDom5.txt	10 Sep 2009 02:02:58 -0000	1.14
@@ -864,5 +864,119 @@
 by a single Makefile. This is available from:
    svn+ssh://hgwdev.cse.ucsc.edu/projects/compbio/usr/markd/svn/projs/transMap/tags/vertebrate.2009-07-01
 
 see doc/builds.txt for specific details.
-############################################################################
+
+###########################################################################
+# ALIGNMENTS/CHAINS/NETS (DONE Dec 2008, Andy)
+# 
+# To be honest I didn't really concentrate on getting the whole enchilada of 
+# make-notes into record, because the whole process is so robotic.  
+# 
+# I'll start with the DEF files.  These have varying parameters based on 
+# the query species.  So here's the various DEF parameters:
+#
+     DB    H    Y     L    K   T   M   A_R*      Q  1CHUNK 1LAP 2CHUNK 2LAP 2LIMIT
+bosTau4    - 3400  6000 2200   -   -     -  HoxD55      1M  10K    20M    0    100
+canFam2 2000 3400 10000 2200   -  50     0  HoxD55      1M  10K    30M    0      -
+danRer5 2000 3400  6000 2200   -   -     -  HoxD55      5M  10K    10M    0      -
+galGal3 2000 3400 10000 2200   -   -     -  HoxD55      5M  10K    20M    0      -
+   hg18 2000 3400 10000 2200   -  50     0  HoxD55      5M  10K    10M    0      -
+macEug1    - 3400     -    -   2  10     -       -     10M 320K   500K    0    100
+   mm9     - 3400  6000 2200   -   -     -  HoxD55      5M  10K    20M    0      -
+ornAna1 2000 3400  6000 2200   -  50     -  HoxD55      5M  10K    20M    0    300
+panTro2 2000 3400 10000 2200   -  50     0  HoxD55      1M  10K    30M    0      -
+ponAbe2 2000 3400 10000 2200   -  50     0  HoxD55      1M  10K    30M    0      -
+rheMac2 2000 3400 10000 2200   -  50     0  HoxD55      1M  10K    30M    0      -
+    rn4    - 3400  6000 2200   -   -     -  HoxD55      2M  10K    10M    0      -
+    rn5    - 3400  6000 2200   -   -     -       -      5M  10K    20M    0      -
+xenTro2 2000 3400  8000 2200   -  50     -  HoxD55      5M  10K    20M    0    100
+
+* BLASTZ_ABRIDGE_REPEATS
+
+# In most of those cases I was looking to the DEF variables in the monDom4 version
+# of the alignment.  In the case of macEug1, the variable settings were given by
+# Webb Miller.
+# 
+# After all the DEF files are created, it's time to run doBlastzChainNet.pl
+# In the case of monDom5, they all have the profile of this:
+
+cd /hive/data/genomes/monDom5/bed
+mkdir blastz.otherDb
+cd blastz.otherDb
+screen -S otherDb_monDom5
+doBlastzChainNet.pl -bigClusterHub swarm -stop cat DEF >& doUntilCat.log
+#[detach screen, come back when it's done]
+
+# The reason I'd stop after the cat step is that I was having better results 
+# using swarm for chaining instead of memk.  At the time, memk only had a dozen
+# nodes, and usually only 8 would be online.  Because of the large chromosomes,
+# the chaining step would be bottlenecked by the hippo jobs on the big chroms.
+# It was going much more quickly when the cluster could accommodate more
+# concurrent jobs. On memk the jobs were allocated 8GB of RAM and they normally
+# only get 2GB on swarm, so in case that mattered, I did this
+
+#[re-attach screen]
+doBlastzChainNet.pl -bigClusterHub swarm -smallClusterHub swarm -minChainScore 3000 DEF >& doAfterCat.log
+#[detach screen]
+ssh swarm
+cd /hive/data/genomes/monDom5/bed/blastz.otherDb/axtChain/run
+para check
+# check to see the jobs have started, once they have:
+para stop; para resetCounts; para -ram=8g -cpu=4 create jobList; para push
+# check to see things are all fine, and log out of swarm
+
+# after that all should have completed fine. Stopping the cluster run manually
+# while the doBlastzChainNet.pl script is running doesn't typically confuse it.
+# I suppose it's possible that the script while check on the run in the small
+# period while it's stopped, but it didn't happen to me.
+#
+# In retrospect I think the memk trick is possibly avoidable now that there are
+# more memk nodes.  However the oppossum is a particularly difficult species to
+# chain, so I'm just mentioning it anyway.  Later when doing alignments on hg19,
+# memk was up to the task just fine.
+# 
+# Also of note is the inconsistency of DEF parameter settings.  When monDom6
+# happens, someone will probably look here to see what was done with monDom5.
+# If I have any advice it's to take an approach like the one with hg19 and
+# use consistent parameters as much as possible and set them according to
+# different tiers of evolutionary distance.  
+
+
+####################################################################
+# RELOAD CHAINS/NETS AS NON-SPLIT (2009-06-09, Andy)
+
+for d in blastz.*; do 
+   if [ $d != "blastz.bosTau4" ]; then 
+      db=${d#blastz.};
+      Db=`echo $db | sed 's/^./\u&/'`; 
+      echo Loading $db chains into monDom5...;
+      time nice -n +19 hgLoadChain -tIndex monDom5 chain$Db \
+         blastz.$db/axtChain/monDom5.$db.all.chain.gz;
+    fi; 
+done >& unsplit/chainReloads.log
+# problem with macEug1 
+
+cd blastz.macEug1/axtChain/
+time nice -n +19 hgLoadChain -tIndex monDom5 chainMacEug1 monDom5.macEug1.all.chain.gz
+#Loading 19668859 chains into monDom5.chainMacEug1
+#Can't start query:
+#load data local infile 'link.tab' into table chainMacEug1Link
+#
+#mySQL error 1114: The table 'chainMacEug1Link' is full
+#real    146m54.273s
+#
+wc -l link.tab
+#200440062 link.tab
+#19668859 chain.tab
+randomLines link.tab 10000000 stdout | awk '{print length($0)}' | sort | uniq -c
+randomLines chain.tab 1000000 stdout | awk '{print length($0)}' | sort | uniq -c
+# 92 chain, 42 link
+sed "s/hgLoadChain.*/hgsqldump monDom5 chainRn4Link --no-data --skip-comments\n\
+ | sed \'s\/Rn4\/MacEug1\/; s\/TYPE=MyISAM\/ENGINE=MyISAM max_rows=201000000\n\
+ avg_row_length=42 pack_keys=1 CHARSET=latin1\/\' | hgsql monDom5 \n\
+/" loadUp.csh > manualLoadUp.csh
+
+hgsqldump monDom5 chainRn4 --no-data --skip-comments | sed \'s\/Rn4\/MacEug1\/; s\/TYPE=MyISAM\/ENGINE=MyISAM max_rows=20000000 avg_row_length=92 pack_keys=1 CHARSET=latin1\/\' | hgsql monDom5 \n\
+hgsql monDom5 -e \"load data local infile \'chain.tab\' into table chainMacEug1\"\n\
+hgsql monDom5 -e \"load data local infile \'link.tab\' into table chainMacEug1Link\"\n\
+hgsql monDom5 -e \"INSERT into history (ix, startId, endId, who, what, modTime, errata) VALUES(NULL,0,0,\'aamp\',\'Loaded 19668859 chains into macEug1 chain table manually\', NOW(), NULL)\"\