src/hg/makeDb/doc/hg19.txt 1.78

1.78 2010/02/04 19:25:27 hartera
Loaded corrected data for the Seg Dupes track.
Index: src/hg/makeDb/doc/hg19.txt
===================================================================
RCS file: /projects/compbio/cvsroot/kent/src/hg/makeDb/doc/hg19.txt,v
retrieving revision 1.77
retrieving revision 1.78
diff -b -B -U 4 -r1.77 -r1.78
--- src/hg/makeDb/doc/hg19.txt	2 Feb 2010 20:11:44 -0000	1.77
+++ src/hg/makeDb/doc/hg19.txt	4 Feb 2010 19:25:27 -0000	1.78
@@ -8170,9 +8170,9 @@
     ln -s `pwd`/$gp.$mz.exonAA.fa.gz $pd/$gp.exonAA.fa.gz
     ln -s `pwd`/$gp.$mz.exonNuc.fa.gz $pd/$gp.exonNuc.fa.gz
 
 ############################################################################
-# SEGMENTAL DUPLICATIONS (20010-02-02, hartera, in progress)
+# SEGMENTAL DUPLICATIONS (2010-02-02 - 2010-02-04, hartera, DONE)
 # File emailed from Tin Louie <tinlouie at u.washington.edu>
 # in Evan Eichler's lab on 01/28/10. This is a data update since it was
 # thought that the last data set was incorrect so the pipeline had to be
 # re-run.
@@ -8180,8 +8180,18 @@
 # column could be dropped. It is just the size of the otherChrom and it 
 # does not seem to be used for the track display or details page. It has the
 # correct description in the table schema so it is ok to keep it for now. 
 # In the future, this column could be dropped if it not useful.
+# There are a number of columns that could be dropped as they are 
+# meaningless but decided to keep them as the code for the details page
+# expect them to be there.
+# 01/28/10 Received new data as previous run of the pipeline may have
+# produced incorrect results. 
+# 2010-02-02 Loader aborted on data since in some lines there was an empty
+# field so the loader read only 28 words instead of 29. E-mailed Tin to
+# ask for the data to be fixed. 
+# 2010-02-03 Received new data as the previous data had empty fields.
+# 2010-02-04 Loaded new data into hg19 database.
     mkdir /hive/data/genomes/hg19/bed/genomicSuperDups
     cd /hive/data/genomes/hg19/bed/genomicSuperDups
     # Remove old data
     rm *
@@ -8207,22 +8217,28 @@
     sed -e 's/\t_\t/\t-\t/' hg19genomicSuperDups \
     | awk '($3 - $2) >= 1000 && ($9 - $8) >= 1000 {print;}' \
     | hgLoadBed hg19 genomicSuperDups stdin \
       -sqlTable=$HOME/kent/src/hg/lib/genomicSuperDups.sql
-    # Loader says:
-Expecting 29 words line 29 of stdin got 28. Problem is that there are two tabs
-with a blank indelS field on this line so the loader, splitting on tabs, only
-reads 28 fields for this line. Same problem in other lines of the data.
-Contacted Tin to see if this can be fixed. 
-
-# Reading stdin
-# Loaded 63463 elements of size 29
+# Loaded 51549 elements of size 29
 # Sorted
 # Creating table definition for genomicSuperDups
 # Saving bed.tab
 # Loading hg19
+
+    # 2009-11-05: 
     # Updated details page with suggested text and an additional reference. 
     # src/hg/makeDb/trackDb/genomicSuperDups.html
+    # 2010-02-04: Updated the schema description as below in
+    # src/hg/lib/genomicSuperDups.sql. Kept score as it is used in older
+    # datasets e.g. on hg18 -  
+    # Suggestions by Tin Louie for the schema description:
+# I suggest that the description of those meaningless columns (on the webpage
+# 'Schema for Segmental Dups') be changed to "for future use". The meaningless
+# columns are:  score, posBasesHit, testResult, verdict, chits, ccov
+# The descriptions of other columns should be changed for clarification:
+# otherSize -- equal to otherEnd minus otherStart
+# uid -- id shared by the query & subject of a hit 
+
 ############################################################################
 # ADD LINK TO GENENETWORK (DONE. 12/02/09 Fan).
 
 # Received geneNetwork ID list file, GN_human_RefSeq.txt, for hg19 from GeneNetwork, Zhou Xiaodong [xiaodong.zhou@gmail.com].