src/hg/utils/tdbQuery/tdbQuery.doc 1.3

1.3 2009/12/03 21:51:19 kent
Improving docs.
Index: src/hg/utils/tdbQuery/tdbQuery.doc
===================================================================
RCS file: /projects/compbio/cvsroot/kent/src/hg/utils/tdbQuery/tdbQuery.doc,v
retrieving revision 1.2
retrieving revision 1.3
diff -b -B -U 1000000 -r1.2 -r1.3
--- src/hg/utils/tdbQuery/tdbQuery.doc	3 Dec 2009 21:04:30 -0000	1.2
+++ src/hg/utils/tdbQuery/tdbQuery.doc	3 Dec 2009 21:51:19 -0000	1.3
@@ -1,68 +1,82 @@
 This file describes tdbQuery, in particular the RQL language, in more detail than the
 usage statement you get from running tdbQuery with no parameters.
 
-The trackDb system stores the track settings for the UCSC Genome Browser.  It consists of .ra
-files in three levels of directories - root, organism, and assembly.  The assembly level corresponds
-to a particular "genome database" like hg19, currently the latest human.  Records at the lower
-levels override and extend records at higher levels.  In addition to the inheritance 
-implied by the directory hierarchy, there is also inheritance defined by the "subTrack" field
-in a record, which if it exists specifies the parent track.   A third type of inheritance
-also exists, the "view" level.   The trackDb system is explained in more detail in the
-README file in the trackDb root directory.
+The trackDb system stores the track settings for the UCSC Genome Browser in a collection of
+.ra files.  The .ra files have a fairly simple syntax.  Each field in a record takes up a line.
+Records are separated by blank lines.  The first word in a field line is the field name, and
+the rest of the line is the field value.   Here's an example of a .ra file with two records:
+    track littleFeatures
+    type bed 6 +
+    shortLabel Little Features
+    longLabel A bunch of little features with strand and score information
+
+    track myGenes
+    type genePred
+    shortLabel Plank Genes
+    longLabel Output of the latest gene predictor from Max Plank
+The output of tdbQuery is also in .ra format.
+
+The trackDb .ra files exist in three levels of directories - root, organism, and assembly.  The 
+assembly level corresponds to a particular "genome database" like hg19, currently the latest 
+human.  Records at the lower levels override and extend records at higher levels.  In addition 
+to the inheritance implied by the directory hierarchy, there is also inheritance defined by the 
+"subTrack" field in a record, which if it exists specifies the parent track.   A third type of 
+inheritance also exists, the "view" level.   The trackDb system is explained in more detail 
+in the README file in the trackDb root directory at kent/src/hg/makeDb/trackDb.
 
 TdbQuery is a convenient way to see what's in this system, which has grown to the point where
 it is difficult to scan by eye.  It procedes in three steps.  First it parses the relevant
 .ra files,  second it applies all of the inheritances so that a record includes all the fields
 from it's parents as well as the fields defined in the record itself, third it applies the
 query language statement to each record, printing the relevant part of each matching record.
 
 The query language, RQL, is much like SQL.  A statement in this language is made up of multiple
 clauses:  select, from, where, and limit.  Some of the clauses are optional.  An example of
 a statement with all the clauses is:
     select track,shortLabel,type from hg18,hg19 where type = 'bedGraph 4' limit 10
 The select clause comes first, and specifies which fields should be printed.  It is possible to use the wildcard characters * and ? in the select clause, which work like they do in the Unix file 
 system.   The statement
     select * from hg18 where track='knownGene'
 selects all fields from the knownGene track.  Unlike SQL, you can used more nuanced wildcards in
 this clause, so that
     select *Label from hg18 where track='knownGene'
 will print all fields ending with Label.   As with SQL you can count the number of records with
     select count(*) from hg18
 and it is possible to add a where clause with count(*) as well.
 
 The from clause in RQL is a list of genome databases to query. This is quite different from SQL in
 that the different items in the where clause are just searched one after the other, rather than
 "joined" together.  Wildcards are permitted in the from clause as well.  The statement
     select count(*) from *
 will count all records in all databases.
 
 The where clause is optional, but it can be complex.  If present the "where" is followed by a 
 logical expression.  The select statement is only applied to records where the where clause 
 evaluates to true.  The items in the expression can be number, strings, and fields from the
 record.   Numbers evaluate to true if they are non-zero.  Strings evaluate to true if they are 
 non-empty.  Fields evaluate to TRUE if they exist in the record.  This last is particularly
 useful in practice.  The statement
      select * from hg18 where autoScale
 for instance prints out all records where there is an autoScale field.   
 
 Items in a where clause can be combined with comparison and logical operations.  The comparison
 operators are = != > < >= <= and can be used with numbers and strings.  For strings the greater
 than and less than operations apply to the alphabetical order (case sensitive) of the strings.
 String can also be compared to wildcards using the 'like' and 'not like' operations.   As in 
 SQL the wildcards are '%' for match any number of characters (the equivalent to '*' in 
 field and file wildcards) and '_' to match a single character (the equivalent to '?').  The
 statement
     select * from hg18 where shortLabel like "%Gene"
 returns all fields from all records where the shortLabel ends in Gene.   The logical operators are
 "or" and "and" and "not" which have the usual meanings.  The or operator has, as is traditional,
 lower precedence than the and operator.  Because of this the following two statements are
 equivalent
     select * from hg18 where a > 0 and a < 10 or b > 0 and b < 10
     select * from hg18 where (a > 0 and a < 10) or (b > 0 and b < 10)
 The where clause also supports array indexes.  These are used to fetch whitespace delimited
 words within a string.   A common use is to specify the first word in the type line:
     select * from hg18 where type[0] = "wig"
 
 The limit clause is very straightforward, it is just the word "limit" followed by a positive number
 which specifies the maximum number of records to return.