src/hg/utils/tdbQuery/tdbQuery.doc 1.2

1.2 2009/12/03 21:04:30 kent
Adding to documentation.
Index: src/hg/utils/tdbQuery/tdbQuery.doc
===================================================================
RCS file: /projects/compbio/cvsroot/kent/src/hg/utils/tdbQuery/tdbQuery.doc,v
retrieving revision 1.1
retrieving revision 1.2
diff -b -B -U 4 -r1.1 -r1.2
--- src/hg/utils/tdbQuery/tdbQuery.doc	3 Dec 2009 20:28:58 -0000	1.1
+++ src/hg/utils/tdbQuery/tdbQuery.doc	3 Dec 2009 21:04:30 -0000	1.2
@@ -9,4 +9,60 @@
 in a record, which if it exists specifies the parent track.   A third type of inheritance
 also exists, the "view" level.   The trackDb system is explained in more detail in the
 README file in the trackDb root directory.
 
+TdbQuery is a convenient way to see what's in this system, which has grown to the point where
+it is difficult to scan by eye.  It procedes in three steps.  First it parses the relevant
+.ra files,  second it applies all of the inheritances so that a record includes all the fields
+from it's parents as well as the fields defined in the record itself, third it applies the
+query language statement to each record, printing the relevant part of each matching record.
+
+The query language, RQL, is much like SQL.  A statement in this language is made up of multiple
+clauses:  select, from, where, and limit.  Some of the clauses are optional.  An example of
+a statement with all the clauses is:
+    select track,shortLabel,type from hg18,hg19 where type = 'bedGraph 4' limit 10
+The select clause comes first, and specifies which fields should be printed.  It is possible to use the wildcard characters * and ? in the select clause, which work like they do in the Unix file 
+system.   The statement
+    select * from hg18 where track='knownGene'
+selects all fields from the knownGene track.  Unlike SQL, you can used more nuanced wildcards in
+this clause, so that
+    select *Label from hg18 where track='knownGene'
+will print all fields ending with Label.   As with SQL you can count the number of records with
+    select count(*) from hg18
+and it is possible to add a where clause with count(*) as well.
+
+The from clause in RQL is a list of genome databases to query. This is quite different from SQL in
+that the different items in the where clause are just searched one after the other, rather than
+"joined" together.  Wildcards are permitted in the from clause as well.  The statement
+    select count(*) from *
+will count all records in all databases.
+
+The where clause is optional, but it can be complex.  If present the "where" is followed by a 
+logical expression.  The select statement is only applied to records where the where clause 
+evaluates to true.  The items in the expression can be number, strings, and fields from the
+record.   Numbers evaluate to true if they are non-zero.  Strings evaluate to true if they are 
+non-empty.  Fields evaluate to TRUE if they exist in the record.  This last is particularly
+useful in practice.  The statement
+     select * from hg18 where autoScale
+for instance prints out all records where there is an autoScale field.   
+
+Items in a where clause can be combined with comparison and logical operations.  The comparison
+operators are = != > < >= <= and can be used with numbers and strings.  For strings the greater
+than and less than operations apply to the alphabetical order (case sensitive) of the strings.
+String can also be compared to wildcards using the 'like' and 'not like' operations.   As in 
+SQL the wildcards are '%' for match any number of characters (the equivalent to '*' in 
+field and file wildcards) and '_' to match a single character (the equivalent to '?').  The
+statement
+    select * from hg18 where shortLabel like "%Gene"
+returns all fields from all records where the shortLabel ends in Gene.   The logical operators are
+"or" and "and" and "not" which have the usual meanings.  The or operator has, as is traditional,
+lower precedence than the and operator.  Because of this the following two statements are
+equivalent
+    select * from hg18 where a > 0 and a < 10 or b > 0 and b < 10
+    select * from hg18 where (a > 0 and a < 10) or (b > 0 and b < 10)
+The where clause also supports array indexes.  These are used to fetch whitespace delimited
+words within a string.   A common use is to specify the first word in the type line:
+    select * from hg18 where type[0] = "wig"
+
+The limit clause is very straightforward, it is just the word "limit" followed by a positive number
+which specifies the maximum number of records to return.
+