src/hg/autoXml/autoXml.doc d8feabb353b3c2650facea4afd08c86bb56e5549

d8feabb353b3c2650facea4afd08c86bb56e5549
kent
  Fri Apr 6 16:51:54 2012 -0700
Moving autoSql and autoDtd and autoXml back to just under hg.  A little autoSql -django fix.
diff --git src/hg/autoXml/autoXml.doc src/hg/autoXml/autoXml.doc
new file mode 100644
index 0000000..5d698b1
--- /dev/null
+++ src/hg/autoXml/autoXml.doc
@@ -0,0 +1,190 @@
+AUTOXML OVERVIEW
+
+AutoXML generates C code for an XML parser given
+a XML DTD file.   It will generate a structure
+for each 'element' in the DTD,  and populate the
+structure with fields for each attribute of the structure.
+By default it will generate a parser that ignores
+elements and attributes not in the DTD, but otherwise
+is a 'validating' parser.  If you use the -picky
+flag it will be fully validating.   
+
+The AutoXML parser will load the entire file into
+memory.   If this is a problem you'll have to resort
+to the lower level 'xap' parser, which is much like
+the commonly used 'expat' parser, but a bit faster.
+
+A SHORT XML AND DTD TUTORIAL
+
+If you find yourself befuddled by all the acronyms
+so far you're probably new to XML.   Here's a brief
+description.   XML stands for eXtensible Markup Language.
+It's a tag-based format.   A simple example of
+an XML doc might be:
+   <POLYGON id="square">
+       <DESCRIPTION>This is soooo square man</DESCRIPTION>
+       <POINT x="0" y="0" ->
+       <POINT x="0" y="1" ->
+       <POINT x="1" y="1" ->
+       <POINT x="1" y="0" ->
+   </POLYGON>
+Everything in XML lives between <TAG></TAG> pairs.  
+A tag may have associated text, attributes and subtags.  
+In the example above POLYGON has the subtags DESCRIPTION 
+and POINT, the attribute id, and no text.  DESCRIPTON has 
+the text "This is soooo square man" and no subtags or
+attributes.   POINT has the attributes x and y.  POINT
+also illustrates a little XML shortcut - tags containing
+only attributes can be written <TAG att="something" ->
+as a shortcut for <TAG att="something"></TAG>.  
+
+XML is much like HTML, but has significant differences as
+well.  All attributes must be enclosed in quotes in XML,
+while quotes are optional in HTML.   Tags must strictly
+nest in XML, while HTML allows tags to be opened but not
+closed.  The tags in HTML are predefined.   In XML the
+definition of tags is up to you.
+
+Tags can be defined two ways in XML - by a DTD file or
+by an XML schema.  There are pros and cons of each
+method.  DTD files are relatively simple, and are recognized
+by a wide variety of parsers and XML browsers.  On the
+other hand DTD files can't express that a certain attribute
+has to be numerical.   XML schemas are more complex.  They
+are themselves written in a type of XML, which is nice in
+some ways.  They are not as widely supported yet.  Currently
+autoXml only works with DTD files with some modest extensions.
+
+Here is an DTD file which would describe the POLYGON format
+above:
+
+<!ELEMENT POLYGON (DESCRIPTION? POINT+)>
+<!ATTLIST POLYGON id CDATA #REQUIRED>
+
+<!ELEMENT DESCRIPTION (#PCDATA)>
+
+<!ELEMENT POINT>
+<!ATTLIST POINT x CDATA #REQUIRED>
+<!ATTLIST POINT y CDATA #REQUIRED>
+<!ATTLIST POINT z CDATA "0">
+
+The DTD has two major types of definitions - ELEMENTs and ATTLISTs
+(or attributes).  An element definition includes the name of
+the element and an optional parethesized list of subelements.
+The subelements must be defined elsewhere in the DTD with the
+exception of the #PCDATA subelement, which is used to indicate
+that the element can have text between it's tags.   Each subelement
+may be followed by one of the following characters:
+    ? - the subelement is optional
+    + - the subelement occurs at least once
+    * - the subelement occurs 0 or more times
+if there is no following character the subelement occurs exactly
+once.
+
+The ATTLIST defines an attribute and associates it with an element.
+It is good style to keep ATTLISTs together with their ELEMENT.  
+Here are the fields in an ATTLIST:
+    element - name of element this is associated with
+    name - name of this attribute
+    type - generally CDATA.  Can be a reference or date, but these
+           are not supported by autoXml.
+    default - this contains a default value to be used if the attribute
+           is not present.   The keyword #REQUIRED in this field means
+	   that the attribute must be present.  The keyword #IMPLIED
+	   means that it's ok for this attribute to be missing (in which
+	   case it will have a NULL or zero value after it is read by 
+	   autoXML).
+
+There's a third type of tag in a DTD file, the ENTITY.  This tag is
+used more or less as a macro definition.   An example of a ENTITY is
+     <!ENTITY % address "street,city,state,zip">
+After this ENTITY is defined you can type %address; with the same effect
+as typing street,city,state,zip.
+
+
+AUTOXML EXTENSIONS AND LIMITS
+
+One disadvantage of DTDs is that all types are strings.  This is not
+convenient for a language like C where numerical and string types are
+handled very differently, and indeed where numerical types can be handled
+much more effiently than string types.  To get around this we make use
+of some predefined XML entities.   In an ATTLIST you can use the
+entities %INT; and %FLOAT; which will map to C int and double types.
+Instead of #PCDATA  you can use the entities %INTEGER; and %REAL;
+for the same effect.  The %INTEGER; and %REAL; entities are used by
+NCBI as well as UCSC.  As far as I can tell NCBI doesn't have definitions
+for numerical attributes.
+
+Currently AutoXML can't handle external DTDs or DTDs that reference
+other DTDs.
+
+
+AUTOXML CODE GENERATION
+
+The polygon.dtd file here:
+
+<!ELEMENT POLYGON (DESCRIPTION? POINT+)>
+<!ATTLIST POLYGON id CDATA #REQUIRED>
+<!ELEMENT DESCRIPTION (#PCDATA)>
+<!ELEMENT POINT>
+<!ATTLIST POINT x %FLOAT;  #REQUIRED>
+<!ATTLIST POINT y %FLOAT;  #REQUIRED>
+<!ATTLIST POINT z %FLOAT;  "0">
+
+and the command line:
+   autoXml polygon.dtdx poly
+
+Generates poly.h as follows
+
+/* poly.h autoXml generated file */
+#ifndef POLY_H
+#define POLY_H
+
+struct polyPolygon
+    {
+    struct polyPolygon *next;
+    char *id;	/* Required */
+    struct polyDescription *polyDescription;	/** Optional (may be NULL). **/
+    struct polyPoint *polyPoint;	/** Non-empty list required. **/
+    };
+
+void polyPolygonSave(struct polyPolygon *obj, int indent, FILE *f);
+/* Save polyPolygon to file. */
+
+struct polyPolygon *polyPolygonLoad(char *fileName);
+/* Load polyPolygon from file. */
+
+struct polyDescription
+    {
+    struct polyDescription *next;
+    char *text;
+    };
+
+void polyDescriptionSave(struct polyDescription *obj, int indent, FILE *f);
+/* Save polyDescription to file. */
+
+struct polyDescription *polyDescriptionLoad(char *fileName);
+/* Load polyDescription from file. */
+
+struct polyPoint
+    {
+    struct polyPoint *next;
+    double x;	/* Required */
+    double y;	/* Required */
+    double z;	/* Defaults to 0 */
+    };
+
+void polyPointSave(struct polyPoint *obj, int indent, FILE *f);
+/* Save polyPoint to file. */
+
+struct polyPoint *polyPointLoad(char *fileName);
+/* Load polyPoint from file. */
+
+#endif /* POLY_H */
+
+It generates a corresponding .c file as well.  Each XML file has
+to have a root object.  In this case the root object is POLYGON
+(our DTD as is won't let us have more than one polygon per file).
+You can read an XML file that respects this DTD using the 
+polyPolygonLoad() function,  and save it back out using the
+polyPolygonSave.