d8feabb353b3c2650facea4afd08c86bb56e5549 kent Fri Apr 6 16:51:54 2012 -0700 Moving autoSql and autoDtd and autoXml back to just under hg. A little autoSql -django fix. diff --git src/hg/autoXml/autoXml.doc src/hg/autoXml/autoXml.doc new file mode 100644 index 0000000..5d698b1 --- /dev/null +++ src/hg/autoXml/autoXml.doc @@ -0,0 +1,190 @@ +AUTOXML OVERVIEW + +AutoXML generates C code for an XML parser given +a XML DTD file. It will generate a structure +for each 'element' in the DTD, and populate the +structure with fields for each attribute of the structure. +By default it will generate a parser that ignores +elements and attributes not in the DTD, but otherwise +is a 'validating' parser. If you use the -picky +flag it will be fully validating. + +The AutoXML parser will load the entire file into +memory. If this is a problem you'll have to resort +to the lower level 'xap' parser, which is much like +the commonly used 'expat' parser, but a bit faster. + +A SHORT XML AND DTD TUTORIAL + +If you find yourself befuddled by all the acronyms +so far you're probably new to XML. Here's a brief +description. XML stands for eXtensible Markup Language. +It's a tag-based format. A simple example of +an XML doc might be: + <POLYGON id="square"> + <DESCRIPTION>This is soooo square man</DESCRIPTION> + <POINT x="0" y="0" -> + <POINT x="0" y="1" -> + <POINT x="1" y="1" -> + <POINT x="1" y="0" -> + </POLYGON> +Everything in XML lives between <TAG></TAG> pairs. +A tag may have associated text, attributes and subtags. +In the example above POLYGON has the subtags DESCRIPTION +and POINT, the attribute id, and no text. DESCRIPTON has +the text "This is soooo square man" and no subtags or +attributes. POINT has the attributes x and y. POINT +also illustrates a little XML shortcut - tags containing +only attributes can be written <TAG att="something" -> +as a shortcut for <TAG att="something"></TAG>. + +XML is much like HTML, but has significant differences as +well. All attributes must be enclosed in quotes in XML, +while quotes are optional in HTML. Tags must strictly +nest in XML, while HTML allows tags to be opened but not +closed. The tags in HTML are predefined. In XML the +definition of tags is up to you. + +Tags can be defined two ways in XML - by a DTD file or +by an XML schema. There are pros and cons of each +method. DTD files are relatively simple, and are recognized +by a wide variety of parsers and XML browsers. On the +other hand DTD files can't express that a certain attribute +has to be numerical. XML schemas are more complex. They +are themselves written in a type of XML, which is nice in +some ways. They are not as widely supported yet. Currently +autoXml only works with DTD files with some modest extensions. + +Here is an DTD file which would describe the POLYGON format +above: + +<!ELEMENT POLYGON (DESCRIPTION? POINT+)> +<!ATTLIST POLYGON id CDATA #REQUIRED> + +<!ELEMENT DESCRIPTION (#PCDATA)> + +<!ELEMENT POINT> +<!ATTLIST POINT x CDATA #REQUIRED> +<!ATTLIST POINT y CDATA #REQUIRED> +<!ATTLIST POINT z CDATA "0"> + +The DTD has two major types of definitions - ELEMENTs and ATTLISTs +(or attributes). An element definition includes the name of +the element and an optional parethesized list of subelements. +The subelements must be defined elsewhere in the DTD with the +exception of the #PCDATA subelement, which is used to indicate +that the element can have text between it's tags. Each subelement +may be followed by one of the following characters: + ? - the subelement is optional + + - the subelement occurs at least once + * - the subelement occurs 0 or more times +if there is no following character the subelement occurs exactly +once. + +The ATTLIST defines an attribute and associates it with an element. +It is good style to keep ATTLISTs together with their ELEMENT. +Here are the fields in an ATTLIST: + element - name of element this is associated with + name - name of this attribute + type - generally CDATA. Can be a reference or date, but these + are not supported by autoXml. + default - this contains a default value to be used if the attribute + is not present. The keyword #REQUIRED in this field means + that the attribute must be present. The keyword #IMPLIED + means that it's ok for this attribute to be missing (in which + case it will have a NULL or zero value after it is read by + autoXML). + +There's a third type of tag in a DTD file, the ENTITY. This tag is +used more or less as a macro definition. An example of a ENTITY is + <!ENTITY % address "street,city,state,zip"> +After this ENTITY is defined you can type %address; with the same effect +as typing street,city,state,zip. + + +AUTOXML EXTENSIONS AND LIMITS + +One disadvantage of DTDs is that all types are strings. This is not +convenient for a language like C where numerical and string types are +handled very differently, and indeed where numerical types can be handled +much more effiently than string types. To get around this we make use +of some predefined XML entities. In an ATTLIST you can use the +entities %INT; and %FLOAT; which will map to C int and double types. +Instead of #PCDATA you can use the entities %INTEGER; and %REAL; +for the same effect. The %INTEGER; and %REAL; entities are used by +NCBI as well as UCSC. As far as I can tell NCBI doesn't have definitions +for numerical attributes. + +Currently AutoXML can't handle external DTDs or DTDs that reference +other DTDs. + + +AUTOXML CODE GENERATION + +The polygon.dtd file here: + +<!ELEMENT POLYGON (DESCRIPTION? POINT+)> +<!ATTLIST POLYGON id CDATA #REQUIRED> +<!ELEMENT DESCRIPTION (#PCDATA)> +<!ELEMENT POINT> +<!ATTLIST POINT x %FLOAT; #REQUIRED> +<!ATTLIST POINT y %FLOAT; #REQUIRED> +<!ATTLIST POINT z %FLOAT; "0"> + +and the command line: + autoXml polygon.dtdx poly + +Generates poly.h as follows + +/* poly.h autoXml generated file */ +#ifndef POLY_H +#define POLY_H + +struct polyPolygon + { + struct polyPolygon *next; + char *id; /* Required */ + struct polyDescription *polyDescription; /** Optional (may be NULL). **/ + struct polyPoint *polyPoint; /** Non-empty list required. **/ + }; + +void polyPolygonSave(struct polyPolygon *obj, int indent, FILE *f); +/* Save polyPolygon to file. */ + +struct polyPolygon *polyPolygonLoad(char *fileName); +/* Load polyPolygon from file. */ + +struct polyDescription + { + struct polyDescription *next; + char *text; + }; + +void polyDescriptionSave(struct polyDescription *obj, int indent, FILE *f); +/* Save polyDescription to file. */ + +struct polyDescription *polyDescriptionLoad(char *fileName); +/* Load polyDescription from file. */ + +struct polyPoint + { + struct polyPoint *next; + double x; /* Required */ + double y; /* Required */ + double z; /* Defaults to 0 */ + }; + +void polyPointSave(struct polyPoint *obj, int indent, FILE *f); +/* Save polyPoint to file. */ + +struct polyPoint *polyPointLoad(char *fileName); +/* Load polyPoint from file. */ + +#endif /* POLY_H */ + +It generates a corresponding .c file as well. Each XML file has +to have a root object. In this case the root object is POLYGON +(our DTD as is won't let us have more than one polygon per file). +You can read an XML file that respects this DTD using the +polyPolygonLoad() function, and save it back out using the +polyPolygonSave.