Thursday, November 10, 2011

XPath, Math and a bit of History

More than a decade ago now I worked for an XML company named eBusiness Technologies.  Several years before that name (and before it had been purchased), it was an SGML company named Electronic Book Technologies.  Under either name, EBT was well known for its expertise in markup languages.  My product manager co-chaired the XSLT committee, the guy next door (Steve DeRose) wrote the XPath and XPointer, and a fellow two offices down was deeply involved in the XML Workgroup.

So, one of the perks of my job at the time was to go to XML Conferences.  And when the company closed its doors in 2001, and I moved into healthcare, I was still pretty involved in the XML scene. So in 2002 I submitted a paper to the Extreme Markup languages conference and it was accepted.  While I was there, they had organized a poster session.  It wasn't the typical poster session.  They had poster-board and magic-markers, so people could craft posters there on the spot.  I put together a poster that is now somewhere lost in my basement talking about the mathematical properties of XML and XSLT.  It was something that one of the inventors of XPath hadn't even realized when I showed it to him.

There are eight atomic XPath axes:
  1. child 
  2. descendant
  3. parent
  4. ancestor
  5. following
  6. preceding
  7. attribute
  8. self
Four additional axes are combinations of two axes or features (sibling is not an axis, so I call it a feature)
  1. following-sibling combines following and sibling
  2. preceding-sibling combines preceding and sibling
  3. descendant-or-self combines descendant and self
  4. ancestor-or-self combines ancestor and self
The last axes is namespace, and it is kind of special because it flows downward from the point of declaration.

The descendant axis is the transitive closure of the child axis.  That is to say that a child is a descendant, as is a child of a child, et cetera.  The same is true with parent and ancestor.

I add, for the purpose of the poster, the "left" and "right" axes.  The left axis contains the node immediately preceding the context node.  The right axis contains the node immediately following the context node.

Using these, I can define the preceding and following axis as the transitive closure over the left and right axis resp.

If you number the nodes consecutively in an XML document in the order in which they appear, you get some interesting features.  I've found it helpful build a table that shows the nods, along with a pointer to the first non-descendant node and the parent node.  Here is an example data set to work with where this has been done.

Node   #   Type XML FirstNonDesc  Parent
1 Element   <html> 27 0
2 Element  <head> 8 1
3 Element    <title> 5 2
4 Text      This is the title 5 3
5 Element    <link href='file.css' type='text/css' rel='stylesheet'/>  6 2
6 Element    <script type='text/javascript'> 8 2
7 Text      alert("hello world"); 8 6
8 Element  <body> 27 1
9 Element    <table cols='2'> 27 8
10 Element       <thead> 16 9
11 Element          <tr> 16 10
12 Element            <th> 14 11
13 Text              Axis 14 12
14 Element           <th> 16 11
15 Text             description 16 12
16 Element       <tbody> 27 9
17 Element         <tr> 5 16
18 Element           <td> 2 17
19 Text              Left 1 18
20 Element           <td> 2 17
21 Text              X - 1 1 20
22 Element         <tr> 5 16
23 Element           <td> 2 22
24 Text              Right 1 23
25 Element           <td> 2 22
26 Text              X + 1 1 25
27 EOF

Now, something cool happens.  Let's look at the eight atomic axes and the left/right ones I added (for completeness):

Simple AxesTransitive Axes
Axes
FunctionAxes Function
Child Y.parent = X Ancestor Y < X && Y.FirstNonDesc < X
Parent Y = X.parent Descendent Y > X && X.FirstNonDesc < Y
Left Y = X + 1 Preceding Y < X
Right Y = X - 1 Following Y > X
Sibling Y.parent = X.parent && X != Y
Self Y = X

The function given in the table shows what computation you need to perform on two nodes X and Y to determine if the node Y is related to the node X by the relationship specified by the axes.  These can be computed very quickly (in one clock-cycle on most CPUs today).  So, if each node in the XML document is represented by tuple as described above, you can very quickly compute the XPath relationships.  If you add other integers representing a string position and offset for strings stored in an array buffer you can handle name and namespace tests, and with another numeric identifier, you can also handle the namespace nodes and attributes (which is a longer discussion I won't go into).  So, in about 32 bytes, you can represent an XML node really well.

This in part is the basis for the Document Table Model found in Xalan, my favorite XSLT parser.  I don't know if they use the "First Non-Descendant" index as I have done above, but they surely store something to manage the document structure. The math here is very pretty, and is an interesting emergent property of XML and XPath that even the creators of those specifications hadn't fully understood when they built them (based on personal discussions with those very same people).

Why think about this now after all these years?  Well, I've been looking at mapping HQMF into XPath.  This ancient history could be the foundation for an XML based document repository that would enable searching collections of C32 document collections using XPath.  And it could be really fast.

0 comments:

Post a Comment