Extending XML::XPath
- The Players
- Let's do it
- First Contact
- Dealing with Model Mismatches
- First implementation
- Testing
- Adding the missing pieces
- Wrapping it up
- But Wait, There's More!
- Conclusion
Extending XML::XPath
I have always admired Matt Sergeant's XML::XPath
module. XML::XPath
is a pretty complete implementation of the W3C's XPath recommendation. You see, I am the author of XML::Twig
, which implements, in a way that can only be described as dirty, a subset of XPath. Matt's XPath engine is written properly: it uses the BNF (Backus-Naur Form, a syntax to formally describe the... syntax of languages) of the language as the specification, its regexp-based tokenizer feeds an analyzer that builds an internal form of the expression, which is then applied to an XML document (or fragment). This is way cleaner, and more powerful, than the simple regexp-based, all-in-one analyzer that I use in XML::Twig
. Incidently, this sad state of affair is mostly due the fact that I never set out to write a full XPath engine, but it just grew as users, including me, asked for more and more features of the language to be supported.
So when I decided to add a complete XPath engine to XML::Twig
, good software engineering principle told me that I should have a look at XML::XPath
, and try to re-use it as much as possible.
This article will show how I integrated the XML::XPath
engine into XML::Twig
to (at last!) give it full XPath support, then how I got carried away and added XPath support to XML::DOM
too. It will also give you a blueprint on how to add XPath support to modules that manage trees. Maybe more important, the method I used can be used in other cases, to wrestle re-use out of code that might not have been designed for it.
The Players
XPATH (http://www.w3c.org/XPATH/) is a W3C recommendation for the query of XML documents. XPath expressions can be as simple as /doc/title
, which selects all elements title
right within the root of the document (the doc
element), and as complex as //item/title[following-sibling::description[starts_with( string(), "2002")]]
, which selects all title
elements within item
elements, which are followed by a description
element that starts with the string ``2002
''.
is a module that builds a tree representing an XML document, using the DOM (Document Object Model, yet another W3C recommendation). It offers methods to select nodes in the tree using XPath, and methods to modify the tree and to output it (or parts of it) based on the DOM. It is a stable module, quite popular, although XML::LibXML
is probably a better choice if you can install it: it is more powerful, offering DOM, XPath, RelaxNG, XInclude... support, and more efficient, if often more unstable than XML::XPath
.
XML::Twig
is a Perl module that does ``XML, The Perl Way''. It offers a comprehensive API on XML documents, and includes a (very!) crude XPath engine.
An other module on CPAN, Class::XPath
, is designed as a general purpose XPath engine. Its documentation lists clearly the methods to implement to add XPath support to any code that manages trees. I chose not to use it though, because it's XPath engine is quite limited, even less powerful than XML::Twig
's.
Let's do it
First Contact
The first step was to have a look at XML::XPath
, figure out how it worked, and see how I could use it's XPath engine.
My main goal was to implement the findnodes
, findvalue
and findnodes_as_string
methods, which return the result of an XPath query on a document in different forms.
One option was to rip the relevant code from the module, modify it to work with XML::Twig
and include it. This is Open-Source remember, and the Artistic License allows this. I thought a better option, though was to use XML::XPath
and inherit from it. This would be much cleaner, allow me to use future versions of the module, and might be easier if XML::XPath
had been written to allow this. Additionaly it would mean that I could deflect some bug reports towards Matt instead of having to maintain my fork of the code...
So could this be done?
Looking at the code of the module, I quickly figured that it consisted in 4 big parts:
-
XML::XPath::PerlSAX
andXML::XPath::XMLParser
- These packages parse the XML and build a tree of nodes
-
XML::XPath::Node
andXML::XPath::Node::*
- These packages store the various types of nodes in the tree
-
XML::XPath::Parser
- This package parses an XPath expression and stores it in a form that can later be evaluated
- evaluation modules
- Those packages are used to evaluate the parsed XPath expression, on the tree (or on a sub-tree) built by
XML::XPath::XMLParser
.
Logically the parsing of the XPath expression does not depend on the tree, so this block could probably stay as-is. XML::Twig
provides the parsing block, and has its own tree model, which will replace the equivalent code in XML::XPath
. So the real challenge would be to allow the evaluation of the query to work on XML::Twig
nodes instead of on XML::XPath
nodes.
Encouragingly, the main call to evaluate
from XPath.pm looks like this:
return $parsed_path->evaluate($context);
$context
is an XML::Path::Node
. It looked like the context was just passed to the method, so it might be possible to pass a different one, an XML::Twig
one.
I then looked for the evaluate
method in the module: a quick grep
found it in the different packages:
% grep 'sub evaluate' *.pm XPath/*.pm XPath/Node/*.pm XPath/Expr.pm:sub evaluate { XPath/Function.pm:sub evaluate { XPath/Literal.pm:sub evaluate { XPath/LocationPath.pm:sub evaluate { XPath/Number.pm:sub evaluate { XPath/Root.pm:sub evaluate { XPath/Step.pm:sub evaluate { XPath/Step.pm:sub evaluate_node { XPath/Variable.pm:sub evaluate {
Interestingly enough Step.pm
seemed to be the only package one that deals with the nodes in the XML tree. The others either just loop through the steps (LocationPath.pm
) or work on objects that are generated by XML::XPath
during the evaluation of the query.
It looked like I had found the small part of the code that I needed to take a closer look at.
Trusting Matt not to change the name of the parameter gratuitously, I once again grep
-ed for information:
% grep 'context->' XPath/Step.pm # warn "Node: ", $context->[node_name], "\n"; my $parent = $context->getParentNode; $context = $context->getParentNode; foreach my $attrib (@{$context->getAttributes}) { foreach my $node (@{$context->getChildNodes}) { my @stack = $context->getChildNodes; my $parent = $context->getParentNode; while ($context = $context->getNextSibling) { while ($context = $context->getNextSibling) { return $results unless $context->isElementNode; foreach my $ns (@{$context->getNamespaces}) { my $parent = $context->getParentNode; my $parent = $context->getParentNode; while ($context = $context->getPreviousSibling) { while ($context = $context->getPreviousSibling) {
Bingo! This looked very much like the list of methods I would need to implement on my nodes to allow XML::XPath
to evaluate the queries.
A little post-processing later I had a nice list:
% grep '\$context->' XPath/Step.pm | perl -ln -e'print $1 if( m{$context->(\w+)}); ' | sort -u getAttributes getChildNodes getNamespaces getNextSibling getParentNode getPreviousSibling isElementNode
In addition methods were called on nodes themselves, which led to a second list of methods:
% grep 'node->' XPath/Step.pm | perl -ln -e'print $1 if( m{node->(\w+)}); ' | sort -u getChildNodes getExpanded getLocalName getName getNamespace getTarget getValue isCommentNode isElementNode isPINode isTextNode
Most of them were quite straightforward to write, as they existed under a different name in XML::Twig
:
getParentNode
, getChildNodes
, getNextSibling
, getPreviousSibling
, getName
, getValue
, isElementNode
, isCommentNode
, isPINode
, isTextNode
, and getTarget
could all be aliased to the equivalent XML::Twig
method.
getNamespaces
, getExpanded
and getLocalName
, which all deal with namespaces, would have to be written, as XML::Twig
's namespace support was not quite there. This indeed would be a nice by-product of the whole exercice, giving me a reason, and already-written tests, to improve XML::Twig
.
Finally getAttributes
showed the real problem: the difference between XML::XPath
's and XML::Twig
's tree models.
Dealing with Model Mismatches
XML::XPath
uses the DOM model to store XML: a base class of XML::XPath::Node
provides the basic functions on nodes, and is subclassed in XML::XPath::Node::Attribute
, XML::XPath::Node::Comment
, XML::XPath::Node::Element
, XML::XPath::Node::Namespace
, XML::XPath::Node::PI
and XML::XPath::Node::Text
. XML::Twig
on the other hand uses mostly 2 classes: the entire document, XML::Twig
, element and XML::Twig::Elt
. Elements can be regular elements, comments, PI's or text, depending on their tag name. Attributes are simply attached to the elements, they are not objects themselves. The problem was worse because attributes are actually stored in a hash, and despite the module providing methods to access them, most of the code I have seen uses that fact: users write $elt->{atts}->{foo}
instead of the cleaner $elt->att( 'foo')
. So even though the XML::Twig
API would have allowed me to change the implementation and promote attributes to full-fledged objects, doing so would have likely considerably annoyed most users.
A look at the way XML::XPath
uses attributes showed that getAttributes
returns an array (or a reference to an array) of attribute nodes, each attribute having a name and a value, accessible through the getName
and getValue
methods.
In short I could not create an attribute class, as it would break existing code.
Or couldn't I?
The fact that XML::Twig
does not use a separate class for attributes for its own processing does not preclude the use of one just for XML::XPath
benefit. So I could perfectly have the getAttributes
method create attribute objects and return them.
Namespaces could be dealt with the same way: although they would be handled through methods in XML::Twig::Elt
, the class that would interface with XML::XPath
would create XML::Twig::XPath::Namespace
objects and return them when needed.
First implementation
It was now time to try implementing this:
I created a file XPath.pm
that held 2 packages sub-classing XML::Twig
s', XML::Twig::XPath
and XML::Twig::XPath::Elt
, and the new packages XML::Twig::XPath::Attributes
and XML::Twig::XPath::Namespace
.
As a side note, I don't like the common Perl convention of having only one package per file, along with its documentation, as it makes navigating the docs harder than it should. I cannot count the number of times people have asked questions about XML::Parser
that were plainly stated in the documentation for XML::Parser::Expat
, that no one ever reads. So the documentation for XML::Twig::XPath
ended up in XML::Twig
.
XML::Twig::XPath
subclasses XML::Twig
. It mostly adds an XML::XPath
object to the base object, that will be used to perform the queries, and uses XML::Twig::XPath::Elt
to store the nodes of the tree. It also includes the methods I wanted to add to XML::Twig
: findnodes
, findnodes_as_string
, findvalue
, exists
, find
and matches
. Each of them is just a call to the same method on the twig's XML::XPath
object, with the path and the twig as arguments.
XML::Twig::XPath::Elt
is similar: it subclasses XML::Twig::Elt
and adds it the XML::XPath
methods. It also includes the two methods getAttributes
and getNamespace
, which create respectively an array of XML::Twig::XPath::Attribute
objects and an XML::Twig::XPath::Namespace
objects.
Those objects are initialized from the data in the XML::Twig::Elt
object, and are designed to mimic the corresponding objects in XML::XPath::Node::Attribute
and XML::XPath::Node::Namespace
Testing
This looked OK, it even passed perl -c
;--)
The next step was testing (even though I like testing, I am not an eXtreme Programer, I code first, then test).
The thing is, I am not an expert in XPath. For sure I have read the recommendation, and I know how to write the queries I need, but writing comprehensive tests requires quite a bit more knowledge, not to mention efforts, than that. Especially when you are too lazy to write the XPath engine yourself in the first place!
The solution was obvious: re-use the tests from XML::XPath
!
A quick look at the t/
directory showed that the tests on the query engine were neatly separated from the ones on the other functions of XML::XPath
. The tests themselves were written in a very consistent way, that lead itself to automatic processing. Ah the joys of working on a module written by a good coder!
This lead me to writing this simple script, that would take an XML::XPath
test and generate an XML::Twig::XPath
one.
#!/usr/bin/perl -p -w use strict; BEGIN { # add the shebang line, use strict and the CVS id line print qq{#!/usr/bin/perl -w\nuse strict;\n# \$Id\n}; } s{^use XML::XPath;}{use XML::Twig::XPath;}; # change the module to use s{^my \$xp = XML::XPath->new\(ioref => \*DATA\);} # change the object creation {my \$t= XML::Twig::XPath->new->parse( \\*DATA);}; s{^ok\(\$xp\);}{ok(\$t);}; # test the creation on $t, not $xp s{\$xp->}{\$t->}g; # method calls on $xp are now method calls on $t
Note that I changed the main object from $xp
to $t
, and added use strict
in order to get an error if $xp
was used in any other context than testing its creation (ok($xp)
) or to call methods on it.
Then of course the first make test
spewed errors all over the place...
Adding the missing pieces
Actually it wasn't too bad. Once I had fixed a couple of syntax errors in the tests and added the getRootNode
method that I had missed, the error message I got, for every single test, was:
Can't locate object method "get_global_pos" via package "XML::Twig::XPath::Elt" at /usr/lib/perl5/site_perl/5.8.1/XML/XPath/NodeSet.pm line 20.
What was this mysterious get_global_pos
method?
A little investigation in NodeSet.pm
revealed the ugly truth: a NodeSet
is an object that stores an array of results, or partial results, for a query. It is created deep in the bowels of XML::XPath
, in every place where results are generated.
XML::XPath
always return results in the document order. In my experience users are very often disapointed when the results of a query are not in document order, so really this makes sense. It is even needed for some methods (findnodes_as_string
and findvalue
). Plus of course the XPath recommendation seems to require it...
The way NodeSet
s are sorted in XML::XPath
is by using an internal property of the nodes, that gives their position in the document. That's what get_global_pos
returns. This position is updated, and the document renumbered every time a new element is added.
The problem here is that XML::Twig
does not have such a property. As it is oriented towards modifying XML documents, even huge ones, I feel that the renumbering phase would be too expensive to perform. It can sort elements though, using the cmp
method.
So the solution looked easy: I needed to subclass XML::XPath::NodeSet
and change its sort
methods.
But I couldn't...
XML::XPath::NodeSet
objects are created within XML::XPath
sub classes, not by user code. XML::XPath
is an object factory. Practically, this means that the module does not offer any way to subclasse XML::XPath::NodeSet
. Interestingly enough XML::Twig
used to have the same problem with XML::Twig::Elt
objects: as they were created by XML::Twig
during the parsing, it was impossible to use subclasses of XML::Twig::Elt
to add custom methods. I solved this by adding an option to the constructor of XML::Twig
, that lets the user specify an alternate class to use.
This could not be done in this case though, as the XML::XPath::NodeSet
constructor is called from places in the code where the initial XML::XPath
object is not even in scope. And anyway I did not want to change XML::XPath
at all.
So the solution was to use the power of Perl to cheat...
I redefined the sort
method directly in the XML::XPath::NodeSet
package, from XML::Twig::XPath
:
BEGIN { package XML::XPath::NodeSet; no warnings; # to avoid the "Subroutine sort redefined" message # replace the native sort routine by a Twig'd one sub sort { my $self = CORE::shift; # XML::XPath::NodeSet has a shift method @$self = CORE::sort { $a->node_cmp( $b) } @$self; return $self; } }
Note that this removes the original sort in XML::XPath::NodeSet
, so you can't use both XML::Twig::XPath
and XML::XPath
at the same time (you might not want to do this, but it would be convenient for me, to add more tests that would compare the results with the 2 modules). I also sent an email to Matt to tell him about that problem, and he agreed to add a way to set this method cleanly in the next version of XML::XPath
.
Now I just had to write the node_cmp
methods for the various types of nodes, XML::Twig::XPath
, XML::Twig::Elt
and XML::Twig::Attribute
, which turned out to be a little tedious because each comparison had to test the type of the other node before using the already existing cmp
.
Here is for example node_cmp
for the XML::Twig::XPath::Attribute
package:
sub node_cmp($$) { my( $a, $b)= @_; # $a is always an attribute, but $b can be an attribute, element or document if( UNIVERSAL::isa( $b, 'XML::Twig::XPath::Attribute')) { # 2 attributes: compare their elements, then their name if on the same element return ($a->{elt}->cmp( $b->{elt}) ) || ($a->{name} cmp $b->{name}); } elsif( UNIVERSAL::isa( $b, 'XML::Twig::XPath::Elt')) { # attribute <=> elt : compare the attribute->elt and the elt # if attribute->elt is the elt (cmp returns 0) then 1 (elt is before att) return ($a->{elt}->cmp( $b) ) || 1 ; } elsif( UNIVERSAL::isa( $b, 'XML::Twig::XPath')) { # attribute <=> document, document is before att return 1; } else { die "unknown node type ", ref( $b); } }
Once that was done, most tests started passing, with the exception of the namespace one, and the one processing xml:lang
attribute.
Writing the namespace processing code was a bit of a pain, due to the complete lack of support for them both in XML::Twig
and in my brain. It was very valuable nevertheless, as most of it ended up in the main XML::Twig
module, adding some much needed features to the module. Reading the code for XML::XPath
and the tests helped a lot here. XML::XPath
ended up providing me with the specification for the support of namespaces in XML::Twig
. It would have been much harder for me to figure out what to do on my own.
Wrapping it up
Then I added some more tests just to make sure that everything worked fine, this time re-using XML::Twig
's original tests. This showed me that I was still lacking a to_number
method for all of my new classes.
I proceeded to add some more tests of my own... these methods are really well tested now: HARNESS_PERL_SWITCHES=-MDevel::Cover make test
gives me a coverage of 100.0% subroutines, statements and branches, and 92.3% conditions, I suspect the missing condition is a bug in Devel::Cover
;--).
The documentation for the new module was quite easy to write: a quick synopsys, a list of methods, and a reference to XML::Twig
and to XML::XPath
for aditional details!
Then it was time to release XML::Twig 3.12
, which includes XML::Twig::XPath
.
But Wait, There's More!
XML::DOM::XPath
Once XML::Twig::XPath
was complete, I also realized that the old-but-still-fairly-popular XML::DOM
could well benefit from an injection of XPath. So I set out to write XML::DOM::XPath
, which adds exactly the same methods to XML::DOM
, and will help those poor souls who still have to maintain code written with XML::DOM
. This was easier in many ways than the previous task, as XML::DOM
and XML::XPath
use very similar models for the documents (both are DOM based), and even the methods names are mostly identical. It was a bit of a pain in places though, precisely for the same reason: XML::XPath
needed to use methods that already existed in XML::DOM
, but with a very different result (in fact usually a different result type). This had to be solved using dirty hacks to return different results depending on the caller. Once again the main problem came from the fact that lots of objects are created during the parsing, with no clean way to change the class of the objects being created (short of re-writing a whole lot of methods).
Overall this was quick and fun, I even got queries using namespaces to work (by grabbing code from XML::Twig
this time, sorry, I can't help it), even though XML::DOM
does not support them!
Conclusion
Re-using XML::XPath:
Re-using XML::XPath
for a different type of tree structure can be very simple. The XML model is a bit complex, but most trees only use attributes or content, not both, and don't have the concept of namespaces, not to mention processing instructions and comments.
For a simple tree of named elements, without any text, all you need is a node class that offers the following methods:
getName getValue getRootNode getParentNode getChildNodes getNextSibling getPreviousSibling isElementNode
If you need text leaves, a text class is in order, with the same methods (getValue
returns the text itself), except that isTextNode
replaces isElementNode
.
Easy hey? If you want attributes, then you will need a getAttributes
method for the node class, plus an attribute class that offers getName
, getValue
and isAttributeNode
.
In addition all of these classes need a to_number
method that creates a new XML::XPath::Number
from the text of the node.
And I pity the fool who wants namespace support on non-XML trees!
Note that you don't need to have separate classes, XML::Twig
itself uses a single one for elements and text (and comments and processing instructions), you just need isElementNode
or isText
to return the appropriate value.
Add to these a custom sort
method, et voila! Actually the sort
can be quite a pain to write, as we have seen above. If the tree is quasi-static (ie if it is not modified too much and re-numbering it when necessary is not prohibitive), using a global_pos
method on nodes and not subclassing XML::XPath::NodeSet
might make things easier.
Overall I spent a good day coding this, writing about 200 lines of code between the XML::Twig::XPath
module and the original XML::Twig
. This is certainly a lot less than I would have spend re-writing an XPath parser on my own.
Considerations on re-using an existing module
The whole exercice goes beyond allowing me to extend XML::Twig
for cheap. It also shows a way to re-use the code of a module with as little effort as possible, using grep
to isolate an API for the module that was not necessarily designed for re-use. It also shows how to reuse the tests, thus saving even more time.
I have to stress though that this works much better if the existing module is written cleanly, with good OO discipline, consistent naming of variables and good tests (thanks Matt!).
And of course reusing the existing tests does not mean that you do not need more tests.
Finally
As I was procrastinating^Hfinishing writing the docs for the new version of the module, I received an email asking me why $twig->findnodes( q{/Radio/Sender[Gueltigkeiten[Gueltigkeit[@Code='Y2003']] and Gebiete[Gebiet[@Code='GES']]]})
did not work...
I was quite pleased with the answer I could give!