XML::Twig Quick Reference
This is a quick list of the most useful features of XML::Twig. Some seldom used methods and options have been ommited so if you are using the module and cannot find a way to do something you should refer to the complete doc by doing perldoc XML::Twig or going to xmltwig.com.
Conventions used in this document: arguments which name start with opt_ are optional, (3.00) denotes methods added for version 3.00 of XML::Twig.
Twig Options
Options set when creating the twig can be written either using a Java-like style UglyOptionName or with a more Perl-ish style a_cool_option_name, they are normalized before being used.
twig_handlers | $handlers | $handlers is a ref to a hash expression => sub_ref, where expression is a XPath-like expression, which triggers a call to the subroutine referenced by code_ref, the subroutine receives 2 arguments: the twig itself and the element, $_ is also the element, the subroutine is called once the element is completely parsed |
twig_roots | $handlers | $handler is a ref to a hash expression => sub_ref or 1, a twig is buit only for the elements for wich the expression is true, if the value is a sub_ref then it is called with the twig and the element as arguments, elements outside the twig roots are ignored (or printed if twig_print_outside_roots is set), elements inside the twig roots are included in the twig and can trigger twig_handlers |
twig_print_outside_roots | true or false value | can only be used if twig_roots is also used, if set to a true value will print all parts of a document that are not inside the twig roots |
start_tag_handlers | $handlers | $handlers is a ref to a hash expression => sub_ref, where expression is a XPath-like expression, which triggers a call to the subroutine referenced by code_ref, the subroutine receives 2 arguments: the twig itself and the element, the subroutine is called as soon as the start tag for the element has been parsed, so the element will only contain its attributes, but not its sub-elements |
keep_encoding | true or false value | keeps the original encoding of the document |
pretty_print | 'nsgmls', 'nice','indented' 'record'or 'record_c' | 'nsgmls' is kind of ugly but safe, 'nice' and 'indented' look better but can produce invalid, but well-formed XML (the document is no longer conformant to its DTD but it is still XML), 'record' and 'record_c' are nice for record-oriented documents |
empty_tags | 'html' | by default empty tags are displayed as <empty/>, setting this option makes them display as <empty /> (an extra space is added before the /) so HTML browsers display them properly |
Twig Methods
Note that some of the methods only make sense when used in a handler: purge, finish, finish_print for example, while others should only be used once the document is completely parsed: print for example.
parse | $string or \*OPEN_FILEHANDLE | parse a document from a string or from an open filehandle |
parsefile | $filename | parse a document from a file |
$opt_filehandle | print the entire document (use only after the parse!), optionally to a filehandle | |
sprint | return the entire document as a string | |
safe_parse | $string or \*OPEN_FILEHANDLE | this method is similar to parse except that it wraps the parsing in an eval block. It returns the twig on success and 0 on failure (the twig object also contains the parsed twig). $@ contains the error message on failure. |
safe_parsefile | $string | same as safe_parse for a file name |
flush | $opt_filehandle | print the document so far and release the memory for all output elements. Don't forget to flush one last time after the parsing is done to output the end of the document |
purge | same as flush except that the twig is not printed, releases as much memory as possible by purging all closed elements | |
root | return the root element for the twig | |
first_elt | $opt_gi | return the first ($opt_gi) element in the twig |
get_xpath | $xpath, $opt_offset | return the list of element filtered by the $xpath expression, if $opt_offset return just one element, the one with that offset in the list |
finish | unset all handlers and finish parsing the document as fast as possible | |
finish_print | unset all handlers and finish parsing the document as fast as possible, printing the rest of the document as-is as is | |
dispose | releases the memory used by the twig, use if you use a lot of twigs in your script (3.00) |
Element Methods
Method | Arguments | Description |
Elements | ||
$opt_filehandle, $opt_pretty_print_style | print the element | |
sprint | $opt_no_enclosing_tag | return the element string, with the tags (if $opt_no_enclosing_tag is true then the outside tags are ommited (equivalent to xml_string in 3.00), XML base entities are escaped |
gi | return the gi (the tag) for the element. Equivalent to the tag method (3.00) | |
set_gi | $gi | set the gi (the tag) for the element to $gi. Equivalent to the set_tag method (3.00) |
text | the text of the element (without any tags, the text is not XML-escaped) | |
trimmed_text | the trimmed text of the element (without any tags, the text is not XML-escaped): leading and trailing whitespace is trimmed and all consecutive spaces are collapsed to a single one | |
new | $opt_gi, $opt_atts, @opt_content | create a new element, $opt_atts is a ref to a hash of attributes (a-la CGI.pm), @opt_content is a list of strings and elements used as the children of the element. |
parse | $string, %args | create a new element from $string, %args is a hash with the arguments used to create the twig contraining the element |
set_text | $text | set the text of the element |
set_content | $opt_atts, @content or $opt_atts, '#EMPTY' | set the content of the element, $opt_atts is a ref to a hash of attributes (a-la CGI.pm), @content is a list of elements and strings, '#EMPTY' creates an empty element |
Attributes | ||
att | $att | get the $att attribute or undef |
atts | return a reference to a hash containing the attribute of the element | |
set_att | $att, $value | set the value of attribute $att |
set_atts | $atts_ref | set the attribute of the element using the hash referenced by $atts_ref |
del_att | $att | delete the $att attribute |
del_atts | delete all of the attributes of the element | |
Cut'n Paste | ||
cut | cut the element from the tree | |
paste | $opt_position, $ref_elt | paste the element before, after, as first_child (default) or last_child of $ref_elt |
move | $opt_position, $ref_elt | same as paste but cut the element before pasting it |
replace | $ref | replace $ref by the element in the tree |
copy | return a "deep" copy of the element | |
delete | cut the element from the tree and delete it | |
cut_children | cut all children of the element, returns the list of children | |
Navigation | ||
first_child | $opt_gi | return the first ($opt_gi) child of the element |
last_child | $opt_gi | return the last ($opt_gi) child of the element |
prev_sibling | $opt_gi | return the ($opt_gi) previous sibling of the element |
next_sibling | $opt_gi | return the ($opt_gi) next sibling of the element |
parent | $opt_gi | return the ($opt_gi) parent of the element |
children | $opt_gi | return the list of ($opt_gi) children of the element |
descendants | $opt_gi | return the list of ($opt_gi) descendants of the element |
ancestors | $opt_gi | return the list of ($opt_gi) ancestors of the element |
get_xpath | $xpath, $opt_offset | return the list of element filtered by the $xpath expression, if $opt_offset return just one element, the one with that offset in the list |
Note: starting at XML::Twig 3.00.10 $opt_gi can be either a gi, #ELT (in wich case any "real" element is returned), #TEXT (in which case any "text", PCDATA or CDATA element is returned), a regexp, applied to the gi of elements, or a code reference, applied to the element. | ||
Twig Specials | ||
field | $opt_gi | return the text of the first child ($opt_gi) of the element |
prefix | $string | prefix the element with $string |
suffix | $string | suffix the element with $string |
insert | @gi | For each $gi in @gi insert an element $gi as the only child of the element, all original children of the element are set as children of the new element, return the inner most element: $table->insert( 'tr', 'td', 'p'); creates a single tr, a nested single td, a p nested in the td and returns the p element |
wrap_in | @gi | Wrap the element in elements from @gi, return the outer element: $p->wrap_in( 'td', 'tr', 'table'); puts $p in a table with a single tr and a single td and returns the table element. |
erase | cut the element and paste its children in its place, as if the tag had been erased from the document | |
in | $parent | return true if the element is in the element $parent |
in_context | $gi, $opt_level | return true if the element is included in an element whose gi is $gi, optionally within $opt_level levels, the returned value is the innermost including element $gi |
inherit_att | $att, @opt_gi | return the value of an attribute inherited from parent tags. The first value found by looking at the element then in turn at each of its ancestors (in @opt_gi) is returned. |
level | $opt_gi | Returns the depth of the element in the twig (root is 0). If the $opt gi is given then only ancestors of the given type are counted. |
next_elt | $opt_root, $opt_gi | return the next ($opt_gi) element (next element found in the document after the opening tag of $elt), if $opt_root is used then undef is returned if the next element is not under the element $opt_root, so you can use my $elt= $subtree_root; while( $elt= $elt->next_elt( $subtree_root) { my_process( $elt); } to loop through all elements in $subtree_root |
path | return a string showing the path to the element XPath style: /doc/section/title | |
remove_cdata | remove all CDATA markers in the element. Useful when you have HTML-is-a-CDATA-section in a document that you want to ignore during processing, but that you might want to output as markup when converting to HTML | |
simplify | same arguments as XMLin in XML::Simple | (experimental in 3.10): generate a data structure similar to the one generated by XML::Simple's XMLin for an element |
XPath-like Syntax
XPath-like syntax is used in 2 places: to trigger handlers and in the get_xpath method.
handler triggers
This table describes the various types of pseudo-xpath expressions than can trigger the various handlers. Expression types are listed from highest to lowest priority. If several expressions match then they will be stacked and the various handlers will be called until one of them returns a false value.
Getting only one handler to be triggered for each element is generally regarded as a good way to keep one's sanity...
Convention: litteral parts are in bold, variable parts are in normal font.
syntax | Description |
_all_ | always triggers the handler (even if a previous handler returns a false value) |
*[@att] | triggers if the attribute att exists for the element |
*[@att='val'] | triggers if the attribute att exists for the element and is equal to val (a string comparison is performed, not a numeric one) |
gi[string()="foo"] | triggers the handler if the gi of the element is gi and its text isfoo, the text is the result of the element text method, cannot be used for twig_roots and start_tag_handlers |
gi[string(child_gi)="foo"] | triggers the handler if the gi of the element is gi and the text of one of it's direct child_gi child is foo, cannot be used for twig_roots and start_tag_handlers |
gi[string()=~ /foo/] | triggers the handler if the gi of the element is gi and its text matches foo, the i, m, s and o modifiers can be used to modify the regexp, cannot be used for twig_roots and start_tag_handlers |
gi[string(child_gi)="foo"] | triggers the handler if the gi of the element is gi and the text of one of it's direct child_gi child matchesfoo, cannot be used for twig_roots and start_tag_handlers |
gi[@att] | triggers the handler for gi elements with an attribute att |
gi[@att="val"] | triggers the handler for gi elements with an attribute att which value is val |
/root/elt/subelt | triggers the handler for elements matching this exact path path, starting from the root |
elt/subelt | triggers the handler for element matching this path |
elt | triggers the handler for all gi elements |
_default_ | triggers the handler if no other handler has been trigger |
get_xpath method
Summarry of the syntax:
gi | selects gi elements |
gi[1] | selects the first gi element, any integer, positive or negative can be used, negative integers start from the last element |
gi[last()] | selects the last gi element |
gi[@att] | selects the gi elements which have an attribute att |
gi[@att="val"] | selects the gi elements with an att attribute equals to val |
gi[att1="val1" and att2="val2"] | |
gi[att1="val1" or att2="val2"] | |
gi[string()="toto"] | selects gi elements which text (as per the text method) is toto |
gi[string()=~/regexp/] | selects gi elements which text matches regexp |
In addition:
- expressions can start with / (search starts at the document root)
- expressions can start with . (search starts at the current element)
- .. returns the parent of the current element
- // can be used to get all descendants instead of just direct children
- * matches any gi
Examples
para | selects the para element children of the current element |
* | selects all element children of the current element |
para[1] | selects the first para child of the current element |
para[last()] | selects the last para child of the current element |
*/para | selects all para grandchildren of the current element |
/doc/chapter[5]/section[2] | selects the second section of the fifth chapter of the doc |
chapter//para | selects the para element descendants of the chapter element children of the current element |
//para | selects all the para descendants of the document root and thus selects all para elements in the same document as the current element |
//olist/item | selects all the item elements in the same document as the current element that have an olist parent |
.//para | selects the para element descendants of the current element |
.. | selects the parent of the current element |
para[@type="warning"] | selects all para children of the current element that have a type attribute with value warning |
employee[@secretary and @assistant] | selects all the employee children of the current element that have both a secretary attribute and an assistant attribute |