XML, the Perl Way

Ways to Rome 2 - Kourallinen Dollareita

last updated Thu Aug 21 18:16:07 2003

Introduction

This article follows the original Ways to Rome article. It shows how to process XML data using some of the numerous XML modules on CPAN.

The task for this article is to process data-oriented XML. Data-oriented, as opposed to text-oriented XML, is XML that essentially contains data, often extracted from a data base, or resulting from the serialization of an object, or just configuration data for a piece of software. A lot of XMl modules are specifically oriented towards processing data-oriented XML, which is typicaly easier to process than text-oriented XML (it is usually more regular, doesn't include mixed-content...).

Specification of the problem

Finvoice

The data to process will be XML invoices. Those invoices follow the Finvoice specification. Finvoice is the electronic invoice of the Finnish Bankers' Association. It is one of the few publically available standard way of exchanging invoices between a seller and a buyer. Visa also has published the Visa XML Invoice Specification which is quite a bit more complex than the Finvoice one, so I stuck to the simpler one. SAP also has a specification for invoices, but it does not seem to be public. Note that none of my emails asking for information and support to Finvoice, Visa and SAP was ever answered, so don't expect much support in the XML invoice World ;--(

The Finvoice DTD is quite comprehensive and look quite usable for general invoicing.

I found that it includes a couple of content model that in my experience are not optimal for processing, where elements that belong together are not wrapped in a containing element:

  <! ELEMENT Finvoice (SellerPartyDetails, SellerOrganisationUnitNumber, SellerContactPersonName?,
             SellerCommunicationDetails?, SellerInformationDetails?, InvoiceRecipientPartyDetails?,
             BuyerPartyDetails, BuyerOrganisationUnitNumber?, BuyerContactPersonName?,
             BuyerCommunicationDetails?, DeliveryPartyDetails?, DeliveryDetails?, InvoiceDetails,
             PaymentStatusDetails?, PartialPaymentDetails*, VirtualBankBarcode?, InvoiceUrlText?,
             InvoiceRow+, SpecificationDetails?, EpiDetails)>

I would much rather have this as:

<! ELEMENT Finvoice (SellerInfo, BuyerInfo, DeliveryInfo?, InvoiceDetails, PaymentDetails, 
          InvoiceRows, SpecificationDetails?, EpiDetails)>

with the sub-elements containing related data. I especially do not like the fact that InvoiceRow is repeated in the "main" Finvoice element, which I don't think is good practice.

Overall the Finvoice DTD is quite simple, most elements have non-contextual names: names that identify them uniquely without needing to know the context. For example there is a SellerStreetName and a BuyerStreetName. While this makes the DTD harder to maintain (if you change the address model you need to change it in several places), it also makes processing quite easier. The only, pretty annoying, exceptions to this rule are the CountryCode and CountryName, which are used in the SellerPostalAddressDetails and BuyerPostalAddressDetails. This is quite a pain because it forces the code to deal with the context. Using non-contextual element names is a valid design option, but not being consistent about it gives you the worst of both worlds: the DTD is hard to maintain and the code has to deal with the context.

In the end Finvoice is quite a convenient DTD and understanding it and using it was not too difficult.

It should be noted that, as for much "standard" DTDs in my experience, it is quite flexible. It allows for exemple invoice rows to be either highly structured (with a lot of optional elements) or... just plain text. This of course increases the complexity for the developer (but makes the DTD usable in the Real World).

A few words about the invoices

These invoice are an interesting class of document: they are clearly data-oriented: thhey have no mixed content and they are very structured. But on the other hand the order of elements is quite significant, at least for display purposes, which is a feature usually found in document-oriented XML. So as long as the task consists solely in checking them and extracting information from them they can be considered as strictly data-oriented, but if we need to modify them, they they should be dealt with as documents.

Additional characteristics of the invoices: they are quite small, and I would hope a single company wouldn't receive several thousands of then every minute. This allows us to load each invoice in memory, and not to have to worry too much about performance (especially as the cost of processing the XML is likely to be small compared to the time spend talking to the data base). This explains the lack of benchmark in this article (at least the initial version).

The task

The task that the various examples will perform is a typical application of what should be done with such data: process invoices, coming in separate files, check them (check that they reference an existing purchase order, that they are addressed to the right company...) and if they are ok store some of the data in a data base (created with DBI and DBD::SQLite for easy set-up). For extra credit the examples can output documents that do not pass the checks, in the original form plus the error messages as new XML elements at the beginning of the document. Note that this is quite arbitrary, and is there mostly to show how to create new elements with each module. A "real" application would likely use a different method to report errors.

Potential problems not dealt with in these examples are:

The Examples

XML::Simple

XML::Simple is extremely convenient for loading XML data into a Perl data structure, using XMLin, and then forgetting that it was ever in XML.

The code was very easy to write. I used YAML during the debugging phase to dump the content of the $xml structure, so it was really easy to see where was the data I needed.

Note the 2 options for XMLin:

forcearray => [ qw(InvoiceRow)]
this will cause XMLin to load the InvoiceRow's in an array, even if there is only one in the document (without that option a single InvoiceRow would be turned into a hash value instead of an array, which would have forced me to test it).
forcecontent => 1
despite the documentation for XML::Simple stating that this option is rarely used I found it made it easier and safer to write the code: all text content is stored in the content field of a hash. This makes accessing the values a little uglier (you have to add an extra ->{content} in the expression), but in fact it saved me treating differently elements that had attributes and elements which didn't, plus it's an extra security if new attributes are added to some elements of the DTD, or for optional attributes, that would cause the generated data structure to be different depending on them being present or not (for example in OrderedQuantity the QuantityUnitCode attribute is optional).

If you use XML::Simple for data where the DTD is simpler and where attributes are always set then you might want not to use this option. In this case I found it convenient.

Finally note that XML::Simple cannot output a modified document, as it looses the order of elements in the document (once elements are assigne to a hash their order is lost), so $CAN_OUTPUT is set to 0 and the error messages are output to the console, rather than creating a new document.

This might or might not be a problem, depending on your specific case.

The complete example is in wtr2_simple

  1  #!/usr/bin/perl -w
  2  use strict;
  3  
  4  use XML::Simple;
  5  
  6  use FindBin qw($Bin);
  7  use lib $Bin;
  8  use wtr2_base;
  9  
 10  my $DEBUG=0;
 11  
 12  init_db();
 13  
 14  # XML::Simple cannot output properly the document
 15  # the order of the elements will be lost
 16  my $CAN_OUTPUT= 0; 
 17  
 18  my @files= @ARGV || (<$dir{invoices}/*.xml>);
 19  
 20  foreach my $file (@files)
 21    { my $xml= XMLin( $file, forcearray => [ qw(InvoiceRow)], forcecontent => 1);
 22      
 23      my $errors= check_invoice( $xml);
 24  
 25      if( !@$errors)
 26        { store_invoice( $xml); }
 27      else
 28        { print "ERROR in $file\n  ", join( "\n  ", @$errors), "\n";
 29          if( $CAN_OUTPUT) 
 30            { my $rejected_file= rejected( $file);
 31              print "adding errors in $rejected_file\n" if( $DEBUG);
 32              add_errors( $xml, $errors);
 33              output_doc_to_check(  $rejected_file, $xml);
 34            }
 35        }
 36    }
 37  
 38  exit;
 39  
 40  sub check_invoice
 41    { my( $xml)= @_;
 42      my $errors=[];  # array ref, holds the error messages
 43  
 44      check_buyer(  $xml->{BuyerPartyDetails}->{BuyerPartyIdentifier}->{content},
 45                    $xml->{BuyerPartyDetails}->{BuyerOrganisationName}->{content},
 46                    $errors
 47                 );
 48      check_po(     $xml->{InvoiceDetails}->{OrderIdentifier}->{content}, $errors);
 49  
 50      my @rows= @{$xml->{InvoiceRow}};
 51  
 52      reset_default_row_id();
 53      
 54      foreach my $row( @rows)
 55        { # this does not cope well with broken row numbers
 56  
 57          my $row_id= $row->{RowIdentifier}->{content} || default_row_id();
 58          
 59          print "checking row $row_id\n" if $DEBUG;
 60  
 61          check_qtty( $row_id,
 62                      $row->{DeliveredQuantity}->{content}, 
 63                      $row->{DeliveredQuantity}->{QuantityUnitCode}, 
 64                      $row->{OrderedQuantity}->{content}, 
 65                      $row->{OrderedQuantity}->{QuantityUnitCode}, 
 66                      $errors
 67                    );
 68        }
 69      
 70      return $errors;
 71    }
 72  
 73  sub store_invoice
 74    { my( $xml)= @_; 
 75      print "storing invoice $xml->{InvoiceDetails}->{InvoiceNumber}->{content}\n";
 76  
 77      # build the various data structures
 78      my $data;
 79  
 80      my $invoice = $xml->{InvoiceDetails};
 81      $data->{invoice} = { number         => $invoice->{InvoiceNumber}->{content},
 82                           date           => $invoice->{InvoiceDate}->{content},
 83                           po             => $invoice->{OrderIdentifier}->{content},
 84                           amount_no_tax  => $invoice->{InvoiceTotalVatExcludedAmount}->{content},
 85                           tax            => $invoice->{InvoiceTotalVatAmount}->{content},
 86                           amount         => $invoice->{InvoiceTotalVatIncludedAmount}->{content},
 87                           payment_status => $xml->{PaymentStatusDetails}->{PaymentStatusCode}->{content},
 88                         };
 89  
 90      my $seller = $xml->{SellerPartyDetails};
 91      $data->{seller}  = { identifier      => $seller->{SellerPartyIdentifier}->{content},
 92                           name            => $seller->{SellerOrganisationName}->{content},
 93                           tax_code        => $seller->{SellerOrganisationTaxCode}->{content},
 94                         };
 95  
 96      my $address = $xml->{SellerPartyDetails}->{SellerPostalAddressDetails};
 97      $data->{address} = { street         => $address->{SellerStreetName}->{content},
 98                           town           => $address->{SellerTownName}->{content},
 99                           zip            => $address->{SellerPostCodeIdentifier}->{content},
100                           country_code   => $address->{CountryCode}->{content},
101                           po_box         => $address->{SellerPostOfficeBoxIdentifier}->{content},
102                         };
103  
104      $data->{contact} = { name           => $xml->{SellerContactPersonName}->{content},
105                           phone          => $xml->{SellerCommunicationDetails}->{SellerPhoneNumberIdentifier}->{content},
106                           email          => $xml->{SellerCommunicationDetails}->{SellerEmailaddressIdentifier}->{content},
107                         };
108  
109      $data->{invoicerow} ||= [];
110      reset_default_row_id();
111  
112      foreach my $invoicerow (@{$xml->{InvoiceRow}})
113        { push @{$data->{invoicerow}},
114            { row_id        => $invoicerow->{RowIdentifier}->{content} || default_row_id(),
115              sku           => $invoicerow->{ArticleIdentifier}->{content},
116              name          => $invoicerow->{ArticleName}->{content},
117              qty           => $invoicerow->{DeliveredQuantity}->{content},
118              qty_unit      => $invoicerow->{DeliveredQuantity}->{QuantityUnitCode},
119              unit_price    => $invoicerow->{UnitPriceAmount}->{content},
120              amount_no_tax => $invoicerow->{RowVatExcludedAmount}->{content},
121              tax           => $invoicerow->{RowVatAmount}->{content},
122              amount        => $invoicerow->{RowAmount}->{content},
123            }
124        }
125  
126  
127      store_all( $data);
128    }

XML::Smart

I wrote this XML::SMART example after having written the XML::Simple one and I must say I was impressed by how compatible it was: it took me a good 10 minutes to modufy the wtr2_simple code to get this one: basically understanding that I needed to work with <$xml-{Finvoice} >> instead of directly the original $xml object, and removing all the extra ->{content} calls as XML::Smart lets me happily get the content of an element using $row->{DeliveredQuantity} and the value of an attribute with $row->{DeliveredQuantity}->{QuantityUnitCode}. Very neat.

In 10 more minutes I could add the errors, by just unshifting the newly created container in the document object.

One big annoyance with this module: I had to remove the -w switch as I kept getting warnings. This should be in the TODO list of the author (along with the unnecessary attacks against XML::Simple in the docs, the module can stand on its own and certainly does not need them).

The complete example is in wtr2_smart

  1  #!/usr/bin/perl 
  2  use strict;
  3  
  4  use XML::Smart;
  5  
  6  use FindBin qw($Bin);
  7  use lib $Bin;
  8  use wtr2_base;
  9  
 10  init_db();
 11  
 12  my $DEBUG=0;
 13  
 14  my $CAN_OUTPUT= 1; 
 15  
 16  my @files= @ARGV || (<$dir{invoices}/*.xml>);
 17  
 18  foreach my $file (@files)
 19    { my $doc= XML::Smart->new( $file);
 20      my $xml= $doc->{Finvoice};
 21      
 22      my $errors= check_invoice( $xml);
 23  
 24      if( !@$errors)
 25        { store_invoice( $xml); }
 26      else
 27        { print "ERROR in $file\n  ", join( "\n  ", @$errors), "\n";
 28          if( $CAN_OUTPUT) 
 29            { my $rejected_file= rejected( $file);
 30              print "adding errors in $rejected_file\n" if( $DEBUG);
 31              add_errors( $doc, $errors);
 32              output_doc_to_check(  $rejected_file, $doc);
 33            }
 34        }
 35    }
 36  
 37  exit;
 38  
 39  sub check_invoice
 40    { my( $xml)= @_;
 41      my $errors=[];  # array ref, holds the error messages
 42  
 43      check_buyer(  $xml->{BuyerPartyDetails}->{BuyerPartyIdentifier},
 44                    $xml->{BuyerPartyDetails}->{BuyerOrganisationName},
 45                    $errors
 46                 );
 47      check_po(     $xml->{InvoiceDetails}->{OrderIdentifier}, $errors);
 48  
 49      my @rows= @{$xml->{InvoiceRow}};
 50  
 51      reset_default_row_id();
 52      
 53      foreach my $row( @rows)
 54        { # this does not cope well with broken row numbers
 55  
 56          my $row_id= $row->{RowIdentifier} || default_row_id();
 57          
 58          print "checking row $row_id\n" if $DEBUG;
 59  
 60          check_qtty( $row_id,
 61                      $row->{DeliveredQuantity}, 
 62                      $row->{DeliveredQuantity}->{QuantityUnitCode}, 
 63                      $row->{OrderedQuantity}, 
 64                      $row->{OrderedQuantity}->{QuantityUnitCode}, 
 65                      $errors
 66                    );
 67        }
 68      
 69      return $errors;
 70    }
 71  
 72  sub store_invoice
 73    { my( $xml)= @_; 
 74      print "storing invoice $xml->{InvoiceDetails}->{InvoiceNumber}\n";
 75  
 76      # build the various data structures
 77      my $data;
 78  
 79      my $invoice = $xml->{InvoiceDetails};
 80      $data->{invoice} = { number         => $invoice->{InvoiceNumber},
 81                           date           => $invoice->{InvoiceDate},
 82                           po             => $invoice->{OrderIdentifier},
 83                           amount_no_tax  => $invoice->{InvoiceTotalVatExcludedAmount},
 84                           tax            => $invoice->{InvoiceTotalVatAmount},
 85                           amount         => $invoice->{InvoiceTotalVatIncludedAmount},
 86                           payment_status => $xml->{PaymentStatusDetails}->{PaymentStatusCode},
 87                         };
 88  
 89      my $seller = $xml->{SellerPartyDetails};
 90      $data->{seller}  = { identifier      => $seller->{SellerPartyIdentifier},
 91                           name            => $seller->{SellerOrganisationName},
 92                           tax_code        => $seller->{SellerOrganisationTaxCode},
 93                         };
 94  
 95      my $address = $xml->{SellerPartyDetails}->{SellerPostalAddressDetails};
 96      $data->{address} = { street         => $address->{SellerStreetName},
 97                           town           => $address->{SellerTownName},
 98                           zip            => $address->{SellerPostCodeIdentifier},
 99                           country_code   => $address->{CountryCode},
100                           po_box         => $address->{SellerPostOfficeBoxIdentifier},
101                         };
102  
103      $data->{contact} = { name           => $xml->{SellerContactPersonName},
104                           phone          => $xml->{SellerCommunicationDetails}->{SellerPhoneNumberIdentifier},
105                           email          => $xml->{SellerCommunicationDetails}->{SellerEmailaddressIdentifier},
106                         };
107  
108      $data->{invoicerow} ||= [];
109      reset_default_row_id();
110  
111      foreach my $invoicerow (@{$xml->{InvoiceRow}})
112        { push @{$data->{invoicerow}},
113            { row_id        => $invoicerow->{RowIdentifier} || default_row_id(),
114              sku           => $invoicerow->{ArticleIdentifier},
115              name          => $invoicerow->{ArticleName},
116              qty           => $invoicerow->{DeliveredQuantity},
117              qty_unit      => $invoicerow->{DeliveredQuantity}->{QuantityUnitCode},
118              unit_price    => $invoicerow->{UnitPriceAmount},
119              amount_no_tax => $invoicerow->{RowVatExcludedAmount},
120              tax           => $invoicerow->{RowVatAmount},
121              amount        => $invoicerow->{RowAmount},
122            }
123        }
124  
125  
126      store_all( $data);
127    }
128  
129  
130  sub add_errors
131    { my( $doc, $error_messages)= @_;
132      my $errors= { error => [@$error_messages]};
133      unshift @$doc, $errors;
134      return $doc;
135    }
136  
137  sub output_doc_to_check
138    { my( $file, $doc)= @_;
139      open( FILE, ">$file") or die "cannot create file to check $file: $!";
140      print FILE $doc->data;
141      close FILE;
142    }

XML::Twig

An XML::Twig version based on the XML::Simple code

This code uses XML::Twig to process the invoices

As this code was written after the wtr2_simple example, it was very, very easy to write. I blatantly cheated ;--): it uses XML::Twig simplify method, which generates the same data structure as XML::Simple for a document (or an element) (available only in XML::Twig 3.10). So once the XML document is parsed in memory a call to that method gives me a data structure that can be processed with exactly the same code as in wtr2_simple.

The original document is still in memory though, which makes it possible to update it with the error messages and output it.

Note that in this example the only modification of the original document is the addition of the error messages. If I had needed to change data within the document before outputing it (for example to link the error messages to the place where th error occurs), then I would have had to access it using XML::Twig methods, see wtr2_twig.

The complete example is in wtr2_twig_simple

Here are the parts that differ from wtr2_simple:

...
 20    { my $doc= XML::Twig->new( pretty_print => 'indented')->parsefile( $file);
 21      my $xml= $doc->simplify( forcearray => [ qw(InvoiceRow)], forcecontent => 1);
...

A "native" XML::Twig version (in full-tree mode)

This code uses XML::Twig to process the invoices. It uses the full-tree mode: load the entire XML document through the parsefile method and then process it.

It uses mostly navigation to access the information, the first_child and field methods. New elements are created using the insert_new_elt method.

This was easy to write, but you would expect so, as I wrote XML::Twig and I am obviously quite familiar with it ;--)

The complete example is in wtr2_twig

  1  #!/usr/bin/perl -w
  2  use strict;
  3  
  4  use XML::Twig;
  5  
  6  use FindBin qw($Bin);
  7  use lib $Bin;
  8  use wtr2_base;
  9  
 10  init_db();
 11  
 12  my $DEBUG=0;
 13  
 14  # XML::Twig can output the updated document, whith the error messages
 15  my $CAN_OUTPUT= 1;
 16  
 17  my @files= @ARGV || (<$dir{invoices}/*.xml>);
 18  
 19  foreach my $file (@files)
 20    { my $doc= XML::Twig->new( pretty_print => 'indented')->parsefile( $file);
 21      
 22      my $errors= check_invoice( $doc);
 23  
 24      if( !@$errors)
 25        { store_invoice( $doc); }
 26      else
 27        { print "ERROR in $file\n  ", join( "\n  ", @$errors), "\n";
 28          if( $CAN_OUTPUT) 
 29            {my $rejected_file= rejected( $file);
 30              print "adding errors in $rejected_file\n" if( $DEBUG);
 31              add_errors( $doc, $errors);
 32              output_doc_to_check(  $rejected_file, $doc);
 33            }
 34        };
 35    }
 36  
 37  exit;
 38  
 39  sub check_invoice
 40    { my( $doc)= @_;
 41      my $root= $doc->root;
 42      my $errors=[];  # array ref, holds the error messages
 43  
 44      check_buyer(  $root->first_child( 'BuyerPartyDetails')->field( 'BuyerPartyIdentifier'),
 45                    $root->first_child( 'BuyerPartyDetails')->field( 'BuyerOrganisationName'),
 46                    $errors
 47                 );
 48      check_po(     $root->first_child( 'InvoiceDetails')->field( 'OrderIdentifier'), $errors);
 49  
 50      my @rows= $root->children( 'InvoiceRow');
 51  
 52      reset_default_row_id();
 53      
 54      foreach my $row( @rows)
 55        { # this does not cope well with broken row numbers
 56  
 57          my $row_id= $row->field( 'RowIdentifier') || default_row_id();
 58          
 59          print "checking row $row_id\n" if $DEBUG;
 60  
 61      my $DeliveredQuantity = $row->first_child( 'DeliveredQuantity');
 62      my $OrderedQuantity   = $row->first_child( 'OrderedQuantity');
 63          my $delivered_qty  = $DeliveredQuantity ? $DeliveredQuantity->text : 0;
 64          my $delivered_unit = $DeliveredQuantity ? $DeliveredQuantity->att( 'QuantityUnitCode') : '';
 65          my $ordered_qty    = $OrderedQuantity ? $OrderedQuantity->text : 0;
 66          my $ordered_unit   = $OrderedQuantity ? $OrderedQuantity->att( 'QuantityUnitCode') : '';
 67  
 68          check_qtty( $row_id, $delivered_qty, $delivered_unit, $ordered_qty, $ordered_unit, $errors);
 69        }
 70      
 71      return $errors;
 72    }
 73  
 74  sub store_invoice
 75    { my( $doc)= @_; 
 76      my $root= $doc->root;
 77      print "storing invoice " . $root->first_child( 'InvoiceDetails')->field('InvoiceNumber') . "\n";
 78  
 79      # build the various data structures
 80      my $data;
 81  
 82      my $invoice = $root->first_child( 'InvoiceDetails');
 83      $data->{invoice} = { number         => $invoice->field( 'InvoiceNumber'),
 84                           date           => $invoice->field( 'InvoiceDate'),
 85                           po             => $invoice->field( 'OrderIdentifier'),
 86                           amount_no_tax  => $invoice->field( 'InvoiceTotalVatExcludedAmount'),
 87                           tax            => $invoice->field( 'InvoiceTotalVatAmount'),
 88                           amount         => $invoice->field( 'InvoiceTotalVatIncludedAmount'),
 89                           payment_status => $root->first_child( 'PaymentStatusDetails')
 90                                                  ->field( 'PaymentStatusCode'),
 91                         };
 92  
 93      my $seller = $root->first_child( 'SellerPartyDetails');
 94      $data->{seller}  = { identifier      => $seller->field( 'SellerPartyIdentifier'),
 95                           name            => $seller->field( 'SellerOrganisationName'),
 96                           tax_code        => $seller->field( 'SellerOrganisationTaxCode'),
 97                         };
 98  
 99      my $address = $root->first_child( 'SellerPartyDetails')
100                         ->first_child( 'SellerPostalAddressDetails');
101      $data->{address} = { street         => $address->field( 'SellerStreetName'),
102                           town           => $address->field( 'SellerTownName'),
103                           zip            => $address->field( 'SellerPostCodeIdentifier'),
104                           country_code   => $address->field( 'CountryCode'),
105                           po_box         => $address->field( 'SellerPostOfficeBoxIdentifier'),
106                         };
107  
108      $data->{contact} = { name           => $root->field( 'SellerContactPersonName'),
109                           phone          => $root->first_child( 'SellerCommunicationDetails')
110                                      ->field( 'SellerPhoneNumberIdentifier'),
111                           email          => $root->first_child( 'SellerCommunicationDetails')
112                                      ->field( 'SellerEmailaddressIdentifier'),
113                         };
114  
115      $data->{invoicerow} ||= [];
116      reset_default_row_id();
117  
118      foreach my $invoicerow ($root->children( 'InvoiceRow'))
119        { my $DeliveredQuantity= $invoicerow->first_child( 'DeliveredQuantity');
120          my $qty_unit= $DeliveredQuantity ? $DeliveredQuantity->att( 'QuantityUnitCode') : '';
121  
122      push @{$data->{invoicerow}},
123            { row_id        => $invoicerow->field( 'RowIdentifier') || default_row_id(),
124              sku           => $invoicerow->field( 'ArticleIdentifier'),
125              name          => $invoicerow->field( 'ArticleName'),
126              qty           => $invoicerow->field( 'DeliveredQuantity'),
127              qty_unit      => $qty_unit,
128              unit_price    => $invoicerow->field( 'UnitPriceAmount'),
129              amount_no_tax => $invoicerow->field( 'RowVatExcludedAmount'),
130              tax           => $invoicerow->field( 'RowVatAmount'),
131              amount        => $invoicerow->field( 'RowAmount'),
132            }
133        }
134  
135  
136      store_all( $data);
137    }
138  
139  sub add_errors
140    { my( $doc, $error_messages)= @_;
141      my $errors= $doc->root->insert_new_elt( first_child => 'errors');
142      foreach my $message (@$error_messages)
143        { $errors->insert_new_elt( last_child => error => $message); }
144      return $doc;
145    }
146  
147  sub output_doc_to_check
148    { my( $file, $doc)= @_;
149      open( FILE, ">$file") or die "cannot create file to check $file: $!";
150      $doc->print( \*FILE);
151      close FILE;
152    }

XML::DOM

This code uses XML::DOM to process the invoices

I have never liked the DOM. The Object Model is good, very complete and solid, but the API is very Java oriented (camels are good as Perl mascottes, not as veryLongMethodNames), and at least at level 1 (XML::DOM is a level 1 DOM implementation) quite weak and indeed dangerous.

Having gotten this out of my system...

Writing the DOM example wasn't particulary hard. It was long and quite painful, but more boring than difficult.

A minor annoyance, that could actually be a blessing in a different context: every time I needed to access an attribute for an optional element (DeliveredQuantity for example) I had to check the existence of the element, or calling getAttribute on undef (the non existent element) would cause the script to die. XML::Simple and the likes, by contrast, would let me access non existent hash values in the Perl data structure without complaining. Having to check is a pain in a short script like this one, but could be very useful in a bigger project, as it is always better to have the code die with a bang than fail silently (and cause errors later).

I wrote a little layer on top of it, with the first_child, children and text functions, that basically ensure that when I get a child I get the proper one, and not an extra whitespace, comment or other, that would happen to be in the XML.

The complete example is in wtr2_dom

  1  #!/usr/bin/perl -w
  2  use strict;
  3  
  4  use XML::DOM;
  5  
  6  use FindBin qw($Bin);
  7  use lib $Bin;
  8  use wtr2_base;
  9  
 10  init_db();
 11  
 12  my $DEBUG=0;
 13  
 14  my $CAN_OUTPUT= 1;
 15  
 16  my @files= @ARGV || (<$dir{invoices}/*.xml>);
 17  
 18  foreach my $file (@files)
 19    { my $parser = new XML::DOM::Parser;
 20      my $doc = $parser->parsefile ( $file);
 21      
 22      my $errors= check_invoice( $doc);
 23  
 24      if( !@$errors)
 25        { store_invoice( $doc); }
 26      else
 27        { print "ERROR in $file\n  ", join( "\n  ", @$errors), "\n";
 28          if( $CAN_OUTPUT) 
 29            { my $rejected_file= rejected( $file);
 30              print "adding errors in $rejected_file\n" if( $DEBUG);
 31              add_errors( $doc, $errors);
 32              output_doc_to_check(  $rejected_file, $doc);
 33            }
 34        };
 35    }
 36  
 37  exit;
 38  
 39  sub check_invoice
 40    { my( $doc)= @_;
 41      my $root= $doc->getDocumentElement; # so we really work with the root element
 42      my $errors=[];  # array ref, holds the error messages
 43  
 44      my $BuyerPartyDetails     = first_child( $root, 'BuyerPartyDetails');
 45      my $BuyerPartyIdentifier  = first_child( $BuyerPartyDetails, 'BuyerPartyIdentifier');
 46      my $BuyerOrganisationName = first_child( $BuyerPartyDetails, 'BuyerOrganisationName');
 47      check_buyer(  text( $BuyerPartyIdentifier),
 48                    text( $BuyerOrganisationName),
 49                    $errors
 50                 );
 51      my $InvoiceDetails  = first_child( $root, 'InvoiceDetails');
 52      my $OrderIdentifier = first_child( $InvoiceDetails, 'OrderIdentifier');
 53      check_po( text( $OrderIdentifier), $errors);
 54  
 55      my @rows= children( $root, 'InvoiceRow');
 56  
 57      reset_default_row_id();
 58      
 59      foreach my $row ( @rows)
 60        { # this does not cope well with broken row numbers
 61  
 62          my $row_id= text( first_child( $row, 'RowIdentifier')) || default_row_id();
 63          
 64          print "checking row $row_id\n" if $DEBUG;
 65  
 66          my $DeliveredQuantity= first_child( $row, 'DeliveredQuantity');
 67          my $OrderedQuantity  = first_child( $row, 'OrderedQuantity');
 68  
 69          my $delivered_qty  = text( $DeliveredQuantity);
 70          my $delivered_unit = $DeliveredQuantity ? $DeliveredQuantity->getAttribute( 'QuantityUnitCode') : '';
 71          my $ordered_qty    = text( $OrderedQuantity);
 72          my $ordered_unit   = $OrderedQuantity ? $OrderedQuantity->getAttribute( 'QuantityUnitCode') : '';
 73  
 74          check_qtty( $row_id, $delivered_qty, $delivered_unit, $ordered_qty, $ordered_unit, $errors);
 75        }
 76      
 77      return $errors;
 78    }
 79  
 80  sub store_invoice
 81    { my( $doc)= @_; 
 82      my $root= $doc->getDocumentElement; # so we really work with the root element
 83      
 84      print "storing invoice ", text( first_child( first_child( $root, 'InvoiceDetails'), 'InvoiceNumber')), "\n";
 85  
 86      # build the various data structures
 87      my $data;
 88  
 89      my $invoice = first_child( $root, 'InvoiceDetails');
 90      $data->{invoice} = { number         => text( first_child( $invoice, 'InvoiceNumber')),
 91                           date           => text( first_child( $invoice, 'InvoiceDate')),
 92                           po             => text( first_child( $invoice, 'OrderIdentifier')),
 93                           amount_no_tax  => text( first_child( $invoice, 'InvoiceTotalVatExcludedAmount')),
 94                           tax            => text( first_child( $invoice, 'InvoiceTotalVatAmount')),
 95                           amount         => text( first_child( $invoice, 'InvoiceTotalVatIncludedAmount')),
 96                           payment_status => text( first_child( first_child( $root, 'PaymentStatusDetails'), 'PaymentStatusCode')),
 97                         };
 98  
 99      my $seller = first_child( $root, 'SellerPartyDetails');
100      $data->{seller}  = { identifier      => text( first_child( $seller, 'SellerPartyIdentifier')),
101                           name            => text( first_child( $seller, 'SellerOrganisationName')),
102                           tax_code        => text( first_child( $seller, 'SellerOrganisationTaxCode')),
103                         };
104  
105      my $SellerPartyDetails = first_child( $root, 'SellerPartyDetails');
106      my $address            = first_child( $SellerPartyDetails, 'SellerPostalAddressDetails');
107      $data->{address} = { street         => text( first_child( $address, 'SellerStreetName')),
108                           town           => text( first_child( $address, 'SellerTownName')),
109                           zip            => text( first_child( $address, 'SellerPostCodeIdentifier')),
110                           country_code   => text( first_child( $address, 'CountryCode')),
111                           po_box         => text( first_child( $address, 'SellerPostOfficeBoxIdentifier')),
112                         };
113  
114      my $contact = first_child( $root, 'SellerCommunicationDetails');
115      $data->{contact} = { name           => text( first_child( $root, 'SellerContactPersonName')),
116                           phone          => text( first_child( $contact, 'SellerPhoneNumberIdentifier')),
117                           email          => text( first_child( $contact, 'SellerEmailaddressIdentifier')),
118                         };
119  
120      $data->{invoicerow} ||= [];
121      reset_default_row_id();
122  
123      foreach my $invoicerow ( children($root, 'InvoiceRow'))
124        { # need to check that the DeliveredQuantity element is present before getting its attribute
125          my $DeliveredQuantity= first_child( $invoicerow, 'DeliveredQuantity');
126          my $qty_unit= $DeliveredQuantity ? $DeliveredQuantity->getAttribute( 'QuantityUnitCode') : '';
127          
128          push @{$data->{invoicerow}},
129            { row_id        => text( first_child( $invoicerow, 'RowIdentifier')) || default_row_id(),
130              sku           => text( first_child( $invoicerow, 'ArticleIdentifier')),
131              name          => text( first_child( $invoicerow, 'ArticleName')),
132              qty           => text( $DeliveredQuantity),
133              qty_unit      => $qty_unit,
134              unit_price    => text( first_child( $invoicerow, 'UnitPriceAmount')),
135              amount_no_tax => text( first_child( $invoicerow, 'RowVatExcludedAmount')),
136              tax           => text( first_child( $invoicerow, 'RowVatAmount')),
137              amount        => text( first_child( $invoicerow, 'RowAmount')),
138            }
139        }
140  
141  
142      store_all( $data);
143    }
144  
145  sub add_errors
146    { my( $doc, $error_messages)= @_;
147      my $root= $doc->getDocumentElement;
148      my $errors= $doc->createElement( 'errors');
149      $root->insertBefore ( $errors, $root->getFirstChild);
150      foreach my $message (@$error_messages)
151        { my $error= $doc->createElement( 'error');
152          $errors->appendChild( $error); 
153          # those 2 lines could be replaced by $error->addText( $message)
154          # which is not in the DOM spec
155          my $text= $doc->createTextNode( $message); 
156          $error->appendChild( $text);
157        }
158      return $doc;
159    }
160  
161  sub output_doc_to_check
162    { my( $file, $doc)= @_;
163      open( FILE, ">$file") or die "cannot create file to check $file: $!";
164      print FILE $doc->toString;
165      close FILE;
166    }
167  
168  # this is very important,this ensures that we get the proper child
169  # never use XML::DOM's getFirstChild method directly!
170  sub first_child
171    { my( $node, $tag)= @_;
172      my $child= $node->getFirstChild or return undef;
173      while( $child && ($child->getNodeName ne $tag) )
174        { $child= $child->getNextSibling; }
175      return $child;
176    }
177  
178  sub children
179    { my( $node, $tag)= @_;
180      my @children;
181      my $child= first_child( $node, $tag) or return undef;
182      push @children, $child;
183      while( $child= $child->getNextSibling)
184        { push @children, $child if( $child->getNodeName eq $tag); }
185      return @children;
186    }
187    
188  # node must include only text (and comments)
189  sub text
190    { my( $node)= @_;
191      unless( $node) { return undef; }
192      my $text='';
193      foreach my $child ($node->getChildNodes)
194        { if( $child->getNodeName eq '#text')
195            { $text.= $child->getData; }
196        }
197      return $text;
198    }

XML::EasyOBJ

This code uses XML::EasyOBJ to process the invoices

XML::EasyOBJ is built on top of XML::DOM, and mostly allows for easy navigation by letting you write $invoice= $doc->InvoiceDetails->InvoiceNumber to get the invoice number element, and $invoice->getString to get its value.

The fact that tag names (from the document) are used as method names in the code feels a little weird, and it can make name collision possible (XML::EasyOBJ lets you rename methods though, so you can deal with this problem). It makes it really easy to work with these invoices though, the code was easy to write and

The bad news here is that XML::EasyOBJ mostly helps you accessing the data. If you want to create new elements you have to use the DOM methods.

Overall XML::EasyOBJ feels like a very convenient layer on top of the DOM, which lets you easily navigate and access the data in the DOM (making easy things easy), and lets you use native DOM methods for advanced (making harder things possible).

The complete example is in wtr2_easyobj

  1  #!/usr/bin/perl -w
  2  use strict;
  3  
  4  use XML::EasyOBJ;
  5  
  6  use FindBin qw($Bin);
  7  use lib $Bin;
  8  use wtr2_base;
  9  
 10  init_db();
 11  
 12  my $DEBUG=0;
 13  
 14  my $CAN_OUTPUT= 0;
 15  
 16  my @files= @ARGV || (<$dir{invoices}/*.xml>);
 17  
 18  foreach my $file (@files)
 19    { my $doc =  XML::EasyOBJ->new( $file);
 20      
 21      my $errors= check_invoice( $doc);
 22  
 23      if( !@$errors)
 24        { store_invoice( $doc); }
 25      else
 26        { print "ERROR in $file\n  ", join( "\n  ", @$errors), "\n";
 27          if( $CAN_OUTPUT) 
 28            { my $rejected_file= rejected( $file);
 29              print "adding errors in $rejected_file\n" if( $DEBUG);
 30          my $dom= $doc->getDomObj->getOwnerDocument; # we need to use the DOM for this
 31              add_errors( $dom, $errors);
 32              output_doc_to_check(  $rejected_file, $dom);
 33            }
 34        };
 35    }
 36  
 37  exit;
 38  
 39  sub check_invoice
 40    { my( $doc)= @_;
 41      my $errors=[];  # array ref, holds the error messages
 42  
 43      my $BuyerPartyDetails     = $doc->BuyerPartyDetails;
 44      my $BuyerPartyIdentifier  = $BuyerPartyDetails->BuyerPartyIdentifier;
 45      my $BuyerOrganisationName = $BuyerPartyDetails->BuyerOrganisationName;
 46      check_buyer(  $BuyerPartyIdentifier->getString,
 47                    $BuyerOrganisationName->getString,
 48                    $errors
 49                 );
 50      my $InvoiceDetails  = $doc->InvoiceDetails;
 51      my $OrderIdentifier = $InvoiceDetails->OrderIdentifier;
 52      check_po( $OrderIdentifier->getString, $errors);
 53  
 54      my @rows= $doc->InvoiceRow;
 55  
 56      reset_default_row_id();
 57      
 58      foreach my $row ( @rows)
 59        { # this does not cope well with broken row numbers
 60  
 61          my $row_id= $row->RowIdentifier->getString || default_row_id();
 62          
 63          print "checking row $row_id\n" if $DEBUG;
 64  
 65          my $DeliveredQuantity= $row->DeliveredQuantity;
 66          my $OrderedQuantity  = $row->OrderedQuantity;
 67  
 68      my( $delivered_qty, $delivered_unit, $ordered_qty, $ordered_unit)= ('','','',''); 
 69      if( $DeliveredQuantity)
 70            { $delivered_qty  = $DeliveredQuantity->getString;
 71              $delivered_unit = $DeliveredQuantity->getAttr( 'QuantityUnitCode');
 72        }
 73      if( $OrderedQuantity)
 74            { $ordered_qty    = $OrderedQuantity->getString;
 75              $ordered_unit   = $OrderedQuantity->getAttr( 'QuantityUnitCode');
 76            }
 77  
 78          check_qtty( $row_id, $delivered_qty, $delivered_unit, $ordered_qty, $ordered_unit, $errors);
 79        }
 80      
 81      return $errors;
 82    }
 83  
 84  sub store_invoice
 85    { my( $doc)= @_; 
 86      
 87      print "storing invoice ", $doc->InvoiceDetails->InvoiceNumber->getString, "\n";
 88  
 89      # build the various data structures
 90      my $data;
 91  
 92      my $invoice = $doc->InvoiceDetails;
 93      $data->{invoice} = { number         => $invoice->InvoiceNumber->getString,
 94                           date           => $invoice->InvoiceDate->getString,
 95                           po             => $invoice->OrderIdentifier->getString,
 96                           amount_no_tax  => $invoice->InvoiceTotalVatExcludedAmount->getString,
 97                           tax            => $invoice->InvoiceTotalVatAmount->getString,
 98                           amount         => $invoice->InvoiceTotalVatIncludedAmount->getString,
 99                           payment_status => $doc->PaymentStatusDetails->PaymentStatusCode->getString,
100                         };
101  
102      my $seller = $doc->SellerPartyDetails;
103      $data->{seller}  = { identifier      => $seller->SellerPartyIdentifier->getString,
104                           name            => $seller->SellerOrganisationName->getString,
105                           tax_code        => $seller->SellerOrganisationTaxCode->getString,
106                         };
107  
108      my $address            = $doc->SellerPartyDetails->SellerPostalAddressDetails;
109      $data->{address} = { street         => $address->SellerStreetName->getString,
110                           town           => $address->SellerTownName->getString,
111                           zip            => $address->SellerPostCodeIdentifier->getString,
112                           country_code   => $address->CountryCode->getString,
113                           po_box         => $address->SellerPostOfficeBoxIdentifier->getString,
114                         };
115  
116      my $contact = $doc->SellerCommunicationDetails;
117      $data->{contact} = { name           => $doc->SellerContactPersonName->getString,
118                           phone          => $contact->SellerPhoneNumberIdentifier->getString,
119                           email          => $contact->SellerEmailaddressIdentifier->getString,
120                         };
121  
122      $data->{invoicerow} ||= [];
123      reset_default_row_id();
124  
125      foreach my $invoicerow ( $doc->InvoiceRow)
126        { # need to check that the DeliveredQuantity element is present before getting its attribute
127          my $DeliveredQuantity= $invoicerow->DeliveredQuantity;
128      my( $qty, $qty_unit)= ('','');
129      if( $DeliveredQuantity)
130            { $qty      = $DeliveredQuantity->getString;
131          $qty_unit = $DeliveredQuantity->getAttr( 'QuantityUnitCode');
132            }
133          
134          push @{$data->{invoicerow}},
135            { row_id        => $invoicerow->RowIdentifier->getString || default_row_id(),
136              sku           => $invoicerow->ArticleIdentifier->getString,
137              name          => $invoicerow->ArticleName->getString,
138              qty           => $qty,
139              qty_unit      => $qty_unit,
140              unit_price    => $invoicerow->UnitPriceAmount->getString,
141              amount_no_tax => $invoicerow->RowVatExcludedAmount->getString,
142              tax           => $invoicerow->RowVatAmount->getString,
143              amount        => $invoicerow->RowAmount->getString,
144            }
145        }
146  
147  
148      store_all( $data);
149    }
150  
151  sub add_errors
152    { my( $doc, $error_messages)= @_;
153      my $root= $doc->getDocumentElement;
154      my $errors= $doc->createElement( 'errors');
155      $root->insertBefore ( $errors, $root->getFirstChild);
156      foreach my $message (@$error_messages)
157        { my $error= $doc->createElement( 'error');
158          $errors->appendChild( $error); 
159          # those 2 lines could be replaced by $error->addText( $message)
160          # which is not in the DOM spec
161          my $text= $doc->createTextNode( $message); 
162          $error->appendChild( $text);
163        }
164      return $doc;
165    }
166  
167  sub output_doc_to_check
168    { my( $file, $doc)= @_;
169      open( FILE, ">$file") or die "cannot create file to check $file: $!";
170      print FILE $doc->toString;
171      close FILE;
172    }

XML::LibXML

An XML::LibXML version based on the XML::DOM code

This code uses XML::LibXML to process the invoices

It is directly derived from the XML::DOM code in wtr2_dom. It only differs where XML::DOM and XML::LibXML DOM methods or constants have different names:

  XML::DOM        XML::LibXML
  getNodeName     nodeName       method
  #text           text           constant returned by nodeName/getNodeName

This was very easy to write, but does not take advantage of XML::LibXML's best feature: its support for XPath. See wtr2_libxml for a slightly different version, made a lot safer by using XPath queries instead of navigation methods (like getFirstChild) to access the data.

This exemple is here just to show how easy it is to port code from XML::DOM (which IMHO should be deprecated) to XML::LibXML, and then take advantage of XML::LibXML more powerful features.

The complete example is in wtr2_libxml_dom

Here are the parts that differ from wtr2_dom:

...
 19    { my $parser = new XML::LibXML;
 20      my $doc = $parser->parse_file ( $file);
...
173      while( $child && ($child->nodeName ne $tag) )
...
184        { push @children, $child if( $child->nodeName eq $tag); }
...
194        { if( $child->nodeName eq 'text')
...

A "native" XML::LibXML version

This code uses XML::LibXML to process the invoices. It uses a lot of XML::LibXML specific methods instead of using the DOM. This makes for nicer and safer code. XML::LibXML is a Perl wrapper on top of the libxml2 library, an XML/XPath/DOM/RelaxNG/... library written by Daniel Veillard for the Gnome project.

Instead of using navigation methods (getFirstChild) it relies mostly on findnodes and findvalue, which use XPath to select nodes to access.

The error message is built as text and then parsed using parse_xml_chunk. While this might not be the best method I found it really nice to use.

Overall the code was quite easy to write, all the extra goodies provided by XML::LibXML compared to the DOM are really useful and make it much easier to write compact and safe code.

My main gripe with XML::LibXML is that it is often unstable, as the Perl module tries to keep up with the development of the librarie. The problem is that libxml2 is so widely used that I find that it is often upgraded by unrelated software, which can then cause trouble to code that uses XML::LibXML.

The complete example is in wtr2_libxml

  1  #!/usr/bin/perl -w
  2  use strict;
  3  
  4  use XML::LibXML;
  5  
  6  use FindBin qw($Bin);
  7  use lib $Bin;
  8  use wtr2_base;
  9  
 10  init_db();
 11  
 12  my $DEBUG=0;
 13  
 14  my $CAN_OUTPUT= 1;
 15  
 16  my @files= @ARGV || (<$dir{invoices}/*.xml>);
 17  
 18  foreach my $file (@files)
 19    { my $parser = new XML::LibXML;
 20      my $doc = $parser->parse_file ( $file);
 21      
 22      my $errors= check_invoice( $doc);
 23  
 24      if( !@$errors)
 25        { store_invoice( $doc); }
 26      else
 27        { print "ERROR in $file\n  ", join( "\n  ", @$errors), "\n";
 28          if( $CAN_OUTPUT) 
 29            { my $rejected_file= rejected( $file);
 30              print "adding errors in $rejected_file\n" if( $DEBUG);
 31              add_errors( $parser, $doc, $errors);         # the parser is passed so it can be re-used 
 32              output_doc_to_check(  $rejected_file, $doc);
 33            }
 34        };
 35    }
 36  
 37  exit;
 38  
 39  sub check_invoice
 40    { my( $doc)= @_;
 41      my $errors=[];  # array ref, holds the error messages
 42  
 43      check_buyer(  $doc->findvalue( '/Finvoice/BuyerPartyDetails/BuyerPartyIdentifier'),
 44                    $doc->findvalue( '/Finvoice/BuyerPartyDetails/BuyerOrganisationName'),
 45                    $errors
 46                 );
 47      check_po( $doc->findvalue( '/Finvoice/InvoiceDetails/OrderIdentifier'), $errors);
 48  
 49      my @rows= $doc->findnodes( '/Finvoice/InvoiceRow');
 50  
 51      reset_default_row_id();
 52      
 53      foreach my $row ( @rows)
 54        { # this does not cope well with broken row numbers
 55  
 56          my $row_id= $row->findvalue( 'RowIdentifier') || default_row_id();
 57          
 58          print "checking row $row_id\n" if $DEBUG;
 59  
 60      my( $delivered_qty, $delivered_unit, $ordered_qty, $ordered_unit)= ( '','','','');
 61      
 62      if( my $DeliveredQuantity= $row->findnodes( 'DeliveredQuantity')->[0])
 63            { $delivered_qty  = $DeliveredQuantity->textContent;
 64              $delivered_unit = $DeliveredQuantity->getAttribute( 'QuantityUnitCode');
 65        }
 66      if( my $OrderedQuantity= $row->findnodes( 'OrderedQuantity')->[0])
 67            { $ordered_qty    = $OrderedQuantity->textContent;
 68              $ordered_unit   = $OrderedQuantity->getAttribute( 'QuantityUnitCode');
 69        }
 70  
 71          check_qtty( $row_id, $delivered_qty, $delivered_unit, $ordered_qty, $ordered_unit, $errors);
 72        }
 73      
 74      return $errors;
 75    }
 76  
 77  sub store_invoice
 78    { my( $doc)= @_; 
 79     
 80      my $invoice_number= $doc->findvalue( '/Finvoice/InvoiceDetails/InvoiceNumber');
 81      print "storing invoice $invoice_number\n";
 82  
 83      # build the various data structures
 84      my $data;
 85  
 86      my $invoice = $doc->findnodes( '/Finvoice/InvoiceDetails')->[0];
 87      $data->{invoice} = { number         => $invoice->findvalue( 'InvoiceNumber'),
 88                           date           => $invoice->findvalue( 'InvoiceDate'),
 89                           po             => $invoice->findvalue( 'OrderIdentifier'),
 90                           amount_no_tax  => $invoice->findvalue( 'InvoiceTotalVatExcludedAmount'),
 91                           tax            => $invoice->findvalue( 'InvoiceTotalVatAmount'),
 92                           amount         => $invoice->findvalue( 'InvoiceTotalVatIncludedAmount'),
 93                           payment_status => $doc->findvalue( '/Finvoice/PaymentStatusDetails/PaymentStatusCode'),
 94                         };
 95  
 96      my $seller = $doc->findnodes( '/Finvoice/SellerPartyDetails')->[0];
 97      $data->{seller}  = { identifier      => $seller->findvalue( 'SellerPartyIdentifier'),
 98                           name            => $seller->findvalue( 'SellerOrganisationName'),
 99                           tax_code        => $seller->findvalue( 'SellerOrganisationTaxCode'),
100                         };
101  
102      my $address            = $doc->findnodes( '/Finvoice/SellerPartyDetails/SellerPostalAddressDetails')->[0];
103      $data->{address} = { street         => $address->findvalue( 'SellerStreetName'),
104                           town           => $address->findvalue( 'SellerTownName'),
105                           zip            => $address->findvalue( 'SellerPostCodeIdentifier'),
106                           country_code   => $address->findvalue( 'CountryCode'),
107                           po_box         => $address->findvalue( 'SellerPostOfficeBoxIdentifier'),
108                         };
109  
110      my $contact = $doc->findnodes( '/Finvoice/SellerCommunicationDetails')->[0];
111      $data->{contact} = { name           => $doc->findvalue( '/Finvoice/SellerContactPersonName'),
112                           phone          => $contact->findvalue( 'SellerPhoneNumberIdentifier'),
113                           email          => $contact->findvalue( 'SellerEmailaddressIdentifier'),
114                         };
115  
116      $data->{invoicerow} ||= [];
117      reset_default_row_id();
118  
119      foreach my $invoicerow ( $doc->findnodes( '/Finvoice/InvoiceRow'))
120        { # need to check that the DeliveredQuantity element is present before getting its attribute
121          my $DeliveredQuantity= $invoicerow->getChildrenByTagName( 'DeliveredQuantity')->[0];
122          my $qty     = $DeliveredQuantity ? $DeliveredQuantity->textContent : '';
123          my $qty_unit= $DeliveredQuantity ? $DeliveredQuantity->getAttribute( 'QuantityUnitCode') : '';
124          
125          push @{$data->{invoicerow}},
126            { row_id        => $invoicerow->findvalue( 'RowIdentifier') || default_row_id(),
127              sku           => $invoicerow->findvalue( 'ArticleIdentifier'),
128              name          => $invoicerow->findvalue( 'ArticleName'),
129              qty           => $qty,
130              qty_unit      => $qty_unit,
131              unit_price    => $invoicerow->findvalue( 'UnitPriceAmount'),
132              amount_no_tax => $invoicerow->findvalue( 'RowVatExcludedAmount'),
133              tax           => $invoicerow->findvalue( 'RowVatAmount'),
134              amount        => $invoicerow->findvalue( 'RowAmount'),
135            }
136        }
137  
138      store_all( $data);
139    }
140  
141  sub add_errors
142    { my( $parser, $doc, $error_messages)= @_;
143      my $root= $doc->documentElement();
144      # here I chose to build the error messages as text and then parse them
145      my $chunk= "\n  <errors>\n    " . join( "\n    ", map { "<error>$_</error>" } @$error_messages) . "\n  </errors>";
146      my $errors= $parser->parse_xml_chunk( $chunk );
147      $root->insertBefore ( $errors, $root->getFirstChild);
148      return $doc;
149    }
150  
151  sub output_doc_to_check
152    { my( $file, $doc)= @_;
153      open( FILE, ">$file") or die "cannot create file to check $file: $!";
154      print FILE $doc->toString;
155      close FILE;
156    }

XML::XPath version

This code uses XML::XPath. Accessing the data, the check_invoice and store_invoice functions, is very similar to the code using XML::LibXML (not surprisingly as the 2 modules were both written by Matt Sergeant ;--) The only difference are the way to create the document object and the name of the method used to get the text of an element (string_value instead of textContent).

Creating the errors element proved a little more challenging.

The complete example is in wtr2_xpath

  1  #!/usr/bin/perl -w
  2  use strict;
  3  
  4  use XML::XPath;
  5  use XML::XPath::XMLParser;
  6  
  7  
  8  use FindBin qw($Bin);
  9  use lib $Bin;
 10  use wtr2_base;
 11  
 12  init_db();
 13  
 14  my $DEBUG=0;
 15  
 16  my $CAN_OUTPUT= 1;
 17  
 18  my @files= @ARGV || (<$dir{invoices}/*.xml>);
 19  
 20  my $doc; # note that putting the my _in_ the loop causes an Out of Memory
 21           # error after just a few documents
 22  
 23  foreach my $file (@files)
 24    { $doc= XML::XPath->new( filename => $file); 
 25  
 26      my $errors= check_invoice( $doc);
 27  
 28      if( !@$errors)
 29        { store_invoice( $doc); }
 30      else
 31        { print "ERROR in $file\n  ", join( "\n  ", @$errors), "\n";
 32          if( $CAN_OUTPUT) 
 33            { my $rejected_file= rejected( $file);
 34              print "adding errors in $rejected_file\n" if( $DEBUG);
 35              add_errors( $doc, $errors);
 36              output_doc_to_check(  $rejected_file, $doc);
 37            }
 38        };
 39    }
 40  
 41  exit;
 42  
 43  sub check_invoice
 44    { my( $doc)= @_;
 45      my $errors=[];  # array ref, holds the error messages
 46  
 47      check_buyer(  $doc->findvalue( '/Finvoice/BuyerPartyDetails/BuyerPartyIdentifier'),
 48                    $doc->findvalue( '/Finvoice/BuyerPartyDetails/BuyerOrganisationName'),
 49                    $errors
 50                 );
 51      check_po( $doc->findvalue( '/Finvoice/InvoiceDetails/OrderIdentifier'), $errors);
 52  
 53      my @rows= $doc->findnodes( '/Finvoice/InvoiceRow');
 54  
 55      reset_default_row_id();
 56      
 57      foreach my $row ( @rows)
 58        { # this does not cope well with broken row numbers
 59  
 60          my $row_id= $row->findvalue( 'RowIdentifier') || default_row_id();
 61          
 62          print "checking row $row_id\n" if $DEBUG;
 63  
 64      my( $delivered_qty, $delivered_unit, $ordered_qty, $ordered_unit)= ( '','','','');
 65      
 66      if( my $DeliveredQuantity= $row->findnodes( 'DeliveredQuantity')->[0])
 67            { $delivered_qty  = $DeliveredQuantity->string_value;
 68              $delivered_unit = $DeliveredQuantity->getAttribute( 'QuantityUnitCode');
 69        }
 70      if( my $OrderedQuantity= $row->findnodes( 'OrderedQuantity')->[0])
 71            { $ordered_qty    = $OrderedQuantity->string_value;
 72              $ordered_unit   = $OrderedQuantity->getAttribute( 'QuantityUnitCode');
 73        }
 74  
 75          check_qtty( $row_id, $delivered_qty, $delivered_unit, $ordered_qty, $ordered_unit, $errors);
 76        }
 77      
 78      return $errors;
 79    }
 80  
 81  sub store_invoice
 82    { my( $doc)= @_; 
 83     
 84      my $invoice_number= $doc->findvalue( '/Finvoice/InvoiceDetails/InvoiceNumber');
 85      print "storing invoice $invoice_number\n";
 86  
 87      # build the various data structures
 88      my $data;
 89  
 90      my $invoice = $doc->findnodes( '/Finvoice/InvoiceDetails')->[0];
 91      $data->{invoice} = { number         => $invoice->findvalue( 'InvoiceNumber'),
 92                           date           => $invoice->findvalue( 'InvoiceDate'),
 93                           po             => $invoice->findvalue( 'OrderIdentifier'),
 94                           amount_no_tax  => $invoice->findvalue( 'InvoiceTotalVatExcludedAmount'),
 95                           tax            => $invoice->findvalue( 'InvoiceTotalVatAmount'),
 96                           amount         => $invoice->findvalue( 'InvoiceTotalVatIncludedAmount'),
 97                           payment_status => $doc->findvalue( '/Finvoice/PaymentStatusDetails/PaymentStatusCode'),
 98                         };
 99  
100      my $seller = $doc->findnodes( '/Finvoice/SellerPartyDetails')->[0];
101      $data->{seller}  = { identifier      => $seller->findvalue( 'SellerPartyIdentifier'),
102                           name            => $seller->findvalue( 'SellerOrganisationName'),
103                           tax_code        => $seller->findvalue( 'SellerOrganisationTaxCode'),
104                         };
105  
106      my $address            = $doc->findnodes( '/Finvoice/SellerPartyDetails/SellerPostalAddressDetails')->[0];
107      $data->{address} = { street         => $address->findvalue( 'SellerStreetName'),
108                           town           => $address->findvalue( 'SellerTownName'),
109                           zip            => $address->findvalue( 'SellerPostCodeIdentifier'),
110                           country_code   => $address->findvalue( 'CountryCode'),
111                           po_box         => $address->findvalue( 'SellerPostOfficeBoxIdentifier'),
112                         };
113  
114      my $contact = $doc->findnodes( '/Finvoice/SellerCommunicationDetails')->[0];
115      $data->{contact} = { name           => $doc->findvalue( '/Finvoice/SellerContactPersonName'),
116                           phone          => $contact->findvalue( 'SellerPhoneNumberIdentifier'),
117                           email          => $contact->findvalue( 'SellerEmailaddressIdentifier'),
118                         };
119  
120      $data->{invoicerow} ||= [];
121      reset_default_row_id();
122  
123      foreach my $invoicerow ( $doc->findnodes( '/Finvoice/InvoiceRow'))
124        { # need to check that the DeliveredQuantity element is present before getting its attribute
125          my $DeliveredQuantity= $invoicerow->findnodes( 'DeliveredQuantity')->[0];
126          my $qty     = $DeliveredQuantity ? $DeliveredQuantity->string_value : '';
127          my $qty_unit= $DeliveredQuantity ? $DeliveredQuantity->getAttribute( 'QuantityUnitCode') : '';
128          
129          push @{$data->{invoicerow}},
130            { row_id        => $invoicerow->findvalue( 'RowIdentifier') || default_row_id(),
131              sku           => $invoicerow->findvalue( 'ArticleIdentifier'),
132              name          => $invoicerow->findvalue( 'ArticleName'),
133              qty           => $qty,
134              qty_unit      => $qty_unit,
135              unit_price    => $invoicerow->findvalue( 'UnitPriceAmount'),
136              amount_no_tax => $invoicerow->findvalue( 'RowVatExcludedAmount'),
137              tax           => $invoicerow->findvalue( 'RowVatAmount'),
138              amount        => $invoicerow->findvalue( 'RowAmount'),
139            }
140        }
141  
142      store_all( $data);
143    }
144  
145  sub add_errors
146    { my( $doc, $error_messages)= @_;
147      my $errors= $doc->createNode( '/Finvoice/errors');
148      foreach my $message (@$error_messages)
149        { my $error= XML::XPath::Node::Element->new( 'error');
150      $errors->appendChild( $error);
151          $doc->setNodeText( '/Finvoice/errors/error[last()]' => $message, );
152        }
153      return $doc;
154    }
155  
156  sub output_doc_to_check
157    { my( $file, $doc)= @_;
158      open( FILE, ">$file") or die "cannot create file to check $file: $!";
159      print FILE $doc->findnodes( '/')->[0]->toString;
160      close FILE;
161    }

XML::PYX

An XML::PYX version based on the XML::Simple code

This example uses XML::PYX in a kinda devious way: it reimplements a simplified version of XML::Simple's XMLin that will work for the class of documents we are processing. The code is then exactly the same as the one for wtr2_simple.

Note that this version of XMLin does not offer any option besides a limited version of forcearray. It will not for exemple use keyattr.

I guess this was pretty easy to write as it worked the first time I run it, much to my surprise I should say ;--)

Generally It illustrates the fact that it often pays to write a layer on top of generic modules to adapt them to your specific needs (please, just don't release it on CPAN! Chances are that it is really specific to your problem or to the way you like to code, and it will only add to the already considerable confusion in the XML namespace).

The complete example is in wtr2_pyx_simple

Here are the call to XMLin and the XMLin sub

...
 17    { # XMLin just re-implements a simplified version of XML::Simple's XMLin
 18      # that will work for this specific class of document 
 19      my $xml= XMLin( $file, forcearray => [ qw(InvoiceRow)]); 
...
126  sub XMLin
127    { my( $file, %options)=@_;
128  
129      # get the 'forcearray' elements in a convenient hash
130      my %forcearray= map { $_ => 1} @{$options{forcearray}};
131      
132      my @current; # the stack of hashes or array in the data structure
133                   # $current[0] is the overall structure (the root of the document),
134                   # $current[1] is the first_level child currently open
135                   # ...
136                   # $current[-1] is the current element
137      open( IN, "pyx $file | ") or die "cannot open pyx $file: $!";
138      while( <IN>)
139        { if( m{^\((.*)$})    # open element
140            { my $tag= $1;
141              my $current={}; # will be filled later;
142              if( @current) 
143                { # normal (non-root) element
144                  if( $forcearray{$1}) { $current[-1]->{$1} ||= []; push @{$current[-1]->{$1}}, $current; }
145                  else                 { $current[-1]->{$1} = $current;                                   }
146                }
147              push @current, $current;
148            }
149          elsif( m{^\)(.*)$}) # close element (except when only the root is left)
150            { pop @current unless( @current == 1); }
151          elsif( m{^-\\n$})   # empty line, skip
152            { }
153          elsif( m{^-(.*)$})  # content, assign to the current element content
154            { $current[-1]->{content}= $1; }
155          elsif( m{^A(\w*) (.*)$})  # attribute, assign to a field in the current element
156            { $current[-1]->{$1}= $2; }
157        }
158  
159      # note that in case there is an error during the parsing it will show when
160      # closing the file, so you need to check the result of close
161      close IN or die "error processing pyx $file: $!"; 
162  
163      return $current[0];
164    }
...

A more advanced XML::PYX version, still based on the XML::Simple code

The previous code might seem naive (and useless), but it can easily be improved and thus get extra credits:

This example uses XML::PYX and the XMLin layer on top of it, but stores the initial PYX flow so it can later update it with the error messages.

The code is very similar to wtr2_pyx_simple, but the PYX flow if first stored in $pyx, before being passed to XMLin. If errors are found, $pyx is used to add the errors (add_errors looks for the end of the root (Finvoice) start tag, builds the PYX flow for the errors element and insert it after the root. The pyx flow is then output using pyxw which writes it back as XML.

Creating the PYX for the error element is not really elegant, but it is not that much of a pain either. A simple layer a-la-XML::Writer could make it even easier, but wasn't deemed necessary here.

The complete example is in wtr2_pyx_simple_plus

Here are the interesing bits:

...
 19      my $pyx= pyx_in( $file);
 20      my $xml= XMLin( $pyx, forcearray => [ qw(InvoiceRow)]); 
...
 30              add_errors( $pyx, $errors);
 31              output_doc_to_check(  $rejected_file, $pyx);
...
127  sub add_errors
128    { my( $pyx, $error_messages)= @_;
129      # first get the root element
130      my $root_index=0;
131      while($_=$pyx->[$root_index])
132        { if( m{^\(Finvoice$}) 
133            { # found the root element
134              # now skip attributes
135          while( $pyx->[$root_index+1]=~ m{^A}) { $root_index++; }
136          last;
137        }
138      $root_index++;
139        }
140  
141      my $lf= "-\\n\n"; # a line feed in pyx
142      # build the error messages
143      my $messages=   $lf
144                    . "-  \n(errors\n"
145                    . join( "", map { "$lf-    \n(error\n-$_\n)error\n" } @$error_messages)
146            . $lf
147            . "-  \n)errors\n";
148  
149      # now insert the messages
150      splice( @$pyx, $root_index+1, 0, $messages);
151              
152      return $pyx;
153    }
154  
155  sub output_doc_to_check
156    { my( $file, $pyx)= @_;
157      open( FILE, "| pyxw > $file") or die "cannot create file to check $file: $!";
158      print FILE @$pyx;
159      close FILE;
160    }
161  
162  
163  sub pyx_in
164    { my( $file)= @_;
165      open( IN, "pyx $file | ") or die "cannot open pyx $file: $!";
166      my @pyx= <IN>;
167      # note that in case there is an error during the parsing it will show when
168      # closing the file, so you need to check the result of close
169      close IN or die "error processing pyx $file: $!"; 
170      return \@pyx;
171    }
172      
...

XML::SAX::Base

This code uses SAX to extract the data from the invoices. It parses the invoice and extract the relevant data into a Perl data structure that is then used to check the invoice and update the data base.

The first problem to solve when using SAX is that the content of elements can be broken in different calls to the characters handler. So I needed to buffer the content. Luckily enough, Robin Berjon's XML::Filter::BufferText does just that!

So I used a SAX machine (using SAX::Machines) to pipe the 2 handlers, first XML::Filter::BufferText, then my own handler: wtr2_handler. Note that SAX::Machines takes care of

wtr2_handler extracts all the information needed to check the invoice, then store it in the data base. The resulting data (returned by the end_document handler) is then used by check_invoice and store_invoice.

As this is something that is likely to be quite common and as there are few SAX modules that do this, I decided to go generic: I created a small language to describe how to extract the data and store it in my custom data structure.

The idea is to give an element name (no namespaces are used in this DTD, so there is no need to get fancy) and associate an action to it. Actions can be associated with the start of an element or with its content. At the start of an element it is possible to store attributes or to create new sub-records, for repeatable data in the document, such as InvoiceRow

The content of an element can be stored, either as top-level data, for non-repeatable data, or in a sub-record, for repeatable data.

The easiest way I found to parse these actions was to use Getopt::Long Overall this is slightly overkill for this problem, but could be re-used in other cases, so I thought it would be worth it to show it here.

In order to know in which element the parser is from the characters handler I used a stack of element names: the start_element handler pushes the current element name on the stack and the end_element handler pops it. This is the only way to get access to the parent name, needed for the --parent option.

Overall the code was quite a pain to write, especially as the default parser, XML::LibXML::SAX::Parser had a problem during my tests, as once again I had upgraded libxml2 but not the Perl module. The hardest part was designing a way to express what I wanted to extract from the XML document and how to store it, without resorting with one of those long lists of ifs that I find make code such a pain to maintain.

The complete example is in wtr2_sax_base

  1  #!/usr/bin/perl -w
  2  use strict;
  3  
  4  #use diagnostics;
  5  
  6  use XML::Filter::BufferText;     # to buffer all character events
  7  use XML::SAX::Machines qw(:all); # to pipe the 2 SAX handlers
  8  
  9  use FindBin qw($Bin);
 10  use lib $Bin;
 11  use wtr2_base;
 12  
 13  my $DEBUG=0;
 14  
 15  init_db();
 16  
 17  my $CAN_OUTPUT= 0;
 18  
 19  my @files= @ARGV || (<$dir{invoices}/*.xml>);
 20  
 21  foreach my $file (@files)
 22    { my $handler= wtr2_handler->new();
 23  
 24      my $pipeline = Pipeline( XML::Filter::BufferText->new(),
 25                               $handler,
 26                             );
 27      my $data = $pipeline->parse_uri ( $file);
 28      
 29      my $errors= check_invoice( $data);
 30  
 31      if( !@$errors)
 32        { store_invoice( $data); }
 33      else
 34        { print "ERROR in $file\n  ", join( "\n  ", @$errors), "\n";
 35          if( $CAN_OUTPUT) 
 36            { my $rejected_file= rejected( $file);
 37              print "adding errors in $rejected_file\n" if( $DEBUG);
 38              add_errors( $pipeline, $data, $errors);         # the parser is passed so it can be re-used 
 39              output_doc_to_check(  $rejected_file, $data);
 40            }
 41        };
 42    }
 43  
 44  sub check_invoice
 45    { my( $data)= @_;
 46      my $errors=[];  # array ref, holds the error messages
 47  
 48      check_buyer(  $data->{BuyerPartyIdentifier},
 49                    $data->{BuyerOrganisationName},
 50                    $errors
 51                 );
 52      check_po(     $data->{OrderIdentifier}, $errors);
 53  
 54      my @rows= @{$data->{InvoiceRow}};
 55  
 56      reset_default_row_id();
 57      
 58      foreach my $row( @rows)
 59        { # this does not cope well with broken row numbers
 60  
 61          my $row_id= $row->{RowIdentifier} || default_row_id();
 62          
 63          print "checking row $row_id\n" if $DEBUG;
 64  
 65          check_qtty( $row_id,
 66                      $row->{DeliveredQuantity}, 
 67                      $row->{DeliveredQuantityUC}, 
 68                      $row->{OrderedQuantity}, 
 69                      $row->{OrderedQuantityUC},
 70                      $errors
 71                    );
 72        }
 73      
 74      return $errors;
 75    }
 76  
 77  
 78  sub store_invoice
 79    { my( $xml_data)= @_; 
 80      print "storing invoice $xml_data->{InvoiceNumber}\n";
 81  
 82      # build the various data structures
 83      my $data;
 84  
 85      $data->{invoice} = { number         => $xml_data->{InvoiceNumber},
 86                           date           => $xml_data->{InvoiceDate},
 87                           po             => $xml_data->{OrderIdentifier},
 88                           amount_no_tax  => $xml_data->{InvoiceTotalVatExcludedAmount},
 89                           tax            => $xml_data->{InvoiceTotalVatAmount},
 90                           amount         => $xml_data->{InvoiceTotalVatIncludedAmount},
 91                           payment_status => $xml_data->{PaymentStatusCode},
 92                         };
 93  
 94      $data->{seller}  = { identifier      => $xml_data->{SellerPartyIdentifier},
 95                           name            => $xml_data->{SellerOrganisationName},
 96                           tax_code        => $xml_data->{SellerOrganisationTaxCode},
 97                         };
 98  
 99      $data->{address} = { street         => $xml_data->{SellerStreetName},
100                           town           => $xml_data->{SellerTownName},
101                           zip            => $xml_data->{SellerPostCodeIdentifier},
102                           country_code   => $xml_data->{CountryCode},
103                           po_box         => $xml_data->{SellerPostOfficeBoxIdentifier},
104                         };
105  
106      $data->{contact} = { name           => $xml_data->{SellerContactPersonName},
107                           phone          => $xml_data->{SellerPhoneNumberIdentifier},
108                           email          => $xml_data->{SellerEmailaddressIdentifier},
109                         };
110  
111      $data->{invoicerow} ||= [];
112      reset_default_row_id();
113  
114      foreach my $invoicerow (@{$xml_data->{InvoiceRow}})
115        { push @{$data->{invoicerow}},
116            { row_id        => $invoicerow->{RowIdentifier} || default_row_id(),
117              sku           => $invoicerow->{ArticleIdentifier},
118              name          => $invoicerow->{ArticleName},
119              qty           => $invoicerow->{DeliveredQuantity},
120              qty_unit      => $invoicerow->{DeliveredQuantityUC},
121              unit_price    => $invoicerow->{UnitPriceAmount},
122              amount_no_tax => $invoicerow->{RowVatExcludedAmount},
123              tax           => $invoicerow->{RowVatAmount},
124              amount        => $invoicerow->{RowAmount},
125            }
126        }
127  
128      store_all( $data);
129    }
130  
131  
132  
133  package wtr2_handler;
134  
135  use base qw(XML::SAX::Base);
136  
137  use Getopt::Long;                # to process actions associated with SAX events
138  
139  # all those could be stored as part of the parser object
140  # but IMHO this implies potential name collision
141  
142  my( $content, $start);
143  
144  BEGIN {
145  
146  # declare which element content and attributes we want to store
147  # note that this works fine because elements are not re-used, the element
148  # name always gives enough information to figure out what to do with it
149  
150  # note that the little languages used in the action part depends on the fact
151  # that - is not a valid characters at the start of an XML identifier
152  
153  # actions for start_element handler
154  # 
155  # actions format is:
156  #   actions  : action (';' action)* # only one used here
157  #   action   : command options+
158  #   command : ('--create' | '--store_att' <attribute name>)
159  #   options  : option+
160  #   option   : --in <sub_record>    # sub_record must be have been defined
161  #              --as <field_name>    # name of the field in the hash 
162  #                                   # (defaults to the attribute name)
163  #              --parent <name>      # trigger only when in proper parent
164  #              --create only supports the -as option (not used)
165  
166  $start =  { # elements that create a new sub_record for repeated content
167              InvoiceRow => '--create',
168              # attributes to be stored
169              # format is #store_att
170              DeliveredQuantity => '--store_att QuantityUnitCode --in InvoiceRow --as DeliveredQuantityUC',
171              OrderedQuantity   => '--store_att QuantityUnitCode --in InvoiceRow --as OrderedQuantityUC',
172            };
173            
174                        
175  # actions for characters handler
176  # actions format is: 
177  #   actions  : action (';' action)* # only one used here
178  #   action   : command options+
179  #   command : ('--store')
180  #   options  : option+
181  #   option   : --in <sub_record>    # sub_record must be have been defined
182  #              --as <field_name>    # not used here
183  #              --parent <name>      # trigger only when in proper parent
184  
185  $content= { # content that needs to be stored once per invoice
186              BuyerPartyIdentifier          => '--store',
187              BuyerOrganisationName         => '--store',
188              InvoiceNumber                 => '--store',
189              InvoiceDate                   => '--store',
190              OrderIdentifier               => '--store',
191              InvoiceTotalVatExcludedAmount => '--store',
192              InvoiceTotalVatAmount         => '--store',
193              InvoiceTotalVatIncludedAmount => '--store',
194              PaymentStatusCode             => '--store',
195              SellerPartyIdentifier         => '--store',
196              SellerOrganisationName        => '--store',
197              SellerOrganisationTaxCode     => '--store',
198              SellerStreetName              => '--store',
199              SellerTownName                => '--store',
200              SellerPostCodeIdentifier      => '--store',
201              CountryCode                   => '--store --parent SellerPostalAddressDetails',
202              SellerPostOfficeBoxIdentifier => '--store',
203              SellerContactPersonName       => '--store',
204              SellerPhoneNumberIdentifier   => '--store',
205              SellerEmailaddressIdentifier  => '--store',
206              # repeated content
207              RowIdentifier                 => '--store --in InvoiceRow',
208              ArticleIdentifier             => '--store --in InvoiceRow',
209              ArticleName                   => '--store --in InvoiceRow',
210              DeliveredQuantity             => '--store --in InvoiceRow',
211              OrderedQuantity               => '--store --in InvoiceRow',
212              UnitPriceAmount               => '--store --in InvoiceRow',
213              RowVatExcludedAmount          => '--store --in InvoiceRow',
214              RowVatAmount                  => '--store --in InvoiceRow',
215              RowAmount                     => '--store --in InvoiceRow', 
216            };
217  
218  } 
219  
220  # very complex new! it's a hash so we can add the state data needed
221  sub new
222    { my $class = shift;
223      my $self= bless {}, $class;
224    }
225  
226  # reset data for each invoice
227  sub start_document 
228    { $_[0]->{data}    = {}; # stored data
229      $_[0]->{context} = []; # element name stack
230    }
231  
232  # 
233  sub start_element
234    { my( $p, $elt)= @_;
235            
236      my $name= $elt->{Name};
237      
238      # store the context
239      push @{$p->{context}}, $name;
240      
241      # process actions for the relevant elements
242      if( my $actions= $start->{$name})
243        { my @actions= split /;/, $actions; # actions are ; separated (not used here)
244          foreach my $action (@actions)
245            { # use Getopt::Long to parse the action
246              local @ARGV= split /\s+/, $action;
247              my %options;
248              GetOptions( \%options, "create", "store_att=s", "in=s", "as=s", "parent=s");
249  
250              if( $options{parent} and ($p->{context}->[-2] ne $options{parent})) { next; }
251  
252              # process each type of action
253              if( $options{create})
254                { # create a new sub_record
255                  if( $options{store_att}) { die "can't use --store_att and --create in $name => '$action'\n"; }
256                  if( $options{in})        { die "can't use --in with --create in $name => '$action'\n"; }
257                  $options{as} ||= $name; # the sub_record name defaults to the element name
258                  $p->{data}->{$name} ||= [];
259                  push @{$p->{data}->{$name}}, {};
260                }
261              elsif( my $att= $options{store_att}) 
262                { # store an attribute
263                  my $att_clarkian= "{}".$att;  # attributes are indexed using the clarkian notation 
264                  my $value= $elt->{Attributes}->{$att_clarkian}->{Value}; 
265                  store( $p->{data}, $options{in}, $options{as} || $name, $value); 
266                }
267              else
268                { die "no valid start action found in $name => '$action'\n"; }
269            }
270        }
271    }
272                
273  
274  sub characters
275    { my( $p, $characters)= @_;
276      my $name= $p->{context}->[-1];
277      if( my $actions= $content->{$name})
278        { my @actions= split /;/, $actions;
279          foreach my $action (@actions)
280            { local @ARGV= split /\s+/, $action;
281              my %options;
282              GetOptions( \%options, "store", "in=s", "as=s", "parent=s");
283              if( $options{parent} and ($p->{context}->[-2] ne $options{parent})) { next; }
284              if( $options{store})
285                { store( $p->{data}, $options{in}, $options{as} || $name, $characters->{Data}); }
286              else
287                { die "no valid contentt action found in $name => '$action'\n"; }
288            }
289        }
290    }
291  
292  sub end_element { pop @{$_[0]->{context}}; }
293  
294  sub end_document { return $_[0]->{data}; }
295  
296  sub store
297    { my( $data, $in, $as, $value)= @_;
298      if( my $sub_record= $in)
299        { # create it in a sub_record
300          $data->{$sub_record}->[-1]->{$as}= $value; 
301        }
302      else
303        { # create at top-level
304          $data->{$as}= $value;
305        }
306    }
307  
308  1;

Conclusions

Overall the code was quite easy to write with most modules.

I was surprised by how much common code I could re-use from one example to the other, due to modules using standard API (DOM/SAX) and to a lot of modules being written by the same author (as with XML::XPath / XML::LibXML) (not to mention the author of the article blatantly cheating and stealing method names, if not code, from other modules to add them to XML::Twig ;--).

Modules that simply slurp the XML into a Perl data structure were the easiest to use, sometimes at the cost of making it harder to output a modified document. XML::Smart looked good here, but it is a very recent module and will need to be tested some more before I can recommend it. Tree-based modules were a little harder to use (why can't we call methods on undef! that would save us quite a few test), but allowed for more control over modified XML output. Streaming is not really adapted to this problem, and requires a the bigger layer on top of the basic modules, so I would not really advise using it here.

Benchmark

All modules handled the load pretty well, except for XML::XPath which gives seemingly random Out of memory! errors.

Tested on 206 invoices

ModuleVersionTiming (benchmark)Factor
XML::Simple2.087.43100
XML::Smart1.3.19.71130
XML::Twig3.108.74117
XML::Twig (alt)10.96147
XML::DOM1.428.97120
XML::EasyOBJ1.128.35112
XML::LibXML1.553.7250
XML::LibXML (alt)4.5360
XML::XPath1.1324.86334
XML::PYX0.0721.41288
XML::PYX (alt)22.53303
XML::SAX::Base1.0411.66156

benchmark environment: perl 5.008 on linux - Thu Aug 21 17:11:39 2003

Running the examples

Requirements

You will need to install the following modules: DBI, DBD::SQLite, Memoize, plus of course the various XML modules required by each example.

Files:

A tar file with everything you need

Finvoice files

See Finvoice: Technical files to download the complete package.

The data base definition and creation file

A common package

The various examples

Templates for writing examples