Extracting one node with LibXML - perl

This may be very novice of me, but I am a novice at Perl LibXML (and XPath for that matter). I have this XML doc:
<Tims
xsi:schemaLocation="http://my.location.com/namespace http://my.location.com/xsd/Tims.xsd"
xmlns="http://my.location.com/namespace"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<Error>Too many entities for operation. Acceptable limit is 5,000 and 8,609 were passed in.</Error>
<Timestamp>2012-07-27T12:06:24-04:00</Timestamp>
<ExecutionTime>41.718</ExecutionTime>
</Tims>
All I want to do is get the value of <Error>. Thats all. I've tried plenty of approaches, most recently this one. I've read the docs through and through. This is what I currently have in my code:
#!/usr/bin/perl -w
my $xmlString = <<XML;
<?xml version="1.0" encoding="ISO-8859-1"?>
<Tims
xsi:schemaLocation="http://my.location.com/namespace http://my.location.com/xsd/Tims.xsd"
xmlns="http://my.location.com/namespace"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<Error>Too many entities for operation. Acceptable limit is 5,000 and 8,609 were passed in.</Error>
<Timestamp>2012-07-27T12:06:24-04:00</Timestamp>
<ExecutionTime>41.718</ExecutionTime>
</Tims>
XML
use XML::LibXML;
my $parser = XML::LibXML->new();
my $doc = $parser->parse_string($xmlString);
my $root = $doc->documentElement();
my $xpc = XML::LibXML::XPathContext->new($root);
$xpc->registerNs("x", "http://my.location.com/namespace");
foreach my $node ($xpc->findnodes('x:Tims/x:Error')) {
print $node->toString();
}
Any advice, links, anything is appreciated. Thanks.

Just add a / at the beginning of the XPath (i.e. into findnodes).

Your code isn't working because you use the document element <Tims> as the context node when you create the XPath context $xpc. The <Error> element is an immediate child of this, so all you need to write is
$xpc->findnodes('x:Error')
or an alternative is to use an absolute XPath which specifies the path from the document root
$xpc->findnodes('/x:Tims/x:Error')
That way it doesn't matter what the context node of $xpc is.
But the proper way is to forget about fetching the element node altogether and use the document root as the context node. You can also use findvalue instead of findnodes to get the text of the error message without the enclosing tags:
my $parser = XML::LibXML->new;
my $doc = $parser->parse_string($xmlString);
my $xpc = XML::LibXML::XPathContext->new($doc);
$xpc->registerNs('x', 'http://my.location.com/namespace');
my $error= $xpc->findvalue('x:Tims/x:Error');
print $error, "\n";
output
Too many entities for operation. Acceptable limit is 5,000 and 8,609 were passed in.

Related

XML::Twig parsing same name tag in same path

I am trying to help out a client who was unhappy with an EMR (Electronic Medical Records) system and wanted to switch but the company said they couldn't extract patient demographic data from the database (we asked if they can get us name, address, dob in a csv file of some sort, very basic stuff) - yet they claim they couldn't do that. (crazy considering they are using a sql database).
Anyway - the way they handed over the patients were in xml files and there are about 40'000+ of them. But they contain a lot more than the demographics.
After doing some research and having done extensive Perl programming 15 years ago (I admit it got rusty over the years) - I thought this should be a good task to get done in Perl - and I came across the XML::Twig module which seems to be able to do the trick.
Unfortunately the xml code that is of interest looks like this:
<==snip==>
<patient extension="Patient ID Number"> // <--Patient ID is 5 digit number)
<name>
<family>Patient Family name</family>
<given>Patient First/Given name</given>
<given>Patient Middle Initial</given>
</name>
<birthTime value=YEARMMDD"/>
more fields for address etc.are following in the xml file.
<==snip==>
Here is what I coded:
my $twig=XML::Twig->new( twig_handlers => {
'patient/name/family' => \&get_family_name,
'patient/name/given' => \&get_given_name
});
$twig->parsefile('test.xml');
my #fields;
sub get_family_name {my($twig,$data)=#_;$fields[0]=$data->text;$twig->purge;}
sub get_given_name {my($twig,$data)=#_;$fields[1]=$data->text;$twig->purge;}
I have no problems reading out all the information that have unique tags (family, city, zip code, etc.) but XML:Twig only returns the middle initial for the tag.
How can I address the first occurrence of "given" and assign it to $fields[1] and the second occurrence of "given" to $fields[2] for instance - or chuck the middle initial.
Also how do I extract the "Patient ID" or the "birthTime" value with XML::Twig - I couldn't find a reference to that.
I tried using $data->findvalue('birthTime') but that came back empty.
I looked at: Perl, XML::Twig, how to reading field with the same tag which was very helpful but since the duplicate tags are in the same path it is different and I can't seem to find an answer. Does XML::Twig only return the last value found when finding a match while parsing a file? Is there a way to extract all occurrences of a value?
Thank you for your help in advance!
It is very easy to assume from the documentation that you're supposed to use callbacks for everything. But it's just as valid to parse the whole document and interrogate it in its entirety, especially if the data size is small
It's unclear from your question whether each patient has a separate XML file to themselves, and you don't show what encloses the patient elements, but I suggest that you use a compromise approach and write a handler for just the patient elements which extracts all of the information required
I've chosen to build a hash of information %patient out of each patient element and push it onto an array #patients that contains all the data in the file. If you have only one patient per file then this will need to be changed
I've resolved the problem with the name/given elements by fetching all of them and joining them into a single string with intervening spaces. I hope that's suitable
This is completely untested as I have only a tablet to hand at present, so beware. It does stand a chance of compiling, but I would be surprised if it has no bugs
use strict;
use warnings 'all';
use XML::Twig;
my #patients;
my $twig = XML::Twig->new(
twig_handlers => { patient => \&get_patient }
);
$twig->parsefile('test.xml');
sub get_patient {
my ($twig, $pat) = #_;
my %patient;
$patient{id} = $pat>att('extension');
my $name = $pat->first_child('name');yy
$patient{family} = $name->first_child_trimmed_text('family');
$patient{given} = join ' ', $name->children_trimmed_text('given');
$patient{dob} = $pat->first_child('birthTime')->att('value');
push #patients, \%patient;
}

Why is this xmlns attribute messing up my xpath query?

I'm parsing a simple jhove output using LibXML. However, I don't get the values I expect. Here's the code:
use feature "say";
use XML::LibXML;
my $PRSR = XML::LibXML->new();
my $xs=<DATA>;
say $xs;
my $t1 = $PRSR->load_xml(string => $xs);
say "1:" . $t1->findvalue('//date');
$xs=<DATA>;
say $xs;
$t1 = $PRSR->load_xml(string => $xs);
say "2:" . $t1->findvalue('//date');
__DATA__
<jhove xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://hul.harvard.edu/ois/xml/ns/jhove" xsi:schemaLocation="http://hul.harvard.edu/ois/xml/ns/jhove http://hul.harvard.edu/ois/xml/xsd/jhove/1.3/jhove.xsd" name="Jhove" release="1.0 (beta 3)" date="2005-02-04"><date>2006-10-06T09:11:34+02:00</date></jhove>
<jhove><date>2006-10-06T09:11:34+02:00</date></jhove>
As you can see, the line "1:" is returning an empty string, while "2:" is returning the expected date. What is in the jhove-root-element that keeps the xpath query from working properly? I even tried in XML-Spy and there it works, even with the full header.
Edit: When I remove the xmlns-attribute from the root element, the xpath query works. But how is that possible?
The XML::LibXML::Node documentation specifically mentions this issue and how to deal with it...
NOTE ON NAMESPACES AND XPATH:
A common mistake about XPath is to assume that node tests consisting of an element name with no prefix match elements in the default namespace. This assumption is wrong - by XPath specification, such node tests can only match elements that are in no (i.e. null) namespace.
So, for example, one cannot match the root element of an XHTML document with $node->find('/html') since '/html' would only match if the root element <html> had no namespace, but all XHTML elements belong to the namespace http://www.w3.org/1999/xhtml. (Note that xmlns="..." namespace declarations can also be specified in a DTD, which makes the situation even worse, since the XML document looks as if there was no default namespace).
There are several possible ways to deal with namespaces in XPath:
The recommended way is to use the XML::LibXML::XPathContext module to define an explicit context for XPath evaluation, in which a document independent prefix-to-namespace mapping can be defined. For example:
my $xpc = XML::LibXML::XPathContext->new;
$xpc->registerNs('x', 'http://www.w3.org/1999/xhtml');
$xpc->find('/x:html',$node);
Another possibility is to use prefixes declared in the queried document (if known). If the document declares a prefix for the namespace in question (and the context node is in the scope of the declaration), XML::LibXML allows you to use the prefix in the XPath expression, e.g.:
$node->find('/x:html');
I found another solution. Simply using this
say "1:" . $t1->findvalue('//*[local-name()="date"]');
will also find the value and save the hassle of declaring namespaces in an XPathContext. But apart from that, tobyinks answer is the correct one.

Perl LibXML: isSameNode method

I'm trying to compare two nodes using 'isSameNode'. However, one of the nodes is created via 'parse_balanced_chunk'. From the http://www.w3.org/TR/DOM-Level-3-Core/core.html#Node3-isSameNode document, is says "This method provides a way to determine whether two Node references returned by the implementation reference the same object."
So I'm wonder is it not working as I would expect because they are indeed coming from two different sources (one from the parsed txt file, the other from parse_balance_chunk)?
Here is the 'test_in.xml' & code:
<?xml version="1.0"?>
<TT>
<A>ZAB</A>
<B>ZBW</B>
<C>
<D>
<E>ZSE</E>
<F>ZLC</F>
</D>
</C>
<C>
<K>
<H>dog</H>
<E>one123</E>
<M>bird</M>
</K>
</C>
</TT>
use warnings;
use strict;
use XML::LibXML;
my $parser = XML::LibXML->new({keep_blanks=>(0)});
my $dom = $parser->load_xml(location => 'test_in.xml') or die;
#called in scalar context ($na1)
my ($na1) = $dom->findnodes('//D');
my ($na2) = $dom->findnodes('//D');
my $X1 = $na1->isSameNode($na2); ##MATCHES
my ($frg) = $parser->parse_balanced_chunk ("<D><E>ZSE</E><F>ZLC</F></D>");
my $X2 = $na1->isSameNode($frg); ##WHY NO MATCH?
my ($na3) = $frg->findnodes('//D');
my $X3 = $na1->isSameNode($na3); ##WHY NO MATCH?
print "SAME?: $X1\n";
print "SAME?: $X2\n";
print "SAME?: $X3\n";
And the output:
SAME?: 1
SAME?: 0
SAME?: 0
So the first 'isSameNode' test obviously MATCHES (same exact findnodes & xpath expression).
But neither of the 2nd or 3rd 'isSameNode' tests work using the node from the 'parsed_balance_chunk'. Is it something simple I'm overlooking with the syntax or is it just that I can't compare two nodes this way? If not, what is the method for comparing two nodes? Waht I'm ultimately trying to determine if a block of xml code (i.e. from a previous parsed_balance_chuck) already exist in the xml file.
Because they're not the same node. Like the name says, it checks if two nodes are the same node. You seem think think it checks if two nodes are equivalent, but that's not what it does.

How to convert HTML file into a hash in Perl?

Is there any simple way to convert a HTML file into a Perl hash? For example a working Perl modules or something?
I was search on cpan.org but did'nt find anything what can do what I want. I wanna do something like this:
use Example::Module;
my $hashref = Example::Module->new('/path/to/mydoc.html');
After this I want to refer to second div element something like this:
my $second_div = $hashref->{'body'}->{'div'}[1];
# or like this:
my $second_div = $hashref->{'body'}->{'div'}->findByClass('.myclassname');
# or like this:
my $second_div = $hashref->{'body'}->{'div'}->findById('#myid');
Is there any working solution for this?
HTML::TreeBuilder::XPath gives you a lot more power than a simple hash would.
From the synopsis:
use HTML::TreeBuilder::XPath;
my $tree = HTML::TreeBuilder::XPath->new;
$tree->parse_file( "mypage.html");
my $nb=$tree->findvalue('/html/body//p[#class="section_title"]/span[#class="nb"]');
my $id=$tree->findvalue('/html/body//p[#class="section_title"]/#id');
my $p= $html->findnodes('//p[#id="toto"]')->[0];
my $link_texts= $p->findvalue( './a'); # the texts of all a elements in $p
$tree->delete; # to avoid memory leaks, if you parse many HTML documents
More on XPath.
Mojo::DOM (docs found here) builds a simple DOM, that can be accessed in a CSS-selector style:
# Find
say $dom->at('#b')->text;
say $dom->find('p')->pluck('text');
say $dom->find('[id]')->pluck(attr => 'id');
In case you're using xhtml you could also use XML::Simple, which produces a data structure similar to the one you describe.

Why doesn't Perl's XML::LibXML module (specifically XPathContext) evaluate positions?

I have an XML representation of a document that has the form:
<response>
<paragraph>
<sentence id="1">Hey</sentence>
<sentence id="2">Hello</sentence>
</paragraph>
</response>
I'm trying to use XML::LibXML to parse a document and get the position of the sentences.
my $root_node = XML::LibXML->load_xml( ... )->documentElement;
foreach my $sentence_node ( $root_node->findnodes('//sentence')->get_nodelist ){
print $sentence_node->find( 'position()' );
}
The error I get is "XPath error : Invalid context position error". I've read up on the docs and found this interesting tidbit
evaluating XPath function position() in the initial context raises an XPath error
My problem is that I have no idea what to do with this information. What is the 'initial context'? How do I make the engine automatically track the context position?
Re: #Dan
Appreciate the answer. I tried your example and it worked. In my code, I was assuming context to be the node represented by my perl variable. So, $sentence->find( 'position()' ) I wanted to be './position()'. Despite seeing a working example, I still can't do
foreach my $sentence ...
my $id = $sentence->getAttribute('id');
print $root_node->findvalue( '//sentence[#id=' . "$id]/position()");
I can, however, do
$root_node->findvalue( '//sentence[#id=' . "$id]/text()");
Can position() only be used to limit a query like you have?
position() does work in LibXML. For example see
my $root_node = $doc->documentElement;
foreach my $sentence_node ( $root_node->findnodes('//sentence[position()=2]')->get_nodelist ){
print $sentence_node->textContent;
}
This will print Hello with your sample data.
But the way you're using it here, there's no context. For each sentence_node, you want its position relative to what?
If you're looking for specific nodes by position, use a selector like I have above, that's simplest.