Perl LibXML: isSameNode method - perl

I'm trying to compare two nodes using 'isSameNode'. However, one of the nodes is created via 'parse_balanced_chunk'. From the http://www.w3.org/TR/DOM-Level-3-Core/core.html#Node3-isSameNode document, is says "This method provides a way to determine whether two Node references returned by the implementation reference the same object."
So I'm wonder is it not working as I would expect because they are indeed coming from two different sources (one from the parsed txt file, the other from parse_balance_chunk)?
Here is the 'test_in.xml' & code:
<?xml version="1.0"?>
<TT>
<A>ZAB</A>
<B>ZBW</B>
<C>
<D>
<E>ZSE</E>
<F>ZLC</F>
</D>
</C>
<C>
<K>
<H>dog</H>
<E>one123</E>
<M>bird</M>
</K>
</C>
</TT>
use warnings;
use strict;
use XML::LibXML;
my $parser = XML::LibXML->new({keep_blanks=>(0)});
my $dom = $parser->load_xml(location => 'test_in.xml') or die;
#called in scalar context ($na1)
my ($na1) = $dom->findnodes('//D');
my ($na2) = $dom->findnodes('//D');
my $X1 = $na1->isSameNode($na2); ##MATCHES
my ($frg) = $parser->parse_balanced_chunk ("<D><E>ZSE</E><F>ZLC</F></D>");
my $X2 = $na1->isSameNode($frg); ##WHY NO MATCH?
my ($na3) = $frg->findnodes('//D');
my $X3 = $na1->isSameNode($na3); ##WHY NO MATCH?
print "SAME?: $X1\n";
print "SAME?: $X2\n";
print "SAME?: $X3\n";
And the output:
SAME?: 1
SAME?: 0
SAME?: 0
So the first 'isSameNode' test obviously MATCHES (same exact findnodes & xpath expression).
But neither of the 2nd or 3rd 'isSameNode' tests work using the node from the 'parsed_balance_chunk'. Is it something simple I'm overlooking with the syntax or is it just that I can't compare two nodes this way? If not, what is the method for comparing two nodes? Waht I'm ultimately trying to determine if a block of xml code (i.e. from a previous parsed_balance_chuck) already exist in the xml file.

Because they're not the same node. Like the name says, it checks if two nodes are the same node. You seem think think it checks if two nodes are equivalent, but that's not what it does.

Related

XML::Twig parsing same name tag in same path

I am trying to help out a client who was unhappy with an EMR (Electronic Medical Records) system and wanted to switch but the company said they couldn't extract patient demographic data from the database (we asked if they can get us name, address, dob in a csv file of some sort, very basic stuff) - yet they claim they couldn't do that. (crazy considering they are using a sql database).
Anyway - the way they handed over the patients were in xml files and there are about 40'000+ of them. But they contain a lot more than the demographics.
After doing some research and having done extensive Perl programming 15 years ago (I admit it got rusty over the years) - I thought this should be a good task to get done in Perl - and I came across the XML::Twig module which seems to be able to do the trick.
Unfortunately the xml code that is of interest looks like this:
<==snip==>
<patient extension="Patient ID Number"> // <--Patient ID is 5 digit number)
<name>
<family>Patient Family name</family>
<given>Patient First/Given name</given>
<given>Patient Middle Initial</given>
</name>
<birthTime value=YEARMMDD"/>
more fields for address etc.are following in the xml file.
<==snip==>
Here is what I coded:
my $twig=XML::Twig->new( twig_handlers => {
'patient/name/family' => \&get_family_name,
'patient/name/given' => \&get_given_name
});
$twig->parsefile('test.xml');
my #fields;
sub get_family_name {my($twig,$data)=#_;$fields[0]=$data->text;$twig->purge;}
sub get_given_name {my($twig,$data)=#_;$fields[1]=$data->text;$twig->purge;}
I have no problems reading out all the information that have unique tags (family, city, zip code, etc.) but XML:Twig only returns the middle initial for the tag.
How can I address the first occurrence of "given" and assign it to $fields[1] and the second occurrence of "given" to $fields[2] for instance - or chuck the middle initial.
Also how do I extract the "Patient ID" or the "birthTime" value with XML::Twig - I couldn't find a reference to that.
I tried using $data->findvalue('birthTime') but that came back empty.
I looked at: Perl, XML::Twig, how to reading field with the same tag which was very helpful but since the duplicate tags are in the same path it is different and I can't seem to find an answer. Does XML::Twig only return the last value found when finding a match while parsing a file? Is there a way to extract all occurrences of a value?
Thank you for your help in advance!
It is very easy to assume from the documentation that you're supposed to use callbacks for everything. But it's just as valid to parse the whole document and interrogate it in its entirety, especially if the data size is small
It's unclear from your question whether each patient has a separate XML file to themselves, and you don't show what encloses the patient elements, but I suggest that you use a compromise approach and write a handler for just the patient elements which extracts all of the information required
I've chosen to build a hash of information %patient out of each patient element and push it onto an array #patients that contains all the data in the file. If you have only one patient per file then this will need to be changed
I've resolved the problem with the name/given elements by fetching all of them and joining them into a single string with intervening spaces. I hope that's suitable
This is completely untested as I have only a tablet to hand at present, so beware. It does stand a chance of compiling, but I would be surprised if it has no bugs
use strict;
use warnings 'all';
use XML::Twig;
my #patients;
my $twig = XML::Twig->new(
twig_handlers => { patient => \&get_patient }
);
$twig->parsefile('test.xml');
sub get_patient {
my ($twig, $pat) = #_;
my %patient;
$patient{id} = $pat>att('extension');
my $name = $pat->first_child('name');yy
$patient{family} = $name->first_child_trimmed_text('family');
$patient{given} = join ' ', $name->children_trimmed_text('given');
$patient{dob} = $pat->first_child('birthTime')->att('value');
push #patients, \%patient;
}

Why is this xmlns attribute messing up my xpath query?

I'm parsing a simple jhove output using LibXML. However, I don't get the values I expect. Here's the code:
use feature "say";
use XML::LibXML;
my $PRSR = XML::LibXML->new();
my $xs=<DATA>;
say $xs;
my $t1 = $PRSR->load_xml(string => $xs);
say "1:" . $t1->findvalue('//date');
$xs=<DATA>;
say $xs;
$t1 = $PRSR->load_xml(string => $xs);
say "2:" . $t1->findvalue('//date');
__DATA__
<jhove xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://hul.harvard.edu/ois/xml/ns/jhove" xsi:schemaLocation="http://hul.harvard.edu/ois/xml/ns/jhove http://hul.harvard.edu/ois/xml/xsd/jhove/1.3/jhove.xsd" name="Jhove" release="1.0 (beta 3)" date="2005-02-04"><date>2006-10-06T09:11:34+02:00</date></jhove>
<jhove><date>2006-10-06T09:11:34+02:00</date></jhove>
As you can see, the line "1:" is returning an empty string, while "2:" is returning the expected date. What is in the jhove-root-element that keeps the xpath query from working properly? I even tried in XML-Spy and there it works, even with the full header.
Edit: When I remove the xmlns-attribute from the root element, the xpath query works. But how is that possible?
The XML::LibXML::Node documentation specifically mentions this issue and how to deal with it...
NOTE ON NAMESPACES AND XPATH:
A common mistake about XPath is to assume that node tests consisting of an element name with no prefix match elements in the default namespace. This assumption is wrong - by XPath specification, such node tests can only match elements that are in no (i.e. null) namespace.
So, for example, one cannot match the root element of an XHTML document with $node->find('/html') since '/html' would only match if the root element <html> had no namespace, but all XHTML elements belong to the namespace http://www.w3.org/1999/xhtml. (Note that xmlns="..." namespace declarations can also be specified in a DTD, which makes the situation even worse, since the XML document looks as if there was no default namespace).
There are several possible ways to deal with namespaces in XPath:
The recommended way is to use the XML::LibXML::XPathContext module to define an explicit context for XPath evaluation, in which a document independent prefix-to-namespace mapping can be defined. For example:
my $xpc = XML::LibXML::XPathContext->new;
$xpc->registerNs('x', 'http://www.w3.org/1999/xhtml');
$xpc->find('/x:html',$node);
Another possibility is to use prefixes declared in the queried document (if known). If the document declares a prefix for the namespace in question (and the context node is in the scope of the declaration), XML::LibXML allows you to use the prefix in the XPath expression, e.g.:
$node->find('/x:html');
I found another solution. Simply using this
say "1:" . $t1->findvalue('//*[local-name()="date"]');
will also find the value and save the hassle of declaring namespaces in an XPathContext. But apart from that, tobyinks answer is the correct one.

Extracting one node with LibXML

This may be very novice of me, but I am a novice at Perl LibXML (and XPath for that matter). I have this XML doc:
<Tims
xsi:schemaLocation="http://my.location.com/namespace http://my.location.com/xsd/Tims.xsd"
xmlns="http://my.location.com/namespace"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<Error>Too many entities for operation. Acceptable limit is 5,000 and 8,609 were passed in.</Error>
<Timestamp>2012-07-27T12:06:24-04:00</Timestamp>
<ExecutionTime>41.718</ExecutionTime>
</Tims>
All I want to do is get the value of <Error>. Thats all. I've tried plenty of approaches, most recently this one. I've read the docs through and through. This is what I currently have in my code:
#!/usr/bin/perl -w
my $xmlString = <<XML;
<?xml version="1.0" encoding="ISO-8859-1"?>
<Tims
xsi:schemaLocation="http://my.location.com/namespace http://my.location.com/xsd/Tims.xsd"
xmlns="http://my.location.com/namespace"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<Error>Too many entities for operation. Acceptable limit is 5,000 and 8,609 were passed in.</Error>
<Timestamp>2012-07-27T12:06:24-04:00</Timestamp>
<ExecutionTime>41.718</ExecutionTime>
</Tims>
XML
use XML::LibXML;
my $parser = XML::LibXML->new();
my $doc = $parser->parse_string($xmlString);
my $root = $doc->documentElement();
my $xpc = XML::LibXML::XPathContext->new($root);
$xpc->registerNs("x", "http://my.location.com/namespace");
foreach my $node ($xpc->findnodes('x:Tims/x:Error')) {
print $node->toString();
}
Any advice, links, anything is appreciated. Thanks.
Just add a / at the beginning of the XPath (i.e. into findnodes).
Your code isn't working because you use the document element <Tims> as the context node when you create the XPath context $xpc. The <Error> element is an immediate child of this, so all you need to write is
$xpc->findnodes('x:Error')
or an alternative is to use an absolute XPath which specifies the path from the document root
$xpc->findnodes('/x:Tims/x:Error')
That way it doesn't matter what the context node of $xpc is.
But the proper way is to forget about fetching the element node altogether and use the document root as the context node. You can also use findvalue instead of findnodes to get the text of the error message without the enclosing tags:
my $parser = XML::LibXML->new;
my $doc = $parser->parse_string($xmlString);
my $xpc = XML::LibXML::XPathContext->new($doc);
$xpc->registerNs('x', 'http://my.location.com/namespace');
my $error= $xpc->findvalue('x:Tims/x:Error');
print $error, "\n";
output
Too many entities for operation. Acceptable limit is 5,000 and 8,609 were passed in.

Using hash as a reference is deprecated

I searched SO before asking this question, I am completely new to this and have no idea how to handle these errors. By this I mean Perl language.
When I put this
%name->{#id[$#id]} = $temp;
I get the error Using a hash as a reference is deprecated
I tried
$name{#id[$#id]} = $temp
but couldn't get any results back.
Any suggestions?
The correct way to access an element of hash %name is $name{'key'}. The syntax %name->{'key'} was valid in Perl v5.6 but has since been deprecated.
Similarly, to access the last element of array #id you should write $id[$#id] or, more simply, $id[-1].
Your second variation should work fine, and your inability to retrieve the value has an unrelated reason.
Write
$name{$id[-1]} = 'test';
and
print $name{$id[-1]};
will display test correctly
%name->{...}
has always been buggy. It doesn't do what it should do. As such, it now warns when you try to use it. The proper way to index a hash is
$name{...}
as you already believe.
Now, you say
$name{#id[$#id]}
doesn't work, but if so, it's because of an error somewhere else in the code. That code most definitely works
>perl -wE"#id = qw( a b c ); %name = ( a=>3, b=>4, c=>5 ); say $name{#id[$#id]};"
Scalar value #id[$#id] better written as $id[$#id] at -e line 1.
5
As the warning says, though, the proper way to index an array isn't
#id[...]
It's actually
$id[...]
Finally, the easiest way to get the last element of an array is to use index -1. The means your code should be
$name{ $id[-1] }
The popular answer is to just not dereference, but that's not correct. In other words %$hash_ref->{$key} and %$hash_ref{$key} are not interchangeable. The former is required to access a hash reference nested as an element in another hash reference.
For many moons it has been common place to nest hash references. In fact there are several modules that parse data and store it in this kind of data structure. Instantly depreciating the behavior without module updates was not a good thing. At times my data is trapped in a nested hash and the only way to get it is to do something like.
$new_hash_ref = $target_hash_ref->{$key1}
$new_hash_ref2 = $target_hash_ref->{$key2}
$new_hash_ref3 = $target_hash_ref->{$key3}
because I can't
foreach my $i(keys(%$target_hash_ref)) {
foreach(%$target_hash_ref->{$i} {
#do stuff with $_
}
}
anymore.
Yes the above is a little strange, but creating new variables just to avoid accessing a data structure in a certain way is worse. Am I missing something?
If you want one item from an array or hash use $. For a list of items use # and % respectively. Your use of # as a reference returned a list instead of an item which perl may have interpreted as a hash.
This code demonstrates your reference of a hash of arrays.
#!/usr/bin perl -w
my %these = ( 'first'=>101,
'second'=>102,
);
my #those = qw( first second );
print $these{$those[$#those]};
prints '102'

Why doesn't Perl's XML::LibXML module (specifically XPathContext) evaluate positions?

I have an XML representation of a document that has the form:
<response>
<paragraph>
<sentence id="1">Hey</sentence>
<sentence id="2">Hello</sentence>
</paragraph>
</response>
I'm trying to use XML::LibXML to parse a document and get the position of the sentences.
my $root_node = XML::LibXML->load_xml( ... )->documentElement;
foreach my $sentence_node ( $root_node->findnodes('//sentence')->get_nodelist ){
print $sentence_node->find( 'position()' );
}
The error I get is "XPath error : Invalid context position error". I've read up on the docs and found this interesting tidbit
evaluating XPath function position() in the initial context raises an XPath error
My problem is that I have no idea what to do with this information. What is the 'initial context'? How do I make the engine automatically track the context position?
Re: #Dan
Appreciate the answer. I tried your example and it worked. In my code, I was assuming context to be the node represented by my perl variable. So, $sentence->find( 'position()' ) I wanted to be './position()'. Despite seeing a working example, I still can't do
foreach my $sentence ...
my $id = $sentence->getAttribute('id');
print $root_node->findvalue( '//sentence[#id=' . "$id]/position()");
I can, however, do
$root_node->findvalue( '//sentence[#id=' . "$id]/text()");
Can position() only be used to limit a query like you have?
position() does work in LibXML. For example see
my $root_node = $doc->documentElement;
foreach my $sentence_node ( $root_node->findnodes('//sentence[position()=2]')->get_nodelist ){
print $sentence_node->textContent;
}
This will print Hello with your sample data.
But the way you're using it here, there's no context. For each sentence_node, you want its position relative to what?
If you're looking for specific nodes by position, use a selector like I have above, that's simplest.