Perl Hash using LibXML - perl

I have an XML data as follows.
<type>
<data1>something1</data1>
<data2>something2</data2>
</type>
<type>
<data1>something1</data1>
<data2>something2</data2>
</type>
<type>
<data1>something1</data1>
</type>
As one can see, child node data2 is sometimes not present.
I have used this as a guideline to create the following code.
my %hash;
my $parser = XML::LibXML->new();
my $doc = $parser->parse_file($file_name);
my #nodes = $doc->findnodes("/type");
foreach my $node(#nodes)
{
my $key = $node->getChildrenByTagName('data1');
my $value = $node->getChildrenByTagName('data2');
$hash{$key} = $value;
}
Later, I am using this hash to generate some more data based on a fact if the child node data2 is present or not.
I use ne operator assuming that data in the %hash are key value pairs of strings and when data2 is not present, Perl inserts space as a value in the hash (I have printed this hash and found that only space is printed as a value).
However, I end up with following compilation errors.
Operation "ne": no method found,
left argument in overloaded package XML::LibXML::NodeList,
right argument has no overloaded magic at filename.pl line 74.
How do I solve this? What is the best data structure to store this data when we see that sometimes a node will not be there ?

First thing to realize is $value is an XML::LibXML::NodeList object. It only looks like a string when you print it because it has stringification overloaded. You can check with ref $value.
With my $value = $node->getChildrenByTagName('data2');, $value will always be a NodeList object. It might be an empty NodeList, but you'll always get a NodeList object.
Your version of XML::LibXML is out of date. Your version of XML::LibXML::NodeList has no string comparison overloading and, by default, Perl will not "fallback" to use stringification for other string operators like ne. I reported this bug back in 2010 and it was fixed in 2011 in version 1.77.
Upgrade XML::LibXML and the problem will go away.
As a work around you can force stringification by quoting the NodeList object.
if( "$nodelist" ne "foo" ) { ... }
But really, update that module. There's been a lot of work done on it.
Perl inserts space as a value in the hash
This is a NodeList object stringifying. I get an empty string from an empty NodeList. You might be getting a space as an old bug.
You can also check $value->size to see if the NodeList is empty.

Related

Perl LibXML: isSameNode method

I'm trying to compare two nodes using 'isSameNode'. However, one of the nodes is created via 'parse_balanced_chunk'. From the http://www.w3.org/TR/DOM-Level-3-Core/core.html#Node3-isSameNode document, is says "This method provides a way to determine whether two Node references returned by the implementation reference the same object."
So I'm wonder is it not working as I would expect because they are indeed coming from two different sources (one from the parsed txt file, the other from parse_balance_chunk)?
Here is the 'test_in.xml' & code:
<?xml version="1.0"?>
<TT>
<A>ZAB</A>
<B>ZBW</B>
<C>
<D>
<E>ZSE</E>
<F>ZLC</F>
</D>
</C>
<C>
<K>
<H>dog</H>
<E>one123</E>
<M>bird</M>
</K>
</C>
</TT>
use warnings;
use strict;
use XML::LibXML;
my $parser = XML::LibXML->new({keep_blanks=>(0)});
my $dom = $parser->load_xml(location => 'test_in.xml') or die;
#called in scalar context ($na1)
my ($na1) = $dom->findnodes('//D');
my ($na2) = $dom->findnodes('//D');
my $X1 = $na1->isSameNode($na2); ##MATCHES
my ($frg) = $parser->parse_balanced_chunk ("<D><E>ZSE</E><F>ZLC</F></D>");
my $X2 = $na1->isSameNode($frg); ##WHY NO MATCH?
my ($na3) = $frg->findnodes('//D');
my $X3 = $na1->isSameNode($na3); ##WHY NO MATCH?
print "SAME?: $X1\n";
print "SAME?: $X2\n";
print "SAME?: $X3\n";
And the output:
SAME?: 1
SAME?: 0
SAME?: 0
So the first 'isSameNode' test obviously MATCHES (same exact findnodes & xpath expression).
But neither of the 2nd or 3rd 'isSameNode' tests work using the node from the 'parsed_balance_chunk'. Is it something simple I'm overlooking with the syntax or is it just that I can't compare two nodes this way? If not, what is the method for comparing two nodes? Waht I'm ultimately trying to determine if a block of xml code (i.e. from a previous parsed_balance_chuck) already exist in the xml file.
Because they're not the same node. Like the name says, it checks if two nodes are the same node. You seem think think it checks if two nodes are equivalent, but that's not what it does.

Using hash as a reference is deprecated

I searched SO before asking this question, I am completely new to this and have no idea how to handle these errors. By this I mean Perl language.
When I put this
%name->{#id[$#id]} = $temp;
I get the error Using a hash as a reference is deprecated
I tried
$name{#id[$#id]} = $temp
but couldn't get any results back.
Any suggestions?
The correct way to access an element of hash %name is $name{'key'}. The syntax %name->{'key'} was valid in Perl v5.6 but has since been deprecated.
Similarly, to access the last element of array #id you should write $id[$#id] or, more simply, $id[-1].
Your second variation should work fine, and your inability to retrieve the value has an unrelated reason.
Write
$name{$id[-1]} = 'test';
and
print $name{$id[-1]};
will display test correctly
%name->{...}
has always been buggy. It doesn't do what it should do. As such, it now warns when you try to use it. The proper way to index a hash is
$name{...}
as you already believe.
Now, you say
$name{#id[$#id]}
doesn't work, but if so, it's because of an error somewhere else in the code. That code most definitely works
>perl -wE"#id = qw( a b c ); %name = ( a=>3, b=>4, c=>5 ); say $name{#id[$#id]};"
Scalar value #id[$#id] better written as $id[$#id] at -e line 1.
5
As the warning says, though, the proper way to index an array isn't
#id[...]
It's actually
$id[...]
Finally, the easiest way to get the last element of an array is to use index -1. The means your code should be
$name{ $id[-1] }
The popular answer is to just not dereference, but that's not correct. In other words %$hash_ref->{$key} and %$hash_ref{$key} are not interchangeable. The former is required to access a hash reference nested as an element in another hash reference.
For many moons it has been common place to nest hash references. In fact there are several modules that parse data and store it in this kind of data structure. Instantly depreciating the behavior without module updates was not a good thing. At times my data is trapped in a nested hash and the only way to get it is to do something like.
$new_hash_ref = $target_hash_ref->{$key1}
$new_hash_ref2 = $target_hash_ref->{$key2}
$new_hash_ref3 = $target_hash_ref->{$key3}
because I can't
foreach my $i(keys(%$target_hash_ref)) {
foreach(%$target_hash_ref->{$i} {
#do stuff with $_
}
}
anymore.
Yes the above is a little strange, but creating new variables just to avoid accessing a data structure in a certain way is worse. Am I missing something?
If you want one item from an array or hash use $. For a list of items use # and % respectively. Your use of # as a reference returned a list instead of an item which perl may have interpreted as a hash.
This code demonstrates your reference of a hash of arrays.
#!/usr/bin perl -w
my %these = ( 'first'=>101,
'second'=>102,
);
my #those = qw( first second );
print $these{$those[$#those]};
prints '102'

Meaning of NEXT in Linked List creation in perl

So I am trying to learn Linked Lists using Perl. I am reading Mastering Algorithms with Perl by Jon Orwant. In the book he explains how to create a linked list.
I understand most of it, but I just simply fail to understand the command/index/key NEXT in the second last line of the code snippet.
$list=undef;
$tail=\$list;
foreach (1..5){
my $node = [undef, $_ * $_];
$$tail = $node;
$tail = \${$node->[NEXT]}; # The NEXT on this line?
}
What is he trying to do there?
Is $node a scalar, which stores the address of the unnamed array? Also even if we are dereferencing $node, should we not refer to the individual elements by an index number, such as (0,1). If we do use NEXT as a key, is $node a reference to a hash?
I am very confused.
Something in plain English will be highly appreciated.
NEXT is a constant, declared earlier in the script. It contains an integer value representing the index of the current node's member element that refers to the next node.
Under this scheme, each node is a small anonymous array. One element of this anonymous array contains the payload, and the other contains a reference pointing to the next node.
If you look at some of the earlier examples in that chapter you will see the following declarations:
use constant NEXT => 0;
use constant VAL => 1;
So $node->[NEXT] is synonymous to $node->[0], which contains a reference to the next node in the linked list chain, while $node->[VAL] is synonymous with $node->[1]; the value (or payload) stored in the current node.
I'll comment on the code snippet you provided:
foreach (1..5){
my $node = [undef, $_ * $_]; # Create a new node as an anon array.
# Set the previous node's "next node reference" to point to this new node.
$$tail = $node;
# Remember a reference to the new node's "next node reference" element.
# So that it can be updated when another new element is added on next iteraton.
$tail = \${$node->[NEXT]}; # The NEXT on this line?
}
Excellent book, by the way. I've got several algorithms books, and that one continues to be among my favorites after all these years.
Update: I do agree that the book isn't a model of current idiomatic Perl, or current "best practices" Perl, but do feel it is a nice resource for gaining an understanding of the application of classic algorithms with Perl. I still refer back to it from time to time.
NEXT is a constant, declared on an earlier page, that contains a number. Its being used instead of just the regular number to access the array ref $node so the reader knows that slot is where the next element in the linked list is stored.
It's a technique to use array references to store things other than lists. The technique was intended to save memory and CPU time compared to using a hash reference. In reality it doesn't make much performance difference and its awkward to work with. The book is quite a bit out of date in its ideas about how to write Perl code. Use a hash reference instead.
my $list;
my $tail = \$list;
foreach my $num (1..5) {
my $node = { data => $num };
$$tail = $node;
$tail = \$node->{next};
}

Why do all of these methods of accessing an array work?

It seems to me that some of these should fail, but they all output what they are supposed to:
$, = "\n";
%test = (
"one" => ["one_0", "one_1"],
"two" => ["two_0", "two_1"]
);
print #{$test{"one"}}[0],
#{$test{"one"}}->[0],
$test{"two"}->[0],
$test{"one"}[1];
Why is this?
Your first example is different than the others. It is an array slice -- but in disguise since you are only asking for one item.
#{$test{"one"}}[0];
#{$test{"one"}}[1, 0]; # Another slice example.
Your other examples are alternative ways of de-referencing items within a multi-level data structure. However, the use of an array for de-referencing is deprecated (see perldiag).
#{$test{"one"}}->[0]; # Deprecated. Turn on 'use warnings'.
$test{"two"}->[0]; # Valid.
$test{"one"}[1]; # Valid but easier to type.
Regarding the last example, two subscripts sitting next to each other have an implied -> between them. See, for example, the discussion of the Arrow Rule in perlreftut.

Why doesn't Perl's XML::LibXML module (specifically XPathContext) evaluate positions?

I have an XML representation of a document that has the form:
<response>
<paragraph>
<sentence id="1">Hey</sentence>
<sentence id="2">Hello</sentence>
</paragraph>
</response>
I'm trying to use XML::LibXML to parse a document and get the position of the sentences.
my $root_node = XML::LibXML->load_xml( ... )->documentElement;
foreach my $sentence_node ( $root_node->findnodes('//sentence')->get_nodelist ){
print $sentence_node->find( 'position()' );
}
The error I get is "XPath error : Invalid context position error". I've read up on the docs and found this interesting tidbit
evaluating XPath function position() in the initial context raises an XPath error
My problem is that I have no idea what to do with this information. What is the 'initial context'? How do I make the engine automatically track the context position?
Re: #Dan
Appreciate the answer. I tried your example and it worked. In my code, I was assuming context to be the node represented by my perl variable. So, $sentence->find( 'position()' ) I wanted to be './position()'. Despite seeing a working example, I still can't do
foreach my $sentence ...
my $id = $sentence->getAttribute('id');
print $root_node->findvalue( '//sentence[#id=' . "$id]/position()");
I can, however, do
$root_node->findvalue( '//sentence[#id=' . "$id]/text()");
Can position() only be used to limit a query like you have?
position() does work in LibXML. For example see
my $root_node = $doc->documentElement;
foreach my $sentence_node ( $root_node->findnodes('//sentence[position()=2]')->get_nodelist ){
print $sentence_node->textContent;
}
This will print Hello with your sample data.
But the way you're using it here, there's no context. For each sentence_node, you want its position relative to what?
If you're looking for specific nodes by position, use a selector like I have above, that's simplest.