Get NodeSet Size in XML::XPath - perl

I have a XML i want to get the size of node set in the XML.
XML
<a>
<b>
<c>data</c>
<c>data</c>
<c>data</c>
</b>
</a>
I want to get the count c in the b tag.
my $obj = XML::XPath->new(xml => $xml);
print size(($obj->find('/a/b'));
I am not able to get the count of c in this XML

size is a method, not a function. Also, your XPath expression matches the b node, not its children.
The following works:
my $cs = $obj->find('/a/b/c');
print $cs->size, "\n";
Or, shorter, without the intermediate variable:
print $obj->find('/a/b/c')->size, "\n";

Related

Scrape HTML files with Perl, returning content only, in order

Using HTML::TreeBuilder -- or Mojo::DOM -- I'd like to scrape the content but keep it in order, so that I can put the text values into an array (and then replace the text values with a variable for templating purposes)
But this in TreeBuilder
my $map_r = $tree->tagname_map();
my #contents = map { $_->content_list } $tree->find_by_tag_name(keys %$map_r);
foreach my $c (#contents) {
say $c;
}
doesn't return the order -- of course hashes aren't ordered. So, how to visit the tree from root down and keep the sequence of values returned? Recursively walk the tree? Essentially, I'd like to use the method 'as_text' except for each element. (Followed this nice idea but I need it for all elements)
This is better (using Mojo::DOM):
$dom->parse($html)->find('*')->each(
sub {
my $text = shift->text;
$text =~ s/\s+/ /gi;
push #text, $text;
}
);
However, any further comments are welcome.

Perl libXML find node by attribute value

I have very large XML document that I am iterating through. The XML's use mostly attributes rather than node values. I may need to find numerous nodes in the file to piece together one grouping of information. They are tied together via different ref tag values. Currently each time I need to locate one of the nodes to extract data from I am looping through the entire XML and doing a match on the attribute to find the correct node. Is there a more efficient way to just select a node of a given attribute value instead of constantly looping and compare? My current code is so slow it is almost useless.
Currently I am doing something like this numerous times in the same file for numerous different nodes and attribute combinations.
my $searchID = "1234";
foreach my $nodes ($xc->findnodes('/plm:PLMXML/plm:ExternalFile')) {
my $ID = $nodes->findvalue('#id');
my $File = $nodes->findvalue('#locationRef');
if ( $searchID eq $ID ) {
print "The File Name = $File\n";
}
}
In the above example I am looping and using an "if" to compare for an ID match. I was hoping I could do something like this below to just match the node by attribute instead... and would it be any more efficient then looping?
my $searchID = "1234";
$nodes = ($xc->findnodes('/plm:PLMXML/plm:ExternalFile[#id=$searchID]'));
my $File = $nodes->findvalue('#locationRef');
print "The File Name = $File\n";
Do one pass to extract the information you need into a more convenient format or to build an index.
my %nodes_by_id;
for my $node ($xc->findnodes('//*[#id]')) {
$nodes_by_id{ $node->getAttribute('id') } = $node;
}
Then your loops become
my $node = $nodes_by_id{'1234'};
(And stop using findvalue instead of getAttribute.)
If you will be doing this for lots of IDs, then ikegami's answer is worth reading.
I was hoping I could do something like this below to just match the node by attribute instead
...
$nodes = ($xc->findnodes('/plm:PLMXML/plm:ExternalFile[#id=$searchID]'));
Sort of.
For a given ID, yes, you can do
$nodes = $xc->findnodes("/plm:PLMXML/plm:ExternalFile[\#id=$searchID]");
... provided that $searchID is known to be numeric. Notice the double quotes in perl means the variables interpolate, so you should escape the #id because that is part of the literal string, not a perl array, whereas you want the value of $searchID to become part of the xpath string, so it is not escaped.
Note also that in this case you are asking for it in scalar context will have a XML::LibXML::Nodelist object, not the actual node, nor an arrayref; for the latter you will need to use square brackets instead of round ones as I have done in the next example.
Alternatively, if your search id may not be numeric but you know for sure that it is safe to be put in an XPath string (e.g. doesn't have any quotes), you can do the following:
$nodes = [ $xc->findnodes('/plm:PLMXML/plm:ExternalFile[#id="' . $searchID . '"]') ];
print $nodes->[0]->getAttribute('locationRef'); # if you're 100% sure it exists
Notice here that the resulting string will enclose the value in quotation marks.
Finally, it is possible to skip straight to:
print $xc->findvalue('/plm:PLMXML/plm:ExternalFile[#id="' . $searchID . '"]/#locationRef');
... providing you know that there is only one node with that id.
I think you just need to do some study on XPath expressions. For example, you could do something like this:
my $search_id = "1234";
my $query = "/plm:PLMXML/plm:ExternalFile/[\#id = '$search_id']";
foreach my $node ($xc->findnodes($query)) {
# ...
}
In the XPath expression you can also combine multiple attribute checks, e.g.:
[#id = '$search_id' and contains(#pathname, '.pdf')]
One XPath Tutorial of many
If you have a DTD for your document that declares the id attribute as DTD ID, and you make sure the DTD is read when parsing the document, you can access the elements with a certain id efficiently via $doc->getElementById($id).

Adding html code from subfunction via strings

I have 2 functions that generates simple HTML output
sub one_data {}
sub generate_page {}
The generate_page is the 'meat and potatoes' which generates all of the content, however the one_data{} function generates a small amount of html (divs, etc)
I am trying to add it to a section of code that generate_page does, something like this, ie:
$npage .= sprintf "<div class=sidepage>%s</div>", &one_data();
That doesn't seem to accomplish what I'm doing even though one_data is a simple string (in theory it should work per perldoc sprintf.
I've also tried this, ie:
my $data = &one_data();
$npage .= sprintf "<div class=whatever>%s</div>", $data;
But the format modifier "%s" only contains the number 1 at all times.
One_data /does/ work, as I've moved it into a simple perl script and it displays the required html output.
Your one_data sub should use an explicit return statement:
use warnings;
use strict;
my $npage .= sprintf "<div class=sidepage>%s</div>", one_data();
print "$npage\n";
sub one_data {
return 'foooo';
}
__END__
<div class=sidepage>foooo</div>
If your sub uses print instead of return, the value returned by the sub will be 1 (assuming the print was successful). See also perldoc perlsub.
Fixed this by adding an artificial sleep into the function, as it was returning too early/timing out.

perl, libxml, xpath : how to get an element through an attribute in this example .xml file

I would like your help in the following :
given the .xml file :
<network>
<netelement>
<node pwd="KOR-ASBG" func="describe_SBG_TGC">
<collection category="IMT" dir="Stream_statistics"></collection>
</node>
</netelement>
<netelement>
<node pwd="ADR-ASBG" func="describe_SBG_TGC">
<collection category="IMT" dir="Stream_statistics"></collection>
<collection category="IMT" dir="Proxy_registrar_statistics_ACCESS"></collection>
</node>
</netelement></network>
What I would like to do is to get the element with the attribute "KOR-ASBG", for example,
but using only XPath.
I have written the following Perl code :
#!/usr/bin/perl -w
use strict ;
use warnings ;
use XML::LibXML ;
use Data::Dump qw(dump) ;
my $dump = "/some_path/_NETELEMENT_.xml" ;
my $parser = new XML::LibXML ; my $doc ;
eval{ $doc = $parser->parse_file($dump) ; } ;
if( !$doc ) { print "failed to parse $dump" ; next ; }
my $root = $doc->getDocumentElement ;
my $_demo = $root->find('/network/netelement/node[#pwd="KOR-ASBG"]') ;
print dump($_demo)."\n" ;
But, what it gets dispalyed is :
bless([bless(do{\(my $o = 155172440)}, "XML::LibXML::Element")], "XML::LibXML::NodeList")
So the question would be, how can I get the XML Element that contains the "pwd" attribute (that equals "KOR-ASBG"), using XPath ?
Thank you :)
PS. I have also tried :
my #_demo = $root->findnodes('/network/netelement/node[#pwd="KOR-ASBG"]') ;
print dump(#_demo)."\n" ;
and what it gets displayed is :
bless(do{\(my $o = 179552448)}, "XML::LibXML::Element")
There could technically be more than one element that matches, which is why a result set is being returned instead of single node. You could use
my ($ele) = $root->findnodes('/network/netelement/node[#pwd="KOR-ASBG"]');
That will place the first match into $ele.
Your dumper object is not lying to you; you are getting a node list. To access it you may either iterate through it or just access the first node:
print $_demo->get_node(0)->toString()
Of course, all DOM methods are available to you once you get the actual node:
print $_demo->get_node(0)->getAttribute('func');
What you are seeing is what they call in Perl an "opaque object". It's not a hash, but a key to a set of lexical hashes in the the package which hold the fields for all the instances. It's Perl's way of implementing objects with security. The only way to get at their info is to call their get accessors.
The way to figure out how to deal with these is note the second argument to the bless and look up this:
http://search.cpan.org/perldoc?<name-of-package>
Or in your case: http://search.cpan.org/perldoc?XML::LibXML::NodeList and
http://search.cpan.org/perldoc?XML::LibXML::Element
Now, I don't recommend this in all cases, but if you notice, the NodeList object is a blessed array reference. So you could just access the last node, like so:
my $nodes = $root->find('/network/netelement/node[#pwd="KOR-ASBG"]');
my $first_node = $nodes->[0];
my $last_node = $nodes->[-1];
Of course it often makes sense to make a list implementation behave like an array, either through blessed array or overloaded operators or ties. So, in this case, I don't think it's too big a violation of encapsulation.

XML parsing using perl

I tried to research on simple question I have but couldn't do it. I am trying to get data from web which is in XML and parse it using perl. Now, I know how to loop on repeating elements. But, I am stuck when its not repeating (I know this might be silly). If the elements are repeating, I put it in array and get the data. But, when there is only a single element it throws and error saying 'Not an array reference'. I want my code such that it can parse at both time (for single and multiple elements). The code I am using is as follows:
use LWP::Simple;
use XML::Simple;
use Data::Dumper;
open (FH, ">:utf8","xmlparsed1.txt");
my $db1 = "pubmed";
my $query = "13054692";
my $q = 16354118; #for multiple MeSH terms
my $xml = new XML::Simple;
$urlxml = "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=$db1&id=$query&retmode=xml&rettype=abstract";
$dataxml = get($urlxml);
$data = $xml->XMLin("$dataxml");
#print FH Dumper($data);
foreach $e(#{$data->{PubmedArticle}->{MedlineCitation}->{MeshHeadingList}->{MeshHeading}})
{
print FH $e->{DescriptorName}{content}, ' $$ ';
}
Also, can I do something such that the separator $$ will not get printed after the last element?
I also tried the following code:
$mesh = $data->{PubmedArticle}->{MedlineCitation}->{MeshHeadingList}->{MeshHeading};
while (my ($key, $value) = each(%$mesh)){
print FH "$value";
}
But, this prints all the childnodes and I just want the content node.
Perl's XML::Simple will take a single item and return it as a scalar, and if the value repeats it sends it back as an array reference. So, to make your code work, you just have to force MeshHeading to always return an array reference:
$data = $xml->XMLin("$dataxml", ForceArray => [qw( MeshHeading )]);
I think you missed the part of "perldoc XML::Simple" that talks about the ForceArray option:
check out ForceArray because you'll almost certainly want to turn it on
Then you will always get an array, even if the array contains only one element.
As others have pointed out, the ForceArray option will solve this particular problem. However you'll undoubtedly strike another problem soon after due to XML::Simple's assumptions not matching yours. As the author of XML::Simple, I strongly recommend you read Stepping up from XML::Simple to XML::LibXML - if nothing else it will teach you more about XML::Simple.
Since $data->{PubmedArticle}-> ... ->{MeshHeading} can be either a string or an array reference depending on how many <MeshHeading> tags are present in the document, you need to examine the value's type with ref and conditionally dereference it. Since I am unaware of any terse Perl idioms for doing this, your best bet is to write a function:
sub toArray {
my $meshes = shift;
if (!defined $meshes) { return () }
elsif (ref $meshes eq 'ARRAY') { return #$meshes }
else { return ($meshes) }
}
and then use it like so:
foreach my $e (toArray($data->{PubmedArticle}->{MedlineCitation}->{MeshHeadingList}->{MeshHeading})) { ... }
To prevent ' $$ ' from being printed after the last element, instead of looping over the list, concatenate all the elements together with join:
print FH join ' $$ ', map { $_->{DescriptionName}{content} }
toArray($data->{PubmedArticle}->{MedlineCitation}->{MeshHeadingList}->{MeshHeading});
This is a place where XML::Simple is being...simple. It deduces whether there's an array or not by whether something occurs more than once. Read the doc and look for the ForceArray option to address this.
To only include the ' $$ ' between elements, replace your loop with
print FH join ' $$ ', map $_->{DescriptorName}{content}, #{$data->{PubmedArticle}->{MedlineCitation}->{MeshHeadingList}->{MeshHeading}};