Perl XML lib get full xpath - perl

Looking to return the full xpath from a general xpath that may grab multiple results.
The search string would be something general like this:
/myXmlPath/#myValue
The contained xml nodes might look something like this:
<myXmlPath someAttribute="false" myValue="">
<myXmlPath someAttribute="true" myValue="">
Perl code something like this:
use XML::LibXML;
use XML::XPath::XMLParser;
my $filepath = "c:\\temp\\myfile.xml";
my $parser = XML::LibXML->new();
$parser->keep_blanks(0);
my $doc = $parser->parse_file($filepath);
#myWarn = ('/myXmlPath/#myValue');
foreach(#myWarn) {
my $nodeset = $doc->findnodes($_);
foreach my $node ($nodeset->get_nodelist) {
my $value = $node->to_literal;
print $_,"\n";
print $value," - value \n";
print $node," - node \n";
}
}
I'd like to be able to evaluate the returned full path values from the xml. This code works fine when I'm using it to lookup general things in an xpath, but would be more ideal if I could get at other data from the nodeset result.

Like ikegami said, I'm not sure exactly what you're after so I've kind of produced a shotgun approach for everything I could interpret your question.
use strict;
use warnings;
use XML::LibXML;
use v5.14;
my $doc = XML::LibXML->load_xml(IO => *DATA);
say "Get the full path to the node";
foreach my $node ($doc->findnodes('//myXmlPath/#myValue')) {
say "\t".$node->nodePath();
}
say "Get the parent node of the attribute by searching";
foreach my $node ($doc->findnodes('//myXmlPath[./#myValue="banana"]')) {
say "\t".$node->nodePath();
my ($someAttribute, $myValue) = map { $node->findvalue("./$_") } qw (#someAttribute #myValue);
say "\t\tsomeAttribute: $someAttribute";
say "\t\tmyValue: $myValue";
}
say "Get the parent node programatically";
foreach my $attribute ($doc->findnodes('//myXmlPath/#myValue')) {
my $element = $attribute->parentNode;
say "\t".$element->nodePath();
}
__DATA__
<document>
<a>
<b>
<myXmlPath someAttribute="false" myValue="apple" />
</b>
<myXmlPath someAttribute="false" myValue="banana" />
</a>
</document>
Which would produce:
Get the full path to the node
/document/a/b/myXmlPath/#myValue
/document/a/myXmlPath/#myValue
Get the parent node of the attribute by searching
/document/a/myXmlPath
someAttribute: false
myValue: banana
Get the parent node programatically
/document/a/b/myXmlPath
/document/a/myXmlPath

Related

Find Favicons in HTML using Perl

I'm trying to look for favicons (and variants) for a given URL using Perl (I'd like to avoid using an external service such as Google's favicon finder). There's a CPAN module, WWW::Favicon, but it hasn't been updated in over a decade -- a decade in which now important variants such as "apple-touch-icon" have come to replace the venerable "ico" file.
I thought I found the solution in WWW::Mechanize, since it can list all of the links in a given URL, including <link> header tags. However, I cannot seem to find a clean way to use the "find_link" method to search for the rel attribute.
For example, I tried using 'rel' as the search term, hoping maybe it was in there despite not being mentioned in the documentation, but it doesn't work. This code returns an error about an invalid "link-finding parameter."
my $results = $mech->find_link( 'rel' => "apple-touch-icon" );
use Data::Dumper;
say STDERR Dumper $results;
I also tried using other link-finding parameters, but none of them seem to be suited to searching out a rel attribute.
The only way I could figure out how to do it is by iterating through all links and looking for a rel attribute like this:
my $results = $mech->find_all_links( );
foreach my $result (#{ $results }) {
my $attrs = $result->attrs();
#'tag' => "apple-touch-icon"
foreach my $attr (sort keys %{ $attrs }) {
if ($attrs->{'rel'} =~ /^apple-touch-icon.*$/) {
say STDERR "I found it:" . $result->url();
}
# Add tests for other types of icons here.
# E.g. "mask-icon" and "shortcut icon."
}
}
That works, but it seems messy. Is there a better way?
Here's how I'd do it with Mojo::DOM. Once you fetch an HTML page, use dom to do all the parsing. From that, use a CSS selector to find the interesting nodes:
link[rel*=icon i][href]
This CSS selector looks for link tags that have the rel and href tags at the same time. Additionally, I require that the value in rel contain (*=) "icon", case insensitively (the i). If you want to assume that all nodes will have the href, just leave off [href].
Once I have the list of links, I extract just the value in href and turn that list into an array reference (although I could do the rest with Mojo::Collection methods):
use v5.10;
use Mojo::UserAgent;
my $ua = Mojo::UserAgent->new->max_redirects(3);
my $results = $ua->get( shift )
->result
->dom
->find( 'link[rel*=icon i][href]' )
->map( attr => 'href' )
->to_array
;
say join "\n", #$results;
That works pretty well so far:
$ perl mojo.pl https://www.perl.org
https://cdn.perl.org/perlweb/favicon.ico
$ perl mojo.pl https://www.microsoft.com
https://c.s-microsoft.com/favicon.ico?v2
$ perl mojo.pl https://leanpub.com/mojo_web_clients
https://d3g6anj9jkury9.cloudfront.net/assets/favicons/apple-touch-icon-57x57-b83f183ad6b00aa74d8e692126c7017e.png
https://d3g6anj9jkury9.cloudfront.net/assets/favicons/apple-touch-icon-60x60-6dc1c10b7145a2f1156af5b798565268.png
https://d3g6anj9jkury9.cloudfront.net/assets/favicons/apple-touch-icon-72x72-5037b667b6f7a8d5ba8c4ffb4a62ec2d.png
https://d3g6anj9jkury9.cloudfront.net/assets/favicons/apple-touch-icon-76x76-57860ca8a817754d2861e8d0ef943b23.png
https://d3g6anj9jkury9.cloudfront.net/assets/favicons/apple-touch-icon-114x114-27f9c42684f2a77945643b35b28df6e3.png
https://d3g6anj9jkury9.cloudfront.net/assets/favicons/apple-touch-icon-120x120-3819f03d1bad1584719af0212396a6fc.png
https://d3g6anj9jkury9.cloudfront.net/assets/favicons/apple-touch-icon-144x144-a79479b4595dc7ca2f3e6f5b962d16fd.png
https://d3g6anj9jkury9.cloudfront.net/assets/favicons/apple-touch-icon-152x152-aafe015ef1c22234133158a89b29daf5.png
https://d3g6anj9jkury9.cloudfront.net/assets/favicons/favicon-16x16-c1207cd2f3a20fd50de0e585b4b307a3.png
https://d3g6anj9jkury9.cloudfront.net/assets/favicons/favicon-32x32-e9b1d6ef3d96ed8918c54316cdea011f.png
https://d3g6anj9jkury9.cloudfront.net/assets/favicons/favicon-96x96-842fcd3e7786576fc20d38bbf94837fc.png
https://d3g6anj9jkury9.cloudfront.net/assets/favicons/favicon-128x128-e97066b91cc21b104c63bc7530ff819f.png
https://d3g6anj9jkury9.cloudfront.net/assets/favicons/favicon-196x196-b8cab44cf725c4fa0aafdbd237cdc4ed.png
Now, the problem comes if you find more interesting cases that you can't easily write a selector for. Suppose not all of the rel values have "icon" in them. You can get a little more fancy by specifying multiple selectors separated by commas so you don't have to use the experimental case insensitivity flag:
link[rel*=icon][href], link[rel*=ICON][href]
or different values in rel:
link[rel="shortcut icon"][href], link[rel="apple-touch-icon-precomposed"][href]
Line up as many of those as you like.
But, you could also filter your results without the selectors. Use Mojo::Collection's grep to pick out the nodes that you want:
my %Interesting = ...;
my $results = $ua->get( shift )
->result
->dom
->find( '...' )
->grep( sub { exists $Interesting{ $_->attr('rel') } } )
->map( attr => 'href' )
->to_array
;
I have a lot more examples of Mojo::DOM in Mojo Web Clients, and I think I'll go add this example now.
The problem is very easy to solve with:
assistance of any module allowing to load webpage
define $regex for all possible favicon variations
look for <link rel="$regex" href="icon_address" ...>
Note:
The script has default YouTube url embedded in the code
use strict;
use warnings;
use feature 'say';
use HTTP::Tiny;
my $url = shift || 'https://www.youtube.com/';
my $icons = get_favicon($url);
say for #{$icons};
sub get_favicon {
my $url = shift;
my #lookup = (
'shortcut icon',
'apple-touch-icon',
'image_src',
'icon',
'alternative icon'
);
my $re = join('|',#lookup);
my $html = load_page($url);
my #icons = ($html =~ /<link rel="(?:$re)" href="(.*?)"/gmsi);
return \#icons;
}
sub load_page {
my $url = shift;
my $response = HTTP::Tiny->new->get($url);
my $html;
if ($response->{success}) {
$html = $response->{content};
} else {
say 'ERROR: Could not extract webpage';
say 'Status: ' . $response->{status};
say 'Reason: ' . $response->{reason};
exit;
}
return $html;
}
Run as script.pl
https://www.youtube.com/s/desktop/8259e7c9/img/favicon.ico
https://www.youtube.com/s/desktop/8259e7c9/img/favicon_32.png
https://www.youtube.com/s/desktop/8259e7c9/img/favicon_48.png
https://www.youtube.com/s/desktop/8259e7c9/img/favicon_96.png
https://www.youtube.com/s/desktop/8259e7c9/img/favicon_144.png
https://www.youtube.com/img/desktop/yt_1200.png
Run as script.pl "http://www.microsoft.com/"
https://c.s-microsoft.com/favicon.ico?v2
Run as script.pl "http://finance.yahoo.com/"
https://s.yimg.com/cv/apiv2/default/icons/favicon_y19_32x32_custom.svg

Print output using XML::LibXML

my $doc = $parser->parse_string( $res->content );
my $root = $doc->getDocumentElement;
my #objects = $root->getElementsByTagName('OBJECT');
foreach my $object ( #objects ){
my $name = $object->firstChild;
print "OBJECT = " . $name . "\n";}
OUTPUT is:
OBJECT = XML::LibXML::Text=SCALAR(0x262e170)
OBJECT = XML::LibXML::Text=SCALAR(0x2ee4b00)
OBJECT = XML::LibXML::Text=SCALAR(0x262e170)
OBJECT = XML::LibXML::Text=SCALAR(0x2ee4b00)
Can anyone please explain why print prints the $name attribute values like this? Why does it print normal when I use the function getAttribute with virtually he same code?
getAttribute returns an attribute, while firstChild returns a text node, element, processing instruction, or a comment.
What you see is a normal Perl way of printing an object: it prints its class and address. Your version of XML::LibXML seems to be a bit antique, recent versions overload the stringification and the code produces the actual text node.
#!/usr/bin/perl
use warnings;
use strict;
use XML::LibXML;
my $doc = 'XML::LibXML'->load_xml( string => << '__XML__');
<root>
<OBJECT name="o1">hello</OBJECT>
</root>
__XML__
my #objects = $doc->getElementsByTagName('OBJECT');
for my $object (#objects) {
print 'OBJECT = ', $object->firstChild, "\n";
}
Output:
OBJECT = hello
In the old versions, one needed to call the nodeValue or data method.
print 'OBJECT = ', $object->firstChild->data, "\n";

Which syntax is better XML::Simple or XML::Twig

I was running a Perl script and I encountered the following result, instead of the answer I expected.
input HASH(0x17268bb0)
input HASH(0x172b3300)
input HASH(0x172b32a0)
Can anyone say what this is and how to rectify it?
This is my XML file here
<Root>
<Top name="ri_32">
<Module name="ALU">
<input name="power_control_bus"/>
<bidirection name="address_bus"/>
</Module>
<Module name="Power_control">
<input name="cpu_control_bus"/>
<output name="power_control_bus"/>
<bidirection name="address_bus"/>
</Module>
<input name="address"/>
<input name="clock"/>
<input name="data_in"/>
<output name="data_out"/>
<bidirection name="control"/>
</Top>
</Root>
I'm writing a Perl script which can be converted into a specific requirement (.v, .sv file)
use strict;
use XML::Simple;
use Data::Dumper;
my $xml_root = XMLin( './simodule.xml' );
my $root_top = $xml_root->{Top};
my $mod = $root_top->{Module};
print "Top $root_top->{name}\n";
my $top_in = $root_top->{input};
foreach my $namein ( keys %$top_in ) {
print " input $top_in->{$namein}\n";
}
my $top_ou = $root_top->{output};
foreach my $nameou ( keys %$top_ou ) {
print " output $top_ou->{$nameou}\n";
}
my $top_bi = $root_top->{bidirection};
foreach my $namebi ( keys %$top_bi ) {
print " bidirection $top_bi->{$namebi}\n";
}
output:
Top risc_32
input HASH(0x172b3300)
input HASH(0x172b32a0)
input HASH(0x17268bb0)
output data_out
bidirection control
Expected output
input address
input clock
input data_in
output data_out
bidirection control
You've made your task more difficult for yourself by using one of the most deceitful modules on CPAN. XML::Simple isn't simple.
But it's docs also suggest not using it:
Why is XML::Simple "Discouraged"?
So - how about instead, XML::Twig:
#!/usr/bin/env perl
use strict;
use warnings;
use XML::Twig;
#$twig now contains our XML data structure.
my $twig = XML::Twig->new->parsefile('simodule.xml');
#fetch a value with an xpath expression - ./Top
#then extract the attribute 'name' from this node.
print "Top ", $twig->get_xpath( './Top', 0 )->att('name'), "\n";
#iterate all 'input' elements beneath "Top":
#note - single argument to "get_xpath" means all of them in a list.
foreach my $input ( $twig->get_xpath('./Top/input') ) {
#retrieve from each their name attribute (and print)
print "input ", $input->att('name'), "\n";
}
#locate the 'output' and 'bidirection' nodes within the tree, and fetch
#their name attribute.
print "output ", $twig -> get_xpath( './Top/output',0) -> att('name'),"\n";
print "bidirection ", $twig -> get_xpath( './Top/bidirection',0) -> att('name'),"\n";
We use XML::Twig which makes use of get_xpath to specify an XML path. We also use att to retrieve a named attribute. You could use iterators such as first_child and children if you prefer though:
#Top element is below the root - we create a reference to it $top
my $top = $twig->root->first_child('Top');
#From this reference, fetch the name attribute.
print "Top ", $top->att('name'), "\n";
#get children of Top matching 'input' and iterate
foreach my $input ( $top -> children('input') ) {
#print attribute called 'name'.
print "input ", $input->att('name'), "\n";
}
#Find a child below Top called 'output' and retrieve 'name' attribute.
print "output ", $top -> first_child('output') -> att('name'),"\n";
#as above.
print "bidirection ", $top -> first_child('bidirection') -> att('name'),"\n";
These are doing the same thing - personally I like xpath as a way of navigating XML but that's a matter of taste. (It lets you do all sorts of things like specify a path with embedded attributes, that kind of thing - moot point in this example though).
Given your input XML, both produce:
Top ri_32
input address
input clock
input data_in
output data_out
bidirection control
it skips the nested hashes for ALU and Power_Control because your original code appears to.
You still haven't been at all clear about exactly what output you want. But I use XML::LibXML for most of my XML processing requirements and I'd write something like this:
#!/usr/bin/perl
use strict;
use warnings;
use 5.010;
use XML::LibXML;
my $parser = XML::LibXML->new();
my $doc = $parser->parse_file('simodule.xml');
foreach my $type (qw[input output bidirection]) {
foreach ($doc->findnodes("/Root/Top/$type")) {
say $_->nodeName, ' ', $_->getAttribute('name');
}
}
The output is correct.
As we don't know what exactly you need,
I modify your code to following so maybe you can figure out why you see the HASH and how to de-reference it by yourself, it's pretty simple:
use strict;
use XML::Simple;
use Data::Dumper;
local $/;
my $xml_root = XMLin(<DATA>);
print Dumper $xml_root;
my $root_top=$xml_root->{Top};
my $mod=$root_top->{Module};
print "Top $root_top->{name}\n";
my $top_in=$root_top->{input};
foreach my $namein (keys %$top_in)
{
print " input" , Dumper $top_in->{$namein};
}
my $top_ou=$root_top->{output};
foreach my $nameou (keys %$top_ou)
{
print " output $top_ou->{$nameou}\n";
}
my $top_bi=$root_top->{bidirection};
foreach my $namebi (keys %$top_bi)
{
print " bidirection $top_bi->{$namebi}\n";
}
__DATA__
<Root>
<Top name="ri_32">
<Module name="ALU">
<input name="power_control_bus"/>
<bidirection name="address_bus"/>
</Module>
<Module name="Power_control">
<input name="cpu_control_bus"/>
<output name="power_control_bus"/>
<bidirection name="address_bus"/>
</Module>
<input name="address">X</input>
<input name="clock"/>
<input name="data_in"/>
<output name="data_out"/>
<bidirection name="control"/>
</Top>
</Root>

Doing XPath using Perl

I am coding with Perl on a Window 7 machine. I am able to extract data from the XML using the XPath code below
use strict;
use warning;
use XML::LibXML;
my $parser = XML::LibXML->new();
my $doc = $parser->parse_file($newfile);
my $query = "/tradenet/message/header/unique_ref_no/date/text( )";
my($node) = $doc->findnodes($query);
$node->setData("$file_seq_number");
However, when i use the same code on a different XML, the xpath from the second document looks as below:
/TradenetResponse/OutboundMessage/out:OutwardPermit/out:Declaration/out:Header/cac:UniqueReferenceNumber/cbc:SequenceNumeric
Together with the Perl code, this is what the extraction code looks like:
my $parser = XML::LibXML->new();
my $doc = $parser->parse_file($newfile);
my $query = "/TradenetResponse/OutboundMessage/out:OutwardPermit/out:Declaration/out:Header/cac:UniqueReferenceNumber/cbc:SequenceNumeric/text( )";
my($node) = $doc->findnodes($query);
$node->setData("$file_seq_number");
Using the second code, I am unable to retrieve the data from the second XML. I receive this error "Can't call method "setData"on an undefined value at Perl.pl line 5".
Does the ":" character in the second XPATH address affecting the code?
You have to define what out, cac, and cbc mean in order for the XPath query to find the appropriate nodes:
my $doc = $parser->parse_file($newfile);
my $xpath_context = XML::LibXML::XPathContext->new($doc->documentElement());
# These URIs need to be the same as the ones in the source document
$xpath_context->registerNs('out', 'http://example.com/out.xsd');
$xpath_context->registerNs('cac', 'http://example.com/cac.xsd');
$xpath_context->registerNs('cbc', 'http://example.com/cbc.xsd');
my $query = "/TradenetResponse/OutboundMessage/out:OutwardPermit/out:Declaration/out:Header/cac:UniqueReferenceNumber/cbc:SequenceNumeric/text( )";
my ($node) = $xpath_context->findnodes($query);
As promised, here is a working example. First, the test input file:
<?xml version="1.0"?>
<!-- input.xml -->
<TradenetResponse xmlns:a="http://example.com/out.xsd"
xmlns:b="http://example.com/cac.xsd"
xmlns:c="http://example.com/cbc.xsd">
<OutboundMessage>
<a:OutwardPermit>
<a:Declaration>
<a:Header>
<b:UniqueReferenceNumber>
<c:SequenceNumeric>1234</c:SequenceNumeric>
</b:UniqueReferenceNumber>
</a:Header>
</a:Declaration>
</a:OutwardPermit>
</OutboundMessage>
</TradenetResponse>
And here is the working Perl script:
#!/usr/bin/perl
# parse.pl
use strict;
use warnings;
use XML::LibXML;
my $parser = XML::LibXML->new();
my $newfile = "input.xml";
my $doc = $parser->parse_file($newfile);
my $xpath_context = XML::LibXML::XPathContext->new($doc->documentElement());
# These URIs need to be the same as the ones in the source document
$xpath_context->registerNs('out', 'http://example.com/out.xsd');
$xpath_context->registerNs('cac', 'http://example.com/cac.xsd');
$xpath_context->registerNs('cbc', 'http://example.com/cbc.xsd');
# Query wrapped for clarity
my $query = "/TradenetResponse/OutboundMessage/out:OutwardPermit" .
"/out:Declaration/out:Header/cac:UniqueReferenceNumber" .
"/cbc:SequenceNumeric/text()";
my ($node) = $xpath_context->findnodes($query);
print "Value: " . $node->getData() . "\n";
The output for me is:
sean#localhost:~xmltest$ ./parse.pl
Value: 1234

How can I access attributes and elements from XML::LibXML in Perl?

I am having trouble understanding / using name spaces with XML::LibXML package in Perl. I can access an element successfully but not an attribute. I have the following code which accesses an XML file (http://pastebin.com/f3fb9d1d0).
my $tree = $parser->parse_file($file); # parses the file contents into the new libXML object.
my $xpc = XML::LibXML::XPathContext->new($tree);
$xpc->registerNs(microplateML => 'http://moleculardevices.com/microplateML');
I then try and access an element called common-name and an attribute called name.
foreach my $camelid ($xpc->findnodes('//microplateML:species')) {
my $latin_name = $camelid->findvalue('#name');
my $common_name = $camelid->findvalue('common-name');
print "$latin_name, $common_name" ;
}
But only the latin-name (#name) is printing out, the common-name is not. What am I doing wrong and how can I get the common-name to print out as well?
What does the #name do in this case? I presume it is an array, and that attributes should be put into an array as there can be more than one, but elements (like common-name) should not be because there should just be one?
I've been following the examples here: http://www.xml.com/pub/a/2001/11/14/xml-libxml.html
and here: http://perl-xml.sourceforge.net/faq/#namespaces_xpath, and trying to get their example camel script working with my namespace, hence the weird namespace.
Make sure you XML file is valid then use $node->getAttribute("someAttribute") to access attributes.
#name is a attribute name. You'd use it in findnodes() to specify elements with a given attribute set. Eg. a path like:
//camelids/species[#name="Camelus bactrianus"]/
Here is a simple/contrived example:
#!/usr/bin/perl -w
use XML::LibXML;
my $parser = XML::LibXML->new();
my $doc = $parser->parse_file('/Users/castle/Desktop/animal.xml');
my $xc = XML::LibXML::XPathContext->new( $doc->documentElement() );
$xc->registerNs('ns', 'http://moleculardevices.com/microplateML');
my #n = $xc->findnodes('//ns:species');
foreach $nod (#n) {
print "A: ".$nod->getAttribute("name")."\n";
my #c = $xc->findnodes("./ns:common-name", $nod);
foreach $cod (#c) {
print "B: ".$cod->nodeName;
print " = ";
print $cod->getFirstChild()->getData()."\n";
}
}
Output is:
perl ./xmltest.pl
A: Camelus bactrianus
B: common-name = Bactrian Camel