LibXML - Inserting a Comment - perl

I'm using XML::LibXML, I'd like to add a comment such that the the comment is outside the tag. Is it even possible to put it outside the tag? I've tried appendChild, insertBefore | After, no difference ...
<JJ>junk</JJ> <!--My comment Here!-->
# Code excerpt from within a foreach loop:
my $elem = $dom->createElement("JJ");
my $txt_node = $dom->createTextNode("junk");
my $cmt = $dom->createComment("My comment Here!");
$elem->appendChild($txt_node);
$b->appendChild($elem);
$b->appendChild($frag);
$elem->appendChild($cmt);
# but it puts the comment between the tags ...
<JJ>junk<!--My comment Here!--></JJ>

Don't append the comment node to $elem but to the parent node. For example, the following script
use XML::LibXML;
my $doc = XML::LibXML::Document->new;
my $root = $doc->createElement("doc");
$doc->setDocumentElement($root);
$root->appendChild($doc->createElement("JJ"));
$root->appendChild($doc->createComment("comment"));
print $doc->toString(1);
prints
<?xml version="1.0"?>
<doc>
<JJ/>
<!--comment-->
</doc>

Related

How to add Child and Grandchild XML tags by XML::LibXML?

I'm trying to make a simple XML file and add Child and Grandchild XML tags by XML::LibXML.
I made country and location tags as below, but I want to add child and grandchild tags under country and location tags such as <dev>value</dev>. How do I add Child and Grandchild XML tags by XML::LibXML?
use strict;
use warnings;
use XML::LibXML;
my $doc = XML::LibXML::Document->new('1.0', 'utf-8');
#my $record = $doc->documentElement;
my $root = $doc->createElement('my-root-element');
$root->setAttribute('some-attr'=> 'some-value');
my $country = $doc->createElement('country');
$country-> appendText('Jamaica');
$root->appendChild($country);
my $dev = $doc->createElement('dev');
$dev-> appendText('value');
$country->appendChild($dev);
my $location = $doc->createElement('location');
$location-> appendText('21.241.21.2');
$root->appendChild($location);
$doc->setDocumentElement($root);
print $doc->toString(1);
The result I got is below:
<?xml version="1.0" encoding="utf-8"?>
<my-root-element some-attr="some-value">
<country>Jamaica<dev>value</dev></country>
<location>21.241.21.2</location>
</my-root-element>
Actually, I expected the output as below
<?xml version="1.0" encoding="utf-8"?>
<my-root-element some-attr="some-value">
<country>
<dev>
<Name>Jameica</Name>
<field>value</field>
<range>value1</range>
</dev>
</country>
<country>
<dev>
<Name>USA</Name>
<field>value</field>
<range>value1</range>
</dev>
</country>
</my-root-element>
Since <country> only has elements, not text, you should not use $country->appendText('Jamaica');. Similarly, you should not use appendText for <dev>.
You also need to create 3 child elements under <dev>: Name, field and range. For those, you need to call $dev->appendChild, etc.
This code sets you on a path to get the type of output you desire:
use strict;
use warnings;
use XML::LibXML;
my $doc = XML::LibXML::Document->new('1.0', 'utf-8');
my $root = $doc->createElement('my-root-element');
$root->setAttribute('some-attr'=> 'some-value');
my $country = $doc->createElement('country');
$root->appendChild($country);
my $dev = $doc->createElement('dev');
$country->appendChild($dev);
my $name = $doc->createElement('Name');
$name->appendText('Jamaica');
$dev->appendChild($name);
my $field = $doc->createElement('field');
$field->appendText('value');
$dev->appendChild($field);
my $range = $doc->createElement('range');
$range->appendText('value1');
$dev->appendChild($range);
$doc->setDocumentElement($root);
print $doc->toString(1);
Prints:
<?xml version="1.0" encoding="utf-8"?>
<my-root-element some-attr="some-value">
<country>
<dev>
<Name>Jamaica</Name>
<field>value</field>
<range>value1</range>
</dev>
</country>
</my-root-element>

unable to parse xml file using registered namespace

I am using XML::LibXML to parse a XML file. There seems to some problem in using registered namespace while accessing the node elements. I am planning to covert this xml data into CSV file. I am trying to access each and every element here. To start with I tried out extracting attribute values of <country> and <state> tags. Below is the code I have come with . But I am getting error saying XPath error : Undefined namespace prefix.
use strict;
use warnings;
use Data::Dumper;
use XML::LibXML;
my $XML=<<EOF;
<DataSet xmlns="http://www.w3schools.com" xmlns:xsi="https://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3schools.com note.xsd">
<exec>
<survey_region ver="1.1" type="x789" date="20160312"/>
<survey_loc ver="1.1" type="x789" date="20160312"/>
<note>Population survey</note>
</exec>
<country name="ABC" type="MALE">
<state name="ABC_state1" result="PASS">
<info>
<type>literacy rate comparison</type>
</info>
<comment><![CDATA[
Some random text
contained here
]]></comment>
</state>
</country>
<country name="XYZ" type="MALE">
<state name="XYZ_state2" result="FAIL">
<info>
<type>literacy rate comparison</type>
</info>
<comment><![CDATA[
any random text data
]]></comment>
</state>
</country>
</DataSet>
EOF
my $parser = XML::LibXML->new();
my $doc = $parser->parse_string($XML);
my $xc = XML::LibXML::XPathContext->new($doc);
$xc->registerNs('x','http://www.w3schools.com');
foreach my $camelid ($xc->findnodes('//x:DataSet')) {
my $country_name = $camelid->findvalue('./x:country/#name');
my $country_type = $camelid->findvalue('./x:country/#type');
my $state_name = $camelid->findvalue('./x:state/#name');
my $state_result = $camelid->findvalue('./x:state/#result');
print "state_name ($state_name)\n";
print "state_result ($state_result)\n";
print "country_name ($country_name)\n";
print "country_type ($country_type)\n";
}
Update
if I remove the name space from XML and change my XPath slightly it seems to work. Can someone help me understand the difference.
foreach my $camelid ($xc->findnodes('//DataSet')) {
my $country_name = $camelid->findvalue('./country/#name');
my $country_type = $camelid->findvalue('./country/#type');
my $state_name = $camelid->findvalue('./country/state/#name');
my $state_result = $camelid->findvalue('./country/state/#result');
print "state_name ($state_name)\n";
print "state_result ($state_result)\n";
print "country_name ($country_name)\n";
print "country_type ($country_type)\n";
}
This would be my approach
#!/usr/bin/perl
use strict;
use warnings;
use XML::LibXML;
my $XML=<<EOF;
<DataSet xmlns="http://www.w3schools.com" xmlns:xsi="https://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3schools.com note.xsd">
<exec>
<survey_region ver="1.1" type="x789" date="20160312"/>
<survey_loc ver="1.1" type="x789" date="20160312"/>
<note>Population survey</note>
</exec>
<country name="ABC" type="MALE">
<state name="ABC_state1" result="PASS">
<info>
<type>literacy rate comparison</type>
</info>
<comment><![CDATA[
Some random text
contained here
]]></comment>
</state>
</country>
<country name="XYZ" type="MALE">
<state name="XYZ_state2" result="FAIL">
<info>
<type>literacy rate comparison</type>
</info>
<comment><![CDATA[
any random text data
]]></comment>
</state>
</country>
</DataSet>
EOF
my $parser = XML::LibXML->new();
my $tree = $parser->parse_string($XML);
my $root = $tree->getDocumentElement;
my #country = $root->getElementsByTagName('country');
foreach my $citem(#country){
my $country_name = $citem->getAttribute('name');
my $country_type = $citem->getAttribute('type');
print "Country Name -- $country_name\nCountry Type -- $country_type\n";
my #state = $citem->getElementsByTagName('state');
foreach my $sitem(#state){
my #info = $sitem->getElementsByTagName('info');
my $state_name = $sitem->getAttribute('name');
my $state_result = $sitem->getAttribute('result');
print "State Name -- $state_name\nState Result -- $state_result\n";
foreach my $i (#info){
my $text = $i->getElementsByTagName('type');
print "Info --- $text\n";
}
}
print "\n";
}
Of course you can manipulate the data anyway you'd like. If you are parsing from a file change parse_string to parse_file.
For the individual elements in the xml use the getElementsByTagName to get the elements within the tags. This should be enough to get you going
There seem to be two small mistakes here.
1. call findvalue for the XPathContext document with the context node as parameter.
2. name is a attribute in country no a node.
Therefor try :
my $country_name = $xc->findvalue('./x:country/#name', $camelid );
Update to the updated question if I remove the name space from XML and change my XPath slightly it seems to work. Can someone help me understand the difference.
To understand what happens here have a look to NOTE ON NAMESPACES AND XPATH
In your case $camelid->findvalue('./x:state/#name'); calls findvalue is called for an node.
But: The recommended way is to use the XML::LibXML::XPathContext module to define an explicit context for XPath evaluation, in which a document independent prefix-to-namespace mapping can be defined. Which I did above.
Conclusion:
Calling find on a node will only work: if the root element had no namespace
(or if you use the same prefix as in the xml doucment if ther is any)

PHP DOMDocument - match and remove URLs

I'm trying to extract links from html page using DOM:
$html = file_get_contents('links.html');
$DOM = new DOMDocument();
$DOM->loadHTML($html);
$a = $DOM->getElementsByTagName('a');
foreach($a as $link){
//echo out the href attribute of the <A> tag.
echo $link->getAttribute('href').'<br/>';
}
Output:
http://dontwantthisdomain.com/dont-want-this-domain-name/
http://dontwantthisdomain2.com/also-dont-want-any-pages-from-this-domain/
http://dontwantthisdomain3.com/dont-want-any-pages-from-this-domain/
http://domain1.com/page-X-on-domain-com.html
http://dontwantthisdomain.com/dont-want-link-from-this-domain-name.html
http://dontwantthisdomain2.com/dont-want-any-pages-from-this-domain/
http://domain.com/page-XZ-on-domain-com.html
http://dontwantthisdomain.com/another-page-from-same-domain-that-i-dont-want-to-be-included/
http://dontwantthisdomain2.com/same-as-above/
http://domain3.com/page-XYZ-on-domain3-com.html
I would like to remove all results matching dontwantthisdomain.com, dontwantthisdomain2.com and dontwantthisdomain3.com so the output will looks like that:
http://domain1.com/page-X-on-domain-com.html
http://domain.com/page-XZ-on-domain-com.html
http://domain3.com/page-XYZ-on-domain3-com.html
Any ideas? :)
I think you should use regular expression.Google it and have fun

Extracting links inside <div>'s with HTML::TokeParser & URI

I'm an old-newbie in Perl, and Im trying to create a subroutine in perl using HTML::TokeParser and URI.
I need to extract ALL valid links enclosed within on div called "zone-extract"
This is my code:
#More perl above here... use strict and other subs
use HTML::TokeParser;
use URI;
sub extract_links_from_response {
my $response = $_[0];
my $base = URI->new( $response->base )->canonical;
# "canonical" returns it in the one "official" tidy form
my $stream = HTML::TokeParser->new( $response->content_ref );
my $page_url = URI->new( $response->request->uri );
print "Extracting links from: $page_url\n";
my($tag, $link_url);
while ( my $div = $stream->get_tag('div') ) {
my $id = $div->get_attr('id');
next unless defined($id) and $id eq 'zone-extract';
while( $tag = $stream->get_tag('a') ) {
next unless defined($link_url = $tag->[1]{'href'});
next if $link_url =~ m/\s/; # If it's got whitespace, it's a bad URL.
next unless length $link_url; # sanity check!
$link_url = URI->new_abs($link_url, $base)->canonical;
next unless $link_url->scheme eq 'http'; # sanity
$link_url->fragment(undef); # chop off any "#foo" part
print $link_url unless $link_url->eq($page_url); # Don't note links to itself!
}
}
return;
}
As you can see, I have 2 loops, first using get_tag 'div' and then look for id = 'zone-extract'. The second loop looks inside this div and retrieve all links (or that was my intention)...
The inner loop works, it extracts all links correctly working standalone, but I think there is some issues inside the first loop, looking for my desired div 'zone-extract'... Im using this post as a reference: How can I find the contents of a div using Perl's HTML modules, if I know a tag inside of it?
But all I have by the moment is this error:
Can't call method "get_attr" on unblessed reference
Some ideas? Help!
My HTML (Note URL_TO_EXTRACT_1 & 2):
<more html above here>
<div class="span-48 last">
<div class="span-37">
<div id="zone-extract" class="...">
<h2 class="genres"><img alt="extracting" class="png"></h2>
<li><a title="Extr 2" href="**URL_TO_EXTRACT_1**">2</a></li>
<li><a title="Con 1" class="sel" href="**URL_TO_EXTRACT_2**">1</a></li>
<li class="first">Pàg</li>
</div>
</div>
</div>
<more stuff from here>
I find that TokeParser is a very crude tool requiring too much code, its fault is that only supports the procedural style of programming.
A better alternatives which require less code due to declarative programming is Web::Query:
use Web::Query 'wq';
my $results = wq($response)->find('div#zone-extract a')->map(sub {
my (undef, $elem_a) = #_;
my $link_url = $elem_a->attr('href');
return unless $link_url && $link_url !~ m/\s/ && …
# Further checks like in the question go here.
return [$link_url => $elem_a->text];
});
Code is untested because there is no example HTML in the question.

Perl OpenOffice::OODoc - accessing header/footer elements

How do you get elements in a header/footer of a odt doc?
for example I have:
use OpenOffice::OODoc;
my $doc = odfDocument(file => 'whatever.odt');
my $t=0;
while (my $table = $doc->getTable($t))
{
print "Table $t exists\n";
$t++;
}
When I check the tables they are all from the body. I can't seem to find elements for anything in the header or footer?
I found sample code here which led me to the answer:
#! /usr/local/bin/perl
use OpenOffice::OODoc;
my $file='asdf.odt';
# odfContainer is a representation of the zipped odf file
# and all of its parts.
my $container = odfContainer("$file");
# We're going to look at the 'style' part of the container,
# because that's where the header is located.
my $style = odfDocument
(
container => $container,
part => 'styles'
);
# masterPageHeader takes the style name as its argument.
# This is not at all clear from the documentation.
my $masterPageHeader = $style->masterPageHeader('Standard');
my $headerText = $style->getText( $masterPageHeader );
print "$headerText\n"
The master page style defines the look and feel of the document -- think CSS. Apparently 'Standard' is the default name for the master page style of a document created by OpenOffice... that was the toughest nut to crack... once I found the example code, that fell out in my lap.