how to get block of xml code through perl - perl

From morning i was scratching my head to resolve the below requirment. I know how to parse an xml but not able to find out the sollution to get the exact block along with tags.
sample code:
<employee name="sample1">
<interest name="cricket">
<function action= "bowling">
<rating> average </rating>
</function>
</interest>
<interest name="football">
<function action="defender">
<rating> good </rating>
</function>
</interest>
</employee>
I just want to extract the below content from above xml file and write it into another text file.
<interest name="cricket">
<function action= "bowling">
<rating> average </rating>
</function>
</interest>
Thanks for your help

Using XML::Twig:
#!/usr/bin/perl
use strict;
use warnings;
use XML::Twig;
XML::Twig->new( twig_handlers => { 'interest[#name="cricket"]' => sub { $_->print } },
)
->parsefile( 'interest.xml');
A little explanation: the twig_handler is called when an element satisfying the trigger condition, in this case interest[#name="cricket"], is satisfied. At this point the associated sub is called. In the sub $_ is set to be the current element, which is then print'ed. For more complex subs, 2 arguments are passed, the twig itself (the document) and the current element.
VoilĂ .
Also with XML::Twig comes a tool called xml_grep, which makes it easy to extract what you want:
xml_grep --nowrap 'interest[#name="cricket"]' interest.xml
the --nowrap option prevents the default behaviour which wraps the results in a containing element.

Related

How to parse <rss> tag with XML::LibXML to find xmlns defintions

It seems that there is no consistent way that podcasts define their rss feeds.
Ran into one that is using different schema defs for the RSS.
What's the best way to scan for xmlnamespace in an RSS url, using XML::LibXML
E.g.
One feed might be
<rss
xmlns:content="http://purl.org/rss/1.0/modules/content/"
xmlns:wfw="http://wellformedweb.org/CommentAPI/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:atom="http://www.w3.org/2005/Atom"
xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
xmlns:slash="http://purl.org/rss/1.0/modules/slash/" version="2.0">
Another might be
<rss xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd"version="2.0"
xmlns:atom="http://www.w3.org/2005/Atom">
I want to include in my script an assessment of all the namespaces being used so that when parsing the rss, the appropriate field names can be tracked.
Not sure what that will look like yet, as I'm not sure this module has the capability to do the <rss> tag attribute atomization that I want.
I'm not sure I understand exactly what kind of output you're looking for, but XML::LibXML is indeed able to list the namespaces:
use warnings;
use strict;
use XML::LibXML;
my $dom = XML::LibXML->load_xml(string => <<'EOT');
<rss
xmlns:content="http://purl.org/rss/1.0/modules/content/"
xmlns:wfw="http://wellformedweb.org/CommentAPI/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:atom="http://www.w3.org/2005/Atom"
xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
xmlns:slash="http://purl.org/rss/1.0/modules/slash/" version="2.0">
</rss>
EOT
for my $ns ($dom->documentElement->getNamespaces) {
print $ns->getLocalName(), " / ", $ns->getData(), "\n";
}
Output:
content / http://purl.org/rss/1.0/modules/content/
wfw / http://wellformedweb.org/CommentAPI/
dc / http://purl.org/dc/elements/1.1/
atom / http://www.w3.org/2005/Atom
sy / http://purl.org/rss/1.0/modules/syndication/
slash / http://purl.org/rss/1.0/modules/slash/
I know that OP has already accepted an answer. But for completeness sake it should be mentioned that the recommended way to make searches on the DOM resilient is to use XML::LibXML::XPathContext:
#!/usr/bin/perl
use strict;
use warnings;
use XML::LibXML;
my #examples = (
<<EOT
<rss xmlns:atom="http://www.w3.org/2005/Atom">
<atom:test>One Ring to rule them all,</atom:test>
</rss>
EOT
,
<<EOT
<rss xmlns:a="http://www.w3.org/2005/Atom">
<a:test>One Ring to find them,</a:test>
</rss>
EOT
,
<<EOT
<rss xmlns="http://www.w3.org/2005/Atom">
<test>The end...</test>
</rss>
EOT
,
);
my $xpc = XML::LibXML::XPathContext->new();
$xpc->registerNs('atom', 'http://www.w3.org/2005/Atom');
for my $example (#examples) {
my $dom = XML::LibXML->load_xml(string => $example)
or die "XML: $!\n";
for my $node ($xpc->findnodes("//atom:test", $dom)) {
printf("%-10s: %s\n", $node->nodeName, $node->textContent);
}
}
exit 0;
i.e. you assign a local namespace prefix for those namespaces you are interested in.
Output:
$ perl dummy.pl
atom:test : One Ring to rule them all,
a:test : One Ring to find them,
test : The end...

unable to parse xml file using registered namespace

I am using XML::LibXML to parse a XML file. There seems to some problem in using registered namespace while accessing the node elements. I am planning to covert this xml data into CSV file. I am trying to access each and every element here. To start with I tried out extracting attribute values of <country> and <state> tags. Below is the code I have come with . But I am getting error saying XPath error : Undefined namespace prefix.
use strict;
use warnings;
use Data::Dumper;
use XML::LibXML;
my $XML=<<EOF;
<DataSet xmlns="http://www.w3schools.com" xmlns:xsi="https://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3schools.com note.xsd">
<exec>
<survey_region ver="1.1" type="x789" date="20160312"/>
<survey_loc ver="1.1" type="x789" date="20160312"/>
<note>Population survey</note>
</exec>
<country name="ABC" type="MALE">
<state name="ABC_state1" result="PASS">
<info>
<type>literacy rate comparison</type>
</info>
<comment><![CDATA[
Some random text
contained here
]]></comment>
</state>
</country>
<country name="XYZ" type="MALE">
<state name="XYZ_state2" result="FAIL">
<info>
<type>literacy rate comparison</type>
</info>
<comment><![CDATA[
any random text data
]]></comment>
</state>
</country>
</DataSet>
EOF
my $parser = XML::LibXML->new();
my $doc = $parser->parse_string($XML);
my $xc = XML::LibXML::XPathContext->new($doc);
$xc->registerNs('x','http://www.w3schools.com');
foreach my $camelid ($xc->findnodes('//x:DataSet')) {
my $country_name = $camelid->findvalue('./x:country/#name');
my $country_type = $camelid->findvalue('./x:country/#type');
my $state_name = $camelid->findvalue('./x:state/#name');
my $state_result = $camelid->findvalue('./x:state/#result');
print "state_name ($state_name)\n";
print "state_result ($state_result)\n";
print "country_name ($country_name)\n";
print "country_type ($country_type)\n";
}
Update
if I remove the name space from XML and change my XPath slightly it seems to work. Can someone help me understand the difference.
foreach my $camelid ($xc->findnodes('//DataSet')) {
my $country_name = $camelid->findvalue('./country/#name');
my $country_type = $camelid->findvalue('./country/#type');
my $state_name = $camelid->findvalue('./country/state/#name');
my $state_result = $camelid->findvalue('./country/state/#result');
print "state_name ($state_name)\n";
print "state_result ($state_result)\n";
print "country_name ($country_name)\n";
print "country_type ($country_type)\n";
}
This would be my approach
#!/usr/bin/perl
use strict;
use warnings;
use XML::LibXML;
my $XML=<<EOF;
<DataSet xmlns="http://www.w3schools.com" xmlns:xsi="https://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3schools.com note.xsd">
<exec>
<survey_region ver="1.1" type="x789" date="20160312"/>
<survey_loc ver="1.1" type="x789" date="20160312"/>
<note>Population survey</note>
</exec>
<country name="ABC" type="MALE">
<state name="ABC_state1" result="PASS">
<info>
<type>literacy rate comparison</type>
</info>
<comment><![CDATA[
Some random text
contained here
]]></comment>
</state>
</country>
<country name="XYZ" type="MALE">
<state name="XYZ_state2" result="FAIL">
<info>
<type>literacy rate comparison</type>
</info>
<comment><![CDATA[
any random text data
]]></comment>
</state>
</country>
</DataSet>
EOF
my $parser = XML::LibXML->new();
my $tree = $parser->parse_string($XML);
my $root = $tree->getDocumentElement;
my #country = $root->getElementsByTagName('country');
foreach my $citem(#country){
my $country_name = $citem->getAttribute('name');
my $country_type = $citem->getAttribute('type');
print "Country Name -- $country_name\nCountry Type -- $country_type\n";
my #state = $citem->getElementsByTagName('state');
foreach my $sitem(#state){
my #info = $sitem->getElementsByTagName('info');
my $state_name = $sitem->getAttribute('name');
my $state_result = $sitem->getAttribute('result');
print "State Name -- $state_name\nState Result -- $state_result\n";
foreach my $i (#info){
my $text = $i->getElementsByTagName('type');
print "Info --- $text\n";
}
}
print "\n";
}
Of course you can manipulate the data anyway you'd like. If you are parsing from a file change parse_string to parse_file.
For the individual elements in the xml use the getElementsByTagName to get the elements within the tags. This should be enough to get you going
There seem to be two small mistakes here.
1. call findvalue for the XPathContext document with the context node as parameter.
2. name is a attribute in country no a node.
Therefor try :
my $country_name = $xc->findvalue('./x:country/#name', $camelid );
Update to the updated question if I remove the name space from XML and change my XPath slightly it seems to work. Can someone help me understand the difference.
To understand what happens here have a look to NOTE ON NAMESPACES AND XPATH
In your case $camelid->findvalue('./x:state/#name'); calls findvalue is called for an node.
But: The recommended way is to use the XML::LibXML::XPathContext module to define an explicit context for XPath evaluation, in which a document independent prefix-to-namespace mapping can be defined. Which I did above.
Conclusion:
Calling find on a node will only work: if the root element had no namespace
(or if you use the same prefix as in the xml doucment if ther is any)

Perl XML::Twig - preserving quotes in and around attributes

I'm selectively fixing some elements and attributes. Unfortunately, our input files contain both single- and double-quoted attribute values. Also, some attribute values contain quotes (within a value).
Using XML::Twig, I cannot see out how to preserve whatever quotes exist around attribute values.
Here's sample code:
use strict;
use XML::Twig;
my $file=qq(<file>
<label1 attr='This "works"!' />
<label2 attr="This 'works'!" />
</file>
);
my $fixes=0; # count fixes
my $twig = XML::Twig->new( twig_handlers => {
'[#attr]' => sub {fix_att(#_,\$fixes);} },
# ...
keep_atts_order => 1,
keep_spaces => 1,
keep_encoding => 1, );
#$twig->set_quote('single');
$twig->parse($file);
print $twig->sprint();
sub fix_att {
my ($t,$elt,$fixes) =#_;
# ...
}
The above code returns invalid XML for label1:
<label1 attr="This "works"!" />
If I add:
$twig->set_quote('single');
Then we would see invalid XML for label2:
<label2 attr='This 'works'!' />
Is there an option to preserve existing quotes? Or is there a better approach for selectively fixing twigs?
Is there any specific reason for you to use keep_encoding? Without it the quote is properly encoded.
keep_encoding is used to preserve the original encoding of the file, but there are other ways to do this. It was used mostly in the pre-5.8 era, when encodings didn't work as smoothly as they do now.

How to build a hashref with arrays in perl?

I am having trouble building what i think is a hashref (href) in perl with XML::Simple.
I am new to this so not sure how to go about it and i cant find much to build this href with arrays. All the examples i have found are for normal href.
The code bellow outputs the right xml bit, but i am really struggling on how to add more to this href
Thanks
Dario
use XML::Simple;
$test = {
book => [
{
'name' => ['createdDate'],
'value' => [20141205]
},
{
'name' => ['deletionDate'],
'value' => [20111205]
},
]
};
$test ->{book=> [{'name'=> ['name'],'value'=>['Lord of the rings']}]};
print XMLout($test,RootName=>'library');
To add a new hash to the arrary-ref 'books', you need to cast the array-ref to an array and then push on to it. #{ $test->{book} } casts the array-ref into an array.
push #{ $test->{book} }, { name => ['name'], value => ['The Hobbit'] };
XML::Simple is a pain because you're never sure whether you need an array or a hash, and it is hard to distinguish between elements and attributes.
I suggest you make a move to XML::API. This program demonstrates some how it would be used to create the same XML data as your own program that uses XML::Simple.
It has an advantage because it builds a data structure in memory that properly represents the XML. Data can be added linearly, like this, or you can store bookmarks within the structure and go back and add information to nodes created previously.
This code adds the two book elements in different ways. The first is the standard way, where the element is opened, the name and value elements are added, and the book element is closed again. The second shows the _ast (abstract syntax tree) method that allows you to pass data in nested arrays similar to those in XML::Simple for conciseness. This structure requires you to prefix attribute names with a hyphen - to distinguish them from element names.
use strict;
use warnings;
use XML::API;
my $xml = XML::API->new;
$xml->library_open;
$xml->book_open;
$xml->name('createdDate');
$xml->value('20141205');
$xml->book_close;
$xml->_ast(book => [
name => 'deletionDate',
value => '20111205',
]);
$xml->library_close;
print $xml;
output
<?xml version="1.0" encoding="UTF-8" ?>
<library>
<book>
<name>createdDate</name>
<value>20141205</value>
</book>
<book>
<name>deletionDate</name>
<value>20111205</value>
</book>
</library>

Escape special character at text

I am reading a xml file, and I add some additional text, but I can't get exact text because some special characters automatically converted.
I try this:
<book>
<book-meta>
<book-id pub-id-type="doi">1545</book-id>
<book-title>Regenerating <?tex?> the Curriculum</book-title>
</book-meta>
</book>
Script:
use strict;
use XML::Twig;
open(my $out, '>', 'Output.xml') or die "can't Create stroy file $!\n";
my $story_file = XML::Twig->new(
twig_handlers => {
'book-id' => sub { $_->set_text('<?sample?>') },
keep_atts_order => 1,
},
pretty_print => 'indented',
);
$story_file->parsefile('sample.xml');
$story_file->print($out);
Output:
<book>
<book-meta>
<book-id pub-id-type="doi"><?sample?></book-id>
<book-title>Regenerating <?tex?> the Curriculum</book-title>
</book-meta>
</book>
I would like output as:
<book>
<book-meta>
<book-id pub-id-type="doi"><?sample?></book-id>
<book-title>Regenerating <?tex?> the Curriculum</book-title>
</book-meta>
</book>
How can I escape this type of character in XML twig. I tried the set_asis option, but I can't get it to work.
XML::Twig is correctly inserting the string <?sample?> for you as you are asking for a PCDATA node to be added and < must be replaced with < in such a node. However what you want is a processing instruction node.
The easiest way to insert such a node using XML::Twig is using the set_inner_xml method, which will parse an XML tree fragment from a string and insert it as the contents of the current node.
If you replace
$_->set_text('<?sample?>')
with
$_->set_inner_xml('<?sample?>')
then your code should do what you want. The output I get is
<book>
<book-meta>
<book-id pub-id-type="doi"><?sample?></book-id>
<book-title>Regenerating <?tex?> the Curriculum</book-title>
</book-meta>
</book>
<? ..... ?> is not (part of) text but a processing instruction. When you add it you your XML with set_text however it is processed as text, hence the <.
I'm not familiar with XML::Twig myself, but I think you should check for the possibility to add a processing instruction instead of text.