Escape special character at text - perl

I am reading a xml file, and I add some additional text, but I can't get exact text because some special characters automatically converted.
I try this:
<book>
<book-meta>
<book-id pub-id-type="doi">1545</book-id>
<book-title>Regenerating <?tex?> the Curriculum</book-title>
</book-meta>
</book>
Script:
use strict;
use XML::Twig;
open(my $out, '>', 'Output.xml') or die "can't Create stroy file $!\n";
my $story_file = XML::Twig->new(
twig_handlers => {
'book-id' => sub { $_->set_text('<?sample?>') },
keep_atts_order => 1,
},
pretty_print => 'indented',
);
$story_file->parsefile('sample.xml');
$story_file->print($out);
Output:
<book>
<book-meta>
<book-id pub-id-type="doi"><?sample?></book-id>
<book-title>Regenerating <?tex?> the Curriculum</book-title>
</book-meta>
</book>
I would like output as:
<book>
<book-meta>
<book-id pub-id-type="doi"><?sample?></book-id>
<book-title>Regenerating <?tex?> the Curriculum</book-title>
</book-meta>
</book>
How can I escape this type of character in XML twig. I tried the set_asis option, but I can't get it to work.

XML::Twig is correctly inserting the string <?sample?> for you as you are asking for a PCDATA node to be added and < must be replaced with < in such a node. However what you want is a processing instruction node.
The easiest way to insert such a node using XML::Twig is using the set_inner_xml method, which will parse an XML tree fragment from a string and insert it as the contents of the current node.
If you replace
$_->set_text('<?sample?>')
with
$_->set_inner_xml('<?sample?>')
then your code should do what you want. The output I get is
<book>
<book-meta>
<book-id pub-id-type="doi"><?sample?></book-id>
<book-title>Regenerating <?tex?> the Curriculum</book-title>
</book-meta>
</book>

<? ..... ?> is not (part of) text but a processing instruction. When you add it you your XML with set_text however it is processed as text, hence the <.
I'm not familiar with XML::Twig myself, but I think you should check for the possibility to add a processing instruction instead of text.

Related

How to parse <rss> tag with XML::LibXML to find xmlns defintions

It seems that there is no consistent way that podcasts define their rss feeds.
Ran into one that is using different schema defs for the RSS.
What's the best way to scan for xmlnamespace in an RSS url, using XML::LibXML
E.g.
One feed might be
<rss
xmlns:content="http://purl.org/rss/1.0/modules/content/"
xmlns:wfw="http://wellformedweb.org/CommentAPI/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:atom="http://www.w3.org/2005/Atom"
xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
xmlns:slash="http://purl.org/rss/1.0/modules/slash/" version="2.0">
Another might be
<rss xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd"version="2.0"
xmlns:atom="http://www.w3.org/2005/Atom">
I want to include in my script an assessment of all the namespaces being used so that when parsing the rss, the appropriate field names can be tracked.
Not sure what that will look like yet, as I'm not sure this module has the capability to do the <rss> tag attribute atomization that I want.
I'm not sure I understand exactly what kind of output you're looking for, but XML::LibXML is indeed able to list the namespaces:
use warnings;
use strict;
use XML::LibXML;
my $dom = XML::LibXML->load_xml(string => <<'EOT');
<rss
xmlns:content="http://purl.org/rss/1.0/modules/content/"
xmlns:wfw="http://wellformedweb.org/CommentAPI/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:atom="http://www.w3.org/2005/Atom"
xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
xmlns:slash="http://purl.org/rss/1.0/modules/slash/" version="2.0">
</rss>
EOT
for my $ns ($dom->documentElement->getNamespaces) {
print $ns->getLocalName(), " / ", $ns->getData(), "\n";
}
Output:
content / http://purl.org/rss/1.0/modules/content/
wfw / http://wellformedweb.org/CommentAPI/
dc / http://purl.org/dc/elements/1.1/
atom / http://www.w3.org/2005/Atom
sy / http://purl.org/rss/1.0/modules/syndication/
slash / http://purl.org/rss/1.0/modules/slash/
I know that OP has already accepted an answer. But for completeness sake it should be mentioned that the recommended way to make searches on the DOM resilient is to use XML::LibXML::XPathContext:
#!/usr/bin/perl
use strict;
use warnings;
use XML::LibXML;
my #examples = (
<<EOT
<rss xmlns:atom="http://www.w3.org/2005/Atom">
<atom:test>One Ring to rule them all,</atom:test>
</rss>
EOT
,
<<EOT
<rss xmlns:a="http://www.w3.org/2005/Atom">
<a:test>One Ring to find them,</a:test>
</rss>
EOT
,
<<EOT
<rss xmlns="http://www.w3.org/2005/Atom">
<test>The end...</test>
</rss>
EOT
,
);
my $xpc = XML::LibXML::XPathContext->new();
$xpc->registerNs('atom', 'http://www.w3.org/2005/Atom');
for my $example (#examples) {
my $dom = XML::LibXML->load_xml(string => $example)
or die "XML: $!\n";
for my $node ($xpc->findnodes("//atom:test", $dom)) {
printf("%-10s: %s\n", $node->nodeName, $node->textContent);
}
}
exit 0;
i.e. you assign a local namespace prefix for those namespaces you are interested in.
Output:
$ perl dummy.pl
atom:test : One Ring to rule them all,
a:test : One Ring to find them,
test : The end...

unable to parse xml file using registered namespace

I am using XML::LibXML to parse a XML file. There seems to some problem in using registered namespace while accessing the node elements. I am planning to covert this xml data into CSV file. I am trying to access each and every element here. To start with I tried out extracting attribute values of <country> and <state> tags. Below is the code I have come with . But I am getting error saying XPath error : Undefined namespace prefix.
use strict;
use warnings;
use Data::Dumper;
use XML::LibXML;
my $XML=<<EOF;
<DataSet xmlns="http://www.w3schools.com" xmlns:xsi="https://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3schools.com note.xsd">
<exec>
<survey_region ver="1.1" type="x789" date="20160312"/>
<survey_loc ver="1.1" type="x789" date="20160312"/>
<note>Population survey</note>
</exec>
<country name="ABC" type="MALE">
<state name="ABC_state1" result="PASS">
<info>
<type>literacy rate comparison</type>
</info>
<comment><![CDATA[
Some random text
contained here
]]></comment>
</state>
</country>
<country name="XYZ" type="MALE">
<state name="XYZ_state2" result="FAIL">
<info>
<type>literacy rate comparison</type>
</info>
<comment><![CDATA[
any random text data
]]></comment>
</state>
</country>
</DataSet>
EOF
my $parser = XML::LibXML->new();
my $doc = $parser->parse_string($XML);
my $xc = XML::LibXML::XPathContext->new($doc);
$xc->registerNs('x','http://www.w3schools.com');
foreach my $camelid ($xc->findnodes('//x:DataSet')) {
my $country_name = $camelid->findvalue('./x:country/#name');
my $country_type = $camelid->findvalue('./x:country/#type');
my $state_name = $camelid->findvalue('./x:state/#name');
my $state_result = $camelid->findvalue('./x:state/#result');
print "state_name ($state_name)\n";
print "state_result ($state_result)\n";
print "country_name ($country_name)\n";
print "country_type ($country_type)\n";
}
Update
if I remove the name space from XML and change my XPath slightly it seems to work. Can someone help me understand the difference.
foreach my $camelid ($xc->findnodes('//DataSet')) {
my $country_name = $camelid->findvalue('./country/#name');
my $country_type = $camelid->findvalue('./country/#type');
my $state_name = $camelid->findvalue('./country/state/#name');
my $state_result = $camelid->findvalue('./country/state/#result');
print "state_name ($state_name)\n";
print "state_result ($state_result)\n";
print "country_name ($country_name)\n";
print "country_type ($country_type)\n";
}
This would be my approach
#!/usr/bin/perl
use strict;
use warnings;
use XML::LibXML;
my $XML=<<EOF;
<DataSet xmlns="http://www.w3schools.com" xmlns:xsi="https://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3schools.com note.xsd">
<exec>
<survey_region ver="1.1" type="x789" date="20160312"/>
<survey_loc ver="1.1" type="x789" date="20160312"/>
<note>Population survey</note>
</exec>
<country name="ABC" type="MALE">
<state name="ABC_state1" result="PASS">
<info>
<type>literacy rate comparison</type>
</info>
<comment><![CDATA[
Some random text
contained here
]]></comment>
</state>
</country>
<country name="XYZ" type="MALE">
<state name="XYZ_state2" result="FAIL">
<info>
<type>literacy rate comparison</type>
</info>
<comment><![CDATA[
any random text data
]]></comment>
</state>
</country>
</DataSet>
EOF
my $parser = XML::LibXML->new();
my $tree = $parser->parse_string($XML);
my $root = $tree->getDocumentElement;
my #country = $root->getElementsByTagName('country');
foreach my $citem(#country){
my $country_name = $citem->getAttribute('name');
my $country_type = $citem->getAttribute('type');
print "Country Name -- $country_name\nCountry Type -- $country_type\n";
my #state = $citem->getElementsByTagName('state');
foreach my $sitem(#state){
my #info = $sitem->getElementsByTagName('info');
my $state_name = $sitem->getAttribute('name');
my $state_result = $sitem->getAttribute('result');
print "State Name -- $state_name\nState Result -- $state_result\n";
foreach my $i (#info){
my $text = $i->getElementsByTagName('type');
print "Info --- $text\n";
}
}
print "\n";
}
Of course you can manipulate the data anyway you'd like. If you are parsing from a file change parse_string to parse_file.
For the individual elements in the xml use the getElementsByTagName to get the elements within the tags. This should be enough to get you going
There seem to be two small mistakes here.
1. call findvalue for the XPathContext document with the context node as parameter.
2. name is a attribute in country no a node.
Therefor try :
my $country_name = $xc->findvalue('./x:country/#name', $camelid );
Update to the updated question if I remove the name space from XML and change my XPath slightly it seems to work. Can someone help me understand the difference.
To understand what happens here have a look to NOTE ON NAMESPACES AND XPATH
In your case $camelid->findvalue('./x:state/#name'); calls findvalue is called for an node.
But: The recommended way is to use the XML::LibXML::XPathContext module to define an explicit context for XPath evaluation, in which a document independent prefix-to-namespace mapping can be defined. Which I did above.
Conclusion:
Calling find on a node will only work: if the root element had no namespace
(or if you use the same prefix as in the xml doucment if ther is any)

how to get block of xml code through perl

From morning i was scratching my head to resolve the below requirment. I know how to parse an xml but not able to find out the sollution to get the exact block along with tags.
sample code:
<employee name="sample1">
<interest name="cricket">
<function action= "bowling">
<rating> average </rating>
</function>
</interest>
<interest name="football">
<function action="defender">
<rating> good </rating>
</function>
</interest>
</employee>
I just want to extract the below content from above xml file and write it into another text file.
<interest name="cricket">
<function action= "bowling">
<rating> average </rating>
</function>
</interest>
Thanks for your help
Using XML::Twig:
#!/usr/bin/perl
use strict;
use warnings;
use XML::Twig;
XML::Twig->new( twig_handlers => { 'interest[#name="cricket"]' => sub { $_->print } },
)
->parsefile( 'interest.xml');
A little explanation: the twig_handler is called when an element satisfying the trigger condition, in this case interest[#name="cricket"], is satisfied. At this point the associated sub is called. In the sub $_ is set to be the current element, which is then print'ed. For more complex subs, 2 arguments are passed, the twig itself (the document) and the current element.
VoilĂ .
Also with XML::Twig comes a tool called xml_grep, which makes it easy to extract what you want:
xml_grep --nowrap 'interest[#name="cricket"]' interest.xml
the --nowrap option prevents the default behaviour which wraps the results in a containing element.

Process quoted string within XML

Perl version: perl, v5.10.1 (*) built for x86_64-linux-thread-multi
I am a relative newbie to perl. I have tried looking at the various XML processing utilities for Perl, XML::Simple, XML::Parser, XML::LibXML, XML::DOM, XML::XML::Twig, XML::XPath etc.
I am trying to process some XML that has quotes in the value portion. I am specifically looking to extract the title from the below XML, however, I've been stumbling over this for a bit now and would appreciate some help if possible.
$VAR1 = {
'issue' => {
'priority' => {
'fid' => '11',
'content' => '3 - Best Effort'
},
'transNum' => {
'fid' => '2',
'content' => '170'
},
'dueDate' => {
'fid' => '17',
'content' => '1327944695'
},
'status' => {
'fid' => '18',
'content' => 'Open - Unassigned'
},
'createdBy' => {
'fid' => '15',
'content' => '32'
},
'title' => {
'fid' => '20',
'content' => 'Testing on spider - issue with "quotation marks"'
},
'description' => {
'fid' => '22',
'content' => 'Noticed issue with title having quotes in title'
},
'issueNum' => {
'fid' => '1',
'content' => '33'
}
}
};
Using XML::LibXML and following code (Note: above if print of contents of $issueXML variable):
my $parser = XML::LibXML->new();
my $doc = $parser->parse_string($issueXML);
print $doc->toString;
This prints out:
<?xml version="1.0" encoding="utf-8"?>
<issues>
<issue>
<issueNum fid="1">33</issueNum>
<transNum fid="2">170</transNum>
<createdBy fid="15">32</createdBy>
<status fid="18">Open - Unassigned</status>
<title fid="20">Testing on spider - issue with "quotation marks"</title>
<priority fid="11">3 - Best Effort</priority>
<description fid="22">Noticed issue with submission of Documentation issue #40 on accurev with quotes in title. </description>
<dueDate fid="17">1327944695</dueDate>
</issue>
</issues>
I am looking to specifically extract value for the title tag.
When I was processing using XML::Parser, I kept ending up with just the final quote mark. I would like to maintain the same format of the string to display:
Testing on spider - issue with "quotation marks"
I am a bit overwhelmed at the moment with the various XML processing functions. I have tried for awhile now to figure this out, and I am seriously spinning my wheels.
TIA, Appreciate any help,
Regards,
Scott
Another go with XML::LibXML. You should have no problems with quotation marks inside text nodes.
#!/usr/bin/perl
use strict;
use warnings;
use XML::LibXML;
use Data::Dumper;
my $xml = XML::LibXML->load_xml(string => q{<?xml version="1.0" encoding="utf-8"?>
<issues>
<issue>
<issueNum fid="1">33</issueNum>
<transNum fid="2">170</transNum>
<createdBy fid="15">32</createdBy>
<status fid="18">Open - Unassigned</status>
<title fid="20">Testing on spider - issue with "quotation marks"</title>
<priority fid="11">3 - Best Effort</priority>
<description fid="22">Noticed issue with submission of Documentation issue #40 on accurev with quotes in title. </description>
<dueDate fid="17">1327944695</dueDate>
</issue>
</issues>
});
my $title = $xml->find('/issues/issue/title');
print $title->get_node(0)->textContent;
I am not sure what problem you run into with the quotation marks. They're just a character like any other, except in attribute values where you may have to use an entity if the quote is already used as the value delimiter. Are you sure the "problem" is not just with the way Data::Dumper displays the data structure generated by XML::Simple?
In any case stay away from XML::Parser, which is too low-level, use XML::LibXML or XML::Twig. XML::Simple seems to generate a lot of questions, especially from people not familiar with Perl, so I am not sure it's the right tool to use.
Here is a solution with XML::Twig, but there are any other ways to do this, depending on exactly what you want to do with the titles.
#!/usr/bin/perl
use strict;
use warnings;
use XML::Twig;
my $issueXML=q{<?xml version="1.0" encoding="utf-8"?>
<issues>
<issue>
<issueNum fid="1">33</issueNum>
<transNum fid="2">170</transNum>
<createdBy fid="15">32</createdBy>
<status fid="18">Open - Unassigned</status>
<title fid="20">Testing on spider - issue with "quotation marks"</title>
<priority fid="11">3 - Best Effort</priority>
<description fid="22">Noticed issue with submission of Documentation issue #40 on accurev with quotes in title. </description>
<dueDate fid="17">1327944695</dueDate>
</issue>
</issues>
};
my $t= XML::Twig->new( twig_handlers => { title => sub { print $_->text, "\n"; } })
->parse( $issueXML);
I usually use XML::XSH2 for XML manipulation. Your problem simplifies to:
open FILE.xml ;
for //title echo (.) ;
Your best way of pulling bits out of XML is with an XPath query.
In this case you are looking for the element 'title', inside an element 'issue', inside an element 'issues'.
So your XPath query is simply '//issues/issue/title'.
In two lines of code, you can use XML::LibXML::XPathContext to perform the XPath query for you, which will return the element's content which you are looking for.
This code snippet will demonstrate a simple way of doing an XPath query. The important bit of it is the two lines following the comment "Relevant bit here".
For more information, see the documentation for XML::LibXML::XPathContext
#!/usr/bin/perl
use strict;
use warnings;
use XML::LibXML;
my $xml = XML::LibXML->load_xml(string => q{<?xml version="1.0" encoding="utf-8"?>
<issues>
<issue>
<issueNum fid="1">33</issueNum>
<transNum fid="2">170</transNum>
<createdBy fid="15">32</createdBy>
<status fid="18">Open - Unassigned</status>
<title fid="20">Testing on spider - issue with "quotation marks"</title>
<priority fid="11">3 - Best Effort</priority>
<description fid="22">Noticed issue with submission of Documentation issue #40 on accurev with quotes in title. </description>
<dueDate fid="17">1327944695</dueDate>
</issue>
</issues>
});
# Relevant bit here
my $xc = XML::LibXML::XPathContext->new($xml);
my $title = $xc->find('//issues/issue/title');
print "$title\n";
# prints:
# Testing on spider - issue with "quotation marks"

perl script to replace the xml values

I have this XML file:
<?xml version="1.0" encoding="ISO-8859-1"?>
<BroadsoftDocument protocol = "OCI" xmlns="C" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<sessionId xmlns="">169.254.52.85,16602326,1324821125562</sessionId>
<command xsi:type="UserAddRequest14sp9" xmlns="">
<serviceProviderId>AtyafBahrain</serviceProviderId>
<groupId>LoadTest</groupId>
<userId>user_0002#atyaf.me</userId>
<lastName>0002</lastName>
<firstName>user</firstName>
<callingLineIdLastName>0002</callingLineIdLastName>
<callingLineIdFirstName>user</callingLineIdFirstName>
<password>123456</password>
<language>English</language>
<timeZone>Asia/Bahrain</timeZone>
<address/>
</command>
</BroadsoftDocument>
and I need to replace the values of some fields (UserID, firstName, password) and output the file to be saved with the same name.
Using the code below I will change the syntax of the xml fields (xml format gets disturbed):
XMLout( $xml, KeepRoot => 1, NoAttr => 1, OutputFile => $xml_file, );
can you please advice how to edit the xml file without changing its syntax?
You can checkout XML::Simple parser for perl. You can refer to the CPAN site. I have used it for parsing XML files but I think this should allow modification as well.
# open XML file (input the XML file name)
open (INPUTFILE, "+<$filename_1");
#file = <INPUTFILE>;
seek INPUTFILE,0,0;
foreach $file (#file)
{
# Find string_1 and replace it by string_2
$file =~ s/$str_1/$str_2/g;
# write to file
print INPUTFILE $file;
}
close INPUTFILE;