Perl with XML::LibXML Dom (Globally Find and Replace XML) - perl

I am new to DOM and XML-LibXML.
This is my sample mathml (XML) file. My XML filename is in.xml and i need the final output XML filename is out.xml. I would like to find <mi>bcde</mi> and need to modify <mtext>pqsd</mtext> globally and store in out.xml. How to achieve this.
<math xmlns="http://www.w3.org/1998/Math/MathML">
<mfrac>
<mi>a</mi>
<mrow>
<mi>bcde</mi>
</mrow>
</mfrac>
<msqrt>
<mi>s</mi>
<mi>e</mi>
<mi>f</mi>
</msqrt>
</math>
#!/usr/bin/perl
use strict;
use warnings 'all';
use XML::LibXML;
my $mediaIdFrom = "MEDIAID_TEST";
my $VodItemIdFrom = "VODITEM_ID_TEST";
my $mediaId="";
my $vodItemId="";
my $filename = 'sample1.xml';
my $out_filename = "sample2.xml";
my $dom = XML::LibXML -> load_xml(location => $filename);
foreach $mediaId ($dom->findnodes('/ScheduleProvider/Episode/Media/#id')) {
$mediaId->setValue("xx " . $mediaIdFrom . " yy");
}
foreach $vodItemId ($dom->findnodes('/ScheduleProvider/VoidItem/#id')) {
$vodItemId->setValue($VodItemIdFrom);
}
#### for storing the output separate XML file
$dom->toFile($out_filename);`

Your XML has a namespace but your XPath queries don't, see note under findnodes in man XML::LibXML::Node. This code should work:
#!/usr/bin/perl
use strict;
use warnings;
use XML::LibXML;
use XML::LibXML::XPathContext;
my $dom = XML::LibXML->load_xml(string => <<'END_OF_XML');
<math xmlns="http://www.w3.org/1998/Math/MathML">
<mfrac>
<mi>a</mi>
<mrow>
<mi>bcde</mi>
</mrow>
</mfrac>
<msqrt>
<mi>s</mi>
<mi>e</mi>
<mi>f</mi>
</msqrt>
</math>
END_OF_XML
my $xpc = XML::LibXML::XPathContext->new();
$xpc->registerNs('math', 'http://www.w3.org/1998/Math/MathML');
foreach my $node ($xpc->findnodes('/math:math/math:mfrac/math:mrow/math:mi', $dom)) {
my $newNode = XML::LibXML::Element->new('mtext');
$newNode->appendText('pqsd');
$node->replaceNode($newNode);
}
print $dom->toString();
Output:
$ perl dummy.pl
<?xml version="1.0"?>
<math xmlns="http://www.w3.org/1998/Math/MathML">
<mfrac>
<mi>a</mi>
<mrow>
<mtext>pqsd</mtext>
</mrow>
</mfrac>
<msqrt>
<mi>s</mi>
<mi>e</mi>
<mi>f</mi>
</msqrt>
</math>
EDIT Maybe I have misunderstood your question and you want to replace all occurrences of <mi>bcde</mi>? Then the foreach would change to
foreach my $node ($xpc->findnodes('//math:mi[text()="bcde"]', $dom)) {
EDIT 2 to find multiple <mi>xyz</mi> and replace them you could use text=replacement command line parameters, i.e.
foreach my $argv (#ARGV) {
next
unless my($find, $replace) = ($argv =~ /^([^=]+)=(.*)$/);
foreach my $node ($xpc->findnodes(qq{//math:mi[text()="${find}"]}, $dom)) {
my $newNode = XML::LibXML::Element->new('mtext');
$newNode->appendText($replace);
$node->replaceNode($newNode);
}
}
and your replacement example would be
$ perl dummy.pl bcde=pqsd
EDIT 3 replace all <mi>xxx</mi> where xxx has more than one character with mtext:
foreach my $node ($xpc->findnodes('//math:mi', $dom)) {
my $text = $node->textContent();
# strip surrounding white space from text
$text =~ s/^\s+//;
$text =~ s/\s+$//;
# if text has more than one character then replace "mi" with "mtext"
if (length($text) > 1) {
my $newNode = XML::LibXML::Element->new('mtext');
$newNode->appendText($text);
$node->replaceNode($newNode);
}
}

Related

perl script to iterate over xml nodes using XML::LibXML

I am trying to come up with a perl script to iterate over some nodes and get values in xml file.
My XML File looks like below and is saved spec.xml
<?xml version="1.0" encoding="UTF-8"?>
<WO xmlns="http://www.example.com/yyyy" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" >
<WOSet>
<SR>
<FINISHTIME>2013-07-29T18:21:38-05:00</FINISHTIME>
<STARTTIME xsi:nil="true" />
<TYPE>SR</TYPE>
<DESCRIPTION>Create CUST</DESCRIPTION>
<EXTERNALSYSTEMID />
<REPORTEDBY>PCAUSR</REPORTEDBY>
<REPORTEDEMAIL />
<STATUS>RESOLVED</STATUS>
<SRID>1001</SRID>
<UID>1</UID>
<SPEC>
<AVALUE>IT</AVALUE>
<ATTRID>CUST_DEPT</ATTRID>
<NALUE xsi:nil="true" />
<TVALUE />
</SPEC>
<SPEC>
<AVALUE>001</AVALUE>
<ATTRID>DEPT_CODE</ATTRID>
<NVALUE xsi:nil="true" />
<TVALUE />
</SPEC>
</SR>
</WOSet>
</WO>
when I run the below script , I neither get the output nor any error to get clue on where to fix things...
I am not a perl expert , would love experts here to through some light...
#!/usr/bin/perl
use XML::LibXML;
use strict;
use warnings;
my $file = 'spec.xml';
my $parser = XML::LibXML->new();
my $tree = $parser->parse_file($file);
my $root = $tree->getDocumentElement;
foreach my $atrid ( $tree->findnodes('WO/WOSet/SR/SPEC') ) {
my $name = $atrid->findvalue('ATTRID');
my $value = $atrid->findvalue('AVALUE');
print $name
print " = ";
print $value;
print ";\n";
}
My expected output is
CUST_DEPT = IT
DEPT_CODE = 001
The XML doesn't contain any element named WO in the null namespace. You want to match the elements named WO in the http://www.example.com/yyyy namespace.
#!/usr/bin/perl
use strict;
use warnings;
use XML::LibXML qw( );
use XML::LibXML::XPathContext qw( );
my $file = 'spec.xml';
my $parser = XML::LibXML->new();
my $doc = $parser->parse_file($file);
my $root = $doc->getDocumentElement;
my $xpc = XML::LibXML::XPathContext->new($doc);
$xpc->registerNs(y => 'http://www.example.com/yyyy');
for my $atrid ( $xpc->findnodes('y:WO/y:WOSet/y:SR/y:SPEC') ) {
my $name = $xpc->findvalue('y:ATTRID', $atrid);
my $value = $xpc->findvalue('y:AVALUE', $atrid);
print "$name = $value\n";
}

Perl word Stemming English text

I am trying to stem an English text, I read a lot of forums but I couldn't see a clear example. I am using porter stemmer as in using Text::ENglish.
This is how far I got:
use Lingua::StopWords qw(getStopWords);
my $stopwords = getStopWords('en');
use Text::English;
#stopwords = grep { $stopwords->{$_} } (keys %$stopwords);
chdir("c:/Test Facility/input");
#files = <*>;
foreach $file (#files)
{
open (input, $file);
while (<input>)
{
open (output,">>c:/Test Facility/normalized/".$file);
chomp;
for my $w (#stopwords)
{
s/\b\Q$w\E\b//ig;
}
$_ =~s/<[^>]*>//g;
$_ =~ s/[[:punct:]]//g;
##What should I write here to apply porter stemming using Text::English##
print output "$_\n";
}
}
close (input);
close (output);
Run the following code like this:
perl stemmer.pl /usr/lib/jvm/java-6-sun-1.6.0.26/jre/LICENSE
It produces output similar to:
operat system distributor licens java version sun microsystems inc sun willing to license java platform standard edition developer kit jdk
Note that strings with length 1 and numeric values are removed, besides stopwords.
#!/usr/bin/env perl
use common::sense;
use Encode;
use Lingua::Stem::Snowball;
use Lingua::StopWords qw(getStopWords);
use Scalar::Util qw(looks_like_number);
my $stemmer = Lingua::Stem::Snowball->new(
encoding => 'UTF-8',
lang => 'en',
);
my %stopwords = map {
lc
} keys %{getStopWords(en => 'UTF-8')};
local $, = ' ';
say map {
sub {
my #w =
map {
encode_utf8 $_
} grep {
length >= 2
and not looks_like_number($_)
and not exists $stopwords{lc($_)}
} split
/[\W_]+/x,
shift;
$stemmer->stem_in_place(\#w);
map {
lc decode_utf8 $_
} #w
}->($_);
} <>;

perl xml::sax parsing

Hi I have an xml as below:
<employees>
<employee>
<firstname>John</firstname>
<lastname>Doe</lastname>
<age>gg</age>
<department>Operations</department>
<amount Ccy="EUR">100</amount>
<joinDate> 12/12/2011 </joinDate>
</employee>
<employee>
<firstname>John1111</firstname>
<lastname>Doe1111</lastname>
<age>gg</age>
<department>Operations</department>
<amount Ccy="EUR">200</amount>
<joinDate> 12/13/2011 </joinDate>
</employee>
</employees>
I would like to parse this using xml::sax. So I override start_element and end_element and characters methods. I have to handle two scenarios which i am not sure how to handle. 1) if joinDate is null then take the joinDate as sysdate. 2) sum up the amount and get a value as totalAmount. How to do this using XML::SAX module in perl.
Throw me some light which method i need to override to do this and small snippet of code would help.
Thanks,
Srikanth
It's pretty straightforward:
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
use XML::SAX::ParserFactory;
use XML::SAX::PurePerl;
use Date::Calc 'Today';
my $today = sprintf("%02d/%02d/%4d", (Today())[1,2,0]);
# Alternatively if you can't use Date::Calc:
# my #localtime = localtime;
# my $today = sprintf("%02d/%02d/%4d", $localtime[4]+1, $localtime[2], $localtime[5]+1900);
my (#joindates, $total_amount, $buffer);
my $factory = new XML::SAX::ParserFactory;
my $handler = new XML::SAX::PurePerl;
my $parser = $factory->parser(
Handler => $handler,
Methods => {
characters => sub {
$buffer = shift->{Data};
$buffer =~ s/^\s+|\s+$//g;
},
end_element => sub {
return unless shift->{LocalName} =~ /^(amount|joinDate)$/;
push #joindates, $buffer || $today if $1 eq 'joinDate';
$total_amount += $buffer if $1 eq 'amount';
}
}
);
$parser->parse_uri("sample.xml");
print "Total amount: $total_amount\n";
printf("Join dates:\n%s\n", join("\n", #joindates));
Output:
Total amount: 300
Join dates:
12/12/2011
12/13/2012

How to add one more node information to xml file

I written one script that create one xml file from multiple files,I written script like this.
#!/usr/bin/perl
use warnings;
use strict;
use XML::LibXML;
use Carp;
use File::Find;
use File::Spec::Functions qw( canonpath );
use XML::LibXML::Reader;
use Digest::MD5 'md5';
if ( #ARGV == 0 ) {
push #ARGV, "c:/main/work";
warn "Using default path $ARGV[0]\n Usage: $0 path ...\n";
}
open( my $allxml, '>', "all_xml_contents.combined.xml" )
or die "can't open output xml file for writing: $!\n";
print $allxml '<?xml version="1.0" encoding="UTF-8"?>',
"\n<Shiporder xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\">\n";
my %shipto_md5;
find(
sub {
return unless ( /(_stc\.xml)$/ and -f );
extract_information();
return;
},
#ARGV
);
print $allxml "</Shiporder>\n";
sub extract_information {
my $path = $_;
if ( my $reader = XML::LibXML::Reader->new( location => $path )) {
while ( $reader->nextElement( 'data' )) {
my $elem = $reader->readOuterXml();
my $md5 = md5( $elem );
print $allxml $reader->readOuterXml() unless ( $shipto_md5{$md5}++ );
}
}
return;
}
from above script I am extracting data node information from all xml files and stored in a new xml file . but I have one more node starts with "details", I need to extract that information and I need to add that information also to the file, I tried like this
$reader->nextElement( 'details' );
my $information = $reader->readOuterXml();
I added this in while loop but how can I assign or print this data into same file($all xml). Please help me with this problem.
After your suggestion I tried like this, It gives error
#!/usr/bin/perl
use warnings;
use strict;
use XML::LibXML;
use Carp;
use File::Find;
use File::Spec::Functions qw( canonpath );
use XML::LibXML::Reader;
if ( #ARGV == 0 ) {
push #ARGV, "V:/main/work";
warn "Using default path $ARGV[0]\n Usage: $0 path ...\n";
}
my $libXML = new XML::LibXML;
my $outputDom = $libXML->parse_string('<?xml version="1.0" encoding="UTF-8"?
>','<Shiporder xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">');
my $shiporder = $outputDom->documentElement;
find(
sub {
return unless ( /(_stc\.xml)$/ and -f );
extract_information();
return;
},
#ARGV
);
sub extract_information {
my $path = $_;
if(my #inputDom = XML::LibXML->load_xml(location => $path)){
$inputDom->findnodes('//data || //deatils');
foreach (#$inputDom) {
$shiporder->appendChild($_->parentNode->cloneNode(1));
}
$outputDom->toFile("allfiles.xml");
}
}
but it gives like " '\n\n:1: Parser error:Strat tag expected,'<' not found " Can you help me with script because I am very new to perl.
You would do a lot better if you used what XML::LibXML and related modules gives you, it is a very large and comprehensive module and allows you to do a lot in few lines.
You can use the parser to start a new dom document using parse_string, storing the root node using documentElement. From there, use parse_file to load up each of your input files, then findnodes on the input files to extract the nodes you want to clone. Then append a clone of your input nodes to the output document, and finally use the toFile method to write out your output.
Something like:
my $libXML = new XML::LibXML;
my $outputDom = $libXML->parse_string('<?xml version="1.0" encoding="UTF-8"?>',
'\n<Shiporder xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">\n');
my $shiporder = $outputDom->documentElement;
...
my $inputDom = $libXML->parse_file(some_file_name);
$inputDom->findnodes('//data || //details'); # use a more suitable xpath
foreach (#$inputDom) {
$shipOrder->appendChild($_->parentNode->cloneNode(1)); # if you want parent too...
}
...
$outputDom->toFile(some_output_file);
}
You will have to allow for namespaces and whatnot, but this gives one approach to start with.

How to read value of a node in Perl libXML::Reader

My XML looks like this-
<info>
<name>NameGoesHere</name>
<age>99</age>
</info>
Using XML::LibXML::Reader, I would like to print the age.
I read the documentation from here, and I am looking for an example. I am able to use $reader->nextElement('info') and then print innerXML but not able to figure out how do I print only age and not the whole innerxml.
Try localName:
#!/usr/bin/perl
use strict;
use warnings;
use XML::LibXML::Reader;
my $reader = XML::LibXML::Reader->new(IO => \*DATA)
or die "Cannot read from \\*DATA\n";
while ( $reader->read ) {
print $reader->readInnerXml if $reader->localName eq 'age';
}
$reader->finish;
__DATA__
<info>
<name>NameGoesHere</name>
<age>99</age>
</info>
Mostly from Regex: keep everything in <profession></profession> tags
use strict;
use warnings;
use feature qw( say );
use XML::LibXML::Reader qw(
XML_READER_TYPE_ELEMENT
XML_READER_TYPE_END_ELEMENT
XML_READER_TYPE_TEXT
);
my $reader = XML::LibXML::Reader->new(IO => \*DATA);
my $in_age = 0;
my $age;
while ($reader->read()) {
if ($reader->nodeType() == XML_READER_TYPE_ELEMENT && $reader->name() eq 'age') {
$age = '' if !$in_age;
++$in_age;
}
elsif ($reader->nodeType() == XML_READER_TYPE_END_ELEMENT && $reader->name() eq 'age') {
--$in_age;
say $age if !$in_age;
}
elsif ($in_age && $reader->nodeType() == XML_READER_TYPE_TEXT) {
$age .= $reader->value();
}
}
__DATA__
<info>
<name>NameGoesHere</name>
<age>99</age>
</info>