Perl : Handling duplicate element names using XML::SAX - perl

How to handle duplicate element names in perl XML::SAX module ? Following is my xml file:
<employees>
<employee>
<name>John</name>
<age>gg</age>
<department>Operations</department>
<amount Ccy="EUR">100</amount>
<company>
<name> abc </name>
</company>
</employee>
<employee>
<name>Larry</name>
<age>45</age>
<department>Accounts</department>
<amount Ccy="EUR">200</amount>
<company>
<name> xyz </name>
</company>
</employee>
</employees>
My question is how to access the element employees->employee->company->name? (I should be able to print "abc" and "xyz").The reason I am asking this is because there is one more 'name' element at employees->employee->name which i want to skip. I would like to use XML::SAX only as my environments only supports this module. Please help. Thanks a lot.

Use a stack to keep record of which nodes you're within by pushing every time you enter a node, and poping every time you leave a node:
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
use XML::SAX::ParserFactory;
use XML::SAX::PurePerl;
my (#nodes, $characters, #names);
my $factory = new XML::SAX::ParserFactory;
my $handler = new XML::SAX::PurePerl;
my $parser = $factory->parser(
Handler => $handler,
Methods => {
start_element => sub {
push #nodes, shift->{LocalName};
},
characters => sub {
$characters = shift->{Data};
},
end_element => sub {
if (shift->{LocalName} eq 'name' && $nodes[-2] eq 'company') {
push #names, $characters;
}
pop #nodes;
}
}
);
$parser->parse_uri("sample2.xml");
print Dumper \#names;
Output:
$VAR1 = [
' abc ',
' xyz '
];
$nodes[-2] is the second to last element in #nodes and will resolve to 'employee' or 'company' when shift->{LocalName} equals 'name'

Related

How to get the text contents of an XML child element based on an attribute of its parent

This is my XML data
<categories>
<category id="Id001" name="Abcd">
<project> ID_1234</project>
<project> ID_5678</project>
</category>
<category id="Id002" name="efgh">
<project> ID_6756</project>
<project> ID_4356</project>
</category>
</categories>
I need to get the text contents of each <project> element based on the name attribute of the containing <category> element.
I am using Perl with the XML::LibXML module.
For example, given category name Abcd i should get the list ID_1234, ID_5678.
Here is my code
my $parser = XML::LibXML->new;
$doc = $parser->parse_file( "/cctest/categories.xml" );
my #nodes = $doc->findnodes( '/categories/category' );
foreach my $cat ( #nodes ) {
my #catn = $cat->findvalue('#name');
}
This gives me the category names in array #catn. But how can I get the text values of each project?
You haven't shown what you've tried so far, or what your desired output is so I've made a guess at what you're looking for.
With XML::Twig you could do something like this:
#!/usr/bin/env perl
use strict;
use warnings;
use XML::Twig;
my $twig = XML::Twig -> parse ( \*DATA );
foreach my $project ( $twig -> findnodes ( '//project' ) ) {
print join ",", (map { $project -> parent -> att($_) } qw ( id name )), $project -> text,"\n";
}
__DATA__
<categories>
<category id="Id001" name="Abcd">
<project> ID_1234</project>
<project> ID_5678</project>
</category>
<category id="Id002" name="efgh">
<project> ID_6756</project>
<project> ID_4356</project>
</category>
</categories>
Which produces:
Id001,Abcd, ID_1234,
Id001,Abcd, ID_5678,
Id002,efgh, ID_6756,
Id002,efgh, ID_4356,
It does this by using findnodes to locate any element 'project'.
Then extract the 'id' and 'name' attributes from the parent (the category), and print that - along with the text in this particular element.
xpath is a powerful tool for selecting data from XML, and with a more focussed question, we can give more specific answers.
So if you were seeking all the projects 'beneath' category "Abcd" you could:
foreach my $project ( $twig -> findnodes ( './category[#name="Abcd"]/project' ) ) {
print $project -> text,"\n";
}
This uses XML::LibXML, which is the library you're already using.
Your $cat variable contains an XML element object which you can process with the same findnodes() and findvalue() methods that you used on the top-level $doc object.
#!/usr/bin/perl
use strict;
use warnings;
# We use modern Perl here (specifically say())
use 5.010;
use XML::LibXML;
my $doc = XML::LibXML->new->parse_file('categories.xml');
foreach my $cat ($doc->findnodes('//category')) {
say $cat->findvalue('#name');
foreach my $proj ($cat->findnodes('project')) {
say $proj->findvalue('.');
}
}
You can try with XML::Simple
use strict;
use warnings;
use XML::Simple;
use Data::Dumper
my $XML_file = 'your XML file';
my $XML_data;
#Get data from your XML file
open(my $IN, '<:encoding(UTF-8)', $XML_file) or die "cannot open file $XML_file";
{
local $/;
$XML_data = <$IN>;
}
close($IN);
#Store XML data as hash reference
my $xmlSimple = XML::Simple->new(KeepRoot => 1);
my $hash_ref = $xmlSimple->XMLin($XML_data);
print Dumper $hash_ref;
The hash reference will be as below:
$VAR1 = {
'categories' => {
'category' => {
'efgh' => {
'id' => 'Id002',
'project' => [
' ID_6756',
' ID_4356'
]
},
'Abcd' => {
'id' => 'Id001',
'project' => [
' ID_1234',
' ID_5678'
]
}
}
}
};
Now to get data which you want:
foreach(#{$hash_ref->{'categories'}->{'category'}->{'Abcd'}->{'project'}}){
print "$_\n";
}
The result is:
ID_1234
ID_5678

How to skip unwanted elements using XML::Twig?

Trying to learn XML::Twig and fetch some data from an XML document.
My XML contains 20k+ <ADN> elements. Eaach <ADN> element contains tens of child elements, one of them is the <GID>. I want process only those ADN where the GID == 1. (See the example XML is the __DATA__)
The docs says:
Handlers are triggered in fixed order, sorted by their type (xpath
expressions first, then regexps, then level), then by whether they
specify a full path (starting at the root element) or not, then by
number of steps in the expression , then number of predicates, then
number of tests in predicates. Handlers where the last step does not
specify a step (foo/bar/*) are triggered after other XPath handlers.
Finally all handlers are triggered last.
Important: once a handler has been triggered if it returns 0 then no
other handler is called, except a all handler which will be called
anyway.
My actual code:
use 5.014;
use warnings;
use XML::Twig;
use Data::Dumper;
my $cat = load_xml_catalog();
say Dumper $cat;
sub load_xml_catalog {
my $hr;
my $current;
my $twig= XML::Twig->new(
twig_roots => {
ADN => sub { # process the <ADN> elements
$_->purge; # and purge when finishes with one
},
},
twig_handlers => {
'ADN/GID' => sub {
return 1 if $_->trimmed_text == 1;
return 0; # skip the other handlers - if the GID != 1
},
'ADN/ID' => sub { #remember the ID as a "key" into the '$hr' for the "current" ADN
$current = $_->trimmed_text;
$hr->{$current}{$_->tag} = $_->trimmed_text;
},
#rules for the wanted data extracting & storing to $hr->{$current}
'ADN/Name' => sub {
$hr->{$current}{$_->tag} = $_->text;
},
},
);
$twig->parse(\*DATA);
return $hr;
}
__DATA__
<ArrayOfADN>
<ADN>
<GID>1</GID>
<ID>1</ID>
<Name>name 1</Name>
</ADN>
<ADN>
<GID>2</GID>
<ID>20</ID>
<Name>should be skipped because GID != 1</Name>
</ADN>
<ADN>
<GID>1</GID>
<ID>1000</ID>
<Name>other name 1000</Name>
</ADN>
</ArrayOfADN>
It outputs
$VAR1 = {
'1000' => {
'ID' => '1000',
'Name' => 'other name 1000'
},
'1' => {
'Name' => 'name 1',
'ID' => '1'
},
'20' => {
'Name' => 'should be skipped because GID != 1',
'ID' => '20'
}
};
So,
The handler for the ADN/GID returns 0 when the GID != 1.
Why the other handlers are still called?
The expected (wanted) output is without the '20' => ... .
How to skip the unwanted nodes correctly?
The "returns zero" thing is a bit of a red herring in this context. If you had multiple matches on your element, then one of them returning zero would inhibit the others.
That doesn't mean it won't still try and process subsequent nodes.
I think you're getting confused - you have handlers for separate subelements of your <ADN> elements - and they trigger separately. That's by design. There is a precedence order for xpath but only on duplicate matches. Yours are completely separate though, so they all 'fire' because they trigger on different elements.
However, you might find it useful to know - twig_handlers allows xpath expressions - so you can explicitly say:
#!/usr/bin/env perl
use strict;
use warnings;
use XML::Twig;
my $twig = XML::Twig->parse( \*DATA );
$twig -> set_pretty_print('indented_a');
foreach my $ADN ( $twig -> findnodes('//ADN/GID[string()="1"]/..') ) {
$ADN -> print;
}
This also works in the twig_handlers syntax. I would suggest doing a handler is only really useful if you need to pre-process your XML, or you're memory constrained. With 20,000 nodes, you may be. (at which point purge is your friend).
#!/usr/bin/env perl
use strict;
use warnings;
use XML::Twig;
my $twig = XML::Twig->new(
pretty_print => 'indented_a',
twig_handlers => {
'//ADN[string(GID)="1"]' => sub { $_->print }
}
);
$twig->parse( \*DATA );
__DATA__
<ArrayOfADN>
<ADN>
<GID>1</GID>
<ID>1</ID>
<Name>name 1</Name>
</ADN>
<ADN>
<GID>2</GID>
<ID>20</ID>
<Name>should be skipped because GID != 1</Name>
</ADN>
<ADN>
<GID>1</GID>
<ID>1000</ID>
<Name>other name 1000</Name>
</ADN>
</ArrayOfADN>
Although, I would probably just do it this way instead:
#!/usr/bin/env perl
use strict;
use warnings;
use XML::Twig;
sub process_ADN {
my ( $twig, $ADN ) = #_;
return unless $ADN -> first_child_text('GID') == 1;
print "ADN with name:", $ADN -> first_child_text('Name')," Found\n";
}
my $twig = XML::Twig->new(
pretty_print => 'indented_a',
twig_handlers => {
'ADN' => \&process_ADN
}
);
$twig->parse( \*DATA );
__DATA__
<ArrayOfADN>
<ADN>
<GID>1</GID>
<ID>1</ID>
<Name>name 1</Name>
</ADN>
<ADN>
<GID>2</GID>
<ID>20</ID>
<Name>should be skipped because GID != 1</Name>
</ADN>
<ADN>
<GID>1</GID>
<ID>1000</ID>
<Name>other name 1000</Name>
</ADN>
</ArrayOfADN>

Using XML::Simple for retrieving data in Perl

I've created the below XML file for retrieving data.
Input:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<ExecutionLogs>
<category cname='Condition1' cid='T1'>
<log>value1</log>
<log>value2</log>
<log>value3</log>
</category>
<category cname='Condition2' cid='T2'>
<log>value4</log>
<log>value5</log>
<log>value6</log>
</category>
<category cname='Condition3' cid='T3'>
<log>value7</log>
</category>
</ExecutionLogs>
I want the output like below,
Condition1 -> value1,value2,value3
Condition2 -> value4,value5,value6
Condition3 -> value7
I have tried the code below,
use strict;
use XML::Simple;
my $filename = "input.xml";
$config = XML::Simple->new();
$config = XMLin($filename);
#values = #{$config->{'category'}{'log'}};
Please help me on this. Thanks in advance.
A way to do this using XML::Twig:
#!/usr/bin/perl
use strict;
use warnings;
use XML::Twig;
XML::Twig->new( twig_handlers => {
category => sub { print $_->att( 'cname'), ': ',
join( ',', $_->children_text( 'log')), "\n";
},
},
)
->parsefile( 'my.xml');
The handler is called each time a category element has been parsed. $_ is the element itself.
What I would do :
use strict; use warnings;
use XML::Simple;
my $config = XML::Simple->new();
$config = XMLin($filename, ForceArray => [ 'log' ]);
# we want an array here ^---------------------^
my #category = #{ $config->{'category'} };
# ^------------------------^
# de-reference of an ARRAY ref
foreach my $hash (#category) {
print $hash->{cname}, ' -> ', join(",", #{ $hash->{log} }), "\n";
}
OUTPUT
Condition1 -> value1,value2,value3
Condition2 -> value4,value5,value6
Condition3 -> value7
NOTE
ForceArray => [ 'log' ] is there to ensure treating same types in {category}->[#]->{log]
unless that, we try to dereferencing an ARRAY ref on a string for the last "Condition3".
Check XML::Simple#ForceArray
and
perldoc perlreftut
perldoc ref
perldoc perlref

perl parsing using sax

I would like to write a xml parsing script in Perl that prints all the firstname values from the following xml file using XML::SAX module.
<employees>
<employee>
<firstname>John</firstname>
<lastname>Doe</lastname>
<age>gg</age>
<department>Operations</department>
<amount Ccy="EUR">100</amount>
</employee>
<employee>
<firstname>Larry</firstname>
<lastname>Page</lastname>
<age>45</age>
<department>Accounts</department>
<amount Ccy="EUR">200</amount>
</employee>
<employee>
<firstname>Harry</firstname>
<lastname>Potter</lastname>
<age>50</age>
<department>Human Resources</department>
<amount Ccy="EUR">300</amount>
</employee>
</employees>
Can anyone help me with sample script?
I am a new to Perl.
Here's an example using XML::SAX. I've used XML::SAX::PurePerl.
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
use XML::SAX::ParserFactory;
use XML::SAX::PurePerl;
my $characters;
my #firstnames;
my $factory = new XML::SAX::ParserFactory;
#Let's see which handlers we have available
#print Dumper $factory;
my $handler = new XML::SAX::PurePerl;
my $parser = $factory->parser(
Handler => $handler,
Methods => {
characters => sub {
$characters = shift->{Data};
},
end_element => sub {
push #firstnames, $characters if shift->{LocalName} eq 'firstname';
}
}
);
$parser->parse_uri("sample.xml");
print Dumper \#firstnames;
Output:
$VAR1 = [
'John',
'Larry',
'Harry'
];
I use $characters to hold character data, and push its contents onto #firstnames whenever I see a closing firstname tag.
Do you have any reason to stick with XML::Sax; If not then probably you can look for some other XML parsers in Perl (XML::Twig, XML::LibXML, XML::LibXMLReader, XML::Simple) and many more.
Here is a sample code to retrieve the firstname using XML::Twig.
use XML::Twig;
my $twig = XML::Twig->new ();
$twig->parsefile ('sample.xml');
my #firstname = map { $_->text } $twig->findnodes ('//firstname');

How to parse multi record XML file ues XML::Simple in Perl

My data.xml
<?xml version="1.0" encoding="ISO-8859-1"?>
<catalog>
<cd country="UK">
<title>Hide your heart</title>
<artist>Bonnie Tyler</artist>
<price>10.0</price>
</cd>
<cd country="CHN">
<title>Greatest Hits</title>
<artist>Dolly Parton</artist>
<price>9.99</price>
</cd>
<cd country="USA">
<title>Hello</title>
<artist>Say Hello</artist>
<price>0001</price>
</cd>
</catalog>
my test.pl
#!/usr/bin/perl
# use module
use XML::Simple;
use Data::Dumper;
# create object
$xml = new XML::Simple;
# read XML file
$data = $xml->XMLin("data.xml");
# access XML data
print "$data->{cd}->{country}\n";
print "$data->{cd}->{artist}\n";
print "$data->{cd}->{price}\n";
print "$data->{cd}->{title}\n";
Output:
Not a HASH reference at D:\learning\perl\t1.pl line 16.
Comment: I googled and found the article(handle single xml record).
http://www.go4expert.com/forums/showthread.php?t=812
I tested with the article code, it works quite well on my laptop.
Then I created my practice code above to try to access multiple record. but failed. How can I fix it? Thank you.
Always use strict;, always use warnings; Don't quote complex references like you're doing. You're right to use Dumper;, it should have shown you that cd was an array ref - you have to specificity which cd.
#!/usr/bin/perl
use strict;
use warnings;
# use module
use XML::Simple;
use Data::Dumper;
# create object
my $xml = new XML::Simple;
# read XML file
my $data = $xml->XMLin("file.xml");
# access XML data
print $data->{cd}[0]{country};
print $data->{cd}[0]{artist};
print $data->{cd}[0]{price};
print $data->{cd}[0]{title};
If you do print Dumper($data), you will see that the data structure does not look like you think it does:
$VAR1 = {
'cd' => [
{
'country' => 'UK',
'artist' => 'Bonnie Tyler',
'price' => '10.0',
'title' => 'Hide your heart'
},
{
'country' => 'CHN',
'artist' => 'Dolly Parton',
'price' => '9.99',
'title' => 'Greatest Hits'
},
{
'country' => 'USA',
'artist' => 'Say Hello',
'price' => '0001',
'title' => 'Hello'
}
]
};
You need to access the data like so:
print "$data->{cd}->[0]->{country}\n";
print "$data->{cd}->[0]->{artist}\n";
print "$data->{cd}->[0]->{price}\n";
print "$data->{cd}->[0]->{title}\n";
In addition to what has been said by Evan, if you're unsure if you're stuck with one or many elements, ref() can tell you what it is, and you can handle it accordingly:
my $data = $xml->XMLin("file.xml");
if(ref($data->{cd}) eq 'ARRAY')
{
for my $cd (#{ $data->{cd} })
{
print Dumper $cd;
}
}
else # Chances are it's a single element
{
print Dumper $cd;
}