perl: how to parse an xml file sequentially

perl: how to parse an xml file sequentially - perl

I have an XML file which describes the data-structure that I can exchange on a UDP channel.
For example:
Here is my input XML file describing my data-structure.
<ds>
<uint32 name='a'/>
<uint32 name='b'/>
<string name='c'/>
<int16 name='d'/>
<uint32 name='e'/>
</ds>
Parsing this XML file using perl's XML:Simple, allows me to generate the following hash
$VAR1 = {
'uint32' => {
'e' => {},
'a' => {},
'b' => {}
},
'int16' => {
'name' => 'd'
},
'string' => {
'name' => 'c'
}
};
As you can see, after parsing I have no way to figure out what will be the relative position of field 'e' relative to the start of the datastructure.
I would like to find out offsets of each of these elements.
I tried searching for a perl XML parser which allows me to parse an XML file sequentially, something like a 'getnexttag()' kind of a functionality, but could not find any.
What is the best way to do this programmatically? If not perl, then which other language is best suited to do this work?

You'll need to use a streaming parser with the appropriate callbacks, this will also improve parsing speed (and give you less memory consumption, if done correctly) when it comes to larger sets of data, which is a good/awesome thing.
I recommend you to use XML::SAX, an introducation to the module is available under the following link:
XML::SAX::Intro
Provide callbacks for start_element, this way you can read the value of each element one at a time.
Could you write me an easy example?
Yes, and I already have! ;-)
The below snippet will parse the data OP provided and print the name of each element, as well as the attributes key/value.
It should be quite easy to comprehend but if you got any questions feel free to add them as a comment and I'll update this post with more detailed information.
use warnings;
use strict;
use XML::SAX;
my $parser = XML::SAX::ParserFactory->parser(
Handler => ExampleHandler->new
);
$parser->parse_string (<<EOT
<ds>
<uint32 name='a'/>
<uint32 name='b'/>
<string name='c'/>
<int16 name='d'/>
<uint32 name='e'/>
</ds>
EOT
);
# # # # # # # # # # # # # # # # # # # # # # # #
package ExampleHandler;
use base ('XML::SAX::Base');
sub start_element {
my ($self, $el) = #_;
print "found element: ", $el->{Name}, "\n";
for my $attr (values %{$el->{Attributes}}) {
print " '", $attr->{Name}, "' = '", $attr->{Value}, "'\n";
}
print "\n";
}
output
found element: ds
found element: uint32
'name' = 'a'
found element: uint32
'name' = 'b'
found element: string
'name' = 'c'
found element: int16
'name' = 'd'
found element: uint32
'name' = 'e'
I'm not satisfied with XML::SAX, are there any other modules available?
Yes, there are plenty to choose from. Read the following list and choose the one that you find fitting for your specific problem:
perl-xml.sourceforge.net/faq - cpan modules
What is the difference between different methods of parsing?
I also recommend reading the following FAQ regarding XML-parsing. It will bring up the Pro's and Con's of using a tree-parser (such as XML::Parser::Simple) or a streaming parser:
Perl-XML Frequently Asked Questions, Tree VS Stream

It most certainly is possible with Perl.
Here's an example with XML::LibXML :
use strict;
use warnings;
use feature 'say';
use XML::LibXML;
my $xml = XML::LibXML->load_xml( location => 'test.xml' );
my ( $dsNode ) = $xml->findnodes( '/ds' );
my #kids = $dsNode->nonBlankChildNodes; # The indices of this array will
# give the offset
my $first_kid = shift #kids; # Pull off the first kid
say $first_kid->toString; # "<uint32 name='a'/>"
my $second = $first_kid->nextNonBlankSibling();
my $third = $second->nextNonBlankSibling();
say $third->toString; # "<string name="c"/>"

Here is an example using XML::Twig
use XML::Twig;
XML::Twig->new( twig_handlers => { 'ds/*' => \&each_child } )
->parse( $your_xml_data );
sub each_child {
my ($twig, $child) = #_;
printf "tag %s : name = %s\n", $child->name, $child->{att}->{name};
}
This outputs:
tag uint32 : name = a
tag uint32 : name = b
tag string : name = c
tag int16 : name = d
tag uint32 : name = e

Related

Returning a hash of the Parsed document (using Twig in Perl) to be used for processing in other subs

I am failing terribly to return a Hash of the Parsed XML document using twig - in order to use it in OTHER subs for performing several validation checks. The goal is to do abstraction and create re-usable blocks of code.
XML Block:
<?xml version="1.0" encoding="utf-8"?>
<Accounts locale="en_US">
<Account>
<Id>abcd</Id>
<OwnerLastName>asd</OwnerLastName>
<OwnerFirstName>zxc</OwnerFirstName>
<Locked>false</Locked>
<Database>mail</Database>
<Customer>mail</Customer>
<CreationDate year="2011" month="8" month-name="fevrier" day-of-month="19" hour-of-day="15" minute="23" day-name="dimanche"/>
<LastLoginDate year="2015" month="04" month-name="avril" day-of-month="22" hour-of-day="11" minute="13" day-name="macredi"/>
<LoginsCount>10405</LoginsCount>
<Locale>nl</Locale>
<Country>NL</Country>
<SubscriptionType>free</SubscriptionType>
<ActiveSubscriptionType>free</ActiveSubscriptionType>
<SubscriptionExpiration year="1980" month="1" month-name="janvier" day-of-month="1" hour-of-day="0" minute="0" day-name="jeudi"/>
<SubscriptionMonthlyFee>0</SubscriptionMonthlyFee>
<PaymentMode>Undefined</PaymentMode>
<Provision>0</Provision>
<InternalMail>asdf#asdf.com</InternalMail>
<ExternalMail>fdsa#zxczxc.com</ExternalMail>
<GroupMemberships>
<Group>werkgroep X.Y.Z.</Group>
</GroupMemberships>
<SynchroCount>6</SynchroCount>
<LastSynchroDate year="2003" month="12" month-name="decembre" day-of-month="5" hour-of-day="12" minute="48" day-name="mardi"/>
<HasActiveSync>false</HasActiveSync>
<Company/>
</Account>
<Account>
<Id>mnbv</Id>
<OwnerLastName>cvbb</OwnerLastName>
<OwnerFirstName>bvcc</OwnerFirstName>
<Locked>true</Locked>
<Database>mail</Database>
<Customer>mail</Customer>
<CreationDate year="2012" month="10" month-name="octobre" day-of-month="10" hour-of-day="10" minute="18" day-name="jeudi"/>
<LastLoginDate/>
<LoginsCount>0</LoginsCount>
<Locale>fr</Locale>
<Country>BE</Country>
<SubscriptionType>free</SubscriptionType>
<ActiveSubscriptionType>free</ActiveSubscriptionType>
<SubscriptionExpiration year="1970" month="1" month-name="janvier" day-of-month="1" hour-of-day="1" minute="0" day-name="jeudi"/>
<SubscriptionMonthlyFee>0</SubscriptionMonthlyFee>
<PaymentMode>Undefined</PaymentMode>
<Provision>0</Provision>
<InternalMail/>
<ExternalMail>qweqwe#qwe.com</ExternalMail>
<GroupMemberships/>
<SynchroCount>0</SynchroCount>
<LastSynchroDate year="1970" month="1" month-name="janvier" day-of-month="1" hour-of-day="1" minute="0" day-name="jeudi"/>
<HasActiveSync>false</HasActiveSync>
<Company/>
</Account>
</Accounts>
Perl Block:
my $file = shift || (print "NOTE: \tYou didn't provide the name of the file to be checked.\n" and exit);
my $twig = XML::Twig -> new ( twig_roots => { 'Account' => \& parsing } ); #'twig_roots' mode builds only the required sub-trees from the document while ignoring everything outside that twig.
$twig -> parsefile ($file);
sub parsing {
my ( $twig, $accounts ) = #_;
my %hash = #_;
my $ref = \%hash; #because was getting an error of Odd number of hash elements
return $ref;
$twig -> purge;
It gives a hash reference - which I'm unable to deference properly (even after doing thousands of attempts).
Again - just need a single clean function (sub) for doing the Parsing and returning the hash of all elements ('Accounts' in this case) - to be used in other other function (valid_sub) for performing the validation checks.
I'm literally stuck at this point - and will HIGHLY appreciate your HELP.

Such a hash is not created by Twig, you have to create it yourself.
Beware: Commands after return will never be reached.
#!/usr/bin/perl
use warnings;
use strict;
use XML::Twig;
use Data::Dumper;
my $twig = 'XML::Twig'->new(twig_roots => { Account => \&account });
$twig->parsefile(shift);
sub account {
my ($twig, $account) = #_;
my %hash;
for my $ch ($account->children) {
if (my $text = $ch->text) {
$hash{ $ch->name } = $text;
} else {
for my $attr (keys %{ $ch->atts }) {
$hash{ $ch->name }{$attr} = $ch->atts->{$attr};
}
}
}
print Dumper \%hash;
$twig -> purge;
validate(\%hash);
}
Handling of nested elements (e.g. GroupMemberships) left as an exercise to the reader.
And for validation:
sub validate {
my $account = shift;
if ('abcd' eq $account->{Id}) {
...
}
}

The problem with downconverting XML into hashes, is that XML is fundamentally a more complicated data structure. Each element has properties, children and content - and it's ordered - where hashes... don't.
So I would suggest that you not do what you're doing, and instead of passing a hash, use an XML::Twig::Elt and pass that into your validation.
Fortunately, this is exactly what XML::Twig passes to it's handlers:
## this is fine:
sub parsing {
my ( $twig, $accounts ) = #_;
but this is nonsense - think about what's in #_ at this point - it's references to XML::Twig objects - two of them, you've just assigned them.
my %hash = #_;
And this doesn't makes sense as a result
my $ref = \%hash; #because was getting an error of Odd number of hash elements
And where are you returning it to? (this is being called when XML::Twig is parsing)
return $ref;
#this doesn't happen, you've already returned
$twig -> purge;
But bear in mind - you're returning it to your twig proces that's parsing, that's ... discarding the return code. So that's not going to do anything anyway.
I would suggest instead you 'save' the $accounts reference and use that for your validation - just pass it into your subroutines to validate.
Or better yet, configure up a set of twig_handlers that do this for you:
my %validate = ( 'Account/Locked' => sub { die if $_ -> trimmed_text eq "true" },
'Account/CreationDate' => \&parsing,
'Account/ExternalMail' => sub { die unless $_ -> text =~ m/\w+\#\w+\.\w+ }
);
my $twig = XML::Twig -> new ( twig_roots => \%validate );
You can either die if you want to discard the whole lot, or use things like cut to remove an invalid entry from a document as you parse. (and maybe paste it into a seperate doc).
But if you really must turn your XML into a perl data structure - first read this for why it's a terrible idea:
Why is XML::Simple "Discouraged"?
And then, if you really want to carry on down that road, look at the simplify option of XML::Twig:
sub parsing {
my ( $twig, $accounts ) = #_;
my $horrible_hacky_hashref = $accounts->simplify(forcearray => 1, keyattr => [], forcecontent => 1 );
print Dumper \$horrible_hacky_hashref;
$twig -> purge;
#do something with it.
}
Edit:
To expand:
XML::Twig::Elt is a subset of XML::Twig - it's the 'building block' of an XML::Twig data structure - so in your example above, $accounts is.
sub parsing {
my ( $twig, $accounts ) = #_;
print Dumper $accounts;
}
You will get a lot of data if you do this, because you're dumping the whole data structure - which is effectively a daisy chain of XML::Twig::Elt objects.
$VAR1 = \bless( {
'parent' => bless( {
'first_child' => ${$VAR1},
'flushed' => 1,
'att' => {
'locale' => 'en_US'
},
'gi' => 6,
....
'att' => {},
'last_child' => ${$VAR1}->{'first_child'}->{'next_sibling'}->{'next_sibling'}->{'next_sibling'}->{'next_sibling'}->{'next_sibling'}->{'next_sibling'}->{'next_sibling'}->{'next_sibling'}->{'next_sibling'}->{'next_sibling'}->{'next_sibling'}->{'next_sibling'}->{'next_sibling'}->{'next_sibling'}->{'next_sibling'}->{'next_sibling'}->{'next_sibling'}->{'next_sibling'}->{'next_sibling'}->{'next_sibling'}->{'next_sibling'}->{'next_sibling'}->{'next_sibling'},
'gi' => 7
}, 'XML::Twig::Elt' );
But it already encapsulates the information you need, as well as the structure you require - that's why XML::Twig is using it. And is in no small part going to illustrate why forcing your data into a hash/array, you're going to lose data.

HTML::Parser handler sends undefined parameter to callback function?

How its being declared:
my $HTML_GRABBER = HTML::Parser->new('api_version' => 2,
'handlers' => {
'start' => [\&start_tag,"tagname,text"],
'text' => [\&read_text,"tagname, text"],
'end' => [\&end_tag,"tagname"]
}
);
callback function:
sub read_text {
print Dumper(#_);
die "\n";
my ($tag,$stuff) = #_;
if(($DO_NOTHING==0)&&($tag eq $current_tag))
{
push #{$data_queue}, $stuff;
}
}
result:
$VAR1 = undef;
$VAR2 = '
';
so it passes an undefined value and an empty string for tag and text, apparently. THis is reading from a saved HTML file on my harddrive. IDK
I had something like this in mind:
#DOC structure:
#(
# "title"=> {"text"=>#("text")}
# "div" => [
# {
# "p"=> [
# {
# "class" => string
# "id" => string
# "style" => string
# "data"=>["first line", "second line"]
# }
# ],
# "class" => string
# "id" => string
# "style" => string
# }
# ]
#)

You've told it to.
You specified which parameters should be passed to the text handler:
'text' => [\&read_text,"tagname, text"],
Well, there is no tagname for a text token, and therefore it passes you undef as the first paramter.
What exactly are you trying to do? If you describe your actual goal, we might be able to suggest a better solution instead of just pointing out the flaws in your current implementation. Check out: What is an XY Problem?
Addendum about Mojo::DOM
There are modern modules like Mojo::DOM that are much better for navigating a document structure and finding specific data. Check out Mojocast Episode 5 for a helpful 8 minute introductory video.
You appear to be prematurely worried about efficiency of the parse. Initially, I'd advise you to just store the raw html in the database, and reparse it whenever you need to pull new information.
If you Benchmark and decide this is too slow, then you can use Storable to save a serialized copy of the parsed $dom object. However, this should definitely be in addition to the saved html.
use strict;
use warnings;
use Mojo::DOM;
use Storable qw(freeze thaw);
my $dom = Mojo::DOM->new(do {local $/; <DATA>});
# Serializing to memory - Can then put it into a DB if you want
my $serialized = freeze $dom;
my $newdom = thaw($serialized);
# Load Title from Serialized dom
print $newdom->find('title')->text;
__DATA__
<html>
<head><title>My Title</title></head>
<body>
<h1>My Header one</h1>
<p>My Paragraph One</p>
<p>My Paragraph Two</p>
</body>
</html>
Outputs:
My Title

Iterate directories in Perl, getting introspectable objects as result

I'm about to start a script that may have some file lookups and manipulation, so I thought I'd look into some packages that would assist me; mostly, I'd like the results of the iteration (or search) to be returned as objects, which would have (base)name, path, file size, uid, modification time, etc as some sort of properties.
The thing is, I don't do this all that often, and tend to forget APIs; when that happens, I'd rather let the code run on an example directory, and dump all of the properties in an object, so I can remind myself what is available where (obviously, I'd like to "dump", in order to avoid having to code custom printouts). However, I'm aware of the following:
list out all methods of object - perlmonks.org
"Out of the box Perl doesn't do object introspection. Class wrappers like Moose provide introspection as part of their implementation, but Perl's built in object support is much more primitive than that."
Anyways, I looked into:
"Files and Directories Handling in Perl - Perl Beginners' Site" http://perl-begin.org/topics/files-and-directories/
... and started looking into the libraries referred there (also related link: rjbs's rubric: the speed of Perl file finders).
So, for one, File::Find::Object seems to work for me; this snippet:
use Data::Dumper;
#targetDirsToScan = ("./");
use File::Find::Object;
my $tree = File::Find::Object->new({}, #targetDirsToScan);
while (my $robh = $tree->next_obj()) {
#print $robh ."\n"; # prints File::Find::Object::Result=HASH(0xa146a58)}
print Dumper($robh) ."\n";
}
... prints this:
# $VAR1 = bless( {
# 'stat_ret' => [
# 2054,
# 429937,
# 16877,
# 5,
# 1000,
# 1000,
# 0,
# '4096',
# 1405194147,
# 1405194139,
# 1405194139,
# 4096,
# 8
# ],
# 'base' => '.',
# 'is_link' => '',
# 'is_dir' => 1,
# 'path' => '.',
# 'dir_components' => [],
# 'is_file' => ''
# }, 'File::Find::Object::Result' );
# $VAR1 = bless( {
# 'base' => '.',
# 'is_link' => '',
# 'is_dir' => '',
# 'path' => './test.blg',
# 'is_file' => 1,
# 'stat_ret' => [
# 2054,
# 423870,
# 33188,
# 1,
# 1000,
# 1000,
# 0,
# '358',
# 1404972637,
# 1394828707,
# 1394828707,
# 4096,
# 8
# ],
# 'basename' => 'test.blg',
# 'dir_components' => []
... which is mostly what I wanted, except the stat results are an array, and I'd have to know its layout (($dev,$ino,$mode,$nlink,$uid,$gid,$rdev,$size,$atime,$mtime,$ctime,$blksize,$blocks) stat - perldoc.perl.org) to make sense of the printout.
Then I looked into IO::All, which I like because of utf-8 handling (but also, say, socket functionality, which would be useful to me for an unrelated task in the same script); and I was thinking I'd use this package instead. The problem is, I have a very hard time discovering what the available fields in the object returned are; e.g. with this code:
use Data::Dumper;
#targetDirsToScan = ("./");
use IO::All -utf8;
$io = io(#targetDirsToScan);
#contents = $io->all(0);
for my $contentry ( #contents ) {
#print Dumper($contentry) ."\n";
# $VAR1 = bless( \*Symbol::GEN298, 'IO::All::File' );
# $VAR1 = bless( \*Symbol::GEN307, 'IO::All::Dir' ); ...
#print $contentry->uid . " -/- " . $contentry->mtime . "\n";
# https://stackoverflow.com/q/24717210/printing-ret-of-ioall-w-datadumper
print Dumper \%{*$contentry}; # doesn't list uid
}
... I get a printout like this:
# $VAR1 = {
# '_utf8' => 1,
# 'constructor' => sub { "DUMMY" },
# 'is_open' => 0,
# 'io_handle' => undef,
# 'name' => './test.blg',
# '_encoding' => 'utf8',
# 'package' => 'IO::All'
# };
# $VAR1 = {
# '_utf8' => 1,
# 'constructor' => sub { "DUMMY" },
# 'mode' => undef,
# 'name' => './testdir',
# 'package' => 'IO::All',
# 'is_absolute' => 0,
# 'io_handle' => undef,
# 'is_open' => 0,
# '_assert' => 0,
# '_encoding' => 'utf8'
... which clearly doesn't show attributes like mtime, etc. - even if they exist (which you can see if you uncomment the respective print line).
I've also tried Data::Printer's (How can I perform introspection in Perl?) p() function - it prints exactly the same fields as Dumper. I also tried to use print Dumper \%{ref ($contentry) . "::"}; (list out all methods of object - perlmonks.org), and this prints stuff like:
'O_SEQUENTIAL' => *IO::All::File::O_SEQUENTIAL,
'mtime' => *IO::All::File::mtime,
'DESTROY' => *IO::All::File::DESTROY,
...
'deep' => *IO::All::Dir::deep,
'uid' => *IO::All::Dir::uid,
'name' => *IO::All::Dir::name,
...
... but only if you use the print $contentry->uid ... line beforehand; else they are not listed! I guess that relates to this:
introspection - How do I list available methods on a given object or package in Perl? #911294
In general, you can't do this with a dynamic language like Perl. The package might define some methods that you can find, but it can also make up methods on the fly that don't have definitions until you use them. Additionally, even calling a method (that works) might not define it. That's the sort of things that make dynamic languages nice. :)
Still, that prints the name and type of the field - I'd want the name and value of the field instead.
So, I guess my main question is - how can I dump an IO::All result, so that all fields (including stat ones) are printed out with their names and values (as is mostly the case with File::Find::Object)?
(I noticed the IO::All results can be of type, say, IO::All::File, but its docs defer to "See IO::All", which doesn't discuss IO::All::File explicitly much at all. I thought, if I could "cast" \%{*$contentry} to a IO::All::File, maybe then mtime etc fields will be printed - but is such a "cast" possible at all?)
If that is problematic, are there other packages, that would allow introspective printout of directory iteration results - but with named fields for individual stat properties?

Perl does introspection in the fact that an object will tell you what type of object it is.
if ( $object->isa("Foo::Bar") ) {
say "Object is of a class of Foo::Bar, or is a subclass of Foo::Bar.";
}
if ( ref $object eq "Foo::Bar" ) {
say "Object is of the class Foo::Bar.";
}
else {
say "Object isn't a Foo::Bar object, but may be a subclass of Foo::Bar";
}
You can also see if an object can do something:
if ( $object->can("quack") ) {
say "Object looks like a duck!";
}
What Perl can't do directly is give you a list of all the methods that a particular object can do.
You might be able to munge some way.Perl objects are stored in package namespaces which are in the symbol table. Classes are implemented via Perl subroutines. It may be possible to go through the package namespace and then find all the subroutines.
However, I can see several issues. First private methods (the ones you're not suppose to use) and non-method subroutines would also be included. There's no way to know which is which. Also, parent methods won't be listed.
Many languages can generate such a list of methods for their objects (I believe both Python and Ruby can), but these usually give you a list without an explanation what these do. For example, File::Find::Object::Result (which is returned by the next_obj method of File::Find::Object) has a base method. What does it do? Maybe it's like basename and gives me the name of the file. Nope, it's like dirname and gives me the name of the directory.
Again, some languages could give a list of those methods for an object and a description. However, those descriptions depend upon the programmer to maintain and make sure they're correct. No guaranteed of that.
Perl doesn't have introspection, but all Perl modules stored in CPAN must be documented via POD embedded documentation, and this is printable from the command line:
$ perldoc File::Find::Object
This is the documentation you see in CPAN pages, in http://Perldoc.perl.org and in ActiveState's Perl documentation.
It's not bad. It's not true introspection, but the documentation is usually pretty good. After all, if the documentation stunk, I probably wouldn't have installed that module in the first place. I use perldoc all the time. I can barely remember my kids' names let alone the way to use Perl classes that I haven't used in a few months, but I find that using perldoc works pretty wall.
What you should not do is use Data::Dumper to dump out objects and try to figure out what they contain and possible methods. Some cleaver programmers are using Inside-Out Objects to thwart peeking toms.
So no, Perl doesn't list methods of a particular class like some languages can, but perldoc comes pretty close to doing what you need. I haven't use File::Find::Object in a long while, but going over the perldoc, I probably could write up such a program without much difficulty.

As I answered to your previous question, it is not a good idea to go relying on the guts of objects in Perl. Instead just call methods.
If IO::All doesn't offer a method that gives you the information that you need, you might be able to write your own method for it that assembles that information using just the documented methods provided by IO::All...
use IO::All;
# Define a new method for IO::All::Base to use, but
# define it in a lexical variable!
#
my $dump_info = sub {
use Data::Dumper ();
my $self = shift;
local $Data::Dumper::Terse = 1;
local $Data::Dumper::Sortkeys = 1;
return Data::Dumper::Dumper {
name => $self->name,
mtime => $self->mtime,
mode => $self->mode,
ctime => $self->ctime,
};
};
$io = io('/tmp');
for my $file ( $io->all(0) ) {
print $file->$dump_info();
}

Ok, this is more-less as an exercise (and reminder for me); below is some code, where I've tried to define a class (File::Find::Object::StatObj) with accessor fields for all of the stat fields. Then, I have the hack for IO::All::File from Replacing a class in Perl ("overriding"/"extending" a class with same name)?, where a mtimef field is added which corresponds to mtime, just as a reminder.
Then, just to see what sort of interface I could have between the two libraries, I have IO::All doing the iterating; and the current file path is passed to File::Find::Object, from which we obtain a File::Find::Object::Result - which has been "hacked" to also show the File::Find::Object::StatObj; but that one is only generated after a call to the hacked Result's full_components (that might as well have been a separate function). Notice that in this case, you won't get full_components/dir_components of File::Find::Object::Result -- because apparently it is not File::Find::Object doing the traversal here, but IO::All. Anyways, the result is something like this:
# $VAR1 = {
# '_utf8' => 1,
# 'mtimef' => 1403956165,
# 'constructor' => sub { "DUMMY" },
# 'is_open' => 0,
# 'io_handle' => undef,
# 'name' => 'img/test.png',
# '_encoding' => 'utf8',
# 'package' => 'IO::All'
# };
# img/test.png
# > - $VAR1 = bless( {
# 'base' => 'img/test.png',
# 'is_link' => '',
# 'is_dir' => '',
# 'path' => 'img/test.png',
# 'is_file' => 1,
# 'stat_ret' => [
# 2054,
# 426287,
# 33188,
# 1,
# 1000,
# 1000,
# 0,
# '37242',
# 1405023944,
# 1403956165,
# 1403956165,
# 4096,
# 80
# ],
# 'basename' => undef,
# 'stat_obj' => bless( {
# 'blksize' => 4096,
# 'ctime' => 1403956165,
# 'rdev' => 0,
# 'blocks' => 80,
# 'uid' => 1000,
# 'dev' => 2054,
# 'mtime' => 1403956165,
# 'mode' => 33188,
# 'size' => '37242',
# 'nlink' => 1,
# 'atime' => 1405023944,
# 'ino' => 426287,
# 'gid' => 1000
# }, 'File::Find::Object::StatObj' ),
# 'dir_components' => []
# }, 'File::Find::Object::Result' );
I'm not sure how correct this would be, but what I like about this is that I could forget where the fields are; then I could rerun the dumper, and see that I could get mtime via (*::Result)->stat_obj->size - and that seems to work (here I'd need just to read these, not to set them).
Anyways, here is the code:
use Data::Dumper;
my #targetDirsToScan = ("./");
use IO::All -utf8 ; # Turn on utf8 for all io
# try to "replace" the IO::All::File class
{ # https://stackoverflow.com/a/24726797/277826
package IO::All::File;
use IO::All::File; # -base; # just do not use `-base` here?!
# hacks work if directly in /usr/local/share/perl/5.10.1/IO/All/File.pm
# NB: field is a sub in /usr/local/share/perl/5.10.1/IO/All/Base.pm
field mtimef => undef; # hack
sub file {
my $self = shift;
bless $self, __PACKAGE__;
$self->name(shift) if #_;
$self->mtimef($self->mtime); # hack
#print("!! *haxx0rz'd* file() reporting in\n");
return $self->_init;
}
1;
}
use File::Find::Object;
# based on /usr/local/share/perl/5.10.1/File/Find/Object/Result.pm;
# but inst. from /usr/local/share/perl/5.10.1/File/Find/Object.pm
{
package File::Find::Object::StatObj;
use integer;
use Tie::IxHash;
#use Data::Dumper;
sub ordered_hash { # https://stackoverflow.com/a/3001400/277826
#my (#ar) = #_; #print("# ". join(",",#ar) . "\n");
tie my %hash => 'Tie::IxHash';
%hash = #_; #print Dumper(\%hash);
\%hash
}
my $fields = ordered_hash(
# from http://perldoc.perl.org/functions/stat.html
(map { $_ => $_ } (qw(
dev ino mode nlink uid gid rdev size
atime mtime ctime blksize blocks
)))
); #print Dumper(\%{$fields});
use Class::XSAccessor
#accessors => %{$fields}, # cannot - is seemingly late
# ordered_hash gets accepted, but doesn't matter in final dump;
#accessors => { (map { $_ => $_ } (qw(
accessors => ordered_hash( (map { $_ => $_ } (qw(
dev ino mode nlink uid gid rdev size
atime mtime ctime blksize blocks
))) ),
#))) },
;
use Fcntl qw(:mode);
sub new
{
#my $self = shift;
my $class = shift;
my #stat_arr = #_; # the rest
my $ic = 0;
my $self = {};
bless $self, $class;
for my $k (keys %{$fields}) {
$fld = $fields->{$k};
#print "$ic '$k' '$fld' ".join(", ",$stat_arr[$ic])." ; ";
$self->$fld($stat_arr[$ic]);
$ic++;
}
#print "\n";
return $self;
}
1;
}
# try to "replace" the File::Find::Object::Result
{
package File::Find::Object::Result;
use File::Find::Object::Result;
#use File::Find::Object::StatObj; # no, has no file!
use Class::XSAccessor replace => 1,
accessors => {
(map { $_ => $_ } (qw(
base
basename
is_dir
is_file
is_link
path
dir_components
stat_ret
stat_obj
)))
}
;
#use Fcntl qw(:mode);
#sub new # never gets called
sub full_components
{
my $self = shift; #print("NEWCOMP\n");
my $sobj = File::Find::Object::StatObj->new(#{$self->stat_ret()});
$self->stat_obj($sobj); # add stat_obj and its fields
return
[
#{$self->dir_components()},
($self->is_dir() ? () : $self->basename()),
];
}
1;
}
# main script start
my $io = io($targetDirsToScan[0]);
my #contents = $io->all(0); # Get all contents of dir
for my $contentry ( #contents ) {
print Dumper \%{*$contentry};
print $contentry->name . "\n"; # img/test.png
# get a File::Find::Object::Result - must instantiate
# a File::Find::Object; just item_obj() will return undef
# right after instantiation, so must give it "next";
# no instantition occurs for $tro, though!
#my $tffor = File::Find::Object->new({}, ($contentry->name))->next_obj();
my $tffo = File::Find::Object->new({}, ("./".$contentry->name));
my $tffos = $tffo->next(); # just a string!
$tffo->_calc_current_item_obj(); # unfortunately, this will not calculate dir_components ...
my $tffor = $tffo->item_obj();
# ->full_components doesn't call new, either!
# must call full_compoments, to generate the fields
# (assign to unused variable triggers it fine)
# however, $arrref_fullcomp will be empty, because
# File::Find::Object seemingly calcs dir_components only
# if it is traversing a tree...
$arrref_fullcomp = $tffor->full_components;
#print("# ".$tffor->stat_obj->size."\n"); # seems to work
print "> ". join(", ", #$arrref_fullcomp) ." - ". Dumper($tffor);
}

Parsing and iterating through an XML structure where each entry has multiple attributes

I'm trying to deal with an XML list of items (in this case, images) and iterate over each one. I don't really understand Perl or hashes, but I found a few explanations and examples (many here) and wrote something that seemed to work. The XML is a list of elements, each of which contains a unique 'id' attribute.
I'm using XMLin from XML::Simple to parse the XML.
When the list contains multiple elements, it iterates through by 'id'. But it seems that when there is only one, it gets confused, and treats each attribute of the element as its own value, which results in a run-time error.
Can't use string ("0") as a HASH ref while "strict refs" in use
I'm guessing that the problem is that the hash key doesn't that 'id' is the unique key, at least when there's only one entry. So I added code to dump the keys. I also added a line to print what the $image is in the foreach loop. In the case that breaks, the line print "In loop; image ID=$image\n"; displays In loop; image ID=Serial Since Serial is an attribute at the same level as id, I'm guessing this is the problem (not properly using id as the key).
Here's my code:
#!/usr/bin/perl
use strict;
use warnings;
use XML::Simple;
my $album_data_file = $ARGV[0];
my $album_file_list = $ARGV[1];
my $do_dump_data = $ARGV[2];
my $album_data = XMLin ( $album_data_file );
my $LIST_FILE;
if ( defined $album_file_list && "$album_file_list" ne "" )
{
if ( open ( $LIST_FILE, ">", "$album_file_list" ) )
{
print "Opened file $album_file_list as $LIST_FILE\n";
}
}
if ( defined $do_dump_data && $do_dump_data eq "true" )
{
use Data::Dumper;
print "data:\n\n";
print Dumper ( $album_data );
print "\n\n\n\n";
print "keys:\n\n";
print Dumper ( keys %{$album_data->{Images}->{Image}} );
print "\n\n\n\n";
}
foreach my $image ( keys %{$album_data->{Images}->{Image}} )
{
print "In loop; image ID=$image\n";
my $ref = $album_data->{Images}->{Image}->{$image};
#
# Write to files list: file name, ID, key, size, MD5
#
print $LIST_FILE ( "$ref->{FileName}\t$image\t$ref->{Key}"
. "\t$ref->{Size}\t$ref->{MD5Sum}\n" );
}
close ( $LIST_FILE );
Here's a sample XML file that breaks it:
<?xml version="1.0" encoding="utf-8"?>
<rsp stat="ok">
<method>images.get</method>
<Images>
<Image id="123" Key="xyz" Type="Album" Caption="Room 5083" FileName="MVI_2838.AVI" Format="MP4" Height="480" Keywords="China; Suite" LastUpdated="2014-04-19 11:49:45" Position="1" Serial="0" Size="116033" Width="640" Date="2014-04-19 11:46:24" Hidden="0" MD5Sum="6151e20053eeda87c688f8becae0d402" Watermark="0">
<Album id="345" Key="zzy" />
</Image>
</Images>
</rsp>
Here's the result of dumping the full $album_data:
$VAR1 = {
'method' => 'images.get',
'Images' => {
'Image' => {
'Serial' => '0',
'Format' => 'MP4',
'Keywords' => 'China; Suite',
'Type' => 'Album',
'Size' => '116033',
'MD5Sum' => '6151e20053eeda87c688f8becae0d402',
'id' => '123',
'Key' => 'xyz',
'LastUpdated' => '2014-04-19 11:49:45',
'Album' => {
'id' => '345',
'Key' => 'zzy'
},
'Position' => '1',
'Height' => '480',
'Date' => '2014-04-19 11:46:24',
'Caption' => 'Room 5083',
'FileName' => 'MVI_2838.AVI',
'Hidden' => '0',
'Width' => '640',
'Watermark' => '0',
}
},
'stat' => 'ok'
};
Here's the result of dumping the keys %{$album_data->{Images}->{Image}} construct:
$VAR1 = 'Serial';
$VAR2 = 'Format';
$VAR3 = 'Keywords';
$VAR5 = 'Type';
$VAR6 = 'Size';
$VAR7 = 'MD5Sum';
$VAR9 = 'id';
$VAR10 = 'Key';
$VAR11 = 'LastUpdated';
$VAR12 = 'Album';
$VAR14 = 'Position';
$VAR15 = 'Height';
$VAR16 = 'Date';
$VAR17 = 'Caption';
$VAR19 = 'FileName';
$VAR20 = 'Hidden';
$VAR23 = 'Width';
$VAR24 = 'Watermark';
$VAR27 = 'Duration';

According to XML::Simple #Status of this Module:
The use of this module in new code is discouraged. Other modules are available which provide more straightforward and consistent interfaces. In particular, XML::LibXML is highly recommended.
The major problems with this module are the large number of options and the arbitrary ways in which these options interact - often with unexpected results.
Patches with bug fixes and documentation fixes are welcome, but new features are unlikely to be added.
XML::Simple is a useful module to quickly parse xml if you're familiar with perl complex data structures. However, whenever the xml gets too complex, the module outlives it's usefulness because of it's arbitrary method of parsing certain structures depending on a lot of configuration variables.
I still use XML::Simple on rare occasions, but I'd advise you to look at either XML::Twig or the afforementioned XML::LibXML to avoid issues like this.

I appreciate Miller's caution that XML::Simple is discouraged and his warning that it is difficult to use due to so many options which interact in ways that are hard to define and manage. While looking into the replacement modules he suggested, I stumbled on some information that I should have been aware of prior to using XML::Simple in the first place. In particular, the fact that my script worked when multiple images are in the XML but fails when there is only one points out that if one is using XML::Simple, it is often critical to set the ForceArray option to the element(s) that are always supposed to be in an array, even if a particular XML file only happens to contain one. Otherwise, the element will sometimes be an array and sometimes a scalar, causing the exact run-time error I was seeing.
So, in my case, setting forcearray => [ 'Image' ] makes the code work (by forcing all <image> elements into an array, even if there is only one), at less immediate effort than trying to figure out how to use a different XML parsing module (although I have no doubt that making the effort to do so will save time in the future).

How do I convert Data::Dumper output back into a Perl data structure?

I was wondering if you could shed some lights regarding the code I've been doing for a couple of days.
I've been trying to convert a Perl-parsed hash back to XML using the XMLout() and XMLin() method and it has been quite successful with this format.
#!/usr/bin/perl -w
use strict;
# use module
use IO::File;
use XML::Simple;
use XML::Dumper;
use Data::Dumper;
my $dump = new XML::Dumper;
my ( $data, $VAR1 );
Topology:$VAR1 = {
'device' => {
'FOC1047Z2SZ' => {
'ChassisID' => '2009-09',
'Error' => undef,
'Group' => {
'ID' => 'A1',
'Type' => 'Base'
},
'Model' => 'CATALYST',
'Name' => 'CISCO-SW1',
'Neighbor' => {},
'ProbedIP' => 'TEST',
'isDerived' => 0
}
},
'issues' => [
'TEST'
]
};
# create object
my $xml = new XML::Simple (NoAttr=>1,
RootName=>'data',
SuppressEmpty => 'true');
# convert Perl array ref into XML document
$data = $xml->XMLout($VAR1);
#reads an XML file
my $X_out = $xml->XMLin($data);
# access XML data
print Dumper($data);
print "STATUS: $X_out->{issues}\n";
print "CHASSIS ID: $X_out->{device}{ChassisID}\n";
print "GROUP ID: $X_out->{device}{Group}{ID}\n";
print "DEVICE NAME: $X_out->{device}{Name}\n";
print "DEVICE NAME: $X_out->{device}{name}\n";
print "ERROR: $X_out->{device}{error}\n";
I can access all the element in the XML with no problem.
But when I try to create a file that will house the parsed hash, problem arises because I can't seem to access all the XML elements. I guess, I wasn't able to unparse the file with the following code.
#!/usr/bin/perl -w
use strict;
#!/usr/bin/perl
# use module
use IO::File;
use XML::Simple;
use XML::Dumper;
use Data::Dumper;
my $dump = new XML::Dumper;
my ( $data, $VAR1, $line_Holder );
#this is the file that contains the parsed hash
my $saveOut = "C:/parsed_hash.txt";
my $result_Holder = IO::File->new($saveOut, 'r');
while ($line_Holder = $result_Holder->getline){
print $line_Holder;
}
# create object
my $xml = new XML::Simple (NoAttr=>1, RootName=>'data', SuppressEmpty => 'true');
# convert Perl array ref into XML document
$data = $xml->XMLout($line_Holder);
#reads an XML file
my $X_out = $xml->XMLin($data);
# access XML data
print Dumper($data);
print "STATUS: $X_out->{issues}\n";
print "CHASSIS ID: $X_out->{device}{ChassisID}\n";
print "GROUP ID: $X_out->{device}{Group}{ID}\n";
print "DEVICE NAME: $X_out->{device}{Name}\n";
print "DEVICE NAME: $X_out->{device}{name}\n";
print "ERROR: $X_out->{device}{error}\n";
Do you have any idea how I could access the $VAR1 inside the text file?
Regards,
newbee_me

$data = $xml->XMLout($line_Holder);
$line_Holder has only the last line of your file, not the whole file, and not the perl hashref that would result from evaling the file. Try something like this:
my $ref = do $saveOut;
The do function loads and evals a file for you. You may want to do it in separate steps, like:
use File::Slurp "read_file";
my $fileContents = read_file( $saveOut );
my $ref = eval( $fileContents );

You might want to look at the Data::Dump module as a replacement for Data::Dumper; its output is already ready to re-eval back.

Basically to load Dumper data you eval() it:
use strict;
use Data::Dumper;
my $x = {"a" => "b", "c"=>[1,2,3],};
my $q = Dumper($x);
$q =~ s{\A\$VAR\d+\s*=\s*}{};
my $w = eval $q;
print $w->{"a"}, "\n";
The regexp (s{\A\$VAR\d+\s*=\s*}{}) is used to remove $VAR1= from the beginning of string.
On the other hand - if you need a way to store complex data structure, and load it again, it's much better to use Storable module, and it's store() and retrieve() functions.

This has worked for me, for hashes of hashes. Perhaps won't work so well with structures which contain references other structures. But works well enough for simple structures, like arrays, hashes, or hashes of hashes.
open(DATA,">",$file);
print DATA Dumper(\%g_write_hash);
close(DATA);
my %g_read_hash = %{ do $file };

Please use dump module as a replacement for Data::Dumper

You can configure the variable name used in Data::Dumper's output with $Data::Dumper::Varname.
Example
use Data::Dumper
$Data::Dumper::Varname = "foo";
my $string = Dumper($object);
eval($string);
...will create the variable $foo, and should contain the same data as $object.
If your data structure is complicated and you have strange results, you may want to consider Storable's freeze() and thaw() methods.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

perl: how to parse an xml file sequentially - perl

Related

Returning a hash of the Parsed document (using Twig in Perl) to be used for processing in other subs

HTML::Parser handler sends undefined parameter to callback function?

Iterate directories in Perl, getting introspectable objects as result

Parsing and iterating through an XML structure where each entry has multiple attributes

How do I convert Data::Dumper output back into a Perl data structure?

Categories

Resources