memoizing an XML parsing function breaks it - perl

I'm new to Perl, and I need to improve the performance of an application someone else wrote.
Profiling showed that the program is spending a lot of time in the XML::Simple library.
Based on knowledge about how the application's use changed over time, we're suspecting that it is re-parsing the same XML data several times over.
Memoizing the XML parsing function seemed like a straightforward fix. The files it gets the XML data from are assumed not to change while the program runs, so let's just cache the results for each file.
Such function, the point-of-entry for the library, is XMLin.
My single change to the software was adding
use Memoize;
memoize('XMLin');
Trying to run in returns the error:
Not a HASH reference at C:\QuEST\Scripts\RangeAnalyzer/ParseETP.pl line 269.
Line 269 is:
#constantElements = #{$xml->{declarations}->{Package}->{declarations}->{Constant}};
... and $xml is defined a few lines up as:
my $xml = XMLin($Filename, KeyAttr => {ConstValue => '', Operator => '', VariableRef => '', Variable => '', StateMachine => '', State => '', IfBlock => '', WhenBlock => '', SizeParameter => ''}, ForceArray => ['Variable', 'ConstValue', 'DataArrayOp', 'Constant']);
Undoing the change fixes the error.
Why did memoizing the function break its return value? How to fix it?
I noticed XML::Simple is deprecated, and replacing it, preferrably with something faster, is in the list of things to try.
Nevertheless, this error broke my mental model of how memoization was supposed to work.
I'm using Perl 5.10.0.

I'm afraid that there isn't enough information in your question to fully answer what's going wrong. (Not until there's an MWE at least). However, I would like to point out two things which you may need to consider.
In order to memoize a function, Memoize uses a normalizer to check if the arguments are the same. As per the docs, this by default will just stringify the arguments. This means that the hashref gets turned into its string representation, which is its location in memory. This will change between invocations of the function, and as such it will never correctly identify that you've passed the same args.
You may want to supply your own normalizing function to address the particular argument style that XML::Simple requires.
In addition, as per the caveats section in the docs, if your function returns a reference, then the same reference gets returned. This means that if at some point you modify the structure (which I have no way of knowing if that happens given the information given), then that modified structure will be returned later.

Related

Pluggable/dynamic data processing/munging/transforming perl module?

Cross-posted from perlmonks:
I have to clean up some gross, ancient code at $work,
and before I try to make a new module I'd love to use an existing one if anyone knows of something appropriate.
At runtime I am parsing a file to determine what processing I need to do on a set of data.
If I were to write a module I would try to do it more generically (non-DBI-specific), but my exact use case is this:
I read a SQL file to determine the query to run against the database.
I parse comments at the top and determine that
column A needs to have a s/// applied,
column B needs to be transformed to look like a date of given format,
column C gets a sort of tr///.
Additionally things can be chained so that column D might s///, then say if it isn't 1 or 2, set it to 3.
So when fetching from the db the program applies the various (possibly stacked) transformations before returning the data.
Currently the code is a disgustingly large and difficult series of if clauses
processing hideously difficult to read or maintain arrays of instructions.
So what I'm imagining is perhaps an object that will parse those lines
(and additionally expose a functional interface),
stack up the list of processors to apply,
then be able to execute it on a passed piece of data.
Optionally there could be a name/category option,
so that one object could be used dynamically to stack processors only for the given name/category/column.
A traditionally contrived example:
$obj = $module->new();
$obj->parse("-- greeting:gsub: /hi/hello"); # don't say "hi"
$obj->parse("-- numbers:gsub: /\D//"); # digits only
$obj->parse("-- numbers:exchange: 1,2,3 one,two,three"); # then spell out the numbers
$obj->parse("-- when:date: %Y-%m-%d 08:00:00"); # format like a date, force to 8am
$obj->stack(action => 'gsub', name => 'when', format => '/1995/1996/'); # my company does not recognize the year 1995.
$cleaned = $obj->apply({greeting => "good morning", numbers => "t2", when => "2010116"});
Each processor (gsub, date, exchange) would be a separate subroutine.
Plugins could be defined to add more by name.
$obj->define("chew", \&CookieMonster::chew);
$obj->parse("column:chew: 3x"); # chew the column 3 times
So the obvious first question is, does anybody know of a module out there that I could use?
About the only thing I was able to find so far is [mod://Hash::Transform],
but since I would be determining which processing to do dynamically at runtime
I would always end up using the "complex" option and I'd still have to build the parser/stacker.
Is anybody aware of any similar modules or even a mildly related module that I might want to utilize/wrap?
If there's nothing generic out there for public consumption (surely mine is not the only one in the darkpan),
does anybody have any advice for things to keep in mind or interface suggestions or even other possible uses
besides munging the return of data from DBI, Text::CSV, etc?
If I end up writing a new module, does anybody have namespace suggestions?
I think something under Data:: is probably appropriate...
the word "pluggable" keeps coming to mind because my use case reminds me of PAM,
but I really don't have any good ideas...
Data::Processor::Pluggable ?
Data::Munging::Configurable ?
I::Chew::Data ?
First I'd try to place as much of the formatting as possible in the SQL queries if possible.
Things like date format etc. definitely should be handled in SQL.
Out top of my head a module I know and which could be used for your purpose is Data::FormValidator. Although is is mainly aimed at validating CGI parameters, it has the functionality you need: you can defined filters and constraints and chain them in various ways. Doesn't mean there no other modules for you purpose, I just don't know.
Or you can do something what you already hinted at. You could define some sort of command classes and chain them on the various data inputs. I'd do something along these lines:
package MyDataProcessor;
use Moose;
has 'Transformations' => (
traits => ['Array'],
is => 'rw',
isa => 'ArrayRef[MyTransformer]',
handles => {
add_transformer => 'push',
}
);
has 'input' => (is => 'rw', isa => 'Str');
sub apply_transforms { }
package MyRegexTransformer;
use Moose;
extends 'MyTransformer';
has 'Regex' => (is => 'rw', isa => 'Str');
has 'Replacement' => (is => 'rw', isa => 'Str');
sub transform { }
# some other transformers
#
# somewhere else
#
#
my $processor = MyDataProcessor->new(input => 'Hello transform me');
my $tr = MyRegexTransformer->new(Regex => 'Hello', Replacement => 'Hi');
$processor->add_transformer($tr);
#...
$processor->apply_transforms;
I'm not aware of any data transform CPAN modules, so I've had to roll my own for work. It was significantly more complicated than this, but operated under a similar principle; it was basically a poor man's implementation of Informatica-style ETL sans the fancy GUI... the configuration was Perl hashes (Perl instead of XML since it allowed me to implement certain complex rules as subroutine references).
As far as namespace, i'd go for Data::Transform::*
Thanks to everyone for their thoughts.
The short version:
After trying to adapt a few existing modules I ended up abstracting my own: Sub::Chain.
It needs some work, but is doing what I need so far.
The long version:
(an excerpt from the POD)
=head1 RATIONALE
This module started out as Data::Transform::Named,
a named wrapper (like Sub::Chain::Named) around
Data::Transform (and specifically Data::Transform::Map).
As the module was nearly finished I realized I was using very little
of Data::Transform (and its documentation suggested that
I probably wouldn't want to use the only part that I I using).
I also found that the output was not always what I expected.
I decided that it seemed reasonable according to the likely purpose
of Data::Transform, and this module simply needed to be different.
So I attempted to think more abstractly
and realized that the essence of the module was not tied to
data transformation, but merely the succession of simple subroutine calls.
I then found and considered Sub::Pipeline
but needed to be able to use the same
named subroutine with different arguments in a single chain,
so it seemed easier to me to stick with the code I had written
and just rename it and abstract it a bit further.
I also looked into Rule::Engine which was beginning development
at the time I was searching.
However, like Data::Transform, it seemed more complex than what I needed.
When I saw that Rule::Engine was using [the very excellent] Moose
I decided to pass since I was doing work on a number of very old machines
with old distros and old perls and constrained resources.
Again, it just seemed to be much more than what I was looking for.
=cut
As for the "parse" method in my original idea/example,
I haven't found that to be necessary, and am currently using syntax like
$chain->append($sub, \#arguments, \%options)

Not Able to Set Class' Attributes in Role

First off, I'm not really sure how much information is necessary to include because I'm having a really hard time tracing the origin of this problem.
I have a Moose role with a subroutine that (along with a few other things) tries to set the attributes for a class like this:
$genre = Movie::Genre->new({
genreName => 'Drama',
genreID => '1'
});
The problem is, it doesn't. The dump of $genre immediately after, indicates that it's still empty:
$genre: bless( {}, 'Movie::Genre' )
Stranger still, when I execute THE EXACT SAME LINE in my test file, it works as expected with this dump:
$genre: bless( {
genreID => '1',
genreName => 'Drama'
}, 'Movie::Genre' )
I'm struggling to find what makes these two lines of code different, causing one to work and one to fail.
Any ideas as to what conditions would cause the first example to fail and allow the second to succeed? I'd be happy to provide more context if necessary. Thanks!
That line simply passes those parameters to the Movie::Genre constructor. It's up to that constructor to decide what to do with them.
It sounds like that call (in the role) is getting executed before the Movie::Genre class has acquired attributes named genreName and genreID. By default, Moose constructors ignore any parameters they don't recognize, so this doesn't generate a warning.
Your test file must be making that call after the attributes have been added to Movie::Genre.
We'd have to see more of the code to figure out exactly why this is happening.

Simulating aspects of static-typing in a duck-typed language

In my current job I'm building a suite of Perl scripts that depend heavily on objects. (using Perl's bless() on a Hash to get as close to OO as possible)
Now, for lack of a better way of putting this, most programmers at my company aren't very smart. Worse, they don't like reading documentation and seem to have a problem understanding other people's code. Cowboy coding is the game here. Whenever they encounter a problem and try to fix it, they come up with a horrendous solution that actually solves nothing and usually makes it worse.
This results in me, frankly, not trusting them with code written in duck typed language. As an example, I see too many problems with them not getting an explicit error for misusing objects. For instance, if type A has member foo, and they do something like, instance->goo, they aren't going to see the problem immediately. It will return a null/undefined value, and they will probably waste an hour finding the cause. Then end up changing something else because they didn't properly identify the original problem.
So I'm brainstorming for a way to keep my scripting language (its rapid development is an advantage) but give an explicit error message when an object isn't used properly. I realize that since there isn't a compile stage or static typing, the error will have to be at run time. I'm fine with this, so long as the user gets a very explicit notice saying "this object doesn't have X"
As part of my solution, I don't want it to be required that they check if a method/variable exists before trying to use it.
Even though my work is in Perl, I think this can be language agnostic.
If you have any shot of adding modules to use, try Moose. It provides pretty much all the features you'd want in a modern programming environment, and more. It does type checking, excellent inheritance, has introspection capabilities, and with MooseX::Declare, one of the nicest interfaces for Perl classes out there. Take a look:
use MooseX::Declare;
class BankAccount {
has 'balance' => ( isa => 'Num', is => 'rw', default => 0 );
method deposit (Num $amount) {
$self->balance( $self->balance + $amount );
}
method withdraw (Num $amount) {
my $current_balance = $self->balance();
( $current_balance >= $amount )
|| confess "Account overdrawn";
$self->balance( $current_balance - $amount );
}
}
class CheckingAccount extends BankAccount {
has 'overdraft_account' => ( isa => 'BankAccount', is => 'rw' );
before withdraw (Num $amount) {
my $overdraft_amount = $amount - $self->balance();
if ( $self->overdraft_account && $overdraft_amount > 0 ) {
$self->overdraft_account->withdraw($overdraft_amount);
$self->deposit($overdraft_amount);
}
}
}
I think it's pretty cool, myself. :) It's a layer over Perl's object system, so it works with stuff you already have (basically.)
With Moose, you can create subtypes really easily, so you can make sure your input is valid. Lazy programmers agree: with so little that has to be done to make subtypes work in Moose, it's easier to do them than not! (from Cookbook 4)
subtype 'USState'
=> as Str
=> where {
( exists $STATES->{code2state}{ uc($_) }
|| exists $STATES->{state2code}{ uc($_) } );
};
And Tada, the USState is now a type you can use! No fuss, no muss, and just a small amount of code. It'll throw an error if it's not right, and all the consumers of your class have to do is pass a scalar with that string in it. If it's fine (which it should be...right? :) ) They use it like normal, and your class is protected from garbage. How nice is that!
Moose has tons of awesome stuff like this.
Trust me. Check it out. :)
In Perl,
make it required that use strict and use warnings are on in 100% of the code
You can try to make an almost private member variables by creating closures. A very good example is "Private Member Variables, Sort of " section in http://www.usenix.org/publications/login/1998-10/perl.html . They are not 100% private but fairly un-obvious how to access unless you really know what you're doing (and require them to read your code and do research to find out how).
If you don't want to use closures, the following approach works somewhat well:
Make all of your object member variables (aka object hash keys in Perl) wrapped in accessors. There are ways to do this efficiently from coding standards POV. One of the least safe is Class::Accessor::Fast. I'm sure Moose has better ways but I'm not that familiar with Moose.
Make sure to "hide" actual member variables in private-convention names, e.g. $object->{'__private__var1'} would be the member variable, and $object->var1() would be a getter/setter accessor.
NOTE: For the last, Class::Accessor::Fast is bad since its member variables share names with accessors. But you can have very easy builders that work just like Class::Accessor::Fast and create key values such as $obj->{'__private__foo'} for "foo".
This won't prevent them shooting themselves in the foot, but WILL make it a lot harder to do so.
In your case, if they use $obj->goo or $obj->goo(), they WOULD get a runtime error, at least in Perl.
They could of course go out of their way to do $obj->{'__private__goo'}, but if they do the gonzo cowboy crap due to sheer laziness, the latter is a lot more work than doing the correct $obj->foo().
You can also have a scan of code-base which detects $object->{"_ type strings, though from your description that might not work as a deterrent that much.
You can use Class::InsideOut or Object::InsideOut which give you true data privacy. Rather than storing data in a blessed hash reference, a blessed scalar reference is used as a key to lexical data hashes. Long story short, if your co-workers try $obj->{member} they'll get a run time error. There's nothing in $obj for them to grab at and no easy way to get at the data except through accessors.
Here is a discussion of the inside-out technique and various implementations.

Perl - Calling subclass constructor from superclass (OO)

This may turn out to be an embarrassingly stupid question, but better than potentially creating embarrassingly stupid code. :-) This is an OO design question, really.
Let's say I have an object class 'Foos' that represents a set of dynamic configuration elements, which are obtained by querying a command on disk, 'mycrazyfoos -getconfig'. Let's say that there are two categories of behavior that I want 'Foos' objects to have:
Existing ones: one is, query ones that exist in the command output I just mentioned (/usr/bin/mycrazyfoos -getconfig`. Make modifications to existing ones via shelling out commands.
Create new ones that don't exist; new 'crazyfoos', using a complex set of /usr/bin/mycrazyfoos commands and parameters. Here I'm not really just querying, but actually running a bunch of system() commands. Affecting changes.
Here's my class structure:
Foos.pm
package Foos, which has a new($hashref->{name => 'myfooname',) constructor that takes a 'crazyfoo NAME' and then queries the existence of that NAME to see if it already exists (by shelling out and running the mycrazyfoos command above). If that crazyfoo already exists, return a Foos::Existing object. Any changes to this object requires shelling out, running commands and getting confirmation that everything ran okay.
If this is the way to go, then the new() constructor needs to have a test to see which subclass constructor to use (if that even makes sense in this context). Here are the subclasses:
Foos/Existing.pm
As mentioned above, this is for when a Foos object already exists.
Foos/Pending.pm
This is an object that will be created if, in the above, the 'crazyfoo NAME' doesn't actually exist. In this case, the new() constructor above will be checked for additional parameters, and it will go ahead and, when called using ->create() shell out using system() and create a new object... possibly returning an 'Existing' one...
OR
As I type this out, I am realizing it is perhaps it's better to have a single:
(an alternative arrangement)
Foos class, that has a
->new() that takes just a name
->create() that takes additional creation parameters
->delete(), ->change() and other params that affect ones that exist; that will have to just be checked dynamically.
So here we are, two main directions to go with this. I'm curious which would be the more intelligent way to go.
In general it's a mistake (design-wise, not syntax-wise) for the new method to return anything but a new object. If you want to sometimes return an existing object, call that method something else, e.g. new_from_cache().
I also find it odd that you're splitting up this functionality (constructing a new object, and returning an existing one) not just into separate namespaces, but also different objects. So in general, you're closer with your second approach, but you can still have the main constructor (new) handle a variety of arguments:
package Foos;
use strict;
use warnings;
sub new
{
my ($class, %args) = #_;
if ($args{name})
{
# handle the name => value option
}
if ($args{some_other_option})
{
# ...
}
my $this = {
# fill in any fields you need...
};
return bless $this, $class;
}
sub new_from_cache
{
my ($class, %args) = #_;
# check if the object already exists...
# if not, create a new object
return $class->new(%args);
}
Note: I don't want to complicate things while you're still learning, but you may also want to look at Moose, which takes care of a lot of the gory details of construction for you, and the definition of attributes and their accessors.
It is generally speaking a bad idea for a superclass to know about its subclasses, a principle which extends to construction.[1] If you need to decide at runtime what kind of object to create (and you do), create a fourth class to have just that job. This is one kind of "factory".
Having said that in answer to your nominal question, your problem as described does not seem to call for subclassing. In particular, you apparently are going to be treating the different classes of Foos differently depending on which concrete class they belong to. All you're really asking for is a unified way to instantiate two separate classes of objects.
So how's this suggestion[3]: Make Foos::Exists and Foos::Pending two separate and unrelated classes and provide (in Foos) a method that returns the appropriate one. Don't call it new; you're not making a new Foos.
If you want to unify the interfaces so that clients don't have to know which kind they're talking about, then we can talk subclassing (or better yet, delegation to a lazily-created and -updated Foos::Handle).
[1]: Explaining why this is true is a subject hefty enough for a book[2], but the short answer is that it creates a dependency cycle between the subclass (which depends on its superclass by definition) and the superclass (which is being made to depend on its subclass by a poor design decision).
[2]: Lakos, John. (1996). Large-scale C++ Software Design. Addison-Wesley.
[3]: Not a recommendation, since I can't get a good enough handle on your requirements to be sure I'm not shooting fish in a dark ocean.
It is also a factory pattern (bad in Perl) if the object's constructor will return an instance blessed into more than one package.
I would create something like this. If the names exists than is_created is set to 1, otherwise it is set to 0.. I would merge the ::Pending, and ::Existing together, and if the object isn't created just put that into the default for the _object, the check happens lazily. Also, Foo->delete() and Foo->change() will defer to the instance in _object.
package Foo;
use Moose;
has 'name' => ( is => 'ro', isa => 'Str', required => 1 );
has 'is_created' => (
is => 'ro'
, isa => 'Bool'
, init_arg => undef
, default => sub {
stuff_if_exists ? 1 : 0
}
);
has '_object' => (
isa => 'Object'
, is => 'ro'
, lazy => 1
, init_arg => undef
, default => sub {
my $self = shift;
$self->is_created
? Foo->new
: Bar->new
}
, handles => [qw/delete change/]
);
Interesting answers! I am digesting it as I try out different things in code.
Well, I have another variation of the same question -- the same question, mind you, just a different problem to the same class:subclass creation issue!
This time:
This code is an interface to a command line that has a number of different complex options. I told you about /usr/bin/mycrazyfoos before, right? Well, what if I told you that that binary changes based on versions, and sometimes it completely changes its underlying options. And that this class we're writing, it has to be able to account for all of these things. The goal (or perhaps idea) is to do: (perhaps called FROM the Foos class we were discussing above):
Foos::Commandline, which has as subclasses different versions of the underlying '/usr/bin/mycrazyfoos' command.
Example:
my $fcommandobj = new Foos::Commandline;
my #raw_output_list = $fcommandobj->getlist();
my $result_dance = $fcommandobj->dance();
where 'getlist' and 'dance' are version-dependent. I thought about doing this:
package Foos::Commandline;
new (
#Figure out some clever way to decide what version user has
# (automagically)
# And call appropriate subclass? Wait, you all are telling me this is bad OO:
# if v1.0.1 (new Foos::Commandline::v1.0.1.....
# else if v1.2 (new Foos::Commandline::v1.2....
#etc
}
then
package Foos::Commandline::v1.0.1;
sub getlist ( eval... system ("/usr/bin/mycrazyfoos", "-getlistbaby"
# etc etc
and (different .pm files, in subdir of Foos/Commandline)
package Foos::Commandline::v1.2;
sub getlist ( eval... system ("/usr/bin/mycrazyfoos", "-getlistohyeahrightheh"
#etc
Make sense? I expressed in code what I'd like to do, but it just doesn't feel right, particularly in light of what was discussed in the above responses. What DOES feel right is that there should be a generic interface / superclass to Commandline... and that different versions should be able to override it. Right? Would appreciate a suggestion or two on that. Gracias.

What is the difference between new Some::Class and Some::Class->new() in Perl?

Many years ago I remember a fellow programmer counselling this:
new Some::Class; # bad! (but why?)
Some::Class->new(); # good!
Sadly now I cannot remember the/his reason why. :( Both forms will work correctly even if the constructor does not actually exist in the Some::Class module but instead is inherited from a parent somewhere.
Neither of these forms are the same as Some::Class::new(), which will not pass the name of the class as the first parameter to the constructor -- so this form is always incorrect.
Even if the two forms are equivalent, I find Some::Class->new() to be much more clear, as it follows the standard convention for calling a method on a module, and in perl, the 'new' method is not special - a constructor could be called anything, and new() could do anything (although of course we generally expect it to be a constructor).
Using new Some::Class is called "indirect" method invocation, and it's bad because it introduces some ambiguity into the syntax.
One reason it can fail is if you have an array or hash of objects. You might expect
dosomethingwith $hashref->{obj}
to be equal to
$hashref->{obj}->dosomethingwith();
but it actually parses as:
$hashref->dosomethingwith->{obj}
which probably isn't what you wanted.
Another problem is if there happens to be a function in your package with the same name as a method you're trying to call. For example, what if some module that you use'd exported a function called dosomethingwith? In that case, dosomethingwith $object is ambiguous, and can result in puzzling bugs.
Using the -> syntax exclusively eliminates these problems, because the method and what you want the method to operate upon are always clear to the compiler.
See Indirect Object Syntax in the perlobj documentation for an explanation of its pitfalls. freido's answer covers one of them (although I tend to avoid that with explicit parens around my function calls).
Larry once joked that it was there to make the C++ feel happy about new, and although people will tell you not to ever use it, you're probably doing it all the time. Consider this:
print FH "Some message";
Have you ever wondered my there was no comma after the filehandle? And there's no comma after the class name in the indirect object notation? That's what's going on here. You could rewrite that as a method call on print:
FH->print( "Some message" );
You may have experienced some weirdness in print if you do it wrong. Putting a comma after the explicit file handle turns it into an argument:
print FH, "some message"; # GLOB(0xDEADBEEF)some message
Sadly, we have this goofiness in Perl. Not everything that got into the syntax was the best idea, but that's what happens when you pull from so many sources for inspiration. Some of the ideas have to be the bad ones.
The indirect object syntax is frowned upon, for good reasons, but that's got nothing to do with constructors. You're almost never going to have a new() function in the calling package. Rather, you should use Package->new() for two other (better?) reasons:
As you said, all other class methods take the form Package->method(), so consistency is a Good Thing
If you're supplying arguments to the constructor, or you're taking the result of the constructor and immediately calling methods on it (if e.g. you don't care about keeping the object around), it's simpler to say e.g.
$foo = Foo->new(type => 'bar', style => 'baz');
Bar->new->do_stuff;
than
$foo = new Foo(type => 'bar', style => 'baz');
(new Bar)->do_stuff;
Another problem is that new Some::Class happens at run time. If there is an error and you testing never branches to this statement, you never know it until it happens in production. It is better to use Some::Class->new unless you are doing dynamic programing.