Which KiokuDB backend is suitable for my serialization needs? - perl

I use KiokuDB to store a couple of Moose objects and a couple of simple array structures (hashes and arrays).
I do not need any fancy searches, transactions etc., simple the ability to fetch (lookup) an object. Also, as soon as I'm done creating the DB, it can be set read-only. No changes will ever be made to it.
The main (only?) reason I use KiokuDB is to keep object graph.
The largest object, which dominants the total size of the DB, is a Moose object that has a relatively large array in it (let's call this object large_obj). Previously, I stored large_obj (alone) using Storable + PerlIO::gzip or even JSON + PerlIO::gzip. It worked fine and I was very happy with the results (the use of gzip compressed the store file to about 5% of its original size).
There is another, smaller Moose object, which is basically an array of some 20-30k small Moose objects.
Now, after migrating to KiokuDB, I first used the simple Hash backend, then dumped it to a file (using Cmd) with PerlIO::gzip again. This worked very well in cases where large_obj was relatively small, but once it got larger, I just got out of memory errors. I guess the hash backed is not suitable for large objects.
I then tried the recommended Berkeley backend, although it seems like an overkill (as mentioned, I don't really need all the fancy DB capabilities). It works much slower than the original Storable + PerlIO::gzip solution, it takes far more space, and it also runs out of memory for larger objects! (I use a 3GB RAM ubuntu).
I also tried the Files backend, but it fails with:
Too many open files at /usr/local/perls/perl-5.12.2/lib/site_perl/5.12.2/Directory/Transactional.pm line 130.
(in cleanup) Too many open files at /usr/local/perls/perl-5.12.2/lib/site_perl/5.12.2/Directory/Transactional.pm line 130.
Do you have any suggestions on how can I store my objects in a way that is both space-efficient and maintains the object graph?

Implement your own backend using Data::Serializer:
package KiokuDB::Backend::Serialize::Data::Serializer;
use Moose;
use Moose::Role;
use Data::Serializer;
use namespace::clean -except => 'meta';
with qw(
KiokuDB::Backend::Serialize
KiokuDB::Backend::Role::UnicodeSafe
KiokuDB::Backend::Role::BinarySafe
);
has '_serializer' => (
is => 'ro',
isa => 'Data::Serializer',
required => 1,
lazy => 1,
default => sub {
Data::Serializer->new(
serializer => 'FreezeThaw', # Storable, FreezeThaw, Data::Denter, Config::General, YAML, PHP::Serialization, XML::Dumper, and Data::Dumper
digester => 'MD5', # See http://search.cpan.org/~gaas/Digest-1.16/Digest.pm#Digest_speed
compress => 1,
compressor => 'Compress::Zlib', # Compress::Zlib or Compress::PPMd
);
},
);
sub serialize {
my ( $self, $entry ) = #_;
return $self->_serializer->serialize($entry);
}
sub deserialize {
my ( $self, $blob ) = #_;
return $self->_serializer->deserialize($blob);
}
sub serialize_to_stream {
my ( $self, $fh, $entry ) = #_;
$self->_serializer->store( $entry, $fh );
}
sub deserialize_from_stream {
my ( $self, $fh ) = #_;
$self->_serializer->retrieve($fh);
}
__PACKAGE__

Related

Merge two yml files does not handle duplicates?

I am trying to merge 2 yml files using Hash::Merge perl module. And trying to Dump it to yml file using Dump from YMAL module.
use strict;
use warnings;
use Hash::Merge qw( merge );
Hash::Merge::set_behavior('RETAINMENT_PRECEDENT');
use File::Slurp qw(write_file);
use YAML;
my $yaml1 = $ARGV[0];
my $yaml2 = $ARGV[1];
my $yaml_output = $ARGV[2];
my $clkgrps = &YAML::LoadFile($yaml1);
my $clkgrps1 = &YAML::LoadFile($yaml2);
my $clockgroups = merge($clkgrps1, $clkgrps);
my $out_yaml = Dump $clockgroups;
write_file($yaml_output, { binmode => ':raw' }, $out_yaml);
After merging yml file, I could see duplicate entries i.e. following content is same in two yml files. While merging it is treating them as different entries. Do we have any implicit way in handle duplicates?
The data structures obtained from YAML files generally contain keys with values being arrayrefs with hashrefs. In your test case that's the arrayref for the key test.
Then a tool like Hash::Merge can only add the hashrefs to the arrayref belonging to the same key; it is not meant to compare array elements, as there aren't general criteria for that. So you need to do this yourself in order to prune duplicates, or apply any specific rules of your choice to data.
One way to handle this is to serialize (so stringify) complex data structures in each arrayref that may contain duplicates so to be able to build a hash with them being keys, which is a standard way to handle duplicates (with O(1) complexity, albeit possibly with a large constant).
There are a number of ways to serialize data in Perl. I'd recommend JSON::XS, as a very fast tool with output that can be used by any language and tool. (But please research others of course, that may suit your precise needs better.)
A simple complete example, using your test cases
use strict;
use warnings;
use feature 'say';
use Data::Dump qw(dd pp);
use YAML;
use JSON::XS;
use Hash::Merge qw( merge );
#Hash::Merge::set_behavior('RETAINMENT_PRECEDENT'); # irrelevant here
die "Usage: $0 in-file1 in-file2 output-file\n" if #ARGV != 3;
my ($yaml1, $yaml2, $yaml_out) = #ARGV;
my $hr1 = YAML::LoadFile($yaml1);
my $hr2 = YAML::LoadFile($yaml2);
my $merged = merge($hr2, $hr1);
#say "merged: ", pp $merged;
for my $key (keys %$merged) {
# The same keys get overwritten
my %uniq = map { encode_json $_ => 1 } #{$merged->{$key}};
# Overwrite the arrayref with the one without dupes
$merged->{$key} = [ map { decode_json $_ } keys %uniq ];
}
dd $merged;
# Save the final structure...
More complex data structures require a more judicious traversal; consider using a tool for that.
With files as shown in the question this prints
{
test => [
{ directory => "LIB_DIR", name => "ObsSel.ktc", project => "TOT" },
{ directory => "MODEL_DIR", name => "pipe.v", project => "TOT" },
{
directory => "PCIE_LIB_DIR",
name => "pciechip.ktc",
project => "PCIE_MODE",
},
{ directory => "NAME_DIR", name => "fame.v", project => "SINGH" },
{ directory => "TREE_PROJECT", name => "Syn.yml", project => "TOT" },
],
}
(I use Data::Dump to show complex data, for its simplicity and default compact output.)
If there are issues with serializing and comparing entire structures consider using a digest (checksum, hashing) of some sort.
Another option altogether would be to compare data structures as they are in order to resolve duplicates, by hand. For comparison of complex data structures I like to use Test::More, which works very nicely for mere comparisons outside of any testing. But there are dedicated tools as well of course, like Data::Compare.
Finally, instead of manually processing the result of a naive merge, like above, one can code the desired behavior using Hash::Merge::add_behavior_spec and then have the module do it all. For specific examples of how to use this feature see for instance this post
and this post and this post.
Note that in this case you still write all the code to do the job like above but the module does take some of the mechanics off of your hands.

Copy on Write for References

Perl currently supports Copy on Write (CoW) for scalar variables however it doesn't appear to have anything for hashrefs and arrayrefs.
Perl does, however, have subroutines to modify variable internals like weaken so I'm guessing that there might exist a solution.
I have a situation where I have a large structure I'm returning from a package which keeps an internal state of this large structure. I want to ensure that if either the returned references or the internal reference (which are both currently the same reference) is modified that I end up with a Copy-on-write situation where the data the references are pointing to is copied, modified and the reference used to modified the data is updated to point to the new data.
package SomePackage;
use Moose;
has some_large_internal_variable_ref => (
'is' => 'rw',
'isa' => 'HashRef',
);
sub some_operation {
my ($self) = #_;
$self->some_large_internal_variable_ref({
# create some large result that is different every time
});
}
sub get_result {
my ($self) = #_;
return $self->some_large_internal_variable_ref;
}
1;
use strict;
use warnings;
use SomePackage;
use Test::More;
# Situtation 1 where the internally stored reference is modified
# This will pass!
my $package = SomePackage->new();
$package->some_operation();
my $result1 = $package->get_result();
$package->some_operation();
my $result2 = $package->get_result();
isnt($result1, $result2, "These two references should no longer be the same");
# Situtation 2 where the externally stored references is modified
# This will fail
$package = SomePackage->new();
$package->some_operation();
$result1 = $package->get_result();
$result1->{foo} = "bar";
$result2 = $package->get_result();
isnt($result1, $result2, "These two references should no longer be the same");
done_testing;
I'm trying to avoid a situation where I have to clone the values on the get_result return as this would result in a situation where memory usage is doubled.
I'm hoping there is some form of weaken I can call on the variable to indicate that, should a modification be made to behave with Copy on Write behaviour.

why can't I run this perl code?

While following this tutorial
https://www.codeproject.com/Articles/3152/Perl-Object-Oriented-Programming
I am failing to see where module Address.pm is.. did I miss something or article has an error or do I have a misunderstanding when one of the module says ' use Address ';
mac1:moduleTEST1 user1$ ./Employee.pl
Can't locate object method "new" via package "Address" (perhaps you forgot to load "Address"?) at ./Employee.pl line 16.
mac1:moduleTEST1 user1$
The tutorial is outdated and rather useless. Specifically, it is much worse than the documentation which comes with Perl. Use perldoc perltoc to get a table of contents, and read everything at least once.
See perldoc perlootut and perldoc perlobj.
package Address;
use strict;
use warnings;
sub new {
my $class = shift;
my $args = shift;
my %self = map +($_ => $args->{$_}), qw( street city state zip );
bless \%self => $class;
}
sub street {
my $self = shift;
if ( #_ ) {
$self->{street} = $_[0];
return;
}
return $self->{street};
}
# ditto for the rest of the accessors # there are
# ways to cut down the boilerplate once you learn
# the basics
#
# ...
__PACKAGE__
__END__
You use this module like this:
my $address = Address->new({
street => '123 E. Any St',
city => 'Any Town',
state => 'AY',
zip => '98765',
});
Of course, there a lot of things missing from this little demo. For example, the accessor, as written, allows you to change the state of the object. Immutable objects are easier to reason about, so you might want to disallow that by changing it to:
sub street { $_[0]->{street} }
It also allows you to assign any value you want to fields like state and zip. So, you may want to validate those values in the constructor, ensure that only values for the fields of the class are passed, all the values passed are defined etc.
At the end of that process, you may decide it doesn't make sense to keep writing boilerplate and use Moo or Moose to avail yourself to a richer set of features.
Even then, it helps to know what's happening under the hood.

Moose traits for multdimensional data structures

Breaking out handling an internal variable from calls on the variable into calls on the object is easy using the Attribute::Native::Trait handlers. However, how do you deal with multiple data structures? I can't think of any way to handle something like the below without making the stash an arrayref of My::Stash::Attribute objects, which in turn contain an arrayref of My::Stash::Subattribute objects, which contains an arrayref My::Stash::Instance objects. This includes a lot of munging and coercing the data each level down the stack as I sort things out.
Yes, I can store the items as a flat array and then grep it on every read, but in a situation with frequent reads and that most calls are reads, grepping against a large list of array items is a lot of processing every read vs just indexing the items internally in the way needed.
Is there a MooseX extension that can handle this sort of thing via handlers creating methods, instead of just treating the read accessor as the hashref it is and modifying it in place? Or am I just best off forgetting about doing things like this via method call and just doing it as-is?
use strict;
use warnings;
use 5.010;
package My::Stash;
use Moose;
has '_stash' => (is => 'ro', isa => 'HashRef', default => sub { {} });
sub add_item {
my $self = shift;
my ($item) = #_;
push #{$self->_stash->{$item->{property}}{$item->{sub}}}, $item;
}
sub get_items {
my $self = shift;
my ($property, $subproperty) = #_;
return #{$self->_stash->{$property}{$subproperty}};
}
package main;
use Data::Printer;
my $stash = My::Stash->new();
for my $property (qw/foo bar baz/) {
for my $subproperty (qw/fazz fuzz/) {
for my $instance (1 .. 2) {
$stash->add_item({ property => $property, sub => $subproperty, instance => $instance })
}
}
}
p($_) for $stash->get_items(qw/baz fuzz/);
These are very esoteric:
sub add_item {
my $self = shift;
my ($item) = #_;
push #{$self->_stash->{$item->{property}}{$item->{sub}}}, $item;
}
So add_item takes an hashref item, and pushes it onto an array key in stash indexed by it's own keys property, and sub.
sub get_items {
my $self = shift;
my ($property, $subproperty) = #_;
return #{$self->_stash->{$property}{$subproperty}};
}
Conversely, get_item takes two arguments, a $property and a $subproperty and it retrieves the appropriate elements in a Array in a HoH.
So here are the concerns into making it MooseX:
There is no way in a non-Magic hash to insist that only hashes are values -- this would be required for predictable behavior on the trait. As in your example, what would you expect if _stash->{$property} resolved to a scalar.
add_item has it's depth hardcoded to property and sub.
returning arrays is bad, it requires all of the elements to be pushed onto the stack (return refs)
Now firstly, I don't see why a regular Moose Hash trait couldn't accept array refs for both the setter and getter.
->set( [qw/ key1 key2/], 'foo' )
->get( [qw/ key1 key2/] )
This would certainly make your job easier, if your destination wasn't an array:
sub add_item {
my ( $self, $hash ) = #_;
$self->set( [ $hash->{property}, $hash->{subproperty} ], $hash );
}
# get items works as is, just pass in an `ArrayRef`
# ie, `->get([$property, $subproperty])`
When it comes to having the destination be an array than a hash slot, I assume you'd just have to build that into a totally different helper in the trait, push_to_array_or_create([$property, $subproperty], $value). I'd still just retrieve it with the fictional get helper specified above. auto_deref type functionality is a pretty bad idea.
In short ask a core developer on what they would think about extending set and get in this context to accept ArrayRefs as keys and act appropriately. I can't imagine there is a useful default for ArrayRef keys (I don't think regular stringification would be too useful.).

Catalyst add methods to a DBI model

How would I add methods to my DBI model if I have 'Catalyst::Model::DBI' based model and I'd like a method to have something like $c->model('DBI')->my_method(); but $c->model('DBI') doesn't return a ref to my that object, rather I get back a DBI::db. I can get back the dbh and operate on that, but I have a bunch of utility methods that I'd prefer to add here.
I haven’t seen you code so I can’t know for sure what you’re doing but if you’re using Catalyst::Model::DBI you’re doing something wrong. The raw model does return the object, e.g.: MyApp::Model::DBI=HASH(0xdf7ba0)
It sounds like you might be trying to load DBI with the Adaptor stuff. Subclassing DBI is harder than you might think so I’d definitely shy away from that.
Minimal reproduction–
# Create a new test model with SQLite.
script/*create.pl model DBI DBI "dbi:SQLite::memory:"
# A test controller to go with it.
script/*create.pl controller DBI
# Change the index method to show your raw model–
sub index :Path Args(0) {
my ( $self, $c ) = #_;
$c->response->body( $c->model("DBI") );
}
Now you could try adding something to your model–
# lib/MyApp/Model/DBI.pm
sub add {
my $self = shift;
my #add = #_;
#add == 2 or die "2 is a terrible error message: 2";
return $self->dbh->selectrow_array("SELECT ? + ?", {}, #add);
}
And this to your controller–
# lib/MyApp/Controller/DBI.pm
sub add : Local Args(0) {
my ( $self, $c ) = #_;
$c->response->body( $c->model("DBI")->add( 2,2 ) );
}
Then visit localhost:3000/dbi/add. Continue to extend your model however you like.
Now, that the question is answered. You really, really, really should take the learning hit right now and get familiar with DBIx::Class or one of the other first class ORMs in Perl. Bare bones DBI is fine but you’re going to find 100 problems over time that DBIC has solved and it comes with a deep test suite, a long history, dozens of extensions, and a helpful community.
I am not using a direct DBI model myself, so I am not sure this works for you. I use the DBIC::Schema model
script/myapp_create.pl model DB DBIC::Schema MyApp::Schema \
create=static dbi:mysql:mydb dbusername dbpass
This creates a DB model in the Model dir that is merely a wrapper to the underlying DBIx::Class::Schema schema that is saved in lib/MyApp/Schema.pm and lib/MyApp/Schema/Result/
If I add a foo() subroutine to lib/MyApp/Model/DB.pm I can simply reference it like
$c->model('DB')->foo()
I thought the DBI model also created a wrapper model, $c->model('DBI')->dbh should return the raw DBI handle, $c->model('DBI') the catalyst model wrapper
The code below is an example of how I build my Models, I've written a tutorial on this at:
http://brainbuz.org/techinfo. I use DBIx::Simple as a convenience, you can easily skip it for raw dbi, and directly reference $self->dbh in your Model.
# Parent MODEL
package BoPeep::Model::BoPeep;
use strict;
use warnings;
use DBIx::Simple ;
use parent 'Catalyst::Model::DBI';
__PACKAGE__->config(
dsn => BoPeep->config->{dsn} ,
user => BoPeep->config->{user} ,
password => BoPeep->config->{password} ,
);
use Moose ; #use Moose immediately before calling
#on Moose to extend the object
has db=>(
is =>'ro',
isa=>'DBIx::Simple',
lazy_build=> 1,
# If we don't want to handle all dbis methods,
# specify those that we want.
# handles=> [qw/query flat /],
);
sub _build_db {
my $self = shift ;
return DBIx::Simple->connect($self->dbh);
} ;
# Child Model
package BoPeep::Model::BoPeep::Flock;
use Moose;
use BoPeep;
use namespace::autoclean;
extends 'BoPeep::Model::BoPeep';
sub List {
my $self = shift ;
my $db = $self->db ;
my #sheep = $db->query('SELECT * FROM flock')->flat ;
return #sheep ;
}
__PACKAGE__->meta->make_immutable( inline_constructor => 0 );
1;