Merge two yml files does not handle duplicates? - perl

I am trying to merge 2 yml files using Hash::Merge perl module. And trying to Dump it to yml file using Dump from YMAL module.
use strict;
use warnings;
use Hash::Merge qw( merge );
Hash::Merge::set_behavior('RETAINMENT_PRECEDENT');
use File::Slurp qw(write_file);
use YAML;
my $yaml1 = $ARGV[0];
my $yaml2 = $ARGV[1];
my $yaml_output = $ARGV[2];
my $clkgrps = &YAML::LoadFile($yaml1);
my $clkgrps1 = &YAML::LoadFile($yaml2);
my $clockgroups = merge($clkgrps1, $clkgrps);
my $out_yaml = Dump $clockgroups;
write_file($yaml_output, { binmode => ':raw' }, $out_yaml);
After merging yml file, I could see duplicate entries i.e. following content is same in two yml files. While merging it is treating them as different entries. Do we have any implicit way in handle duplicates?

The data structures obtained from YAML files generally contain keys with values being arrayrefs with hashrefs. In your test case that's the arrayref for the key test.
Then a tool like Hash::Merge can only add the hashrefs to the arrayref belonging to the same key; it is not meant to compare array elements, as there aren't general criteria for that. So you need to do this yourself in order to prune duplicates, or apply any specific rules of your choice to data.
One way to handle this is to serialize (so stringify) complex data structures in each arrayref that may contain duplicates so to be able to build a hash with them being keys, which is a standard way to handle duplicates (with O(1) complexity, albeit possibly with a large constant).
There are a number of ways to serialize data in Perl. I'd recommend JSON::XS, as a very fast tool with output that can be used by any language and tool. (But please research others of course, that may suit your precise needs better.)
A simple complete example, using your test cases
use strict;
use warnings;
use feature 'say';
use Data::Dump qw(dd pp);
use YAML;
use JSON::XS;
use Hash::Merge qw( merge );
#Hash::Merge::set_behavior('RETAINMENT_PRECEDENT'); # irrelevant here
die "Usage: $0 in-file1 in-file2 output-file\n" if #ARGV != 3;
my ($yaml1, $yaml2, $yaml_out) = #ARGV;
my $hr1 = YAML::LoadFile($yaml1);
my $hr2 = YAML::LoadFile($yaml2);
my $merged = merge($hr2, $hr1);
#say "merged: ", pp $merged;
for my $key (keys %$merged) {
# The same keys get overwritten
my %uniq = map { encode_json $_ => 1 } #{$merged->{$key}};
# Overwrite the arrayref with the one without dupes
$merged->{$key} = [ map { decode_json $_ } keys %uniq ];
}
dd $merged;
# Save the final structure...
More complex data structures require a more judicious traversal; consider using a tool for that.
With files as shown in the question this prints
{
test => [
{ directory => "LIB_DIR", name => "ObsSel.ktc", project => "TOT" },
{ directory => "MODEL_DIR", name => "pipe.v", project => "TOT" },
{
directory => "PCIE_LIB_DIR",
name => "pciechip.ktc",
project => "PCIE_MODE",
},
{ directory => "NAME_DIR", name => "fame.v", project => "SINGH" },
{ directory => "TREE_PROJECT", name => "Syn.yml", project => "TOT" },
],
}
(I use Data::Dump to show complex data, for its simplicity and default compact output.)
If there are issues with serializing and comparing entire structures consider using a digest (checksum, hashing) of some sort.
Another option altogether would be to compare data structures as they are in order to resolve duplicates, by hand. For comparison of complex data structures I like to use Test::More, which works very nicely for mere comparisons outside of any testing. But there are dedicated tools as well of course, like Data::Compare.
Finally, instead of manually processing the result of a naive merge, like above, one can code the desired behavior using Hash::Merge::add_behavior_spec and then have the module do it all. For specific examples of how to use this feature see for instance this post
and this post and this post.
Note that in this case you still write all the code to do the job like above but the module does take some of the mechanics off of your hands.

Related

Cleanest way to parse argument with Getopt::Long

I use GetOpt to parse command-line arguments. I would like to add a new option "multi" which should get a string which looks as following: key1=abc,key2=123,key3=xwz.
I don't know how many custom keys user want to give but he can give minimax 5 keys. Also, I would like to put it in a hash with keys.
I'm looking for a good and clean way to implement it.
For starters, I thought of using --multi {key1=abc,key2=123,key3=xwz} but for some reason, it gets only
the first key key1=abc. Also I tried: --multi {key1=abc},{key2=123},{key3=xwz} but it feels kind of messy. I want to give the user the possibility to add arguments with - like key1=./some_script.pl --help. Part of the code:
my %arg;
GetOptions(
"multi=s" => \$arg{"multi"},
}
Then I would like to somehow put those keys in the hash so it will be easy to use them. So I thought of using: $arg{"multi"}{"key3"} in order to get the value of key3.
How should I approach this feature? What is the cleanest way to do so?
To summarize it:
What is the best way to ask the user to give keys in order to get a similar situation to key1=abc,key2=123,key3=xwz, without using a file (giving options, not in a file way)? Meaning - how would you like, as a user of the script, to give those fields?
How to validate that user gave less than 5 keys?
How should I parse those keys and what is the best way to insert those keys into the hash map in the multi key.
Expected output: I would like to have a hash which looks like this: $arg{"multi"}{"key3"} and returns xwz.
The following program reads the comma-separated sub-options from the --multi option on the command line.
#!perl
use strict;
use warnings;
use Data::Dumper;
use Getopt::Long 'GetOptionsFromArray';
my #args = ('--multi', '{key1=abc,key2=123,key3=xwz}', 'some', 'other');
my %arg;
GetOptionsFromArray(
\#args,
"multi=s" => \$arg{"multi"},
);
if( $arg{multi} and $arg{multi} =~ /^\{(.*)\}$/) {
# split up into hash:
$arg{ multi } = { split /[{},=]/, $1 };
};
print Dumper \%arg;
__END__
$VAR1 = {
'multi' => {
'key2' => '123',
'key1' => 'abc',
'key3' => 'xwz'
}
};
The program uses GetOptionsFromArray for easy testability. In the real program, you will likely use GetOptions(...), which is identical to GetOptionsFromArray(\#ARGV, ...).
One way is to assign options of key=value format to a hash, what GetOpt::Long allows. Even better, as this functionality merely needs a hash reference, it turns out that you can have it assign to a hashref that is a value inside a deeper data structure. You can make direct use of that
use warnings;
use strict;
use feature 'say';
use Getopt::Long;
use Data::Dump qw(dd);
my %args;
$args{multi} = {};
GetOptions( 'multi=s' => $args{multi} ) or die "Bad options: $!";
dd \%args;
With multiple invocations of that option the key-value pairs are added
script.pl --multi k1=v1 --multi k2=v2
and the above program prints
{ multi => { k1 => "v1", k2 => "v2" } }
I use Data::Dump to print complex data. Change to core Data::Dumper if that's a problem.
While Getopt::Long has a way to limit the number of arguments that an option takes that apparently applies only for array destinations. So you'd have to count keys to check.
Another way is to process the input string in a subroutine, where you can do practically anything you want. Adding that to the above script, to add yet another key with its hashref to %args
use warnings;
use strict;
use feature 'say';
use Getopt::Long;
use Data::Dump qw(dd);
my %args;
$args{multi} = {};
GetOptions(
'multi=s' => $args{multi},
'other=s' => sub { $args{other} = { split /[=,]/, $_[1] } }
) or die "Bad options: $!";
dd \%args;
When called as
script.pl --multi k1=v1 --multi k2=v2 --other mk1=mv1,mk2=mv2
This prints
{
other => { mk1 => "mv1", mk2 => "mv2" },
multi => { k1 => "v1", k2 => "v2" },
}

Copy on Write for References

Perl currently supports Copy on Write (CoW) for scalar variables however it doesn't appear to have anything for hashrefs and arrayrefs.
Perl does, however, have subroutines to modify variable internals like weaken so I'm guessing that there might exist a solution.
I have a situation where I have a large structure I'm returning from a package which keeps an internal state of this large structure. I want to ensure that if either the returned references or the internal reference (which are both currently the same reference) is modified that I end up with a Copy-on-write situation where the data the references are pointing to is copied, modified and the reference used to modified the data is updated to point to the new data.
package SomePackage;
use Moose;
has some_large_internal_variable_ref => (
'is' => 'rw',
'isa' => 'HashRef',
);
sub some_operation {
my ($self) = #_;
$self->some_large_internal_variable_ref({
# create some large result that is different every time
});
}
sub get_result {
my ($self) = #_;
return $self->some_large_internal_variable_ref;
}
1;
use strict;
use warnings;
use SomePackage;
use Test::More;
# Situtation 1 where the internally stored reference is modified
# This will pass!
my $package = SomePackage->new();
$package->some_operation();
my $result1 = $package->get_result();
$package->some_operation();
my $result2 = $package->get_result();
isnt($result1, $result2, "These two references should no longer be the same");
# Situtation 2 where the externally stored references is modified
# This will fail
$package = SomePackage->new();
$package->some_operation();
$result1 = $package->get_result();
$result1->{foo} = "bar";
$result2 = $package->get_result();
isnt($result1, $result2, "These two references should no longer be the same");
done_testing;
I'm trying to avoid a situation where I have to clone the values on the get_result return as this would result in a situation where memory usage is doubled.
I'm hoping there is some form of weaken I can call on the variable to indicate that, should a modification be made to behave with Copy on Write behaviour.

Replacing a class in Perl ("overriding"/"extending" a class with same name)?

I am trying to Iterate directories in Perl, getting introspectable objects as result, mostly so I can print fields like mtime when I'm using Dumper on the returns from IO::All.
I have discovered, that it can be done, if in the module IO::All::File (for me, /usr/local/share/perl/5.10.1/IO/All/File.pm), I add the line field mtimef => undef;, and then modify its sub file so it runs $self->mtimef($self->mtime); (note, this field cannot have the same name (mtime) as the corresponding method/property, as those are dynamically assigned in IO::All). So, in essence, I'm not interested in "overloading", as in having the same name for multiple function signatures - I'd want to "replace" or "override" a class with its extended version (not sure how this is properly called), but under the same name; so all other classes that may use it, get on to using the extended version from that point on.
The best approach for me now would be, if I could somehow "replace" the IO::All::File class, from my actual "runnable" Perl script -- if somehow possible, by using the mechanisms for inheritance, so I can just add what is "extra". To show what I mean, here is an example:
use warnings;
use strict;
use Data::Dumper;
my #targetDirsToScan = ("./");
use IO::All -utf8 ; # Turn on utf8 for all io
# try to "replace" the IO::All::File class
{ # recursive inheritance!
package IO::All::File;
use IO::All::File -base;
# hacks work if directly in /usr/local/share/perl/5.10.1/IO/All/File.pm
field mtimef => undef; # hack
sub file {
my $self = shift;
bless $self, __PACKAGE__;
$self->name(shift) if #_;
$self->mtimef($self->mtime); # hack
return $self->_init;
}
1;
}
# main script start
my $io = io(#targetDirsToScan);
my #contents = $io->all(0); # Get all contents of dir
for my $contentry ( #contents ) {
print Dumper \%{*$contentry};
}
... which fails with "Recursive inheritance detected in package 'IO::All::Filesys' at /usr/local/share/perl/5.10.1/IO/All/Base.pm line 13."; if you comment out the "recursive inheritance" section, it all works.
I'm sort of clear on why this happens with this kind of syntax - however, is there a syntax, or a way, that can be used to "replace" a class with its extended version but of the same name, similar to how I've tried it above? Obviously, I want the same name, so that I wouldn't have to change anything in IO::All (or any other files in the package). Also, I would preferably do this in the "runner" Perl script (so that I can have everything in a single script file, and I don't have to maintain multiple files) - but if the only way possible is to have a separate .pm file, I'd like to know about it as well.
So, is there a technique I could use for something like this?
Well, I honestly have no idea what is going on, but I poked around with the code above, and it seems all that is required, is to remove the -base from the use IO::All::File statement; and the code otherwise seems to work as I expect it - that is, the package does get "overriden" - if you change this snippet in the code above:
# ...
{ # no more recursive inheritance!? IO::All::File gets overriden with this?!
package IO::All::File;
use IO::All::File; # -base; # just do not use `-base` here?!
# hacks work if directly in /usr/local/share/perl/5.10.1/IO/All/File.pm
field mtimef => undef; # hack
sub file {
my $self = shift;
bless $self, __PACKAGE__;
$self->name(shift) if #_;
$self->mtimef($self->mtime); # hack
print("!! *haxx0rz'd* file() reporting in\n");
return $self->_init;
}
1;
}
# ...
I found this so unbelievable, I even added the print() there to make sure it is the "overriden" function that runs, and sure enough, it is; this is what I get in output:
...
!! *haxx0rz'd* file() reporting in
$VAR1 = {
'_utf8' => 1,
'mtimef' => 1394828707,
'constructor' => sub { "DUMMY" },
'is_open' => 0,
'io_handle' => undef,
'name' => './test.blg',
'_encoding' => 'utf8',
'package' => 'IO::All'
};
...
... and sure enough,the field is there, as expected, too...
Well - I hope someone eventually puts a more qualified answer here; for the time being, I hope this is as good as a fix to my problems :) ...

Which KiokuDB backend is suitable for my serialization needs?

I use KiokuDB to store a couple of Moose objects and a couple of simple array structures (hashes and arrays).
I do not need any fancy searches, transactions etc., simple the ability to fetch (lookup) an object. Also, as soon as I'm done creating the DB, it can be set read-only. No changes will ever be made to it.
The main (only?) reason I use KiokuDB is to keep object graph.
The largest object, which dominants the total size of the DB, is a Moose object that has a relatively large array in it (let's call this object large_obj). Previously, I stored large_obj (alone) using Storable + PerlIO::gzip or even JSON + PerlIO::gzip. It worked fine and I was very happy with the results (the use of gzip compressed the store file to about 5% of its original size).
There is another, smaller Moose object, which is basically an array of some 20-30k small Moose objects.
Now, after migrating to KiokuDB, I first used the simple Hash backend, then dumped it to a file (using Cmd) with PerlIO::gzip again. This worked very well in cases where large_obj was relatively small, but once it got larger, I just got out of memory errors. I guess the hash backed is not suitable for large objects.
I then tried the recommended Berkeley backend, although it seems like an overkill (as mentioned, I don't really need all the fancy DB capabilities). It works much slower than the original Storable + PerlIO::gzip solution, it takes far more space, and it also runs out of memory for larger objects! (I use a 3GB RAM ubuntu).
I also tried the Files backend, but it fails with:
Too many open files at /usr/local/perls/perl-5.12.2/lib/site_perl/5.12.2/Directory/Transactional.pm line 130.
(in cleanup) Too many open files at /usr/local/perls/perl-5.12.2/lib/site_perl/5.12.2/Directory/Transactional.pm line 130.
Do you have any suggestions on how can I store my objects in a way that is both space-efficient and maintains the object graph?
Implement your own backend using Data::Serializer:
package KiokuDB::Backend::Serialize::Data::Serializer;
use Moose;
use Moose::Role;
use Data::Serializer;
use namespace::clean -except => 'meta';
with qw(
KiokuDB::Backend::Serialize
KiokuDB::Backend::Role::UnicodeSafe
KiokuDB::Backend::Role::BinarySafe
);
has '_serializer' => (
is => 'ro',
isa => 'Data::Serializer',
required => 1,
lazy => 1,
default => sub {
Data::Serializer->new(
serializer => 'FreezeThaw', # Storable, FreezeThaw, Data::Denter, Config::General, YAML, PHP::Serialization, XML::Dumper, and Data::Dumper
digester => 'MD5', # See http://search.cpan.org/~gaas/Digest-1.16/Digest.pm#Digest_speed
compress => 1,
compressor => 'Compress::Zlib', # Compress::Zlib or Compress::PPMd
);
},
);
sub serialize {
my ( $self, $entry ) = #_;
return $self->_serializer->serialize($entry);
}
sub deserialize {
my ( $self, $blob ) = #_;
return $self->_serializer->deserialize($blob);
}
sub serialize_to_stream {
my ( $self, $fh, $entry ) = #_;
$self->_serializer->store( $entry, $fh );
}
sub deserialize_from_stream {
my ( $self, $fh ) = #_;
$self->_serializer->retrieve($fh);
}
__PACKAGE__

Separating configuration data and script logic in Perl scripts

I find the following anti-pattern repeated in my Perl scripts: the script contains some machine/setup specific settings which I store in-line as constants in the script whereas the rest of the script is general in nature:
#!/usr/bin/perl
use strict;
use warnings;
# machine specific settings at the start of the script.
my $SETTING_1 = "foo";
my #SETTING_2 = ("123", "456");
my $SETTING_3 = "something";
# general part of script follows.
...
This pattern is somewhat okay when running on one machine, but as soon as I want to distribute the script to multiple machines the trouble starts since I must keep track so that I do not overwrite the settings part with new updates in the general part.
The correct solution is obviously to have one general script file and have it read a configuration file which is specific to the environment that the script runs in.
My question is: What CPAN module would you recommend for solving this problem? Why?
For configuration files, I like to use YAML. Simple, cross-platform, human-readable, and no danger of your configuration accidentally morphing into an actual program.
My favorite is Config::Std. I like the way it handles multi-line and multi-part configuration values.
You have to be careful when a variable is potentially multi-valued: If a single value exists in the configuration file, it will store the value in a scalar; if multiple values exist, you will get an array reference.
I find it convenient to have two configuration files: One for values that describe the operating environment (where to find libraries etc) and another for user-modifiable behavior.
I also like to write a wrapper around it. For example (updated to include autogenerated read-only accessors):
#!/usr/bin/perl
package My::Config;
use strict; use warnings;
use Config::Std;
use FindBin qw($Bin);
use File::Spec::Functions qw( catfile );
sub new {
my $class = shift;
my ($config_file) = #_;
$config_file = catfile($Bin, 'config.ini');
read_config $config_file => my %config;
my $object = bless \%config => $class;
$object->gen_accessors(
single => {
install => [ qw( root ) ],
},
multi => {
template => [ qw( dir ) ],
},
);
return $object;
}
sub gen_accessors {
my $config = shift;
my %args = #_;
my $class = ref $config;
{
no strict 'refs';
for my $section ( keys %{ $args{single} } ) {
my #vars = #{ $args{single}->{$section} };
for my $var ( #vars ) {
*{ "${class}::${section}_${var}" } = sub {
$config->{$section}{$var};
};
}
}
for my $section ( keys %{ $args{multi} } ) {
my #vars = #{ $args{multi}->{$section} };
for my $var ( #vars ) {
*{ "${class}::${section}_${var}" } = sub {
my $val = $config->{$section}{$var};
return [ $val ] unless 'ARRAY' eq ref $val;
return $val;
}
}
}
}
return;
}
package main;
use strict; use warnings;
my $config = My::Config->new;
use Data::Dumper;
print Dumper($config->install_root, $config->template_dir);
C:\Temp> cat config.ini
[install]
root = c:\opt
[template]
dir = C:\opt\app\tmpl
dir = C:\opt\common\tmpl
Output:
C:\Temp> g.pl
$VAR1 = 'c:\\opt';
$VAR2 = [
'C:\\opt\\app\\tmpl',
'C:\\opt\\common\\tmpl'
];
The Config:Properties library is good for reading and writing key/value pair property files.
I prefer YAML and YAML::XS for configuration data. It's simple, readable, and has bindings for almost any programming language. Another popular choice is Config::General.
The usual low-tech method is to simply do EXPR a configuration file. Have you looked into this?
At the risk of being laughed out of class, one solution is to store the config in XML (or for more adventurous, JSON). Human-consumable, interoperable outside of Perl, doesn't have to live on local PC (both XML and JSON can be requested off of a "config URL") and a bunch of standard modules (XML::Simple is usually good enough for config XML files) exist on CPAN.
For simple configuration like this, especially for trivial things where I don't expect this data to change in the real world, I often simply use YAML. The simplicity cannot be beat:
First, write your Perl data structure containing your configuration.
use YAML;
my $SETTINGS = {
'1' => "foo",
'2' => ["123", "456"],
'3' => "something",
};
Then, pass it to YAML::DumpFile();
YAML::DumpFile("~/.$appname.yaml", $SETTINGS);
Delete the data structure and replace it with
my $SETTINGS = YAML::LoadFile("~/.$appname.yaml");
And then forget about it. Even if you don't know or want to learn YAML syntax, small changes to the config can be made by hand and more major ones can be done in Perl and then re-dumped to YAML.
Don't tie yourself to a format -- use Config::Any, or for a little more whizbang DWIM factor, Config::JFDI (which itself wraps Config::Any). With them you buy yourself the ability to support INI, YAML, XML, Apache-style config, and more.
Config::JFDI builds on this by trying to capture some of the magic of Catalyst's config loader: merging of instance-local config with app-wide config, environment variable support, and a limited macro facility (__path_to(foo/bar)__ comes in handy surprisingly often.)