Perl: safely make hash from list, checking for duplicates

Perl: safely make hash from list, checking for duplicates - perl

In Perl if you have a list with an even number of elements you can straightforwardly convert it to a hash:
my #a = qw(each peach pear plum);
my %h = #a;
However, if there are duplicate keys then they will be silently accepted, with the last occurrence being the one used. I would like to make a hash checking that there are no duplicates:
my #a = qw(a x a y);
my %h = safe_hash_from_list(#a); # prints error: duplicate key 'a'
Clearly I could write that routine myself:
sub safe_hash_from_list {
die 'even sized list needed' if #_ % 2;
my %r;
while (#_) {
my $k = shift;
my $v = shift;
die "duplicate key '$k'" if exists $r{$k};
$r{$k} = $v;
}
return %r;
}
This, however, is quite a bit slower than the simple assignment. Moreover I do not want to use my own private routine if there is a CPAN module that already does the same job.
Is there a suitable routine on CPAN for safely turning lists into hashes? Ideally one that is a bit faster than the pure-Perl implementation above (though probably never quite as fast as the simple assignment).
If I may be allowed a related follow-up question, I'm also wondering about a tied hash class which allows each key to be assigned only once and dies on reassignment. That would be a more general case of the above problem. Again, I can write such a tied hash myself but I do not want to reinvent the wheel and I would prefer an optimized implementation if one already exists.

Quick way to check that no keys were duplicate would be count the keys and make sure they are equal to half the number of items in the list:
my #a = ...;
my %h = #a;
if (keys %h == (#a / 2)) {
print "Success!";
}

Related

Why does combining Perl hashes in and each expression not work?

I recently encountered this and became most displeased:
while (my ($key, $value) = each (%hash1, %hash2)) {
}
Gave this error: Experimental each on scalar is now forbidden at ...
But this, which seems to be the same operation using a superfluous variable:
my %h = (%hash1, %hash2);
while (my ($key, $value) = each %h) {
}
Compiled and worked just fine.
What's the reason for this, and is my displeasure warranted?

There are a couple of issues that come up here. First, let's deal with your immediate problem.
my %h = (%hash1, %hash2);
while (my ($key, $value) = each(%hash1, %hash2)) { ... }
each is actually an ordinary function in Perl, not a special syntax or something. So, as far as Perl is concerned, you're calling each with two arguments, not one. each expects either an array or hash value (basically, something that begins with % or #), which is why each(%h) works. You can create a local hash and pass that using a bit more convoluted syntax
while (my ($key, $value) = each(%{{%hash1, %hash2}})) { ... }
Here, we use the hashref constructor to make a new hash {%hash1, %hash2}. This is a scalar value that happens to point to a hash. Then we immediately dereference it with %{...}. Unfortunately, this causes another problem. If you try to run this code, it'll compile fine but then infinitely loop forever. To see why this is, we'll need to take a brief tangent.
each is a bit of an oddball in Perl. It's actually stateful and stores the so-called state of its call in the hash object. So
my %h = (a => 1, b => 2);
say each(%h);
say each(%h);
These two calls to each will return different values. One will return ("a", 1) and the other will return ("b", 2) (the order of the two returns is unspecified).
Now, your while condition is going to run anew every time it loops, so if we create a temporary hash at every loop iteration, and Perl is trying to store its each state in the hash every time, then you'll never reach the end of iteration since you'll never iterate more than once on any given hash before it's erased and replaced with a new one.
My recommendation is to just use the temporary. Even if you could do it with each and the merged hash, you'd be making a new merged hash at every loop iteration. Alternatively, you can use keys to simply get all of the keys as a single list. This will only happen once, since it's happening as the head of a for loop.
for my $key (keys %{{%hash1, %hash2}}) {
my $value = $hash1{$key} // $hash2{$key};
...
}

Syntax for each:
each HASH
each ARRAY
This means
each %NAME
each %BLOCK
each EXPR->%*
each #NAME
each #BLOCK
each EXPR->#*
What you have does not match any of those patterns.
For a while, there was an experimental feature that allowed one to use
each EXPR
as long as the expression returned a reference to a hash or array.
The experiment was a failure, so this is no longer allowed. But your code wouldn't work even in a version of Perl with this feature. Your expression (%hash1, %hash2 in scalar context) returns the size of %hash2 or a weird string (depending on the version of Perl), and neither of those is a reference to a hash or a reference to an array.
Now, you might be tempted to use
each %{ { %hash1, %hash2 } }
Unfortunately, that creates a new hash each time it's evaluated, so you will perpetually get the first element of this new hash.

Efficiently get hash entry only if it exists in Perl

I am quite often writing fragments of code like this:
if (exists $myHash->{$key}) {
$value = $myHash->{$key};
}
What I am trying to do is get the value from the hash if the hash has that key in it, and at the same time I want to avoid autovivifying the hash entry if it did not already exist.
However it strikes me that this is quite inefficient: I am doing a hash lookup to find out if a key exists, and then if it did exist I am doing another hash lookup of the same key to extract it.
It gets even more inefficient in a multilevel structure:
if (exists $myHash->{$key1}
&& exists $myHash->{$key1}{$key2}
&& exists $myHash->{$key1}{$key2}{$key3}) {
$value = $myHash->{$key1}{$key2}{$key3};
}
Here I am presumably doing 9 hash lookups instead of 3!
Is perl smart enough to optimize this kind of case? Or is there some other idiom to get the value of a hash without either autovivifying the entry or doing two successive lookups?
I am aware of the autovivification module, but if possible I am looking for a solution that does not require an XS module to be installed. Also I have not had a chance to try this module out and I am not completely sure what happens in the case of a multilevel hash - the pod says that this:
$h->{$key}
would return undef if the key did not exist - does that mean that this:
$h->{$key1}{$key2}
would die if $key1 did not exist, on the grounds that I am trying to de-reference undef? If so, to avoid that presumably you would still need to do multi-level tests for existence.

I wouldn't worry about optimization since hash lookups are fast. But for your first case, you can do:
if (my $v = $hash{$key}) {
print "have $key => $v\n";
}
Similarly:
if ( ($v = $hash{key1}) && ($v = $v->{key2}) ) {
print "Got $v\n";
}

Autovivification doesn't happen for single-level access so you can safely write
my $value = $hash{$key};
For multi-level access intermediate entries will be autovivified. e.g.
my $value = $hash{a}{b};
will create a reference to an empty hash if $hash{a} doesn't already exist. (If it does exist and isn't a hash reference, perl will throw an error and die.) To avoid that, you need to check each level first. You can write a subroutine to check existence of arbitrarily nested keys.
sub safe_exists {
my $x = shift;
foreach my $k (#_) {
no warnings 'uninitialized';
return unless ref $x eq ref {};
return unless exists $x->{$k};
$x = $x->{$k};
}
return 1;
}
if (safe_exists(\%hash, qw(a b))) {...}
Depending on your algorithm (and why you're trying to avoid autovivification) locking your hash can be a useful alternative to no autovivification or multi-layer exists tests.
use Hash::Util;
my %hash = (a => { b => 1 });
Hash::Util::lock_hash_recurse(%hash);
say $h{a}{b}; # 1
say $h{a}{c}; # error!
I mostly use this as a way to detect programming errors when working with complex data structures. It's useful for detecting mis-typed key names or inadvertent modification of values.

Adding multiple values to key in perl hash

I need to create multi-dimensional hash.
for example I have done:
$hash{gene} = $mrna;
if (exists ($exon)){
$hash{gene}{$mrna} = $exon;
}
if (exists ($cds)){
$hash{gene}{$mrna} = $cds;
}
where $gene, $mrna, $exon, $cds are unique ids.
But, my issue is that I want some properties of $gene and $mrna to be included in the hash.
for example:
$hash{$gene}{'start_loc'} = $start;
$hash{gene}{mrna}{'start_loc'} = $start;
etc. But, is that a feasible way of declaring a hash? If I call $hash{$gene} both $mrna and start_loc will be printed. What could be the solution?
How would I add multiple values for the same key $gene and $mrna being the keys in this case.
Any suggestions will be appreciated.

What you need to do is to read the Perl Reference Tutorial.
Simple answer to your question:
Perl hashes can only take a single value to a key. However, that single value can be a reference to a memory location of another hash.
my %hash1 = ( foo => "bar", fu => "bur" }; #First hash
my %hash2;
my $hash{some_key} = \%hash1; #Reference to %hash1
And, there's nothing stopping that first hash from containing a reference to another hash. It's turtles all the way down!.
So yes, you can have a complex and convoluted structure as you like with as many sub-hashes as you want. Or mix in some arrays too.
For various reasons, I prefer the -> syntax when using these complex structures. I find that for more complex structures, it makes it easier to read. However, the main this is it makes you remember these are references and not actual multidimensional structures.
For example:
$hash{gene}->{mrna}->{start_loc} = $start; #Quote not needed in string if key name qualifies as a valid variable name.
The best thing to do is to think of your hash as a structure. For example:
my $person_ref = {}; #Person is a hash reference.
my $person->{NAME}->{FIRST} = "Bob";
my $person->{NAME}->{LAST} = "Rogers";
my $person->{PHONE}->{WORK}->[0] = "555-1234"; An Array Ref. Might have > 1
my $person->{PHONE}->{WORK}->[1] = "555-4444";
my $person->{PHONE}->{CELL}->[0] = "555-4321";
...
my #people;
push #people, $person_ref;
Now, I can load up my #people array with all my people, or maybe use a hash:
my %person;
$person{$bobs_ssn} = $person; #Now, all of Bob's info is index by his SSN.
So, the first thing you need to do is to think of what your structure should look like. What are the fields in your structure? What are the sub-fields? Figure out what your structure should look like, and then setup your hash of hashes to look like that. Figure out exactly how it will be stored and keyed.
Remember, this hash contains references to your genes (or whatever), so you want to choose your keys wisely.
Read the tutorial. Then, try your hand at it. It's not all that complicated to understand. However, it can be a bear to maintain.
When you say use strict;, you give yourself some protection:
my $foo = "bar";
say $Foo; #This won't work!
This won't work because you didn't declare $Foo, you declared $foo. The use stict; can catch variable names that are mistyped, but:
my %var;
$var{foo} = "bar";
say $var{Foo}; #Whoops!
This will not be caught (except maybe that $var{Foo} has not been initialized. The use strict; pragma can't detect mistakes in typing in your keys.
The next step, after you've grown comfortable with references is to move onto object oriented Perl. There's a Tutorial for that too.
All Object Oriented Perl does is to take your hash references, and turns them into objects. Then, it creates subroutines that will help you keep track of manipulating objects. For example:
sub last_name {
my $person = shift; #Don't worry about this for now..
my $last_name = shift;
if ( exists $last_name ) {
my $person->{NAME}->{LAST} = $last_name;
}
return $person->{NAME}->{LAST};
}
When I set my last name using this subroutine ...I mean method, I guarantee that the key will be $person->{NAME}->{LAST} and not $person->{LAST}->{NAME} or $person->{LAST}->{NMAE}. or $person->{last}->{name}.
The main problem isn't learning the mechanisms, but learning to apply them. So, think about exactly how you want to represent your items. This about what fields you want, and how you're going to pull up that information.

You could try pushing each value onto a hash of arrays:
my (#gene, #mrna, #exon, #cds);
my %hash;
push #{ $hash{$gene[$_]} }, [$mrna[$_], $exon[$_], $cds[$_] ] for 0 .. $#gene;
This way gene is the key, with multiple values ($mrna, $exon, $cds) associated with it.
Iterate over keys/values as follows:
for my $key (sort keys %hash) {
print "Gene: $key\t";
for my $value (#{ $hash{$key} } ) {
my ($mrna, $exon, $cds) = #$value; # De-references the array
print "Values: [$mrna], [$exon], [$cds]\n";
}
}
The answer to a question I've asked previously might be of help (Can a hash key have multiple 'subvalues' in perl?).

Confusion about proper usage of dereference in Perl

I noticed the other day that - while altering values in a hash - that when you dereference a hash in Perl, you actually are making a copy of that hash. To confirm I wrote this quick little script:
#! perl
use warnings;
use strict;
my %h = ();
my $hRef = \%h;
my %h2 = %{$hRef};
my $h2Ref = \%h2;
if($hRef eq $h2Ref) {
print "\n\tThey're the same $hRef $h2Ref";
}
else {
print "\n\tThey're NOT the same $hRef $h2Ref";
}
print "\n\n";
The output:
They're NOT the same HASH(0x10ff6848) HASH(0x10fede18)
This leads me to realize that there could be spots in some of my scripts where they aren't behaving as expected. Why is it even like this in the first place? If you're passing or returning a hash, it would be more natural to assume that dereferencing the hash would allow me to alter the values of the hash being dereferenced. Instead I'm just making copies all over the place without any real need/reason to beyond making syntax a little more obvious.
I realize the fact that I hadn't even noticed this until now shows its probably not that big of a deal (in terms of the need to go fix in all of my scripts - but important going forward). I think its going to be pretty rare to see noticeable performance differences out of this, but that doesn't alter the fact that I'm still confused.
Is this by design in perl? Is there some explicit reason I don't know about for this; or is this just known and you - as the programmer - expected to know and write scripts accordingly?

The problem is that you are making a copy of the hash to work with in this line:
my %h2 = %{$hRef};
And that is understandable, since many posts here on SO use that idiom to make a local name for a hash, without explaining that it is actually making a copy.
In Perl, a hash is a plural value, just like an array. This means that in list context (such as you get when assigning to a hash) the aggregate is taken apart into a list of its contents. This list of pairs is then assembled into a new hash as shown.
What you want to do is work with the reference directly.
for (keys %$hRef) {...}
for (values %$href) {...}
my $x = $href->{some_key};
# or
my $x = $$href{some_key};
$$href{new_key} = 'new_value';
When working with a normal hash, you have the sigil which is either a % when talking about the entire hash, a $ when talking about a single element, and # when talking about a slice. Each of these sigils is then followed by an identifier.
%hash # whole hash
$hash{key} # element
#hash{qw(a b)} # slice
To work with a reference named $href simply replace the string hash in the above code with $href. In other words, $href is the complete name of the identifier:
%$href # whole hash
$$href{key} # element
#$href{qw(a b)} # slice
Each of these could be written in a more verbose form as:
%{$href}
${$href}{key}
#{$href}{qw(a b)}
Which is again a substitution of the string '$href' for 'hash' as the name of the identifier.
%{hash}
${hash}{key}
#{hash}{qw(a b)}
You can also use a dereferencing arrow when working with an element:
$hash->{key} # exactly the same as $$hash{key}
But I prefer the doubled sigil syntax since it is similar to the whole aggregate and slice syntax, as well as the normal non-reference syntax.
So to sum up, any time you write something like this:
my #array = #$array_ref;
my %hash = %$hash_ref;
You will be making a copy of the first level of each aggregate. When using the dereferencing syntax directly, you will be working on the actual values, and not a copy.
If you want a REAL local name for a hash, but want to work on the same hash, you can use the local keyword to create an alias.
sub some_sub {
my $hash_ref = shift;
our %hash; # declare a lexical name for the global %{__PACKAGE__::hash}
local *hash = \%$hash_ref;
# install the hash ref into the glob
# the `\%` bit ensures we have a hash ref
# use %hash here, all changes will be made to $hash_ref
} # local unwinds here, restoring the global to its previous value if any
That is the pure Perl way of aliasing. If you want to use a my variable to hold the alias, you can use the module Data::Alias

You are confusing the actions of dereferencing, which does not inherently create a copy, and using a hash in list context and assigning that list, which does. $hashref->{'a'} is a dereference, but most certainly does affect the original hash. This is true for $#$arrayref or values(%$hashref) also.
Without the assignment, just the list context %$hashref is a mixed beast; the resulting list contains copies of the hash keys but aliases to the actual hash values. You can see this in action:
$ perl -wle'$x={"a".."f"}; for (%$x) { $_=chr(ord($_)+10) }; print %$x'
epcnal
vs.
$ perl -wle'$x={"a".."f"}; %y=%$x; for (%y) { $_=chr(ord($_)+10) }; print %$x; print %y'
efcdab
epcnal
but %$hashref isn't acting any differently than %hash here.

No, dereferencing does not create a copy of the referent. It's my that creates a new variable.
$ perl -E'
my %h1; my $h1 = \%h1;
my %h2; my $h2 = \%h2;
say $h1;
say $h2;
say $h1 == $h2 ?1:0;
'
HASH(0x83b62e0)
HASH(0x83b6340)
0
$ perl -E'
my %h;
my $h1 = \%h;
my $h2 = \%h;
say $h1;
say $h2;
say $h1 == $h2 ?1:0;
'
HASH(0x9eae2d8)
HASH(0x9eae2d8)
1
No, $#{$someArrayHashRef} does not create a new array.

If perl did what you suggest, then variables would get aliased very easily, which would be far more confusing. As it is, you can alias variables with globbing, but you need to do so explicitly.

What's the safest way to iterate through the keys of a Perl hash?

If I have a Perl hash with a bunch of (key, value) pairs, what is the preferred method of iterating through all the keys? I have heard that using each may in some way have unintended side effects. So, is that true, and is one of the two following methods best, or is there a better way?
# Method 1
while (my ($key, $value) = each(%hash)) {
# Something
}
# Method 2
foreach my $key (keys(%hash)) {
# Something
}

The rule of thumb is to use the function most suited to your needs.
If you just want the keys and do not plan to ever read any of the values, use keys():
foreach my $key (keys %hash) { ... }
If you just want the values, use values():
foreach my $val (values %hash) { ... }
If you need the keys and the values, use each():
keys %hash; # reset the internal iterator so a prior each() doesn't affect the loop
while(my($k, $v) = each %hash) { ... }
If you plan to change the keys of the hash in any way except for deleting the current key during the iteration, then you must not use each(). For example, this code to create a new set of uppercase keys with doubled values works fine using keys():
%h = (a => 1, b => 2);
foreach my $k (keys %h)
{
$h{uc $k} = $h{$k} * 2;
}
producing the expected resulting hash:
(a => 1, A => 2, b => 2, B => 4)
But using each() to do the same thing:
%h = (a => 1, b => 2);
keys %h;
while(my($k, $v) = each %h)
{
$h{uc $k} = $h{$k} * 2; # BAD IDEA!
}
produces incorrect results in hard-to-predict ways. For example:
(a => 1, A => 2, b => 2, B => 8)
This, however, is safe:
keys %h;
while(my($k, $v) = each %h)
{
if(...)
{
delete $h{$k}; # This is safe
}
}
All of this is described in the perl documentation:
% perldoc -f keys
% perldoc -f each

One thing you should be aware of when using each is that it has
the side effect of adding "state" to your hash (the hash has to remember
what the "next" key is). When using code like the snippets posted above,
which iterate over the whole hash in one go, this is usually not a
problem. However, you will run into hard to track down problems (I speak from
experience ;), when using each together with statements like
last or return to exit from the while ... each loop before you
have processed all keys.
In this case, the hash will remember which keys it has already returned, and
when you use each on it the next time (maybe in a totaly unrelated piece of
code), it will continue at this position.
Example:
my %hash = ( foo => 1, bar => 2, baz => 3, quux => 4 );
# find key 'baz'
while ( my ($k, $v) = each %hash ) {
print "found key $k\n";
last if $k eq 'baz'; # found it!
}
# later ...
print "the hash contains:\n";
# iterate over all keys:
while ( my ($k, $v) = each %hash ) {
print "$k => $v\n";
}
This prints:
found key bar
found key baz
the hash contains:
quux => 4
foo => 1
What happened to keys "bar" and baz"? They're still there, but the
second each starts where the first one left off, and stops when it reaches the end of the hash, so we never see them in the second loop.

The place where each can cause you problems is that it's a true, non-scoped iterator. By way of example:
while ( my ($key,$val) = each %a_hash ) {
print "$key => $val\n";
last if $val; #exits loop when $val is true
}
# but "each" hasn't reset!!
while ( my ($key,$val) = each %a_hash ) {
# continues where the last loop left off
print "$key => $val\n";
}
If you need to be sure that each gets all the keys and values, you need to make sure you use keys or values first (as that resets the iterator). See the documentation for each.

Using the each syntax will prevent the entire set of keys from being generated at once. This can be important if you're using a tie-ed hash to a database with millions of rows. You don't want to generate the entire list of keys all at once and exhaust your physical memory. In this case each serves as an iterator whereas keys actually generates the entire array before the loop starts.
So, the only place "each" is of real use is when the hash is very large (compared to the memory available). That is only likely to happen when the hash itself doesn't live in memory itself unless you're programming a handheld data collection device or something with small memory.
If memory is not an issue, usually the map or keys paradigm is the more prevelant and easier to read paradigm.

A few miscellaneous thoughts on this topic:
There is nothing unsafe about any of the hash iterators themselves. What is unsafe is modifying the keys of a hash while you're iterating over it. (It's perfectly safe to modify the values.) The only potential side-effect I can think of is that values returns aliases which means that modifying them will modify the contents of the hash. This is by design but may not be what you want in some circumstances.
John's accepted answer is good with one exception: the documentation is clear that it is not safe to add keys while iterating over a hash. It may work for some data sets but will fail for others depending on the hash order.
As already noted, it is safe to delete the last key returned by each. This is not true for keys as each is an iterator while keys returns a list.

I always use method 2 as well. The only benefit of using each is if you're just reading (rather than re-assigning) the value of the hash entry, you're not constantly de-referencing the hash.

I may get bitten by this one but I think that it's personal preference. I can't find any reference in the docs to each() being different than keys() or values() (other than the obvious "they return different things" answer. In fact the docs state the use the same iterator and they all return actual list values instead of copies of them, and that modifying the hash while iterating over it using any call is bad.
All that said, I almost always use keys() because to me it is usually more self documenting to access the key's value via the hash itself. I occasionally use values() when the value is a reference to a large structure and the key to the hash was already stored in the structure, at which point the key is redundant and I don't need it. I think I've used each() 2 times in 10 years of Perl programming and it was probably the wrong choice both times =)

I usually use keys and I can't think of the last time I used or read a use of each.
Don't forget about map, depending on what you're doing in the loop!
map { print "$_ => $hash{$_}\n" } keys %hash;

I woudl say:
Use whatever's easiest to read/understand for most people (so keys, usually, I'd argue)
Use whatever you decide consistently throught the whole code base.
This give 2 major advantages:
It's easier to spot "common" code so you can re-factor into functions/methiods.
It's easier for future developers to maintain.
I don't think it's more expensive to use keys over each, so no need for two different constructs for the same thing in your code.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Perl: safely make hash from list, checking for duplicates - perl

Quick way to check that no keys were duplicate would be count the keys and make sure they are equal to half the number of items in the list: my #a = ...; my %h = #a; if (keys %h == (#a / 2)) { print "Success!"; }

Related

Why does combining Perl hashes in and each expression not work?

Efficiently get hash entry only if it exists in Perl

Adding multiple values to key in perl hash

Confusion about proper usage of dereference in Perl

What's the safest way to iterate through the keys of a Perl hash?

Categories

Resources