Should Perl hashes always contain values? - perl

I had an earlier question that received the following response from the noted Perl expert, Perl author and Perl trainer brian d foy:
[If] you're looking for a fixed sequence of characters at the end of each filename. You want to know if that fixed sequence is in a list of sequences that interest you. Store all the extensions in a hash and look in that hash:
my( $extension ) = $filename =~ m/\.([^.]+)$/;
if( exists $hash{$extension} ) { ... }
You don't need to build up a regular expression, and you don't need to go through several possible regex alternations to check every extension you have to examine.
Thanks for the advice brian.
What I now want to know is what is the best practice in a case like the above. Should one only define the keys, which is all I need to achieve what's described above, or should one always define a value as well?

It's usually preferable to set a defined value for every key. The idiomatic value (when you don't care about the value) is 1.
my %hash = map { $_ => 1 } #array;
Doing it this way makes the code the uses the hash slightly simpler because you can use $hash{key} as a Boolean value. If the value can be undefined you need to use the more verbose exists $hash{key} instead.
That said, there are situations where a value of undef is desirable. For example: imagine that you're parsing C header files to extract preprocessor symbols. It would be logical to store these in a hash of name => value pairs.
#define FOO 1
#define BAR
In Perl, this would map to:
my %symbols = ( FOO => 1, BAR => undef);
In C a #define defines a symbol, not a value -- "defined" in C is mapped to "exists" in Perl.

You can't create a hash key without a value. The value can be undef but it will be there. How else would you construct a hash. Or was your question regarding whether the value can be undef? In which case I would say that the value you store there (undef, 1, 0...) is entirely up to you. If a lot of folks are using it then you probably want to store some true value though incase some one else uses if ($hash{$extension}) {...} instead of exists because they weren't paying attention.

undef is a value.
Of course, stuff like that is always depndent on what you are currently doing. But $foo{bar} is just a variable like $bar and I don't see any reason why either one should not be undef every now and then.
PS:
That's why exists exists.

As others have said, the idiomatic solution for a hashset (a hash that only contains keys, not values) is to use 1 as the value because this makes the testing for existence easy. However, there is something to be said for using undef as the value. It will force the users to test for existence with exists which is marginally faster. Of course, you could test for existence with exists even when the value is 1 and avoid the inevitable bugs from users who forget to use exists.

Using undef as a value in hash is more memory efficient than storing 1.

Storing '1' in a Set-hash Considered Harmful
I know using Considered Harmful is considered harmful, but this is bad, almost as bad as unrestrained goto usage.
Ok, I've harped on this in a few comments, but I think I need a full response to demonstrate the issue.
Let's say we have a daemon process that provides back-end inventory control for a shop that sells widgets.
my #items = qw(
widget
thingy
whozit
whatsit
);
my #items_in_stock = qw(
widget
thingy
);
my %in_stock;
my #in_stock(#items_in_stock) = (1) x #items_in_stock; #initialize all keys to 1
sub Process_Request {
my $request = shift;
if( $request eq REORDER ) {
Reorder_Items(\#items, \%in_stock);
}
else {
Error_Response( ILLEGAL_REQUEST );
}
}
sub Reorder_Items{
my $items = shift;
my $in_stock = shift;
# Order items we do not have in-stock.
for my $item ( #$items ) {
Reorder_Item( $item )
if not exists $in_stock->{$item};
}
}
The tool is great, it automatically keeps items in stock. Very nice. Now, the boss asks for automatically generated catalogs of in-stock items. So we modify Process_Request() and add catalog generation.
sub Process_Request {
my $request = shift;
if( $request eq REORDER ) {
Reorder_Items(\#items, \%in_stock);
}
if( $request eq CATALOG ) {
Build_Catalog(\#items, \%in_stock);
}
else {
Error_Response( ILLEGAL_REQUEST );
}
}
sub Build_Catalog {
my $items = shift;
my $in_stock = shift;
my $catalog_response = '';
foreach my $item ( #$items ) {
$catalog_response .= Catalog_Item($item)
if $in_stock->{$item};
}
return $catalog_response;
}
In testing, Build_Catalog() works fine. Hooray, we go live with the app.
Oops. For some reason nothing is being ordered, the company runs out of stock of everything.
The Build_Catalog() routine adds keys to %in_stock, so Reorder_Items() now sees everything as in stock and never makes an order.
Using Hash::Util's lock_hash can help prevent accidental hash modification. If we locked %in_stock before calling Build_Catalog() we would have gotten a fatal error and would never have gone live with the bug.
In summary, it is best to test existence of keys rather than truth of your set-hash values. If you are using existence as a signifier, don't set your values to '1' because that will mask bugs and make them harder to track. Using lock_hash can help catch these problems.
If you must check for the truth of the values, do so in every case.

Related

Efficiently get hash entry only if it exists in Perl

I am quite often writing fragments of code like this:
if (exists $myHash->{$key}) {
$value = $myHash->{$key};
}
What I am trying to do is get the value from the hash if the hash has that key in it, and at the same time I want to avoid autovivifying the hash entry if it did not already exist.
However it strikes me that this is quite inefficient: I am doing a hash lookup to find out if a key exists, and then if it did exist I am doing another hash lookup of the same key to extract it.
It gets even more inefficient in a multilevel structure:
if (exists $myHash->{$key1}
&& exists $myHash->{$key1}{$key2}
&& exists $myHash->{$key1}{$key2}{$key3}) {
$value = $myHash->{$key1}{$key2}{$key3};
}
Here I am presumably doing 9 hash lookups instead of 3!
Is perl smart enough to optimize this kind of case? Or is there some other idiom to get the value of a hash without either autovivifying the entry or doing two successive lookups?
I am aware of the autovivification module, but if possible I am looking for a solution that does not require an XS module to be installed. Also I have not had a chance to try this module out and I am not completely sure what happens in the case of a multilevel hash - the pod says that this:
$h->{$key}
would return undef if the key did not exist - does that mean that this:
$h->{$key1}{$key2}
would die if $key1 did not exist, on the grounds that I am trying to de-reference undef? If so, to avoid that presumably you would still need to do multi-level tests for existence.
I wouldn't worry about optimization since hash lookups are fast. But for your first case, you can do:
if (my $v = $hash{$key}) {
print "have $key => $v\n";
}
Similarly:
if ( ($v = $hash{key1}) && ($v = $v->{key2}) ) {
print "Got $v\n";
}
Autovivification doesn't happen for single-level access so you can safely write
my $value = $hash{$key};
For multi-level access intermediate entries will be autovivified. e.g.
my $value = $hash{a}{b};
will create a reference to an empty hash if $hash{a} doesn't already exist. (If it does exist and isn't a hash reference, perl will throw an error and die.) To avoid that, you need to check each level first. You can write a subroutine to check existence of arbitrarily nested keys.
sub safe_exists {
my $x = shift;
foreach my $k (#_) {
no warnings 'uninitialized';
return unless ref $x eq ref {};
return unless exists $x->{$k};
$x = $x->{$k};
}
return 1;
}
if (safe_exists(\%hash, qw(a b))) {...}
Depending on your algorithm (and why you're trying to avoid autovivification) locking your hash can be a useful alternative to no autovivification or multi-layer exists tests.
use Hash::Util;
my %hash = (a => { b => 1 });
Hash::Util::lock_hash_recurse(%hash);
say $h{a}{b}; # 1
say $h{a}{c}; # error!
I mostly use this as a way to detect programming errors when working with complex data structures. It's useful for detecting mis-typed key names or inadvertent modification of values.

Adding multiple values to key in perl hash

I need to create multi-dimensional hash.
for example I have done:
$hash{gene} = $mrna;
if (exists ($exon)){
$hash{gene}{$mrna} = $exon;
}
if (exists ($cds)){
$hash{gene}{$mrna} = $cds;
}
where $gene, $mrna, $exon, $cds are unique ids.
But, my issue is that I want some properties of $gene and $mrna to be included in the hash.
for example:
$hash{$gene}{'start_loc'} = $start;
$hash{gene}{mrna}{'start_loc'} = $start;
etc. But, is that a feasible way of declaring a hash? If I call $hash{$gene} both $mrna and start_loc will be printed. What could be the solution?
How would I add multiple values for the same key $gene and $mrna being the keys in this case.
Any suggestions will be appreciated.
What you need to do is to read the Perl Reference Tutorial.
Simple answer to your question:
Perl hashes can only take a single value to a key. However, that single value can be a reference to a memory location of another hash.
my %hash1 = ( foo => "bar", fu => "bur" }; #First hash
my %hash2;
my $hash{some_key} = \%hash1; #Reference to %hash1
And, there's nothing stopping that first hash from containing a reference to another hash. It's turtles all the way down!.
So yes, you can have a complex and convoluted structure as you like with as many sub-hashes as you want. Or mix in some arrays too.
For various reasons, I prefer the -> syntax when using these complex structures. I find that for more complex structures, it makes it easier to read. However, the main this is it makes you remember these are references and not actual multidimensional structures.
For example:
$hash{gene}->{mrna}->{start_loc} = $start; #Quote not needed in string if key name qualifies as a valid variable name.
The best thing to do is to think of your hash as a structure. For example:
my $person_ref = {}; #Person is a hash reference.
my $person->{NAME}->{FIRST} = "Bob";
my $person->{NAME}->{LAST} = "Rogers";
my $person->{PHONE}->{WORK}->[0] = "555-1234"; An Array Ref. Might have > 1
my $person->{PHONE}->{WORK}->[1] = "555-4444";
my $person->{PHONE}->{CELL}->[0] = "555-4321";
...
my #people;
push #people, $person_ref;
Now, I can load up my #people array with all my people, or maybe use a hash:
my %person;
$person{$bobs_ssn} = $person; #Now, all of Bob's info is index by his SSN.
So, the first thing you need to do is to think of what your structure should look like. What are the fields in your structure? What are the sub-fields? Figure out what your structure should look like, and then setup your hash of hashes to look like that. Figure out exactly how it will be stored and keyed.
Remember, this hash contains references to your genes (or whatever), so you want to choose your keys wisely.
Read the tutorial. Then, try your hand at it. It's not all that complicated to understand. However, it can be a bear to maintain.
When you say use strict;, you give yourself some protection:
my $foo = "bar";
say $Foo; #This won't work!
This won't work because you didn't declare $Foo, you declared $foo. The use stict; can catch variable names that are mistyped, but:
my %var;
$var{foo} = "bar";
say $var{Foo}; #Whoops!
This will not be caught (except maybe that $var{Foo} has not been initialized. The use strict; pragma can't detect mistakes in typing in your keys.
The next step, after you've grown comfortable with references is to move onto object oriented Perl. There's a Tutorial for that too.
All Object Oriented Perl does is to take your hash references, and turns them into objects. Then, it creates subroutines that will help you keep track of manipulating objects. For example:
sub last_name {
my $person = shift; #Don't worry about this for now..
my $last_name = shift;
if ( exists $last_name ) {
my $person->{NAME}->{LAST} = $last_name;
}
return $person->{NAME}->{LAST};
}
When I set my last name using this subroutine ...I mean method, I guarantee that the key will be $person->{NAME}->{LAST} and not $person->{LAST}->{NAME} or $person->{LAST}->{NMAE}. or $person->{last}->{name}.
The main problem isn't learning the mechanisms, but learning to apply them. So, think about exactly how you want to represent your items. This about what fields you want, and how you're going to pull up that information.
You could try pushing each value onto a hash of arrays:
my (#gene, #mrna, #exon, #cds);
my %hash;
push #{ $hash{$gene[$_]} }, [$mrna[$_], $exon[$_], $cds[$_] ] for 0 .. $#gene;
This way gene is the key, with multiple values ($mrna, $exon, $cds) associated with it.
Iterate over keys/values as follows:
for my $key (sort keys %hash) {
print "Gene: $key\t";
for my $value (#{ $hash{$key} } ) {
my ($mrna, $exon, $cds) = #$value; # De-references the array
print "Values: [$mrna], [$exon], [$cds]\n";
}
}
The answer to a question I've asked previously might be of help (Can a hash key have multiple 'subvalues' in perl?).

Nicer way to test if hash entry exists before assigning it

I'm looking for a nicer way to first "test" if a hash key exists before using it. I'm currently writing a eventlog parser that decodes hex numbers into strings. As I cannot be sure that my decode table contains hex numbers I first need to check if the key exists in a hash before assigning the value to a new variable. So what I'm doing a lot is:
if ($MEL[$i]{type} eq '5024') {
$MEL[$i]{decoded_inline} = $decode_hash{checkpoint}{"$MEL[$i]{raw}[128]"}
if exists ($decode_hash{checkpoint}{"$MEL[$i]{raw}[128]"})
}
What I do not like is that the expression $decode_hash{checkpoint}{"$MEL[$i]{raw}[128]"} is twice in my code. Is there a nicer or shorter version of the line above?
I doubt this qualifies as "nice", but I think it is achieving the goal of not referring to the expression twice. I'm not sure it's worth this pain, mind you:
my $foo = $decode_hash{checkpoint};
my $bar = $MEL[$i]{raw}[128];
if ($MEL[$i]{type} eq '5024') {
$MEL[$i]{decoded_inline} = $foo->{$bar}
if exists ( $foo->{$bar} );
}
Yes there is an easier way. You know that you can only store references in an array or hash, right? Well, there's a neat side effect to that. You can take references to deep hash or array slots and then treat them like scalar references. The unfortunate side-effect is that it autovivifies the slot, but if you're always going to assign to that slot, and just want to do some checking first, it's not a bad way to keep from typing things over and over--as well as repeatedly indexing the structures as well.
my $ref = \$decode_hash{checkpoint}{"$MEL[$i]{raw}[128]"};
unless ( defined( $$ref )) {
...
$$ref = {};
...
}
As long as an existing hash element can't have an undefined value, I would write this
if ($MEL[$i]{type} eq '5024') {
my $value = $decode_hash{checkpoint}{$MEL[$i]{raw}[128]};
$MEL[$i]{decoded_inline} = $value if defined $value;
}
(Note that you shouldn't have the double-quotes around the hash key.)

Is returning a whole array from a Perl subroutine inefficient?

I often have a subroutine in Perl that fills an array with some information. Since I'm also used to hacking in C++, I find myself often do it like this in Perl, using references:
my #array;
getInfo(\#array);
sub getInfo {
my ($arrayRef) = #_;
push #$arrayRef, "obama";
# ...
}
instead of the more straightforward version:
my #array = getInfo();
sub getInfo {
my #array;
push #array, "obama";
# ...
return #array;
}
The reason, of course, is that I don't want the array to be created locally in the subroutine and then copied on return.
Is that right? Or does Perl optimize that away anyway?
What about returning an array reference in the first place?
sub getInfo {
my $array_ref = [];
push #$array_ref, 'foo';
# ...
return $array_ref;
}
my $a_ref = getInfo();
# or if you want the array expanded
my #array = #{getInfo()};
Edit according to dehmann's comment:
It's also possible to use a normal array in the function and return a reference to it.
sub getInfo {
my #array;
push #array, 'foo';
# ...
return \#array;
}
Passing references is more efficient, but the difference is not as big as in C++. The argument values themselves (that means: the values in the array) are always passed by reference anyway (returned values are copied though).
Question is: does it matter? Most of the time, it doesn't. If you're returning 5 elements, don't bother about it. If you're returning/passing 100'000 elements, use references. Only optimize it if it's a bottleneck.
If I look at your example and think about what you want to do I'm used to write it in this manner:
sub getInfo {
my #array;
push #array, 'obama';
# ...
return \#array;
}
It seems to me as straightforward version when I need return large amount of data. There is not need to allocate array outside sub as you written in your first code snippet because my do it for you. Anyway you should not do premature optimization as Leon Timmermans suggest.
To answer the final rumination, no, Perl does not optimize this away. It can't, really, because returning an array and returning a scalar are fundamentally different.
If you're dealing with large amounts of data or if performance is a major concern, then your C habits will serve you well - pass and return references to data structures rather than the structures themselves so that they won't need to be copied. But, as Leon Timmermans pointed out, the vast majority of the time, you're dealing with smaller amounts of data and performance isn't that big a deal, so do it in whatever way seems most readable.
This is the way I would normally return an array.
sub getInfo {
my #array;
push #array, 'foo';
# ...
return #array if wantarray;
return \#array;
}
This way it will work the way you want, in scalar, or list contexts.
my $array = getInfo;
my #array = getInfo;
$array->[0] == $array[0];
# same length
#$array == #array;
I wouldn't try to optimize it unless you know it is a slow part of your code. Even then I would use benchmarks to see which subroutine is actually faster.
There's two considerations. The obvious one is how big is your array going to get? If it's less than a few dozen elements, then size is not a factor (unless you're micro-optimizing for some rapidly called function, but you'd have to do some memory profiling to prove that first).
That's the easy part. The oft overlooked second consideration is the interface. How is the returned array going to be used? This is important because whole array dereferencing is kinda awful in Perl. For example:
for my $info (#{ getInfo($some, $args) }) {
...
}
That's ugly. This is much better.
for my $info ( getInfo($some, $args) ) {
...
}
It also lends itself to mapping and grepping.
my #info = grep { ... } getInfo($some, $args);
But returning an array ref can be handy if you're going to pick out individual elements:
my $address = getInfo($some, $args)->[2];
That's simpler than:
my $address = (getInfo($some, $args))[2];
Or:
my #info = getInfo($some, $args);
my $address = $info[2];
But at that point, you should question whether #info is truly a list or a hash.
my $address = getInfo($some, $args)->{address};
What you should not do is have getInfo() return an array ref in scalar context and an array in list context. This muddles the traditional use of scalar context as array length which will surprise the user.
Finally, I will plug my own module, Method::Signatures, because it offers a compromise for passing in array references without having to use the array ref syntax.
use Method::Signatures;
method foo(\#args) {
print "#args"; # #args is not a copy
push #args, 42; # this alters the caller array
}
my #nums = (1,2,3);
Class->foo(\#nums); # prints 1 2 3
print "#nums"; # prints 1 2 3 42
This is done through the magic of Data::Alias.
3 other potentially LARGE performance improvements if you are reading an entire, largish file and slicing it into an array:
Turn off BUFFERING with sysread() instead of read() (manual warns
about mixing)
Pre-extend the array by valuing the last element -
saves memory allocations
Use Unpack() to swiftly split data like uint16_t graphics channel data
Passing an array ref to the function allows the main program to deal with a simple array while the write-once-and-forget worker function uses the more complicated "$#" and arrow ->[$II] access forms. Being quite C'ish, it is likely to be fast!
I know nothing about Perl so this is a language-neutral answer.
It is, in a sense, inefficient to copy an array from a subroutine into the calling program. The inefficiency arises in the extra memory used and the time taken to copy the data from one place to another. On the other hand, for all but the largest arrays, you might not give a damn, and might prefer to copy arrays out for elegance, cussedness or any other reason.
The efficient solution is for the subroutine to pass the calling program the address of the array. As I say, I haven't a clue about Perl's default behaviour in this respect. But some languages provide the programmer the option to choose which approach.

What's the safest way to iterate through the keys of a Perl hash?

If I have a Perl hash with a bunch of (key, value) pairs, what is the preferred method of iterating through all the keys? I have heard that using each may in some way have unintended side effects. So, is that true, and is one of the two following methods best, or is there a better way?
# Method 1
while (my ($key, $value) = each(%hash)) {
# Something
}
# Method 2
foreach my $key (keys(%hash)) {
# Something
}
The rule of thumb is to use the function most suited to your needs.
If you just want the keys and do not plan to ever read any of the values, use keys():
foreach my $key (keys %hash) { ... }
If you just want the values, use values():
foreach my $val (values %hash) { ... }
If you need the keys and the values, use each():
keys %hash; # reset the internal iterator so a prior each() doesn't affect the loop
while(my($k, $v) = each %hash) { ... }
If you plan to change the keys of the hash in any way except for deleting the current key during the iteration, then you must not use each(). For example, this code to create a new set of uppercase keys with doubled values works fine using keys():
%h = (a => 1, b => 2);
foreach my $k (keys %h)
{
$h{uc $k} = $h{$k} * 2;
}
producing the expected resulting hash:
(a => 1, A => 2, b => 2, B => 4)
But using each() to do the same thing:
%h = (a => 1, b => 2);
keys %h;
while(my($k, $v) = each %h)
{
$h{uc $k} = $h{$k} * 2; # BAD IDEA!
}
produces incorrect results in hard-to-predict ways. For example:
(a => 1, A => 2, b => 2, B => 8)
This, however, is safe:
keys %h;
while(my($k, $v) = each %h)
{
if(...)
{
delete $h{$k}; # This is safe
}
}
All of this is described in the perl documentation:
% perldoc -f keys
% perldoc -f each
One thing you should be aware of when using each is that it has
the side effect of adding "state" to your hash (the hash has to remember
what the "next" key is). When using code like the snippets posted above,
which iterate over the whole hash in one go, this is usually not a
problem. However, you will run into hard to track down problems (I speak from
experience ;), when using each together with statements like
last or return to exit from the while ... each loop before you
have processed all keys.
In this case, the hash will remember which keys it has already returned, and
when you use each on it the next time (maybe in a totaly unrelated piece of
code), it will continue at this position.
Example:
my %hash = ( foo => 1, bar => 2, baz => 3, quux => 4 );
# find key 'baz'
while ( my ($k, $v) = each %hash ) {
print "found key $k\n";
last if $k eq 'baz'; # found it!
}
# later ...
print "the hash contains:\n";
# iterate over all keys:
while ( my ($k, $v) = each %hash ) {
print "$k => $v\n";
}
This prints:
found key bar
found key baz
the hash contains:
quux => 4
foo => 1
What happened to keys "bar" and baz"? They're still there, but the
second each starts where the first one left off, and stops when it reaches the end of the hash, so we never see them in the second loop.
The place where each can cause you problems is that it's a true, non-scoped iterator. By way of example:
while ( my ($key,$val) = each %a_hash ) {
print "$key => $val\n";
last if $val; #exits loop when $val is true
}
# but "each" hasn't reset!!
while ( my ($key,$val) = each %a_hash ) {
# continues where the last loop left off
print "$key => $val\n";
}
If you need to be sure that each gets all the keys and values, you need to make sure you use keys or values first (as that resets the iterator). See the documentation for each.
Using the each syntax will prevent the entire set of keys from being generated at once. This can be important if you're using a tie-ed hash to a database with millions of rows. You don't want to generate the entire list of keys all at once and exhaust your physical memory. In this case each serves as an iterator whereas keys actually generates the entire array before the loop starts.
So, the only place "each" is of real use is when the hash is very large (compared to the memory available). That is only likely to happen when the hash itself doesn't live in memory itself unless you're programming a handheld data collection device or something with small memory.
If memory is not an issue, usually the map or keys paradigm is the more prevelant and easier to read paradigm.
A few miscellaneous thoughts on this topic:
There is nothing unsafe about any of the hash iterators themselves. What is unsafe is modifying the keys of a hash while you're iterating over it. (It's perfectly safe to modify the values.) The only potential side-effect I can think of is that values returns aliases which means that modifying them will modify the contents of the hash. This is by design but may not be what you want in some circumstances.
John's accepted answer is good with one exception: the documentation is clear that it is not safe to add keys while iterating over a hash. It may work for some data sets but will fail for others depending on the hash order.
As already noted, it is safe to delete the last key returned by each. This is not true for keys as each is an iterator while keys returns a list.
I always use method 2 as well. The only benefit of using each is if you're just reading (rather than re-assigning) the value of the hash entry, you're not constantly de-referencing the hash.
I may get bitten by this one but I think that it's personal preference. I can't find any reference in the docs to each() being different than keys() or values() (other than the obvious "they return different things" answer. In fact the docs state the use the same iterator and they all return actual list values instead of copies of them, and that modifying the hash while iterating over it using any call is bad.
All that said, I almost always use keys() because to me it is usually more self documenting to access the key's value via the hash itself. I occasionally use values() when the value is a reference to a large structure and the key to the hash was already stored in the structure, at which point the key is redundant and I don't need it. I think I've used each() 2 times in 10 years of Perl programming and it was probably the wrong choice both times =)
I usually use keys and I can't think of the last time I used or read a use of each.
Don't forget about map, depending on what you're doing in the loop!
map { print "$_ => $hash{$_}\n" } keys %hash;
I woudl say:
Use whatever's easiest to read/understand for most people (so keys, usually, I'd argue)
Use whatever you decide consistently throught the whole code base.
This give 2 major advantages:
It's easier to spot "common" code so you can re-factor into functions/methiods.
It's easier for future developers to maintain.
I don't think it's more expensive to use keys over each, so no need for two different constructs for the same thing in your code.