Why does Perl think that a non-existent multi-level hash element is there? - perl

Sorry, this seems like such a basic question but I still don't understand. If I have a hash, for example:
my %md_hash = ();
$md_hash{'top'}{'primary'}{'secondary'} = 0;
How come this is true?
if ($md_hash{'top'}{'foobar'}{'secondary'} == 0) {
print "I'm true even though I'm not in that hash\n";
}
There is no "foobar" level in the hash so shouldn't that result in false?
TIA

This isn't a multidimensional hash specific question.
It works the same with
my %foo;
if ( $foo{'bar'} == 0 ) {
print "I'm true even though I'm not in that hash\n";
}
$foo{'bar'} is undef, which compares true to 0, albeit with a warning if you have warnings enabled as you should.
There is an additional side effect in your case; when you say
my %md_hash = ();
$md_hash{'top'}{'primary'}{'secondary'} = 0;
if ( $md_hash{'top'}{'foobar'}{'secondary'} == 0 ) {
print "I'm true even though I'm not in that hash\n";
}
$md_hash{'top'} returns a hash reference, and the 'foobar' key is looked for in that hash. Because of the {'secondary'}, that 'foobar' element lookup is in hash-dereference context. This makes $md_hash{'top'}{'foobar'} "autovivify" a hash reference as the value of the 'foobar' key, leaving you with this structure:
my %md_hash = (
'top' => {
'primary' => {
'secondary' => 0,
},
'foobar' => {},
},
);
The autovivification pragma can be used to disable this behavior. People sometimes assert that exists() has some effect on autovifification, but this is not true.

You are testing an undefined value for numeric zero. Of course you get a true result! What were you expecting?
You should also get a warning under use warnings. Why didn’t you?
If you do not start a program with:
use v5.12; # or whatever it is you are using
use strict;
use warnings;
you really shouldn’t even bother. :)
EDIT
NB: I am only clarifying for correctness, because the comment lines are not good for that. I really could not possibly care one whingeing whit less about the reputation points. I just want people to understand.
Even under the CPAN autovivification module, nothing changes. Witness:
use v5.10;
use strict;
use warnings;
no autovivification;
my %md_hash = ();
$md_hash{top}{primary}{secondary} = 0;
if ($md_hash{top}{foobar}{secondary} == 0) {
say "yup, that was zero.";
}
When run, that says:
$ perl /tmp/demo
Use of uninitialized value in numeric eq (==) at /tmp/demo line 10.
yup, that was zero.
The test is the == operator. Its RHS is 0. Its LHS is undef irrespective of autovivification. Since undef is numerically 0, that means that both LHS and RHS contain 0, which the == correctly identifies as holding the same number.
Autovivification is not the issue here, as ysth correctly observes when he writes that “This isn’t a multidimensional hash specific question.” All that matters is what you pass to ==. And undef is numerically 0.
You can stop autoviv if you really, really want to — by using the CPAN pragma. But you will not ever manage to change what happens here by suppressing autoviv. That shows that it is not an autoviv matter at all, just an undef one.
Now, you will get “extra” keys when you do this, since the undefined lvalue will get filled in on the way through the dereference chain. These are neccesarily all the same:
$md_hash{top}{foobar}{secondary} # implicit arrow for infix deref
$md_hash{top}->{foobar}->{secondary} # explicit arrow for infix deref
${ ${ $md_hash{top} }{foobar} }{secondary} # explicit prefix deref
Whenever you deref an lvaluable undef in Perl, that storage location always gets filled in with the proper sort of reference to a newly allocated anonymous referent of the proper type. In short, it autovivs.
And that you can stop either by suppressing or else by sidestepping the autoviv. However, denying the autoviv is not the same as sidestepping it, because you just change what sort of thing gets returned. The overall expession is still fully evaluated: there is no automatic short-circuiting just because you suppress autoviv. That’s because autoviv is not the problem (and if it were not there, you would be really annoyed: trust me).
If you want short-circuiting, you have to write that yourself. I never seem to need to myself. In Perl, that is. On the other hand, C programmers are quite accustomed to writing
if (p && p->whatever) { ... }
And so you can do that, too, if you want to. However, it is pretty rare in my own experience. You almost have to bend over wrongwards in Perl for that ever to make a difference, as it is quite easy to arrange for your code not to change how it acts if there are empty levels.

Try a search on "Perl autovivification".
The hash values "spring into existence" when you first access them. In this case, the value is undef, which when interpreted as a number is zero.
To test for existence of a hash value without auto-vivifying it, use the exists operator:
if (exists $md_hash{'top'}{'foobar'}{'secondary'}
&& $md_hash{'top'}{'foobar'}{'secondary'} == 0) {
print "I exist and I am zero\n";
}
Note that this will still auto-vivify $md_hash{'top'} and
$md_hash{'top'}{'foobar'} (i.e. the sub-hashes).
[edit]
As tchrist points out in a comment, it is poor style to compare undef against anything. So a better way to write this code would be:
if (defined $md_hash{'top'}{'foobar'}{'secondary'}
&& $md_hash{'top'}{'foobar'}{'secondary'} == 0) {
print "I exist and I am zero\n";
}
(Although this will now auto-vivify all three levels of the nested hash, setting the lowest level to undef'.)

Related

Why does combining Perl hashes in and each expression not work?

I recently encountered this and became most displeased:
while (my ($key, $value) = each (%hash1, %hash2)) {
}
Gave this error: Experimental each on scalar is now forbidden at ...
But this, which seems to be the same operation using a superfluous variable:
my %h = (%hash1, %hash2);
while (my ($key, $value) = each %h) {
}
Compiled and worked just fine.
What's the reason for this, and is my displeasure warranted?
There are a couple of issues that come up here. First, let's deal with your immediate problem.
my %h = (%hash1, %hash2);
while (my ($key, $value) = each(%hash1, %hash2)) { ... }
each is actually an ordinary function in Perl, not a special syntax or something. So, as far as Perl is concerned, you're calling each with two arguments, not one. each expects either an array or hash value (basically, something that begins with % or #), which is why each(%h) works. You can create a local hash and pass that using a bit more convoluted syntax
while (my ($key, $value) = each(%{{%hash1, %hash2}})) { ... }
Here, we use the hashref constructor to make a new hash {%hash1, %hash2}. This is a scalar value that happens to point to a hash. Then we immediately dereference it with %{...}. Unfortunately, this causes another problem. If you try to run this code, it'll compile fine but then infinitely loop forever. To see why this is, we'll need to take a brief tangent.
each is a bit of an oddball in Perl. It's actually stateful and stores the so-called state of its call in the hash object. So
my %h = (a => 1, b => 2);
say each(%h);
say each(%h);
These two calls to each will return different values. One will return ("a", 1) and the other will return ("b", 2) (the order of the two returns is unspecified).
Now, your while condition is going to run anew every time it loops, so if we create a temporary hash at every loop iteration, and Perl is trying to store its each state in the hash every time, then you'll never reach the end of iteration since you'll never iterate more than once on any given hash before it's erased and replaced with a new one.
My recommendation is to just use the temporary. Even if you could do it with each and the merged hash, you'd be making a new merged hash at every loop iteration. Alternatively, you can use keys to simply get all of the keys as a single list. This will only happen once, since it's happening as the head of a for loop.
for my $key (keys %{{%hash1, %hash2}}) {
my $value = $hash1{$key} // $hash2{$key};
...
}
Syntax for each:
each HASH
each ARRAY
This means
each %NAME
each %BLOCK
each EXPR->%*
each #NAME
each #BLOCK
each EXPR->#*
What you have does not match any of those patterns.
For a while, there was an experimental feature that allowed one to use
each EXPR
as long as the expression returned a reference to a hash or array.
The experiment was a failure, so this is no longer allowed. But your code wouldn't work even in a version of Perl with this feature. Your expression (%hash1, %hash2 in scalar context) returns the size of %hash2 or a weird string (depending on the version of Perl), and neither of those is a reference to a hash or a reference to an array.
Now, you might be tempted to use
each %{ { %hash1, %hash2 } }
Unfortunately, that creates a new hash each time it's evaluated, so you will perpetually get the first element of this new hash.

Need help understanding portion of script (globs and references)

I was reviewing this question, esp the response from Mr Eric Strom, and had a question regarding a portion of the more "magical" element within. Please review the linked question for the context as I'm only trying to understand the inner portion of this block:
for (qw($SCALAR #ARRAY %HASH)) {
my ($sigil, $type) = /(.)(.+)/;
if (my $ref = *$glob{$type}) {
$vars{$sigil.$name} = /\$/ ? $$ref : $ref
}
}
So, it loops over three words, breaking each into two vars, $sigil and $type. The if {} block is what I am not understanding. I suspect the portion inside the ( .. ) is getting a symbolic reference to the content within $glob{$type}... there must be some "magic" (some esoteric element of the underlying mechanism that I don't yet understand) relied upon there to determine the type of the "pointed-to" data?
The next line is also partly baffling. Appears to me that we are assigning to the vars hash, but what is the rhs doing? We did not assign to $_ in the last operation ($ref was assigned), so what is being compared to in the /\$/ block? My guess is that, if we are dealing with a scalar (though I fail to discern how we are), we deref the $ref var and store it directly in the hash, otherwise, we store the reference.
So, just looking for a little tale of what is going on in these three lines. Many thanks!
You have hit upon one of the most arcane parts of the Perl language, and I can best explain by referring you to Symbol Tables and Typeglobs from brian d foy's excellent Mastering Perl. Note also that there are further references to the relevant sections of Perl's own documentation at the bottom of the page, the most relevant of which is Typeglobs and Filehandles in perldata.
Essentially, the way perl symbol tables work is that every package has a "stash" -- a "symbol table hash" -- whose name is the same as the package but with a pair of trailing semicolons. So the stash for the default package main is called %main::. If you run this simple program
perl -E"say for keys %main::"
you will see all the familiar built-in identifiers.
The values for the stash elements are references to typeglobs, which again are hashes but have keys that correspond to the different data types, SCALAR, ARRAY, HASH, CODE etc. and values that are references to the data item with that type and identifier.
Suppose you define a scalar variable $xx, or more fully, $main:xx
our $xx = 99;
Now the stash for the main package is %main::, and the typeglob for all data items with the identifier xx is referenced by $main::{xx} so, because the sigil for typeglobs is a star * in the same way that scalar identifiers have a dollar $, we can dereference this as *{$main::{xx}}. To get the reference to the scalar variable that has the identifier xx, this typeglob can be indexed with the SCALAR string, giving *{$main::{xx}}{SCALAR}. Once more, this is a reference to the variable we're after, so to collect its value it needs dereferencing once again, and if you write
say ${*{$main::{xx}}{SCALAR}};
then you will see 99.
That may look a little complex when written in a single statement, but it is fairly stratighforward when split up. The code in your question has the variable $glob set to a reference to a typeglob, which corresponds to this with respect to $main::xx
my $type = 'SCALAR';
my $glob = $main::{xx};
my $ref = *$glob{$type};
now if we say $ref we get SCALAR(0x1d12d94) or similar, which is a reference to $main::xx as before, and printing $$ref will show 99 as expected.
The subsequent assignment to #vars is straightforward Perl, and I don't think you should have any problem understanding that once you get the principle that a packages symbol table is a stash of typglobs, or really just a hash of hashes.
The elements of the iteration are strings. Since we don't have a lexical variable at the top of the loop, the element variable is $_. And it retains that value throughout the loop. Only one of those strings has a literal dollar sign, so we're telling the difference between '$SCALAR' and the other cases.
So what it is doing is getting 3 slots out of a package-level typeglob (sometimes shortened, with a little ambiguity to "glob"). *g{SCALAR}, *g{ARRAY} and *g{HASH}. The glob stores a hash and an array as a reference, so we simply store the reference into the hash. But, the glob stores a scalar as a reference to a scalar, and so needs to be dereferenced, to be stored as just a scalar.
So if you had a glob *a and in your package you had:
our $a = 'boo';
our #a = ( 1, 2, 3 );
our %a = ( One => 1, Two => 2 );
The resulting hash would be:
{ '$a' => 'boo'
, '%a' => { One => 1, Two => 2 }
, '#a' => [ 1, 2, 3 ]
};
Meanwhile the glob can be thought to look like this:
a =>
{ SCALAR => \'boo'
, ARRAY => [ 1, 2, 3 ]
, HASH => { One => 1, Two => 2 }
, CODE => undef
, IO => undef
, GLOB => undef
};
So to specifically answer your question.
if (my $ref = *$glob{$type}) {
$vars{$sigil.$name} = /\$/ ? $$ref : $ref
}
If a slot is not used it is undef. Thus $ref is assigned either a reference or undef, which evaluates to true as a reference and false as undef. So if we have a reference, then store the value of that glob slot into the hash, taking the reference stored in the hash, if it is a "container type" but taking the value if it is a scalar. And it is stored with the key $sigil . $name in the %vars hash.

How to test if a value exist in a hash?

Let's say I have this
#!/usr/bin/perl
%x = ('a' => 1, 'b' => 2, 'c' => 3);
and I would like to know if the value 2 is a hash value in %x.
How is that done?
Fundamentally, a hash is a data structure optimized for solving the converse question, knowing whether the key 2 is present. But it's hard to judge without knowing, so let's assume that won't change.
Possibilities presented here will depend on:
how often you need to do it
how dynamic the hash is
One-time op
grep $_==2, values %x (also spelled grep {$_==1} values %x) will return a list of as many 2s as are present in the hash, or, in scalar context, the number of matches. Evaluated as a boolean in a condition, it yields just what you want.
grep works on versions of Perl as old as I can remember.
use List::Util qw(first); first {$_==2} values %x returns only the first match, undef if none. That makes it faster, as it will short-circuit (stop examining elements) as soon as it succeeds. This isn't a problem for 2, but take care that the returned element doesn't necessarily evaluate to boolean true. Use defined in those cases.
List::Util is a part of the Perl core since 5.8.
use List::MoreUtils qw(any); any {$_==2} values %x returns exactly the information you requested as a boolean, and exhibits the short-circuiting behavior.
List::MoreUtils is available from CPAN.
2 ~~ [values %x] returns exactly the information you requested as a boolean, and exhibits the short-circuiting behavior.
Smart matching is available in Perl since 5.10.
Repeated op, static hash
Construct a hash that maps values to keys, and use that one as a natural hash to test key existence.
my %r = reverse %x;
if ( exists $r{2} ) { ... }
Repeated op, dynamic hash
Use a reverse lookup as above. You'll need to keep it up to date, which is left as an exercise to the reader/editor. (hint: value collisions are tricky)
Shorter answer using smart match (Perl version 5.10 or later):
print 2 ~~ [values %x];
my %reverse = reverse %x;
if( defined( $reverse{2} ) ) {
print "2 is a value in the hash!\n";
}
If you want to find out the keys for which the value is 2:
foreach my $key ( keys %x ) {
print "2 is the value for $key\n" if $x{$key} == 2;
}
Everyone's answer so far was not performance-driven. While the smart-match (~~) solution short circuits (e.g. stops searching when something is found), the grep ones do not.
Therefore, here's a solution which may have better performance for Perl before 5.10 that doesn't have smart match operator:
use List::MoreUtils qw(any);
if (any { $_ == 2 } values %x) {
print "Found!\n";
}
Please note that this is just a specific example of searching in a list (values %x) in this case and as such, if you care about performance, the standard performance analysis of searching in a list apply as discussed in detail in this answer
grep and values
my %x = ('a' => 1, 'b' => 2, 'c' => 3);
if (grep { $_ == 2 } values %x ) {
print "2 is in hash\n";
}
else {
print "2 is not in hash\n";
}
See also: perldoc -q hash
Where $count would be the result:
my $count = grep { $_ == 2 } values %x;
This will not only show you if it's a value in the hash, but how many times it occurs as a value. Alternatively you can do it like this as well:
my $count = grep {/2/} values %x;

Should Perl hashes always contain values?

I had an earlier question that received the following response from the noted Perl expert, Perl author and Perl trainer brian d foy:
[If] you're looking for a fixed sequence of characters at the end of each filename. You want to know if that fixed sequence is in a list of sequences that interest you. Store all the extensions in a hash and look in that hash:
my( $extension ) = $filename =~ m/\.([^.]+)$/;
if( exists $hash{$extension} ) { ... }
You don't need to build up a regular expression, and you don't need to go through several possible regex alternations to check every extension you have to examine.
Thanks for the advice brian.
What I now want to know is what is the best practice in a case like the above. Should one only define the keys, which is all I need to achieve what's described above, or should one always define a value as well?
It's usually preferable to set a defined value for every key. The idiomatic value (when you don't care about the value) is 1.
my %hash = map { $_ => 1 } #array;
Doing it this way makes the code the uses the hash slightly simpler because you can use $hash{key} as a Boolean value. If the value can be undefined you need to use the more verbose exists $hash{key} instead.
That said, there are situations where a value of undef is desirable. For example: imagine that you're parsing C header files to extract preprocessor symbols. It would be logical to store these in a hash of name => value pairs.
#define FOO 1
#define BAR
In Perl, this would map to:
my %symbols = ( FOO => 1, BAR => undef);
In C a #define defines a symbol, not a value -- "defined" in C is mapped to "exists" in Perl.
You can't create a hash key without a value. The value can be undef but it will be there. How else would you construct a hash. Or was your question regarding whether the value can be undef? In which case I would say that the value you store there (undef, 1, 0...) is entirely up to you. If a lot of folks are using it then you probably want to store some true value though incase some one else uses if ($hash{$extension}) {...} instead of exists because they weren't paying attention.
undef is a value.
Of course, stuff like that is always depndent on what you are currently doing. But $foo{bar} is just a variable like $bar and I don't see any reason why either one should not be undef every now and then.
PS:
That's why exists exists.
As others have said, the idiomatic solution for a hashset (a hash that only contains keys, not values) is to use 1 as the value because this makes the testing for existence easy. However, there is something to be said for using undef as the value. It will force the users to test for existence with exists which is marginally faster. Of course, you could test for existence with exists even when the value is 1 and avoid the inevitable bugs from users who forget to use exists.
Using undef as a value in hash is more memory efficient than storing 1.
Storing '1' in a Set-hash Considered Harmful
I know using Considered Harmful is considered harmful, but this is bad, almost as bad as unrestrained goto usage.
Ok, I've harped on this in a few comments, but I think I need a full response to demonstrate the issue.
Let's say we have a daemon process that provides back-end inventory control for a shop that sells widgets.
my #items = qw(
widget
thingy
whozit
whatsit
);
my #items_in_stock = qw(
widget
thingy
);
my %in_stock;
my #in_stock(#items_in_stock) = (1) x #items_in_stock; #initialize all keys to 1
sub Process_Request {
my $request = shift;
if( $request eq REORDER ) {
Reorder_Items(\#items, \%in_stock);
}
else {
Error_Response( ILLEGAL_REQUEST );
}
}
sub Reorder_Items{
my $items = shift;
my $in_stock = shift;
# Order items we do not have in-stock.
for my $item ( #$items ) {
Reorder_Item( $item )
if not exists $in_stock->{$item};
}
}
The tool is great, it automatically keeps items in stock. Very nice. Now, the boss asks for automatically generated catalogs of in-stock items. So we modify Process_Request() and add catalog generation.
sub Process_Request {
my $request = shift;
if( $request eq REORDER ) {
Reorder_Items(\#items, \%in_stock);
}
if( $request eq CATALOG ) {
Build_Catalog(\#items, \%in_stock);
}
else {
Error_Response( ILLEGAL_REQUEST );
}
}
sub Build_Catalog {
my $items = shift;
my $in_stock = shift;
my $catalog_response = '';
foreach my $item ( #$items ) {
$catalog_response .= Catalog_Item($item)
if $in_stock->{$item};
}
return $catalog_response;
}
In testing, Build_Catalog() works fine. Hooray, we go live with the app.
Oops. For some reason nothing is being ordered, the company runs out of stock of everything.
The Build_Catalog() routine adds keys to %in_stock, so Reorder_Items() now sees everything as in stock and never makes an order.
Using Hash::Util's lock_hash can help prevent accidental hash modification. If we locked %in_stock before calling Build_Catalog() we would have gotten a fatal error and would never have gone live with the bug.
In summary, it is best to test existence of keys rather than truth of your set-hash values. If you are using existence as a signifier, don't set your values to '1' because that will mask bugs and make them harder to track. Using lock_hash can help catch these problems.
If you must check for the truth of the values, do so in every case.

What does `$hash{$key} |= {}` do in Perl?

I was wrestling with some Perl that uses hash references.
In the end it turned out that my problem was the line:
$myhash{$key} |= {};
That is, "assign $myhash{$key} a reference to an empty hash, unless it already has a value".
Dereferencing this and trying to use it as a hash reference, however, resulted in interpreter errors about using a string as a hash reference.
Changing it to:
if( ! exists $myhash{$key}) {
$myhash{$key} = {};
}
... made things work.
So I don't have a problem. But I'm curious about what was going on.
Can anyone explain?
The reason you're seeing an error about using a string as a hash reference is because you're using the wrong operator. |= means "bitwise-or-assign." In other words,
$foo |= $bar;
is the same as
$foo = $foo | $bar
What's happening in your example is that your new anonymous hash reference is getting stringified, then bitwise-ORed with the value of $myhash{$key}. To confuse matters further, if $myhash{$key} is undefined at the time, the value is the simple stringification of the hash reference, which looks like HASH(0x80fc284). So if you do a cursory inspection of the structure, it may look like a hash reference, but it's not. Here's some useful output via Data::Dumper:
perl -MData::Dumper -le '$hash{foo} |= { }; print Dumper \%hash'
$VAR1 = {
'foo' => 'HASH(0x80fc284)'
};
And here's what you get when you use the correct operator:
perl -MData::Dumper -le '$hash{foo} ||= { }; print Dumper \%hash'
$VAR1 = {
'foo' => {}
};
Perl has shorthand assignment operators. The ||= operator is often used to set default values for variables due to Perl's feature of having logical operators return the last value evaluated. The problem is that you used |= which is a bitwise or instead of ||= which is a logical or.
As of Perl 5.10 it's better to use //= instead. // is the logical defined-or operator and doesn't fail in the corner case where the current value is defined but false.
I think your problem was using "|=" (bitwise-or assignment) instead of "||=" (assign if false).
Note that your new code is not exactly equivalent. The difference is that "$myhash{$key} ||= {}" will replace existing-but-false values with a hash reference, but the new one won't. In practice, this is probably not relevant.
Try this:
my %myhash;
$myhash{$key} ||= {};
You can't declare a hash element in a my clause, as far as I know. You declare the hash first, then add the element in.
Edit: I see you've taken out the my. How about trying ||= instead of |=? The former is idiomatic for "lazy" initialisation.