Confusion about proper usage of dereference in Perl - perl

I noticed the other day that - while altering values in a hash - that when you dereference a hash in Perl, you actually are making a copy of that hash. To confirm I wrote this quick little script:
#! perl
use warnings;
use strict;
my %h = ();
my $hRef = \%h;
my %h2 = %{$hRef};
my $h2Ref = \%h2;
if($hRef eq $h2Ref) {
print "\n\tThey're the same $hRef $h2Ref";
}
else {
print "\n\tThey're NOT the same $hRef $h2Ref";
}
print "\n\n";
The output:
They're NOT the same HASH(0x10ff6848) HASH(0x10fede18)
This leads me to realize that there could be spots in some of my scripts where they aren't behaving as expected. Why is it even like this in the first place? If you're passing or returning a hash, it would be more natural to assume that dereferencing the hash would allow me to alter the values of the hash being dereferenced. Instead I'm just making copies all over the place without any real need/reason to beyond making syntax a little more obvious.
I realize the fact that I hadn't even noticed this until now shows its probably not that big of a deal (in terms of the need to go fix in all of my scripts - but important going forward). I think its going to be pretty rare to see noticeable performance differences out of this, but that doesn't alter the fact that I'm still confused.
Is this by design in perl? Is there some explicit reason I don't know about for this; or is this just known and you - as the programmer - expected to know and write scripts accordingly?

The problem is that you are making a copy of the hash to work with in this line:
my %h2 = %{$hRef};
And that is understandable, since many posts here on SO use that idiom to make a local name for a hash, without explaining that it is actually making a copy.
In Perl, a hash is a plural value, just like an array. This means that in list context (such as you get when assigning to a hash) the aggregate is taken apart into a list of its contents. This list of pairs is then assembled into a new hash as shown.
What you want to do is work with the reference directly.
for (keys %$hRef) {...}
for (values %$href) {...}
my $x = $href->{some_key};
# or
my $x = $$href{some_key};
$$href{new_key} = 'new_value';
When working with a normal hash, you have the sigil which is either a % when talking about the entire hash, a $ when talking about a single element, and # when talking about a slice. Each of these sigils is then followed by an identifier.
%hash # whole hash
$hash{key} # element
#hash{qw(a b)} # slice
To work with a reference named $href simply replace the string hash in the above code with $href. In other words, $href is the complete name of the identifier:
%$href # whole hash
$$href{key} # element
#$href{qw(a b)} # slice
Each of these could be written in a more verbose form as:
%{$href}
${$href}{key}
#{$href}{qw(a b)}
Which is again a substitution of the string '$href' for 'hash' as the name of the identifier.
%{hash}
${hash}{key}
#{hash}{qw(a b)}
You can also use a dereferencing arrow when working with an element:
$hash->{key} # exactly the same as $$hash{key}
But I prefer the doubled sigil syntax since it is similar to the whole aggregate and slice syntax, as well as the normal non-reference syntax.
So to sum up, any time you write something like this:
my #array = #$array_ref;
my %hash = %$hash_ref;
You will be making a copy of the first level of each aggregate. When using the dereferencing syntax directly, you will be working on the actual values, and not a copy.
If you want a REAL local name for a hash, but want to work on the same hash, you can use the local keyword to create an alias.
sub some_sub {
my $hash_ref = shift;
our %hash; # declare a lexical name for the global %{__PACKAGE__::hash}
local *hash = \%$hash_ref;
# install the hash ref into the glob
# the `\%` bit ensures we have a hash ref
# use %hash here, all changes will be made to $hash_ref
} # local unwinds here, restoring the global to its previous value if any
That is the pure Perl way of aliasing. If you want to use a my variable to hold the alias, you can use the module Data::Alias

You are confusing the actions of dereferencing, which does not inherently create a copy, and using a hash in list context and assigning that list, which does. $hashref->{'a'} is a dereference, but most certainly does affect the original hash. This is true for $#$arrayref or values(%$hashref) also.
Without the assignment, just the list context %$hashref is a mixed beast; the resulting list contains copies of the hash keys but aliases to the actual hash values. You can see this in action:
$ perl -wle'$x={"a".."f"}; for (%$x) { $_=chr(ord($_)+10) }; print %$x'
epcnal
vs.
$ perl -wle'$x={"a".."f"}; %y=%$x; for (%y) { $_=chr(ord($_)+10) }; print %$x; print %y'
efcdab
epcnal
but %$hashref isn't acting any differently than %hash here.

No, dereferencing does not create a copy of the referent. It's my that creates a new variable.
$ perl -E'
my %h1; my $h1 = \%h1;
my %h2; my $h2 = \%h2;
say $h1;
say $h2;
say $h1 == $h2 ?1:0;
'
HASH(0x83b62e0)
HASH(0x83b6340)
0
$ perl -E'
my %h;
my $h1 = \%h;
my $h2 = \%h;
say $h1;
say $h2;
say $h1 == $h2 ?1:0;
'
HASH(0x9eae2d8)
HASH(0x9eae2d8)
1
No, $#{$someArrayHashRef} does not create a new array.

If perl did what you suggest, then variables would get aliased very easily, which would be far more confusing. As it is, you can alias variables with globbing, but you need to do so explicitly.

Related

Printing hash value in Perl

When I print a variable, I am getting a HASH(0xd1007d0) value. I need to print the values of all the keys and values. However, I am unable to as the control does not enter the loop.
foreach my $var(keys %{$HashVariable}){
print"In the loop \n";
print"$var and $HashVariable{$var}\n";
}
But the control is not even entering the loop. I am new to perl.
I can't answer completely, because it depends entirely on what's in $HashVariable.
The easiest way to tell what's in there is:
use Data::Dumper;
print Dumper $HashVariable;
Assuming this is a hash reference - which it would be, if print $HashVariable gives HASH(0xdeadbeef) as an output.
So this should work:
#!/usr/bin/env perl
use strict;
use warnings;
my $HashVariable = { somekey => 'somevalue' };
foreach my $key ( keys %$HashVariable ) {
print $key, " => ", $HashVariable->{$key},"\n";
}
The only mistake you're making is that $HashVariable{$key} won't work - you need to dereference, because as it stands it refers to %HashVariable not $HashVariable which are two completely different things.
Otherwise - if it's not entering the loop - it may mean that keys %$HashVariable isn't returning anything. Which is why that Dumper test would be useful - is there any chance you're either not populating it correctly, or you're writing to %HashVariable instead.
E.g.:
my %HashVariable;
$HashVariable{'test'} = "foo";
There's an obvious problem here, but it wouldn't cause the behaviour that you are seeing.
You think that you have a hash reference in $HashVariable and that sounds correct given the HASH(0xd1007d0) output that you see when you print it.
But setting up a hash reference and running your code, gives slightly strange results:
my $HashVariable = {
foo => 1,
bar => 2,
baz => 3,
};
foreach my $var(keys %{$HashVariable}){
print"In the loop \n";
print"$var and $HashVariable{$var}\n";
}
The output I get is:
In the loop
baz and
In the loop
bar and
In the loop
foo and
Notice that the values aren't being printed out. That's because of the problem I mentioned above. Adding use strict to the program (which you should always do) tells us what the problem is.
Global symbol "%HashVariable" requires explicit package name (did you forget to declare "my %HashVariable"?) at hash line 14.
Execution of hash aborted due to compilation errors.
You are using $HashVariable{$var} to look up a key in your hash. That would be correct if you had a hash called %HashVariable, but you don't - you have a hash reference called $HashVariable (note the $ instead of %). To look up a key from a hash reference, you need to use a dereferencing arrow - $HashVariable->{$var}.
Fixing that, your program works as expected.
use strict;
use warnings;
my $HashVariable = {
foo => 1,
bar => 2,
baz => 3,
};
foreach my $var(keys %{$HashVariable}){
print"In the loop \n";
print"$var and $HashVariable->{$var}\n";
}
And I see:
In the loop
bar and 2
In the loop
foo and 1
In the loop
baz and 3
The only way that you could get the results you describe (the HASH(0xd1007d0) output but no iterations of the loop) is if you have a hash reference but the hash has no keys.
So (as I said in a comment) we need to see how your hash reference is created.

Does iterating over a hash reference require implicitly copying it in perl?

Lets say I have a large hash and I want to iterate over the contents of it contents. The standard idiom would be something like this:
while(($key, $value) = each(%{$hash_ref})){
///do something
}
However, if I understand my perl correctly this is actually doing two things. First the
%{$hash_ref}
is translating the ref into list context. Thus returning something like
(key1, value1, key2, value2, key3, value3 etc)
which will be stored in my stacks memory. Then the each method will run, eating the first two values in memory (key1 & value1) and returning them to my while loop to process.
If my understanding of this is right that means that I have effectively copied my entire hash into my stacks memory only to iterate over the new copy, which could be expensive for a large hash, due to the expense of iterating over the array twice, but also due to potential cache hits if both hashes can't be held in memory at once. It seems pretty inefficient. I'm wondering if this is what really happens, or if I'm either misunderstanding the actual behavior or the compiler optimizes away the inefficiency for me?
Follow up questions, assuming I am correct about the standard behavior.
Is there a syntax to avoid copying of the hash by iterating over it values in the original hash? If not for a hash is there one for the simpler array?
Does this mean that in the above example I could get inconsistent values between the copy of my hash and my actual hash if I modify the hash_ref content within my loop; resulting in $value having a different value then $hash_ref->($key)?
No, the syntax you quote does not create a copy.
This expression:
%{$hash_ref}
is exactly equivalent to:
%$hash_ref
and assuming the $hash_ref scalar variable does indeed contain a reference to a hash, then adding the % on the front is simply 'dereferencing' the reference - i.e. it resolves to a value that represents the underlying hash (the thing that $hash_ref was pointing to).
If you look at the documentation for the each function, you'll see that it expects a hash as an argument. Putting the % on the front is how you provide a hash when what you have is a hashref.
If you wrote your own subroutine and passed a hash to it like this:
my_sub(%$hash_ref);
then on some level you could say that the hash had been 'copied', since inside the subroutine the special #_ array would contain a list of all the key/value pairs from the hash. However even in that case, the elements of #_ are actually aliases for the keys and values. You'd only actually get a copy if you did something like: my #args = #_.
Perl's builtin each function is declared with the prototype '+' which effectively coerces a hash (or array) argument into a reference to the underlying data structure.
As an aside, starting with version 5.14, the each function can also take a reference to a hash. So instead of:
($key, $value) = each(%{$hash_ref})
You can simply say:
($key, $value) = each($hash_ref)
No copy is created by each (though you do copy the returned values into $key and $value through assignment). The hash itself is passed to each.
each is a little special. It supports the following syntaxes:
each HASH
each ARRAY
As you can see, it doesn't accept an arbitrary expression. (That would be each EXPR or each LIST). The reason for that is to allow each(%foo) to pass the hash %foo itself to each rather than evaluating it in list context. each can do that because it's an operator, and operators can have their own parsing rules. However, you can do something similar with the \% prototype.
use Data::Dumper;
sub f { print(Dumper(#_)); }
sub g(\%) { print(Dumper(#_)); } # Similar to each
my %h = (a=>1, b=>2);
f(%h); # Evaluates %h in list context.
print("\n");
g(%h); # Passes a reference to %h.
Output:
$VAR1 = 'a'; # 4 args, the keys and values of the hash
$VAR2 = 1;
$VAR3 = 'b';
$VAR4 = 2;
$VAR1 = { # 1 arg, a reference to the hash
'a' => 1,
'b' => 2
};
%{$h_ref} is the same as %h, so all of the above applies to %{$h_ref} too.
Note that the hash isn't copied even if it is flattened. The keys are "copied", but the values are returned directly.
use Data::Dumper;
my %h = (abc=>"def", ghi=>"jkl");
print(Dumper(\%h));
$_ = uc($_) for %h;
print(Dumper(\%h));
Output:
$VAR1 = {
'abc' => 'def',
'ghi' => 'jkl'
};
$VAR1 = {
'abc' => 'DEF',
'ghi' => 'JKL'
};
You can read more about this here.

Confused by local variables and methods in Perl

I've got the following Perl code:
my $wantedips;
# loop through the interfaces
foreach (#$interfaces) {
# local variable called $vlan
my $vlan = $_->{vlan};
# local variable called $cidr
my $cidr = $_->{ip} ."/".$nnm->bits();
# I dont understand this next bit.
# As a rubyist, it looks like a method called $cidr is being called on $wantedips
# But $cidr is already defined as a local variable.
# Why the spooky syntax? Why is $cidr passed as a method to $wantedips?
# what does ->{} do in PERL? Is it some kind of hash syntax?
$wantedips->{$cidr} = $vlan;
# break if condition true
next if ($ips->{$cidr} == $vlan);
# etc
}
The part I don't get is in my comments. Why is $cidr passed to $wantedips, when both are clearly defined as local variables? I'm a rubyist and this is really confusing. I can only guess that $xyz->{$abc}="hello" creates a hash of some sort like so:
$xyz => {
$abc => "hello"
}
I'm new to Perl as you can probably tell.
I don't understand why you are comfortable with
my $vlan = $_->{vlan}
but then
$wantedips->{$cidr} = $vlan
gives you trouble? Both use the same syntax to access hash elements using a hash reference.
The indirection operator -> is used to apply keys, indices, or parameters to a reference value, so you access elements of a hash by its reference with
$href->{vlan}
elements of an array by its reference with
$aref->[42]
and call a code reference with
$cref->(1, 2, 3)
As a convenience, and to make code cleaner, you can remove the indirection operator from the sequences ]->[ and }->{ (and any mixture of brackets and braces). So if you have a nested data structure you can write
my $name = $system->{$ip_address}{name}[2]
instead of
my $name = $system->{$ip_address}->{name}->[2]
#I dont understand this next bit.
$wantedips->{$cidr} = $vlan;
$wantedips is a scalar, specifically it is a hashref (a reference to a hash).
The arrow gets something from inside the reference.
{"keyname"} is how to access a particular key in a hash.
->{"keyname"} is how you access a particular key in a hash ref
$cidr is also a scalar, in this case it is a string.
->{$cidr} accesses a key from a hash ref when the key name is stored in a string.
So to put it all together:
$wantedips->{$cidr} = $vlan; means "Assign the value of $vlan to the key described by the string stored in $cidr on the hash referenced by $wantedips.
I can only guess that $xyz->{$abc}="hello" creates a hash of some sort
like.
Let's break this down to a step by step example that strips out the loops and other bits not directly associated with the code in question.
# Create a hash
my %hash;
# Make it a hashref
my $xyz = \%hash;
# (Those two steps could be done as: my $xyz = {})
# Create a string
my $abc = "Hello";
# Use them together
$xyz->{$abc} = "world";
# Look at the result:
use Data::Dump;
Data::Dump::ddx($xyz);
# Result: { Hello => "world" }

Adding multiple values to key in perl hash

I need to create multi-dimensional hash.
for example I have done:
$hash{gene} = $mrna;
if (exists ($exon)){
$hash{gene}{$mrna} = $exon;
}
if (exists ($cds)){
$hash{gene}{$mrna} = $cds;
}
where $gene, $mrna, $exon, $cds are unique ids.
But, my issue is that I want some properties of $gene and $mrna to be included in the hash.
for example:
$hash{$gene}{'start_loc'} = $start;
$hash{gene}{mrna}{'start_loc'} = $start;
etc. But, is that a feasible way of declaring a hash? If I call $hash{$gene} both $mrna and start_loc will be printed. What could be the solution?
How would I add multiple values for the same key $gene and $mrna being the keys in this case.
Any suggestions will be appreciated.
What you need to do is to read the Perl Reference Tutorial.
Simple answer to your question:
Perl hashes can only take a single value to a key. However, that single value can be a reference to a memory location of another hash.
my %hash1 = ( foo => "bar", fu => "bur" }; #First hash
my %hash2;
my $hash{some_key} = \%hash1; #Reference to %hash1
And, there's nothing stopping that first hash from containing a reference to another hash. It's turtles all the way down!.
So yes, you can have a complex and convoluted structure as you like with as many sub-hashes as you want. Or mix in some arrays too.
For various reasons, I prefer the -> syntax when using these complex structures. I find that for more complex structures, it makes it easier to read. However, the main this is it makes you remember these are references and not actual multidimensional structures.
For example:
$hash{gene}->{mrna}->{start_loc} = $start; #Quote not needed in string if key name qualifies as a valid variable name.
The best thing to do is to think of your hash as a structure. For example:
my $person_ref = {}; #Person is a hash reference.
my $person->{NAME}->{FIRST} = "Bob";
my $person->{NAME}->{LAST} = "Rogers";
my $person->{PHONE}->{WORK}->[0] = "555-1234"; An Array Ref. Might have > 1
my $person->{PHONE}->{WORK}->[1] = "555-4444";
my $person->{PHONE}->{CELL}->[0] = "555-4321";
...
my #people;
push #people, $person_ref;
Now, I can load up my #people array with all my people, or maybe use a hash:
my %person;
$person{$bobs_ssn} = $person; #Now, all of Bob's info is index by his SSN.
So, the first thing you need to do is to think of what your structure should look like. What are the fields in your structure? What are the sub-fields? Figure out what your structure should look like, and then setup your hash of hashes to look like that. Figure out exactly how it will be stored and keyed.
Remember, this hash contains references to your genes (or whatever), so you want to choose your keys wisely.
Read the tutorial. Then, try your hand at it. It's not all that complicated to understand. However, it can be a bear to maintain.
When you say use strict;, you give yourself some protection:
my $foo = "bar";
say $Foo; #This won't work!
This won't work because you didn't declare $Foo, you declared $foo. The use stict; can catch variable names that are mistyped, but:
my %var;
$var{foo} = "bar";
say $var{Foo}; #Whoops!
This will not be caught (except maybe that $var{Foo} has not been initialized. The use strict; pragma can't detect mistakes in typing in your keys.
The next step, after you've grown comfortable with references is to move onto object oriented Perl. There's a Tutorial for that too.
All Object Oriented Perl does is to take your hash references, and turns them into objects. Then, it creates subroutines that will help you keep track of manipulating objects. For example:
sub last_name {
my $person = shift; #Don't worry about this for now..
my $last_name = shift;
if ( exists $last_name ) {
my $person->{NAME}->{LAST} = $last_name;
}
return $person->{NAME}->{LAST};
}
When I set my last name using this subroutine ...I mean method, I guarantee that the key will be $person->{NAME}->{LAST} and not $person->{LAST}->{NAME} or $person->{LAST}->{NMAE}. or $person->{last}->{name}.
The main problem isn't learning the mechanisms, but learning to apply them. So, think about exactly how you want to represent your items. This about what fields you want, and how you're going to pull up that information.
You could try pushing each value onto a hash of arrays:
my (#gene, #mrna, #exon, #cds);
my %hash;
push #{ $hash{$gene[$_]} }, [$mrna[$_], $exon[$_], $cds[$_] ] for 0 .. $#gene;
This way gene is the key, with multiple values ($mrna, $exon, $cds) associated with it.
Iterate over keys/values as follows:
for my $key (sort keys %hash) {
print "Gene: $key\t";
for my $value (#{ $hash{$key} } ) {
my ($mrna, $exon, $cds) = #$value; # De-references the array
print "Values: [$mrna], [$exon], [$cds]\n";
}
}
The answer to a question I've asked previously might be of help (Can a hash key have multiple 'subvalues' in perl?).

Should I use $_[0] or copy the argument list in Perl?

If I pass a hash to a sub:
parse(\%data);
Should I use a variable to $_[0] first or is it okay to keep accessing $_[0] whenever I want to get an element from the hash? clarification:
sub parse
{ $var1 = $_[0]->{'elem1'};
$var2 = $_[0]->{'elem2'};
$var3 = $_[0]->{'elem3'};
$var4 = $_[0]->{'elem4'};
$var5 = $_[0]->{'elem5'};
}
# Versus
sub parse
{ my $hr = $_[0];
$var1 = $hr->{'elem1'};
$var2 = $hr->{'elem2'};
$var3 = $hr->{'elem3'};
$var4 = $hr->{'elem4'};
$var5 = $hr->{'elem5'};
}
Is the second version more correct since it doesn't have to keep accessing the argument array, or does Perl end up interpereting them the same way anyhow?
In this case there is no difference because you are passing reference to hash. But in case of passing scalar there will be difference:
sub rtrim {
## remove tailing spaces from first argument
$_[0] =~ s/\s+$//;
}
rtrim($str); ## value of the variable will be changed
sub rtrim_bugged {
my $str = $_[0]; ## this makes a copy of variable
$str =~ s/\s+$//;
}
rtrim($str); ## value of the variable will stay the same
If you're passing hash reference, then only copy of reference is created. But the hash itself will be the same. So if you care about code readability then I suggest you to create a variable for all your parameters. For example:
sub parse {
## you can easily add new parameters to this function
my ($hr) = #_;
my $var1 = $hr->{'elem1'};
my $var2 = $hr->{'elem2'};
my $var3 = $hr->{'elem3'};
my $var4 = $hr->{'elem4'};
my $var5 = $hr->{'elem5'};
}
Also more descriptive variable names will improve your code too.
For a general discussion of the efficiency of shift vs accessing #_ directly, see:
Is there a difference between Perl's shift versus assignment from #_ for subroutine parameters?
Is 'shift' evil for processing Perl subroutine parameters?
As for your specific code, I'd use shift, but simplify the data extraction with a hash slice:
sub parse
{
my $hr = shift;
my ($var1, $var2, $var3, $var4, $var5) = #{$hr}{qw(elem1 elem2 elem3 elem4 elem5)};
}
I'll assume that this method does something else with these variables that makes it worthwhile to keep them in separate variables (perhaps the hash is read-only, and you need to make some modifications before inserting them into some other data?) -- otherwise why not just leave them in the hashref where they started?
You are micro-optimizing; try to avoid that. Go with whatever is most readable/maintainable. Usually this would be the one where you use a lexical variable, since its name indicates its purpose...but if you use a name like $data or $x this obviously doesn't apply.
In terms of the technical details, for most purposes you can estimate the time taken by counting the number of basic ops perl will use. For your $_[0], an element lookup in a non-lexical array variable takes multiple ops: one to get the glob, one to get the array part of the glob, one or more to get the index (just one for a constant), and one to look up the element. $hr, on the other hand is a single op. To cater to direct users of #_, there's an optimization that reduces the ops for $_[0] to a single combined op (when the index is between 0 and 255 inclusive), but it isn't used in your case because the hash-deref context requires an additional flag on the array element lookup (to support autovivification) and that flag isn't supported by the optimized op.
In summary, using a lexical is going to be both more readable and (if you using it more than once) imperceptibly faster.
My rule is that I try not to use $_[0] in subroutines that are longer than a couple of statements. After that everything gets a user-defined variable.
Why are you copying all of the hash values into variables? Just leave them in the hash where they belong. That's a much better optimization than the one you are thinking about.
Its the same although the second is more clear
Since they work, both are fine, the common practice is to shift off parameters.
sub parse { my $hr = shift; my $var1 = $hr->{'elem1'}; }