How can I use hashes as arguments to subroutines in Perl? - perl

I was asked to modify some existing code to add some additional functionality. I have searched on Google and cannot seem to find the answer. I have something to this effect...
%first_hash = gen_first_hash();
%second_hash = gen_second_hash();
do_stuff_with_hashes(%first_hash, %second_hash);
sub do_stuff_with_hashes
{
my %first_hash = shift;
my %second_hash = shift;
# do stuff with the hashes
}
I am getting the following errors:
Odd number of elements in hash assignment at ./gen.pl line 85.
Odd number of elements in hash assignment at ./gen.pl line 86.
Use of uninitialized value in concatenation (.) or string at ./gen.pl line 124.
Use of uninitialized value in concatenation (.) or string at ./gen.pl line 143.
Line 85 and 86 are the first two lines in the sub routine and 124 and 143 are where I am accessing the hashes. When I look up those errors it seems to suggest that my hashes are uninitialized. However, I can verify that the hashes have values. Why am I getting these errors?

The hashes are being collapsed into flat lists when you pass them into the function. So, when you shift off a value from the function's arguments, you're only getting one value. What you want to do is pass the hashes by reference.
do_stuff_with_hashes(\%first_hash, \%second_hash);
But then you have to work with the hashes as references.
my $first_hash = shift;
my $second_hash = shift;

A bit late but,
As have been stated, you must pass references, not hashes.
do_stuff_with_hashes(\%first_hash, \%second_hash);
But if you need/want to use your hashes as so, you may dereference them imediatly.
sub do_stuff_with_hashes {
my %first_hash = %{shift()};
my %second_hash = %{shift()};
};

Hash references are the way to go, as the others have pointed out.
Providing another way to do this just for kicks...because who needs temp variables?
do_stuff_with_hashes( { gen_first_hash() }, { gen_second_hash() } );
Here you are just creating hash references on the fly (via the curly brackets around the function calls) to use in your do_stuff_with_hashes function. This is nothing special, the other methods are just as valid and probably more clear. This might help down the road if you see this activity in your travels as someone new to Perl.

First off,
do_stuff_with_hashes(%first_hash, %second_hash);
"streams" the hashes into a list, equivalent to:
( 'key1_1', 'value1_1', ... , 'key1_n', 'value1_n', 'key2_1', 'value2_1', ... )
and then you select one and only one of those items. So,
my %first_hash = shift;
is like saying:
my %first_hash = 'key1_1';
# leaving ( 'value1', ... , 'key1_n', 'value1_n', 'key2_1', 'value2_1', ... )
You cannot have a hash like { 'key1' }, since 'key1' is mapping to nothing.

A solution without shift() is faster, because it do not copy the data in memory, and you can modify the hash in the subroutine. See my example:
sub do_stuff_with_hashes($$$) {
my ($str,$refHash1,$refHash2)=#_;
foreach (keys %{$refHash1}) { print $_.' ' }
$$refHash1{'new'}++;
}
my (%first_hash, %second_hash);
$first_hash{'first'}++;
do_stuff_with_hashes('any_parameter', \%first_hash, \%second_hash);
print "\n---\n", $first_hash{'new'};

Related

Need help understanding portion of script (globs and references)

I was reviewing this question, esp the response from Mr Eric Strom, and had a question regarding a portion of the more "magical" element within. Please review the linked question for the context as I'm only trying to understand the inner portion of this block:
for (qw($SCALAR #ARRAY %HASH)) {
my ($sigil, $type) = /(.)(.+)/;
if (my $ref = *$glob{$type}) {
$vars{$sigil.$name} = /\$/ ? $$ref : $ref
}
}
So, it loops over three words, breaking each into two vars, $sigil and $type. The if {} block is what I am not understanding. I suspect the portion inside the ( .. ) is getting a symbolic reference to the content within $glob{$type}... there must be some "magic" (some esoteric element of the underlying mechanism that I don't yet understand) relied upon there to determine the type of the "pointed-to" data?
The next line is also partly baffling. Appears to me that we are assigning to the vars hash, but what is the rhs doing? We did not assign to $_ in the last operation ($ref was assigned), so what is being compared to in the /\$/ block? My guess is that, if we are dealing with a scalar (though I fail to discern how we are), we deref the $ref var and store it directly in the hash, otherwise, we store the reference.
So, just looking for a little tale of what is going on in these three lines. Many thanks!
You have hit upon one of the most arcane parts of the Perl language, and I can best explain by referring you to Symbol Tables and Typeglobs from brian d foy's excellent Mastering Perl. Note also that there are further references to the relevant sections of Perl's own documentation at the bottom of the page, the most relevant of which is Typeglobs and Filehandles in perldata.
Essentially, the way perl symbol tables work is that every package has a "stash" -- a "symbol table hash" -- whose name is the same as the package but with a pair of trailing semicolons. So the stash for the default package main is called %main::. If you run this simple program
perl -E"say for keys %main::"
you will see all the familiar built-in identifiers.
The values for the stash elements are references to typeglobs, which again are hashes but have keys that correspond to the different data types, SCALAR, ARRAY, HASH, CODE etc. and values that are references to the data item with that type and identifier.
Suppose you define a scalar variable $xx, or more fully, $main:xx
our $xx = 99;
Now the stash for the main package is %main::, and the typeglob for all data items with the identifier xx is referenced by $main::{xx} so, because the sigil for typeglobs is a star * in the same way that scalar identifiers have a dollar $, we can dereference this as *{$main::{xx}}. To get the reference to the scalar variable that has the identifier xx, this typeglob can be indexed with the SCALAR string, giving *{$main::{xx}}{SCALAR}. Once more, this is a reference to the variable we're after, so to collect its value it needs dereferencing once again, and if you write
say ${*{$main::{xx}}{SCALAR}};
then you will see 99.
That may look a little complex when written in a single statement, but it is fairly stratighforward when split up. The code in your question has the variable $glob set to a reference to a typeglob, which corresponds to this with respect to $main::xx
my $type = 'SCALAR';
my $glob = $main::{xx};
my $ref = *$glob{$type};
now if we say $ref we get SCALAR(0x1d12d94) or similar, which is a reference to $main::xx as before, and printing $$ref will show 99 as expected.
The subsequent assignment to #vars is straightforward Perl, and I don't think you should have any problem understanding that once you get the principle that a packages symbol table is a stash of typglobs, or really just a hash of hashes.
The elements of the iteration are strings. Since we don't have a lexical variable at the top of the loop, the element variable is $_. And it retains that value throughout the loop. Only one of those strings has a literal dollar sign, so we're telling the difference between '$SCALAR' and the other cases.
So what it is doing is getting 3 slots out of a package-level typeglob (sometimes shortened, with a little ambiguity to "glob"). *g{SCALAR}, *g{ARRAY} and *g{HASH}. The glob stores a hash and an array as a reference, so we simply store the reference into the hash. But, the glob stores a scalar as a reference to a scalar, and so needs to be dereferenced, to be stored as just a scalar.
So if you had a glob *a and in your package you had:
our $a = 'boo';
our #a = ( 1, 2, 3 );
our %a = ( One => 1, Two => 2 );
The resulting hash would be:
{ '$a' => 'boo'
, '%a' => { One => 1, Two => 2 }
, '#a' => [ 1, 2, 3 ]
};
Meanwhile the glob can be thought to look like this:
a =>
{ SCALAR => \'boo'
, ARRAY => [ 1, 2, 3 ]
, HASH => { One => 1, Two => 2 }
, CODE => undef
, IO => undef
, GLOB => undef
};
So to specifically answer your question.
if (my $ref = *$glob{$type}) {
$vars{$sigil.$name} = /\$/ ? $$ref : $ref
}
If a slot is not used it is undef. Thus $ref is assigned either a reference or undef, which evaluates to true as a reference and false as undef. So if we have a reference, then store the value of that glob slot into the hash, taking the reference stored in the hash, if it is a "container type" but taking the value if it is a scalar. And it is stored with the key $sigil . $name in the %vars hash.

Adding multiple values to key in perl hash

I need to create multi-dimensional hash.
for example I have done:
$hash{gene} = $mrna;
if (exists ($exon)){
$hash{gene}{$mrna} = $exon;
}
if (exists ($cds)){
$hash{gene}{$mrna} = $cds;
}
where $gene, $mrna, $exon, $cds are unique ids.
But, my issue is that I want some properties of $gene and $mrna to be included in the hash.
for example:
$hash{$gene}{'start_loc'} = $start;
$hash{gene}{mrna}{'start_loc'} = $start;
etc. But, is that a feasible way of declaring a hash? If I call $hash{$gene} both $mrna and start_loc will be printed. What could be the solution?
How would I add multiple values for the same key $gene and $mrna being the keys in this case.
Any suggestions will be appreciated.
What you need to do is to read the Perl Reference Tutorial.
Simple answer to your question:
Perl hashes can only take a single value to a key. However, that single value can be a reference to a memory location of another hash.
my %hash1 = ( foo => "bar", fu => "bur" }; #First hash
my %hash2;
my $hash{some_key} = \%hash1; #Reference to %hash1
And, there's nothing stopping that first hash from containing a reference to another hash. It's turtles all the way down!.
So yes, you can have a complex and convoluted structure as you like with as many sub-hashes as you want. Or mix in some arrays too.
For various reasons, I prefer the -> syntax when using these complex structures. I find that for more complex structures, it makes it easier to read. However, the main this is it makes you remember these are references and not actual multidimensional structures.
For example:
$hash{gene}->{mrna}->{start_loc} = $start; #Quote not needed in string if key name qualifies as a valid variable name.
The best thing to do is to think of your hash as a structure. For example:
my $person_ref = {}; #Person is a hash reference.
my $person->{NAME}->{FIRST} = "Bob";
my $person->{NAME}->{LAST} = "Rogers";
my $person->{PHONE}->{WORK}->[0] = "555-1234"; An Array Ref. Might have > 1
my $person->{PHONE}->{WORK}->[1] = "555-4444";
my $person->{PHONE}->{CELL}->[0] = "555-4321";
...
my #people;
push #people, $person_ref;
Now, I can load up my #people array with all my people, or maybe use a hash:
my %person;
$person{$bobs_ssn} = $person; #Now, all of Bob's info is index by his SSN.
So, the first thing you need to do is to think of what your structure should look like. What are the fields in your structure? What are the sub-fields? Figure out what your structure should look like, and then setup your hash of hashes to look like that. Figure out exactly how it will be stored and keyed.
Remember, this hash contains references to your genes (or whatever), so you want to choose your keys wisely.
Read the tutorial. Then, try your hand at it. It's not all that complicated to understand. However, it can be a bear to maintain.
When you say use strict;, you give yourself some protection:
my $foo = "bar";
say $Foo; #This won't work!
This won't work because you didn't declare $Foo, you declared $foo. The use stict; can catch variable names that are mistyped, but:
my %var;
$var{foo} = "bar";
say $var{Foo}; #Whoops!
This will not be caught (except maybe that $var{Foo} has not been initialized. The use strict; pragma can't detect mistakes in typing in your keys.
The next step, after you've grown comfortable with references is to move onto object oriented Perl. There's a Tutorial for that too.
All Object Oriented Perl does is to take your hash references, and turns them into objects. Then, it creates subroutines that will help you keep track of manipulating objects. For example:
sub last_name {
my $person = shift; #Don't worry about this for now..
my $last_name = shift;
if ( exists $last_name ) {
my $person->{NAME}->{LAST} = $last_name;
}
return $person->{NAME}->{LAST};
}
When I set my last name using this subroutine ...I mean method, I guarantee that the key will be $person->{NAME}->{LAST} and not $person->{LAST}->{NAME} or $person->{LAST}->{NMAE}. or $person->{last}->{name}.
The main problem isn't learning the mechanisms, but learning to apply them. So, think about exactly how you want to represent your items. This about what fields you want, and how you're going to pull up that information.
You could try pushing each value onto a hash of arrays:
my (#gene, #mrna, #exon, #cds);
my %hash;
push #{ $hash{$gene[$_]} }, [$mrna[$_], $exon[$_], $cds[$_] ] for 0 .. $#gene;
This way gene is the key, with multiple values ($mrna, $exon, $cds) associated with it.
Iterate over keys/values as follows:
for my $key (sort keys %hash) {
print "Gene: $key\t";
for my $value (#{ $hash{$key} } ) {
my ($mrna, $exon, $cds) = #$value; # De-references the array
print "Values: [$mrna], [$exon], [$cds]\n";
}
}
The answer to a question I've asked previously might be of help (Can a hash key have multiple 'subvalues' in perl?).

Nicer way to test if hash entry exists before assigning it

I'm looking for a nicer way to first "test" if a hash key exists before using it. I'm currently writing a eventlog parser that decodes hex numbers into strings. As I cannot be sure that my decode table contains hex numbers I first need to check if the key exists in a hash before assigning the value to a new variable. So what I'm doing a lot is:
if ($MEL[$i]{type} eq '5024') {
$MEL[$i]{decoded_inline} = $decode_hash{checkpoint}{"$MEL[$i]{raw}[128]"}
if exists ($decode_hash{checkpoint}{"$MEL[$i]{raw}[128]"})
}
What I do not like is that the expression $decode_hash{checkpoint}{"$MEL[$i]{raw}[128]"} is twice in my code. Is there a nicer or shorter version of the line above?
I doubt this qualifies as "nice", but I think it is achieving the goal of not referring to the expression twice. I'm not sure it's worth this pain, mind you:
my $foo = $decode_hash{checkpoint};
my $bar = $MEL[$i]{raw}[128];
if ($MEL[$i]{type} eq '5024') {
$MEL[$i]{decoded_inline} = $foo->{$bar}
if exists ( $foo->{$bar} );
}
Yes there is an easier way. You know that you can only store references in an array or hash, right? Well, there's a neat side effect to that. You can take references to deep hash or array slots and then treat them like scalar references. The unfortunate side-effect is that it autovivifies the slot, but if you're always going to assign to that slot, and just want to do some checking first, it's not a bad way to keep from typing things over and over--as well as repeatedly indexing the structures as well.
my $ref = \$decode_hash{checkpoint}{"$MEL[$i]{raw}[128]"};
unless ( defined( $$ref )) {
...
$$ref = {};
...
}
As long as an existing hash element can't have an undefined value, I would write this
if ($MEL[$i]{type} eq '5024') {
my $value = $decode_hash{checkpoint}{$MEL[$i]{raw}[128]};
$MEL[$i]{decoded_inline} = $value if defined $value;
}
(Note that you shouldn't have the double-quotes around the hash key.)

Confusion about proper usage of dereference in Perl

I noticed the other day that - while altering values in a hash - that when you dereference a hash in Perl, you actually are making a copy of that hash. To confirm I wrote this quick little script:
#! perl
use warnings;
use strict;
my %h = ();
my $hRef = \%h;
my %h2 = %{$hRef};
my $h2Ref = \%h2;
if($hRef eq $h2Ref) {
print "\n\tThey're the same $hRef $h2Ref";
}
else {
print "\n\tThey're NOT the same $hRef $h2Ref";
}
print "\n\n";
The output:
They're NOT the same HASH(0x10ff6848) HASH(0x10fede18)
This leads me to realize that there could be spots in some of my scripts where they aren't behaving as expected. Why is it even like this in the first place? If you're passing or returning a hash, it would be more natural to assume that dereferencing the hash would allow me to alter the values of the hash being dereferenced. Instead I'm just making copies all over the place without any real need/reason to beyond making syntax a little more obvious.
I realize the fact that I hadn't even noticed this until now shows its probably not that big of a deal (in terms of the need to go fix in all of my scripts - but important going forward). I think its going to be pretty rare to see noticeable performance differences out of this, but that doesn't alter the fact that I'm still confused.
Is this by design in perl? Is there some explicit reason I don't know about for this; or is this just known and you - as the programmer - expected to know and write scripts accordingly?
The problem is that you are making a copy of the hash to work with in this line:
my %h2 = %{$hRef};
And that is understandable, since many posts here on SO use that idiom to make a local name for a hash, without explaining that it is actually making a copy.
In Perl, a hash is a plural value, just like an array. This means that in list context (such as you get when assigning to a hash) the aggregate is taken apart into a list of its contents. This list of pairs is then assembled into a new hash as shown.
What you want to do is work with the reference directly.
for (keys %$hRef) {...}
for (values %$href) {...}
my $x = $href->{some_key};
# or
my $x = $$href{some_key};
$$href{new_key} = 'new_value';
When working with a normal hash, you have the sigil which is either a % when talking about the entire hash, a $ when talking about a single element, and # when talking about a slice. Each of these sigils is then followed by an identifier.
%hash # whole hash
$hash{key} # element
#hash{qw(a b)} # slice
To work with a reference named $href simply replace the string hash in the above code with $href. In other words, $href is the complete name of the identifier:
%$href # whole hash
$$href{key} # element
#$href{qw(a b)} # slice
Each of these could be written in a more verbose form as:
%{$href}
${$href}{key}
#{$href}{qw(a b)}
Which is again a substitution of the string '$href' for 'hash' as the name of the identifier.
%{hash}
${hash}{key}
#{hash}{qw(a b)}
You can also use a dereferencing arrow when working with an element:
$hash->{key} # exactly the same as $$hash{key}
But I prefer the doubled sigil syntax since it is similar to the whole aggregate and slice syntax, as well as the normal non-reference syntax.
So to sum up, any time you write something like this:
my #array = #$array_ref;
my %hash = %$hash_ref;
You will be making a copy of the first level of each aggregate. When using the dereferencing syntax directly, you will be working on the actual values, and not a copy.
If you want a REAL local name for a hash, but want to work on the same hash, you can use the local keyword to create an alias.
sub some_sub {
my $hash_ref = shift;
our %hash; # declare a lexical name for the global %{__PACKAGE__::hash}
local *hash = \%$hash_ref;
# install the hash ref into the glob
# the `\%` bit ensures we have a hash ref
# use %hash here, all changes will be made to $hash_ref
} # local unwinds here, restoring the global to its previous value if any
That is the pure Perl way of aliasing. If you want to use a my variable to hold the alias, you can use the module Data::Alias
You are confusing the actions of dereferencing, which does not inherently create a copy, and using a hash in list context and assigning that list, which does. $hashref->{'a'} is a dereference, but most certainly does affect the original hash. This is true for $#$arrayref or values(%$hashref) also.
Without the assignment, just the list context %$hashref is a mixed beast; the resulting list contains copies of the hash keys but aliases to the actual hash values. You can see this in action:
$ perl -wle'$x={"a".."f"}; for (%$x) { $_=chr(ord($_)+10) }; print %$x'
epcnal
vs.
$ perl -wle'$x={"a".."f"}; %y=%$x; for (%y) { $_=chr(ord($_)+10) }; print %$x; print %y'
efcdab
epcnal
but %$hashref isn't acting any differently than %hash here.
No, dereferencing does not create a copy of the referent. It's my that creates a new variable.
$ perl -E'
my %h1; my $h1 = \%h1;
my %h2; my $h2 = \%h2;
say $h1;
say $h2;
say $h1 == $h2 ?1:0;
'
HASH(0x83b62e0)
HASH(0x83b6340)
0
$ perl -E'
my %h;
my $h1 = \%h;
my $h2 = \%h;
say $h1;
say $h2;
say $h1 == $h2 ?1:0;
'
HASH(0x9eae2d8)
HASH(0x9eae2d8)
1
No, $#{$someArrayHashRef} does not create a new array.
If perl did what you suggest, then variables would get aliased very easily, which would be far more confusing. As it is, you can alias variables with globbing, but you need to do so explicitly.

Should I use $_[0] or copy the argument list in Perl?

If I pass a hash to a sub:
parse(\%data);
Should I use a variable to $_[0] first or is it okay to keep accessing $_[0] whenever I want to get an element from the hash? clarification:
sub parse
{ $var1 = $_[0]->{'elem1'};
$var2 = $_[0]->{'elem2'};
$var3 = $_[0]->{'elem3'};
$var4 = $_[0]->{'elem4'};
$var5 = $_[0]->{'elem5'};
}
# Versus
sub parse
{ my $hr = $_[0];
$var1 = $hr->{'elem1'};
$var2 = $hr->{'elem2'};
$var3 = $hr->{'elem3'};
$var4 = $hr->{'elem4'};
$var5 = $hr->{'elem5'};
}
Is the second version more correct since it doesn't have to keep accessing the argument array, or does Perl end up interpereting them the same way anyhow?
In this case there is no difference because you are passing reference to hash. But in case of passing scalar there will be difference:
sub rtrim {
## remove tailing spaces from first argument
$_[0] =~ s/\s+$//;
}
rtrim($str); ## value of the variable will be changed
sub rtrim_bugged {
my $str = $_[0]; ## this makes a copy of variable
$str =~ s/\s+$//;
}
rtrim($str); ## value of the variable will stay the same
If you're passing hash reference, then only copy of reference is created. But the hash itself will be the same. So if you care about code readability then I suggest you to create a variable for all your parameters. For example:
sub parse {
## you can easily add new parameters to this function
my ($hr) = #_;
my $var1 = $hr->{'elem1'};
my $var2 = $hr->{'elem2'};
my $var3 = $hr->{'elem3'};
my $var4 = $hr->{'elem4'};
my $var5 = $hr->{'elem5'};
}
Also more descriptive variable names will improve your code too.
For a general discussion of the efficiency of shift vs accessing #_ directly, see:
Is there a difference between Perl's shift versus assignment from #_ for subroutine parameters?
Is 'shift' evil for processing Perl subroutine parameters?
As for your specific code, I'd use shift, but simplify the data extraction with a hash slice:
sub parse
{
my $hr = shift;
my ($var1, $var2, $var3, $var4, $var5) = #{$hr}{qw(elem1 elem2 elem3 elem4 elem5)};
}
I'll assume that this method does something else with these variables that makes it worthwhile to keep them in separate variables (perhaps the hash is read-only, and you need to make some modifications before inserting them into some other data?) -- otherwise why not just leave them in the hashref where they started?
You are micro-optimizing; try to avoid that. Go with whatever is most readable/maintainable. Usually this would be the one where you use a lexical variable, since its name indicates its purpose...but if you use a name like $data or $x this obviously doesn't apply.
In terms of the technical details, for most purposes you can estimate the time taken by counting the number of basic ops perl will use. For your $_[0], an element lookup in a non-lexical array variable takes multiple ops: one to get the glob, one to get the array part of the glob, one or more to get the index (just one for a constant), and one to look up the element. $hr, on the other hand is a single op. To cater to direct users of #_, there's an optimization that reduces the ops for $_[0] to a single combined op (when the index is between 0 and 255 inclusive), but it isn't used in your case because the hash-deref context requires an additional flag on the array element lookup (to support autovivification) and that flag isn't supported by the optimized op.
In summary, using a lexical is going to be both more readable and (if you using it more than once) imperceptibly faster.
My rule is that I try not to use $_[0] in subroutines that are longer than a couple of statements. After that everything gets a user-defined variable.
Why are you copying all of the hash values into variables? Just leave them in the hash where they belong. That's a much better optimization than the one you are thinking about.
Its the same although the second is more clear
Since they work, both are fine, the common practice is to shift off parameters.
sub parse { my $hr = shift; my $var1 = $hr->{'elem1'}; }