Nicer way to test if hash entry exists before assigning it - perl

I'm looking for a nicer way to first "test" if a hash key exists before using it. I'm currently writing a eventlog parser that decodes hex numbers into strings. As I cannot be sure that my decode table contains hex numbers I first need to check if the key exists in a hash before assigning the value to a new variable. So what I'm doing a lot is:
if ($MEL[$i]{type} eq '5024') {
$MEL[$i]{decoded_inline} = $decode_hash{checkpoint}{"$MEL[$i]{raw}[128]"}
if exists ($decode_hash{checkpoint}{"$MEL[$i]{raw}[128]"})
}
What I do not like is that the expression $decode_hash{checkpoint}{"$MEL[$i]{raw}[128]"} is twice in my code. Is there a nicer or shorter version of the line above?

I doubt this qualifies as "nice", but I think it is achieving the goal of not referring to the expression twice. I'm not sure it's worth this pain, mind you:
my $foo = $decode_hash{checkpoint};
my $bar = $MEL[$i]{raw}[128];
if ($MEL[$i]{type} eq '5024') {
$MEL[$i]{decoded_inline} = $foo->{$bar}
if exists ( $foo->{$bar} );
}

Yes there is an easier way. You know that you can only store references in an array or hash, right? Well, there's a neat side effect to that. You can take references to deep hash or array slots and then treat them like scalar references. The unfortunate side-effect is that it autovivifies the slot, but if you're always going to assign to that slot, and just want to do some checking first, it's not a bad way to keep from typing things over and over--as well as repeatedly indexing the structures as well.
my $ref = \$decode_hash{checkpoint}{"$MEL[$i]{raw}[128]"};
unless ( defined( $$ref )) {
...
$$ref = {};
...
}

As long as an existing hash element can't have an undefined value, I would write this
if ($MEL[$i]{type} eq '5024') {
my $value = $decode_hash{checkpoint}{$MEL[$i]{raw}[128]};
$MEL[$i]{decoded_inline} = $value if defined $value;
}
(Note that you shouldn't have the double-quotes around the hash key.)

Related

Adding multiple values to key in perl hash

I need to create multi-dimensional hash.
for example I have done:
$hash{gene} = $mrna;
if (exists ($exon)){
$hash{gene}{$mrna} = $exon;
}
if (exists ($cds)){
$hash{gene}{$mrna} = $cds;
}
where $gene, $mrna, $exon, $cds are unique ids.
But, my issue is that I want some properties of $gene and $mrna to be included in the hash.
for example:
$hash{$gene}{'start_loc'} = $start;
$hash{gene}{mrna}{'start_loc'} = $start;
etc. But, is that a feasible way of declaring a hash? If I call $hash{$gene} both $mrna and start_loc will be printed. What could be the solution?
How would I add multiple values for the same key $gene and $mrna being the keys in this case.
Any suggestions will be appreciated.
What you need to do is to read the Perl Reference Tutorial.
Simple answer to your question:
Perl hashes can only take a single value to a key. However, that single value can be a reference to a memory location of another hash.
my %hash1 = ( foo => "bar", fu => "bur" }; #First hash
my %hash2;
my $hash{some_key} = \%hash1; #Reference to %hash1
And, there's nothing stopping that first hash from containing a reference to another hash. It's turtles all the way down!.
So yes, you can have a complex and convoluted structure as you like with as many sub-hashes as you want. Or mix in some arrays too.
For various reasons, I prefer the -> syntax when using these complex structures. I find that for more complex structures, it makes it easier to read. However, the main this is it makes you remember these are references and not actual multidimensional structures.
For example:
$hash{gene}->{mrna}->{start_loc} = $start; #Quote not needed in string if key name qualifies as a valid variable name.
The best thing to do is to think of your hash as a structure. For example:
my $person_ref = {}; #Person is a hash reference.
my $person->{NAME}->{FIRST} = "Bob";
my $person->{NAME}->{LAST} = "Rogers";
my $person->{PHONE}->{WORK}->[0] = "555-1234"; An Array Ref. Might have > 1
my $person->{PHONE}->{WORK}->[1] = "555-4444";
my $person->{PHONE}->{CELL}->[0] = "555-4321";
...
my #people;
push #people, $person_ref;
Now, I can load up my #people array with all my people, or maybe use a hash:
my %person;
$person{$bobs_ssn} = $person; #Now, all of Bob's info is index by his SSN.
So, the first thing you need to do is to think of what your structure should look like. What are the fields in your structure? What are the sub-fields? Figure out what your structure should look like, and then setup your hash of hashes to look like that. Figure out exactly how it will be stored and keyed.
Remember, this hash contains references to your genes (or whatever), so you want to choose your keys wisely.
Read the tutorial. Then, try your hand at it. It's not all that complicated to understand. However, it can be a bear to maintain.
When you say use strict;, you give yourself some protection:
my $foo = "bar";
say $Foo; #This won't work!
This won't work because you didn't declare $Foo, you declared $foo. The use stict; can catch variable names that are mistyped, but:
my %var;
$var{foo} = "bar";
say $var{Foo}; #Whoops!
This will not be caught (except maybe that $var{Foo} has not been initialized. The use strict; pragma can't detect mistakes in typing in your keys.
The next step, after you've grown comfortable with references is to move onto object oriented Perl. There's a Tutorial for that too.
All Object Oriented Perl does is to take your hash references, and turns them into objects. Then, it creates subroutines that will help you keep track of manipulating objects. For example:
sub last_name {
my $person = shift; #Don't worry about this for now..
my $last_name = shift;
if ( exists $last_name ) {
my $person->{NAME}->{LAST} = $last_name;
}
return $person->{NAME}->{LAST};
}
When I set my last name using this subroutine ...I mean method, I guarantee that the key will be $person->{NAME}->{LAST} and not $person->{LAST}->{NAME} or $person->{LAST}->{NMAE}. or $person->{last}->{name}.
The main problem isn't learning the mechanisms, but learning to apply them. So, think about exactly how you want to represent your items. This about what fields you want, and how you're going to pull up that information.
You could try pushing each value onto a hash of arrays:
my (#gene, #mrna, #exon, #cds);
my %hash;
push #{ $hash{$gene[$_]} }, [$mrna[$_], $exon[$_], $cds[$_] ] for 0 .. $#gene;
This way gene is the key, with multiple values ($mrna, $exon, $cds) associated with it.
Iterate over keys/values as follows:
for my $key (sort keys %hash) {
print "Gene: $key\t";
for my $value (#{ $hash{$key} } ) {
my ($mrna, $exon, $cds) = #$value; # De-references the array
print "Values: [$mrna], [$exon], [$cds]\n";
}
}
The answer to a question I've asked previously might be of help (Can a hash key have multiple 'subvalues' in perl?).

Perl hash slice from hash return from function

Purely academic question, and I don't see instructions banning them here (although there is no 'academic'-like tag I could find).
If I have an existing hash like the following, I can take a slice(?) of it as shown:
my %hash = (one=>1, two=>2, three=>3, four=>4);
my ($two, $four) = #hash{'two','four'};
Is there a way to do this if the hash is returned from an example function like this?
sub get_number_text
{
my %hash = (one=>1, two=>2, three=>3, four=>4);
return %hash;
}
One way that works is:
my ($two, $four) = #{ { get_number_text() } }{'two', 'four'};
As I understand it, function returns a list of hash keys/values, the inner {} creates an anonymous hash/ref, and #{} uses the reference to "cast" it to a list aka a hash slice since Perl knows the ref is a hash. (I was a little surprised that the last bit worked, but more power to Perl, etc.)
But is that the clearest way to write that admittedly strange access in one expression?
In general, avoid returning a flattened hash (return %foo) from a subroutine; it makes it harder to work with without copying it into another hash. Better to return a hash reference (return \%foo).
But yes, that is the clearest way. Though often lists of hardcoded keys are given using qw:
my ($two, $four) = #{ { returnit() } }{ qw/two four/ };

Which data structure should I use for a hash without values?

I need to check if a scalar exists in a set of scalars. What is the best way of storing this set of scalars?
Walking through an array would yield linear check time. The check time for a hash would be constant, but it feels inefficient since I wouldn't be using the value part of the hash.
Use a hash, but don't use the values. There really isn't a better way.
The memory overhead for using a hash to test for set membership is minimal, and greatly outweighs the cost of repeated sequential searches through an array. There are many ways to make a set membership style hash:
my %set = map {$_ => 1} ...;
my %set; $set{$_}++ for ...;
my %set; #set{...} = (1) x num_of_items;
Each of these allows you to use the hash lookup directly in a conditional without any additional syntax.
If your hash is going to be huge, and you are worried about the memory usage, you can store undef as the value for each key. But in that case you will have to use exists $set{...} in your conditionals.
A hash should do fine. You could use undef for the value and use exists($h{$k}) or you could use 1 and use $h{$k}.
Judy::HS should be a bit more efficient, but there's no value-less version of that structure either.
You may find this section of the FAQ useful:
How can I tell whether a certain element is contained in a list or array?
Iterating through an array could be done:
my #arr = ( $list, $of, $scalars );
push #arr, $any, $other, $ones;
It's expensive to look through, but not that expensive unless you have a massive list:
grep { $_ eq $what_youre_looking_for } #arr;
The hash method also works:
my %hash = ( $list => 1, $of => 1, $scalars => 1 );
$hash{$another} = 1;
if ( exists $hash{$what_youre_looking_for} ) {
...
}
You could implement a binary search and a list sorter, but those are the two most used methods.
HashTable is the best option.
Note:- As you said it is a set, I hope there are no duplicate elements.

How can I use hashes as arguments to subroutines in Perl?

I was asked to modify some existing code to add some additional functionality. I have searched on Google and cannot seem to find the answer. I have something to this effect...
%first_hash = gen_first_hash();
%second_hash = gen_second_hash();
do_stuff_with_hashes(%first_hash, %second_hash);
sub do_stuff_with_hashes
{
my %first_hash = shift;
my %second_hash = shift;
# do stuff with the hashes
}
I am getting the following errors:
Odd number of elements in hash assignment at ./gen.pl line 85.
Odd number of elements in hash assignment at ./gen.pl line 86.
Use of uninitialized value in concatenation (.) or string at ./gen.pl line 124.
Use of uninitialized value in concatenation (.) or string at ./gen.pl line 143.
Line 85 and 86 are the first two lines in the sub routine and 124 and 143 are where I am accessing the hashes. When I look up those errors it seems to suggest that my hashes are uninitialized. However, I can verify that the hashes have values. Why am I getting these errors?
The hashes are being collapsed into flat lists when you pass them into the function. So, when you shift off a value from the function's arguments, you're only getting one value. What you want to do is pass the hashes by reference.
do_stuff_with_hashes(\%first_hash, \%second_hash);
But then you have to work with the hashes as references.
my $first_hash = shift;
my $second_hash = shift;
A bit late but,
As have been stated, you must pass references, not hashes.
do_stuff_with_hashes(\%first_hash, \%second_hash);
But if you need/want to use your hashes as so, you may dereference them imediatly.
sub do_stuff_with_hashes {
my %first_hash = %{shift()};
my %second_hash = %{shift()};
};
Hash references are the way to go, as the others have pointed out.
Providing another way to do this just for kicks...because who needs temp variables?
do_stuff_with_hashes( { gen_first_hash() }, { gen_second_hash() } );
Here you are just creating hash references on the fly (via the curly brackets around the function calls) to use in your do_stuff_with_hashes function. This is nothing special, the other methods are just as valid and probably more clear. This might help down the road if you see this activity in your travels as someone new to Perl.
First off,
do_stuff_with_hashes(%first_hash, %second_hash);
"streams" the hashes into a list, equivalent to:
( 'key1_1', 'value1_1', ... , 'key1_n', 'value1_n', 'key2_1', 'value2_1', ... )
and then you select one and only one of those items. So,
my %first_hash = shift;
is like saying:
my %first_hash = 'key1_1';
# leaving ( 'value1', ... , 'key1_n', 'value1_n', 'key2_1', 'value2_1', ... )
You cannot have a hash like { 'key1' }, since 'key1' is mapping to nothing.
A solution without shift() is faster, because it do not copy the data in memory, and you can modify the hash in the subroutine. See my example:
sub do_stuff_with_hashes($$$) {
my ($str,$refHash1,$refHash2)=#_;
foreach (keys %{$refHash1}) { print $_.' ' }
$$refHash1{'new'}++;
}
my (%first_hash, %second_hash);
$first_hash{'first'}++;
do_stuff_with_hashes('any_parameter', \%first_hash, \%second_hash);
print "\n---\n", $first_hash{'new'};

Should Perl hashes always contain values?

I had an earlier question that received the following response from the noted Perl expert, Perl author and Perl trainer brian d foy:
[If] you're looking for a fixed sequence of characters at the end of each filename. You want to know if that fixed sequence is in a list of sequences that interest you. Store all the extensions in a hash and look in that hash:
my( $extension ) = $filename =~ m/\.([^.]+)$/;
if( exists $hash{$extension} ) { ... }
You don't need to build up a regular expression, and you don't need to go through several possible regex alternations to check every extension you have to examine.
Thanks for the advice brian.
What I now want to know is what is the best practice in a case like the above. Should one only define the keys, which is all I need to achieve what's described above, or should one always define a value as well?
It's usually preferable to set a defined value for every key. The idiomatic value (when you don't care about the value) is 1.
my %hash = map { $_ => 1 } #array;
Doing it this way makes the code the uses the hash slightly simpler because you can use $hash{key} as a Boolean value. If the value can be undefined you need to use the more verbose exists $hash{key} instead.
That said, there are situations where a value of undef is desirable. For example: imagine that you're parsing C header files to extract preprocessor symbols. It would be logical to store these in a hash of name => value pairs.
#define FOO 1
#define BAR
In Perl, this would map to:
my %symbols = ( FOO => 1, BAR => undef);
In C a #define defines a symbol, not a value -- "defined" in C is mapped to "exists" in Perl.
You can't create a hash key without a value. The value can be undef but it will be there. How else would you construct a hash. Or was your question regarding whether the value can be undef? In which case I would say that the value you store there (undef, 1, 0...) is entirely up to you. If a lot of folks are using it then you probably want to store some true value though incase some one else uses if ($hash{$extension}) {...} instead of exists because they weren't paying attention.
undef is a value.
Of course, stuff like that is always depndent on what you are currently doing. But $foo{bar} is just a variable like $bar and I don't see any reason why either one should not be undef every now and then.
PS:
That's why exists exists.
As others have said, the idiomatic solution for a hashset (a hash that only contains keys, not values) is to use 1 as the value because this makes the testing for existence easy. However, there is something to be said for using undef as the value. It will force the users to test for existence with exists which is marginally faster. Of course, you could test for existence with exists even when the value is 1 and avoid the inevitable bugs from users who forget to use exists.
Using undef as a value in hash is more memory efficient than storing 1.
Storing '1' in a Set-hash Considered Harmful
I know using Considered Harmful is considered harmful, but this is bad, almost as bad as unrestrained goto usage.
Ok, I've harped on this in a few comments, but I think I need a full response to demonstrate the issue.
Let's say we have a daemon process that provides back-end inventory control for a shop that sells widgets.
my #items = qw(
widget
thingy
whozit
whatsit
);
my #items_in_stock = qw(
widget
thingy
);
my %in_stock;
my #in_stock(#items_in_stock) = (1) x #items_in_stock; #initialize all keys to 1
sub Process_Request {
my $request = shift;
if( $request eq REORDER ) {
Reorder_Items(\#items, \%in_stock);
}
else {
Error_Response( ILLEGAL_REQUEST );
}
}
sub Reorder_Items{
my $items = shift;
my $in_stock = shift;
# Order items we do not have in-stock.
for my $item ( #$items ) {
Reorder_Item( $item )
if not exists $in_stock->{$item};
}
}
The tool is great, it automatically keeps items in stock. Very nice. Now, the boss asks for automatically generated catalogs of in-stock items. So we modify Process_Request() and add catalog generation.
sub Process_Request {
my $request = shift;
if( $request eq REORDER ) {
Reorder_Items(\#items, \%in_stock);
}
if( $request eq CATALOG ) {
Build_Catalog(\#items, \%in_stock);
}
else {
Error_Response( ILLEGAL_REQUEST );
}
}
sub Build_Catalog {
my $items = shift;
my $in_stock = shift;
my $catalog_response = '';
foreach my $item ( #$items ) {
$catalog_response .= Catalog_Item($item)
if $in_stock->{$item};
}
return $catalog_response;
}
In testing, Build_Catalog() works fine. Hooray, we go live with the app.
Oops. For some reason nothing is being ordered, the company runs out of stock of everything.
The Build_Catalog() routine adds keys to %in_stock, so Reorder_Items() now sees everything as in stock and never makes an order.
Using Hash::Util's lock_hash can help prevent accidental hash modification. If we locked %in_stock before calling Build_Catalog() we would have gotten a fatal error and would never have gone live with the bug.
In summary, it is best to test existence of keys rather than truth of your set-hash values. If you are using existence as a signifier, don't set your values to '1' because that will mask bugs and make them harder to track. Using lock_hash can help catch these problems.
If you must check for the truth of the values, do so in every case.