Declare and populate a hash table in one step in Perl - perl

Currently when I want to build a look-up table I use:
my $has_field = {};
map { $has_field->{$_} = 1 } #fields;
Is there a way I can do inline initialization in a single step? (i.e. populate it at the same time I'm declaring it?)

Just use your map to create a list then drop into a hash reference like:
my $has_field = { map { $_ => 1 } #fields };

Update: sorry, this doesn't do what you want exactly, as you still have to declare $has_field first.
You could use a hash slice:
#{$has_field}{#fields} = (1)x#fields;
The right hand side is using the x operator to repeat one by the scalar value of #fields (i.e. the number of elements in your array). Another option in the same vein:
#{$has_field}{#fields} = map {1} #fields;

Where I've tested it smart match can be 2 to 5 times as fast as creating a lookup hash and testing for the value once. So unless you're going to reuse the hash a good number of times, it's best to do a smart match:
if ( $cand_field ~~ \#fields ) {
do_with_field( $cand_field );
}
It's a good thing to remember that since 5.10, Perl now has a way native to ask "is this untested value any of these known values", it's smart match.

Related

Comparison to an array of a value [duplicate]

This question already has answers here:
How can I verify that a value is present in an array (list) in Perl?
(8 answers)
Closed 9 years ago.
I'm still feeling my way though perl and so there's probably a simple way of doing this but I can find it. I want to compare a single value say A or E to an array that may or may not contain that value, eg A B C D and then perform an action if they match. How should I set this up?
Thanks.
You filter each element of the array to see if it is the element you are looking for and then use the resulting array as a boolean value (not empty = true, empty = false):
#filtered_array = grep { $_ eq 'A' } #array;
if (#filtered_array) {
print "found it!\n";
}
If you store the list in an array then the only way is to examine each element individually in a loop, using grep, or for or any from List::MoreUtils. (grep is the worst of these, as it searches the entire array, even if a match has been found early on.) This is fine if the array is small, but you will hit performance probelms if the array has a significant size and you have to check it frequently.
You can speed things up by representing the same list in a hash, when a check for membership is just a single key lookup.
Alternatively, if the list is enormous, then it is best kept in a database, using SQLite.
Are you stuck on arrays?
Whenever in Perl you're talk about quickly looking up data, you should think in terms of hashes. A hash is a collection of data like an array, but it is keyed, and looking up the key is a very fast operation in Perl.
There's nothing that says the keys to your hash can't be your data, and it is very common in Perl to index an array with a hash in order to quickly search for values.
This turns your array #array into a hash called %arrays_hash.
use strict;
use warnings;
use feature qw(say);
use autodie;
my #array = qw(Alpha Beta Delta Gamma Ohm);
my %array_index;
for my $entry ( #array ) {
$array_index{$entry} = 1; # Doesn't matter. As long as it isn't blank or zero
}
Now, looking up whether or not your data is in your array is very quick. Just simply see if it's a key in your %array_index:
my $item = "Delta"; # Is this in my initial array?
if ( $array_index{$item} ) {
say "Yes! Item '$item' is in my array.";
}
else {
say "No. Item '$item' isn't in my array. David sad.";
}
This is so common, that you'll see a lot of programs that use the map command to index the array. Instead of that for loop, I could have done this:
my %array_index = ( map { $_ => 1 } #array );
or
my %array_index;
map { $array_index{$_} = 1 } #array;
You'll see both. The first one is a one liner. The map command takes each entry in the array, and puts it in $_. Then, it returns the results into an array. Thus, the map will return an array with your data in the even positions (0, 2, 4 8...) and a 1 in the odd positions (1, 3, 5...).
The second one is more literal and easier to understand (or about as easy to understand in a map command). Again, each item in your #array is being assigned to $_, and that is being used as the key in my %array_index hash.
Whether or not you want to use hashes depend upon the length of your array, and how many items of input you'll be searching for. If you're simply searching whether a single item is in your array, I'd probably use List::Utils or List::MoreUtils, or use a for loop to search each value of my array. If I am doing this for multiple values, I am better off with a hash.

Adding multiple values to key in perl hash

I need to create multi-dimensional hash.
for example I have done:
$hash{gene} = $mrna;
if (exists ($exon)){
$hash{gene}{$mrna} = $exon;
}
if (exists ($cds)){
$hash{gene}{$mrna} = $cds;
}
where $gene, $mrna, $exon, $cds are unique ids.
But, my issue is that I want some properties of $gene and $mrna to be included in the hash.
for example:
$hash{$gene}{'start_loc'} = $start;
$hash{gene}{mrna}{'start_loc'} = $start;
etc. But, is that a feasible way of declaring a hash? If I call $hash{$gene} both $mrna and start_loc will be printed. What could be the solution?
How would I add multiple values for the same key $gene and $mrna being the keys in this case.
Any suggestions will be appreciated.
What you need to do is to read the Perl Reference Tutorial.
Simple answer to your question:
Perl hashes can only take a single value to a key. However, that single value can be a reference to a memory location of another hash.
my %hash1 = ( foo => "bar", fu => "bur" }; #First hash
my %hash2;
my $hash{some_key} = \%hash1; #Reference to %hash1
And, there's nothing stopping that first hash from containing a reference to another hash. It's turtles all the way down!.
So yes, you can have a complex and convoluted structure as you like with as many sub-hashes as you want. Or mix in some arrays too.
For various reasons, I prefer the -> syntax when using these complex structures. I find that for more complex structures, it makes it easier to read. However, the main this is it makes you remember these are references and not actual multidimensional structures.
For example:
$hash{gene}->{mrna}->{start_loc} = $start; #Quote not needed in string if key name qualifies as a valid variable name.
The best thing to do is to think of your hash as a structure. For example:
my $person_ref = {}; #Person is a hash reference.
my $person->{NAME}->{FIRST} = "Bob";
my $person->{NAME}->{LAST} = "Rogers";
my $person->{PHONE}->{WORK}->[0] = "555-1234"; An Array Ref. Might have > 1
my $person->{PHONE}->{WORK}->[1] = "555-4444";
my $person->{PHONE}->{CELL}->[0] = "555-4321";
...
my #people;
push #people, $person_ref;
Now, I can load up my #people array with all my people, or maybe use a hash:
my %person;
$person{$bobs_ssn} = $person; #Now, all of Bob's info is index by his SSN.
So, the first thing you need to do is to think of what your structure should look like. What are the fields in your structure? What are the sub-fields? Figure out what your structure should look like, and then setup your hash of hashes to look like that. Figure out exactly how it will be stored and keyed.
Remember, this hash contains references to your genes (or whatever), so you want to choose your keys wisely.
Read the tutorial. Then, try your hand at it. It's not all that complicated to understand. However, it can be a bear to maintain.
When you say use strict;, you give yourself some protection:
my $foo = "bar";
say $Foo; #This won't work!
This won't work because you didn't declare $Foo, you declared $foo. The use stict; can catch variable names that are mistyped, but:
my %var;
$var{foo} = "bar";
say $var{Foo}; #Whoops!
This will not be caught (except maybe that $var{Foo} has not been initialized. The use strict; pragma can't detect mistakes in typing in your keys.
The next step, after you've grown comfortable with references is to move onto object oriented Perl. There's a Tutorial for that too.
All Object Oriented Perl does is to take your hash references, and turns them into objects. Then, it creates subroutines that will help you keep track of manipulating objects. For example:
sub last_name {
my $person = shift; #Don't worry about this for now..
my $last_name = shift;
if ( exists $last_name ) {
my $person->{NAME}->{LAST} = $last_name;
}
return $person->{NAME}->{LAST};
}
When I set my last name using this subroutine ...I mean method, I guarantee that the key will be $person->{NAME}->{LAST} and not $person->{LAST}->{NAME} or $person->{LAST}->{NMAE}. or $person->{last}->{name}.
The main problem isn't learning the mechanisms, but learning to apply them. So, think about exactly how you want to represent your items. This about what fields you want, and how you're going to pull up that information.
You could try pushing each value onto a hash of arrays:
my (#gene, #mrna, #exon, #cds);
my %hash;
push #{ $hash{$gene[$_]} }, [$mrna[$_], $exon[$_], $cds[$_] ] for 0 .. $#gene;
This way gene is the key, with multiple values ($mrna, $exon, $cds) associated with it.
Iterate over keys/values as follows:
for my $key (sort keys %hash) {
print "Gene: $key\t";
for my $value (#{ $hash{$key} } ) {
my ($mrna, $exon, $cds) = #$value; # De-references the array
print "Values: [$mrna], [$exon], [$cds]\n";
}
}
The answer to a question I've asked previously might be of help (Can a hash key have multiple 'subvalues' in perl?).

Nicer way to test if hash entry exists before assigning it

I'm looking for a nicer way to first "test" if a hash key exists before using it. I'm currently writing a eventlog parser that decodes hex numbers into strings. As I cannot be sure that my decode table contains hex numbers I first need to check if the key exists in a hash before assigning the value to a new variable. So what I'm doing a lot is:
if ($MEL[$i]{type} eq '5024') {
$MEL[$i]{decoded_inline} = $decode_hash{checkpoint}{"$MEL[$i]{raw}[128]"}
if exists ($decode_hash{checkpoint}{"$MEL[$i]{raw}[128]"})
}
What I do not like is that the expression $decode_hash{checkpoint}{"$MEL[$i]{raw}[128]"} is twice in my code. Is there a nicer or shorter version of the line above?
I doubt this qualifies as "nice", but I think it is achieving the goal of not referring to the expression twice. I'm not sure it's worth this pain, mind you:
my $foo = $decode_hash{checkpoint};
my $bar = $MEL[$i]{raw}[128];
if ($MEL[$i]{type} eq '5024') {
$MEL[$i]{decoded_inline} = $foo->{$bar}
if exists ( $foo->{$bar} );
}
Yes there is an easier way. You know that you can only store references in an array or hash, right? Well, there's a neat side effect to that. You can take references to deep hash or array slots and then treat them like scalar references. The unfortunate side-effect is that it autovivifies the slot, but if you're always going to assign to that slot, and just want to do some checking first, it's not a bad way to keep from typing things over and over--as well as repeatedly indexing the structures as well.
my $ref = \$decode_hash{checkpoint}{"$MEL[$i]{raw}[128]"};
unless ( defined( $$ref )) {
...
$$ref = {};
...
}
As long as an existing hash element can't have an undefined value, I would write this
if ($MEL[$i]{type} eq '5024') {
my $value = $decode_hash{checkpoint}{$MEL[$i]{raw}[128]};
$MEL[$i]{decoded_inline} = $value if defined $value;
}
(Note that you shouldn't have the double-quotes around the hash key.)

Which data structure should I use for a hash without values?

I need to check if a scalar exists in a set of scalars. What is the best way of storing this set of scalars?
Walking through an array would yield linear check time. The check time for a hash would be constant, but it feels inefficient since I wouldn't be using the value part of the hash.
Use a hash, but don't use the values. There really isn't a better way.
The memory overhead for using a hash to test for set membership is minimal, and greatly outweighs the cost of repeated sequential searches through an array. There are many ways to make a set membership style hash:
my %set = map {$_ => 1} ...;
my %set; $set{$_}++ for ...;
my %set; #set{...} = (1) x num_of_items;
Each of these allows you to use the hash lookup directly in a conditional without any additional syntax.
If your hash is going to be huge, and you are worried about the memory usage, you can store undef as the value for each key. But in that case you will have to use exists $set{...} in your conditionals.
A hash should do fine. You could use undef for the value and use exists($h{$k}) or you could use 1 and use $h{$k}.
Judy::HS should be a bit more efficient, but there's no value-less version of that structure either.
You may find this section of the FAQ useful:
How can I tell whether a certain element is contained in a list or array?
Iterating through an array could be done:
my #arr = ( $list, $of, $scalars );
push #arr, $any, $other, $ones;
It's expensive to look through, but not that expensive unless you have a massive list:
grep { $_ eq $what_youre_looking_for } #arr;
The hash method also works:
my %hash = ( $list => 1, $of => 1, $scalars => 1 );
$hash{$another} = 1;
if ( exists $hash{$what_youre_looking_for} ) {
...
}
You could implement a binary search and a list sorter, but those are the two most used methods.
HashTable is the best option.
Note:- As you said it is a set, I hope there are no duplicate elements.

Should Perl hashes always contain values?

I had an earlier question that received the following response from the noted Perl expert, Perl author and Perl trainer brian d foy:
[If] you're looking for a fixed sequence of characters at the end of each filename. You want to know if that fixed sequence is in a list of sequences that interest you. Store all the extensions in a hash and look in that hash:
my( $extension ) = $filename =~ m/\.([^.]+)$/;
if( exists $hash{$extension} ) { ... }
You don't need to build up a regular expression, and you don't need to go through several possible regex alternations to check every extension you have to examine.
Thanks for the advice brian.
What I now want to know is what is the best practice in a case like the above. Should one only define the keys, which is all I need to achieve what's described above, or should one always define a value as well?
It's usually preferable to set a defined value for every key. The idiomatic value (when you don't care about the value) is 1.
my %hash = map { $_ => 1 } #array;
Doing it this way makes the code the uses the hash slightly simpler because you can use $hash{key} as a Boolean value. If the value can be undefined you need to use the more verbose exists $hash{key} instead.
That said, there are situations where a value of undef is desirable. For example: imagine that you're parsing C header files to extract preprocessor symbols. It would be logical to store these in a hash of name => value pairs.
#define FOO 1
#define BAR
In Perl, this would map to:
my %symbols = ( FOO => 1, BAR => undef);
In C a #define defines a symbol, not a value -- "defined" in C is mapped to "exists" in Perl.
You can't create a hash key without a value. The value can be undef but it will be there. How else would you construct a hash. Or was your question regarding whether the value can be undef? In which case I would say that the value you store there (undef, 1, 0...) is entirely up to you. If a lot of folks are using it then you probably want to store some true value though incase some one else uses if ($hash{$extension}) {...} instead of exists because they weren't paying attention.
undef is a value.
Of course, stuff like that is always depndent on what you are currently doing. But $foo{bar} is just a variable like $bar and I don't see any reason why either one should not be undef every now and then.
PS:
That's why exists exists.
As others have said, the idiomatic solution for a hashset (a hash that only contains keys, not values) is to use 1 as the value because this makes the testing for existence easy. However, there is something to be said for using undef as the value. It will force the users to test for existence with exists which is marginally faster. Of course, you could test for existence with exists even when the value is 1 and avoid the inevitable bugs from users who forget to use exists.
Using undef as a value in hash is more memory efficient than storing 1.
Storing '1' in a Set-hash Considered Harmful
I know using Considered Harmful is considered harmful, but this is bad, almost as bad as unrestrained goto usage.
Ok, I've harped on this in a few comments, but I think I need a full response to demonstrate the issue.
Let's say we have a daemon process that provides back-end inventory control for a shop that sells widgets.
my #items = qw(
widget
thingy
whozit
whatsit
);
my #items_in_stock = qw(
widget
thingy
);
my %in_stock;
my #in_stock(#items_in_stock) = (1) x #items_in_stock; #initialize all keys to 1
sub Process_Request {
my $request = shift;
if( $request eq REORDER ) {
Reorder_Items(\#items, \%in_stock);
}
else {
Error_Response( ILLEGAL_REQUEST );
}
}
sub Reorder_Items{
my $items = shift;
my $in_stock = shift;
# Order items we do not have in-stock.
for my $item ( #$items ) {
Reorder_Item( $item )
if not exists $in_stock->{$item};
}
}
The tool is great, it automatically keeps items in stock. Very nice. Now, the boss asks for automatically generated catalogs of in-stock items. So we modify Process_Request() and add catalog generation.
sub Process_Request {
my $request = shift;
if( $request eq REORDER ) {
Reorder_Items(\#items, \%in_stock);
}
if( $request eq CATALOG ) {
Build_Catalog(\#items, \%in_stock);
}
else {
Error_Response( ILLEGAL_REQUEST );
}
}
sub Build_Catalog {
my $items = shift;
my $in_stock = shift;
my $catalog_response = '';
foreach my $item ( #$items ) {
$catalog_response .= Catalog_Item($item)
if $in_stock->{$item};
}
return $catalog_response;
}
In testing, Build_Catalog() works fine. Hooray, we go live with the app.
Oops. For some reason nothing is being ordered, the company runs out of stock of everything.
The Build_Catalog() routine adds keys to %in_stock, so Reorder_Items() now sees everything as in stock and never makes an order.
Using Hash::Util's lock_hash can help prevent accidental hash modification. If we locked %in_stock before calling Build_Catalog() we would have gotten a fatal error and would never have gone live with the bug.
In summary, it is best to test existence of keys rather than truth of your set-hash values. If you are using existence as a signifier, don't set your values to '1' because that will mask bugs and make them harder to track. Using lock_hash can help catch these problems.
If you must check for the truth of the values, do so in every case.