Hash keys and values in the right order - perl

I've seen many times the following piece of code to join a hash to another hash
%hash1 = ('one' => "uno");
%hash2 = ('two' => "dos", 'three' => "tres");
#hash1{keys %hash2} = values %hash2;
I thought that every time the "values" or "keys" function is called, their output order was random. If this is true, how does the statement above gets the keys and values in the right order on both sides?
In other words, why there is not a chance of getting 'two' => 'tres' in %hash1 after merging both hashes?
Is Perl smart enough to know that if "keys" and "values" are called on the same line, then keys and values must be given in the same order?

See perldoc -f keys
So long as a given hash is unmodified you may rely on keys, values and each to repeatedly return the same order as each other.

A hash is an array of linked lists. A hashing function converts the key into a number which is used as the index of the array element ("bucket") into which to store the value. More than one key can hash to the same index ("collision"), a situation handled by the linked lists.
The iterator used by keys, values and each returns the elements in a order consistent with their location in the hash. I imagine it iterates over the linked list in the first bucket, then over the linked list in the second bucket, etc. The points is that it doesn't randomize the order in which it iterates over the elements of the hash. That's why the docs guarantee the following:
So long as a given hash is unmodified you may rely on keys, values and each to repeatedly return the same order as each other.
What is "random"[1] is in which bucket number to which a key will hash. Each hash has a random secret number that perturbs the hashing function. This causes the order of the elements in a hash to be different for each hash and for each run of a program.[2]
Adding elements to a hash can cause the number of buckets to increase, and it can cause trigger the secret number to change (if one of the linked lists becomes abnormally long). Both of these will change the order of the elements in that hash.
$ perl -le'
my %h1 = map { $_ => 1 } "a".."j";
my %h2 = map { $_ => 1 } "a".."j";
print keys(%$_) for \%h1, \%h1, \%h2, \%h2;
'
hjfeadbigc
hjfeadbigc
bdgcifjhae
bdgcifjhae
$ perl -le'
my %h1 = map { $_ => 1 } "a".."j";
my %h2 = map { $_ => 1 } "a".."j";
print keys(%$_) for \%h1, \%h1, \%h2, \%h2;
'
dcahigjbfe
dcahigjbfe
gihacdefbj
gihacdefbj
It's not quite random. If you insert two elements in a hash, the second element has a greater than 50% chance of being returned after the first by the iterator.
In older versions of Perl, things weren't quite as random.

Related

Does iterating over a hash reference require implicitly copying it in perl?

Lets say I have a large hash and I want to iterate over the contents of it contents. The standard idiom would be something like this:
while(($key, $value) = each(%{$hash_ref})){
///do something
}
However, if I understand my perl correctly this is actually doing two things. First the
%{$hash_ref}
is translating the ref into list context. Thus returning something like
(key1, value1, key2, value2, key3, value3 etc)
which will be stored in my stacks memory. Then the each method will run, eating the first two values in memory (key1 & value1) and returning them to my while loop to process.
If my understanding of this is right that means that I have effectively copied my entire hash into my stacks memory only to iterate over the new copy, which could be expensive for a large hash, due to the expense of iterating over the array twice, but also due to potential cache hits if both hashes can't be held in memory at once. It seems pretty inefficient. I'm wondering if this is what really happens, or if I'm either misunderstanding the actual behavior or the compiler optimizes away the inefficiency for me?
Follow up questions, assuming I am correct about the standard behavior.
Is there a syntax to avoid copying of the hash by iterating over it values in the original hash? If not for a hash is there one for the simpler array?
Does this mean that in the above example I could get inconsistent values between the copy of my hash and my actual hash if I modify the hash_ref content within my loop; resulting in $value having a different value then $hash_ref->($key)?
No, the syntax you quote does not create a copy.
This expression:
%{$hash_ref}
is exactly equivalent to:
%$hash_ref
and assuming the $hash_ref scalar variable does indeed contain a reference to a hash, then adding the % on the front is simply 'dereferencing' the reference - i.e. it resolves to a value that represents the underlying hash (the thing that $hash_ref was pointing to).
If you look at the documentation for the each function, you'll see that it expects a hash as an argument. Putting the % on the front is how you provide a hash when what you have is a hashref.
If you wrote your own subroutine and passed a hash to it like this:
my_sub(%$hash_ref);
then on some level you could say that the hash had been 'copied', since inside the subroutine the special #_ array would contain a list of all the key/value pairs from the hash. However even in that case, the elements of #_ are actually aliases for the keys and values. You'd only actually get a copy if you did something like: my #args = #_.
Perl's builtin each function is declared with the prototype '+' which effectively coerces a hash (or array) argument into a reference to the underlying data structure.
As an aside, starting with version 5.14, the each function can also take a reference to a hash. So instead of:
($key, $value) = each(%{$hash_ref})
You can simply say:
($key, $value) = each($hash_ref)
No copy is created by each (though you do copy the returned values into $key and $value through assignment). The hash itself is passed to each.
each is a little special. It supports the following syntaxes:
each HASH
each ARRAY
As you can see, it doesn't accept an arbitrary expression. (That would be each EXPR or each LIST). The reason for that is to allow each(%foo) to pass the hash %foo itself to each rather than evaluating it in list context. each can do that because it's an operator, and operators can have their own parsing rules. However, you can do something similar with the \% prototype.
use Data::Dumper;
sub f { print(Dumper(#_)); }
sub g(\%) { print(Dumper(#_)); } # Similar to each
my %h = (a=>1, b=>2);
f(%h); # Evaluates %h in list context.
print("\n");
g(%h); # Passes a reference to %h.
Output:
$VAR1 = 'a'; # 4 args, the keys and values of the hash
$VAR2 = 1;
$VAR3 = 'b';
$VAR4 = 2;
$VAR1 = { # 1 arg, a reference to the hash
'a' => 1,
'b' => 2
};
%{$h_ref} is the same as %h, so all of the above applies to %{$h_ref} too.
Note that the hash isn't copied even if it is flattened. The keys are "copied", but the values are returned directly.
use Data::Dumper;
my %h = (abc=>"def", ghi=>"jkl");
print(Dumper(\%h));
$_ = uc($_) for %h;
print(Dumper(\%h));
Output:
$VAR1 = {
'abc' => 'def',
'ghi' => 'jkl'
};
$VAR1 = {
'abc' => 'DEF',
'ghi' => 'JKL'
};
You can read more about this here.

Comparison to an array of a value [duplicate]

This question already has answers here:
How can I verify that a value is present in an array (list) in Perl?
(8 answers)
Closed 9 years ago.
I'm still feeling my way though perl and so there's probably a simple way of doing this but I can find it. I want to compare a single value say A or E to an array that may or may not contain that value, eg A B C D and then perform an action if they match. How should I set this up?
Thanks.
You filter each element of the array to see if it is the element you are looking for and then use the resulting array as a boolean value (not empty = true, empty = false):
#filtered_array = grep { $_ eq 'A' } #array;
if (#filtered_array) {
print "found it!\n";
}
If you store the list in an array then the only way is to examine each element individually in a loop, using grep, or for or any from List::MoreUtils. (grep is the worst of these, as it searches the entire array, even if a match has been found early on.) This is fine if the array is small, but you will hit performance probelms if the array has a significant size and you have to check it frequently.
You can speed things up by representing the same list in a hash, when a check for membership is just a single key lookup.
Alternatively, if the list is enormous, then it is best kept in a database, using SQLite.
Are you stuck on arrays?
Whenever in Perl you're talk about quickly looking up data, you should think in terms of hashes. A hash is a collection of data like an array, but it is keyed, and looking up the key is a very fast operation in Perl.
There's nothing that says the keys to your hash can't be your data, and it is very common in Perl to index an array with a hash in order to quickly search for values.
This turns your array #array into a hash called %arrays_hash.
use strict;
use warnings;
use feature qw(say);
use autodie;
my #array = qw(Alpha Beta Delta Gamma Ohm);
my %array_index;
for my $entry ( #array ) {
$array_index{$entry} = 1; # Doesn't matter. As long as it isn't blank or zero
}
Now, looking up whether or not your data is in your array is very quick. Just simply see if it's a key in your %array_index:
my $item = "Delta"; # Is this in my initial array?
if ( $array_index{$item} ) {
say "Yes! Item '$item' is in my array.";
}
else {
say "No. Item '$item' isn't in my array. David sad.";
}
This is so common, that you'll see a lot of programs that use the map command to index the array. Instead of that for loop, I could have done this:
my %array_index = ( map { $_ => 1 } #array );
or
my %array_index;
map { $array_index{$_} = 1 } #array;
You'll see both. The first one is a one liner. The map command takes each entry in the array, and puts it in $_. Then, it returns the results into an array. Thus, the map will return an array with your data in the even positions (0, 2, 4 8...) and a 1 in the odd positions (1, 3, 5...).
The second one is more literal and easier to understand (or about as easy to understand in a map command). Again, each item in your #array is being assigned to $_, and that is being used as the key in my %array_index hash.
Whether or not you want to use hashes depend upon the length of your array, and how many items of input you'll be searching for. If you're simply searching whether a single item is in your array, I'd probably use List::Utils or List::MoreUtils, or use a for loop to search each value of my array. If I am doing this for multiple values, I am better off with a hash.

How to get all different values(or keys) in a hash in Perl

I have a little problem with some hashes.
if i have a hash containing, John,John,John,Bob,Bob,Paul - is there then a function that can return just:
John, Bob, Paul.
In other words i want to get all different values(or keys if value is not possible) - but only once :).
I hope u understand my question, thx :)
TIMTOWTDI:
my #unique = keys { reverse %hash };
Note the performance caveat with reverse though:
This operator is also handy for inverting a hash, although there are
some caveats. If a value is duplicated in the original hash, only one
of those can be represented as a key in the inverted hash. Also, this
has to unwind one hash and build a whole new one, which may take some
time on a large hash, such as from a DBM file.
%by_name = reverse %by_address; # Invert the hash
Something like this may help you:
use List::MoreUtils qw{ uniq };
my %hash = ( a => 'Paul', b => 'Paul', c => 'Peter' );
my #uniq_names = uniq values %hash;
print "#uniq_names\n";
Keys are uniq always.
De-duping is easily (and idiomatically) done by using a hash:
my #uniq = keys { map { $_ => 1 } values %hash };
A simple enough approach that does not require installing modules. Since hash keys must be unique, any list of strings become automatically de-duped when used as keys in the same hash.
Note the use of curly braces forming an anonymous hash { ... } around the map statement. That is required for keys.
Note also that values %hash can be any list of strings, such as one or more arrays, subroutine calls and whatnot.

In a large Perl hash, how do I extract a subset of particular keys?

I have a large hash and have a subset of keys that I want to extract the values for, without having to iterate over the hash looking for each key (as I think that will take too long).
I was wondering if I can use grep to grab a file with a subset of keys? For example, something along the lines of:
my #value = grep { defined $hash{$_}{subsetofkeysfile} } ...;
Use a hash slice:
my %hash = (foo => 1, bar => 2, fubb => 3);
my #subset_of_keys = qw(foo fubb);
my #subset_of_values = #hash{#subset_of_keys}; # (1, 3)
Two points of clarification. (1) You never need to iterate over a hash looking for particular keys or values. You simply feed the hash a key, and it returns the corresponding value -- or, in the case of hash slices, you supply a list of keys and get back a list of values. (2) There is a difference between defined and exists. If you're merely interested in whether a hash contains a specific key, exists is the correct test.
If the hash may not contain any keys from the subset, use a hash slice and grep:
my #values = grep defined, #hash{#keys};
You can omit the grep part if all keys are contained in the hash.
Have a look at perldata,
and try something like
foreach (#hash{qw[key1 key2]}) {
# do something
}

What's the safest way to iterate through the keys of a Perl hash?

If I have a Perl hash with a bunch of (key, value) pairs, what is the preferred method of iterating through all the keys? I have heard that using each may in some way have unintended side effects. So, is that true, and is one of the two following methods best, or is there a better way?
# Method 1
while (my ($key, $value) = each(%hash)) {
# Something
}
# Method 2
foreach my $key (keys(%hash)) {
# Something
}
The rule of thumb is to use the function most suited to your needs.
If you just want the keys and do not plan to ever read any of the values, use keys():
foreach my $key (keys %hash) { ... }
If you just want the values, use values():
foreach my $val (values %hash) { ... }
If you need the keys and the values, use each():
keys %hash; # reset the internal iterator so a prior each() doesn't affect the loop
while(my($k, $v) = each %hash) { ... }
If you plan to change the keys of the hash in any way except for deleting the current key during the iteration, then you must not use each(). For example, this code to create a new set of uppercase keys with doubled values works fine using keys():
%h = (a => 1, b => 2);
foreach my $k (keys %h)
{
$h{uc $k} = $h{$k} * 2;
}
producing the expected resulting hash:
(a => 1, A => 2, b => 2, B => 4)
But using each() to do the same thing:
%h = (a => 1, b => 2);
keys %h;
while(my($k, $v) = each %h)
{
$h{uc $k} = $h{$k} * 2; # BAD IDEA!
}
produces incorrect results in hard-to-predict ways. For example:
(a => 1, A => 2, b => 2, B => 8)
This, however, is safe:
keys %h;
while(my($k, $v) = each %h)
{
if(...)
{
delete $h{$k}; # This is safe
}
}
All of this is described in the perl documentation:
% perldoc -f keys
% perldoc -f each
One thing you should be aware of when using each is that it has
the side effect of adding "state" to your hash (the hash has to remember
what the "next" key is). When using code like the snippets posted above,
which iterate over the whole hash in one go, this is usually not a
problem. However, you will run into hard to track down problems (I speak from
experience ;), when using each together with statements like
last or return to exit from the while ... each loop before you
have processed all keys.
In this case, the hash will remember which keys it has already returned, and
when you use each on it the next time (maybe in a totaly unrelated piece of
code), it will continue at this position.
Example:
my %hash = ( foo => 1, bar => 2, baz => 3, quux => 4 );
# find key 'baz'
while ( my ($k, $v) = each %hash ) {
print "found key $k\n";
last if $k eq 'baz'; # found it!
}
# later ...
print "the hash contains:\n";
# iterate over all keys:
while ( my ($k, $v) = each %hash ) {
print "$k => $v\n";
}
This prints:
found key bar
found key baz
the hash contains:
quux => 4
foo => 1
What happened to keys "bar" and baz"? They're still there, but the
second each starts where the first one left off, and stops when it reaches the end of the hash, so we never see them in the second loop.
The place where each can cause you problems is that it's a true, non-scoped iterator. By way of example:
while ( my ($key,$val) = each %a_hash ) {
print "$key => $val\n";
last if $val; #exits loop when $val is true
}
# but "each" hasn't reset!!
while ( my ($key,$val) = each %a_hash ) {
# continues where the last loop left off
print "$key => $val\n";
}
If you need to be sure that each gets all the keys and values, you need to make sure you use keys or values first (as that resets the iterator). See the documentation for each.
Using the each syntax will prevent the entire set of keys from being generated at once. This can be important if you're using a tie-ed hash to a database with millions of rows. You don't want to generate the entire list of keys all at once and exhaust your physical memory. In this case each serves as an iterator whereas keys actually generates the entire array before the loop starts.
So, the only place "each" is of real use is when the hash is very large (compared to the memory available). That is only likely to happen when the hash itself doesn't live in memory itself unless you're programming a handheld data collection device or something with small memory.
If memory is not an issue, usually the map or keys paradigm is the more prevelant and easier to read paradigm.
A few miscellaneous thoughts on this topic:
There is nothing unsafe about any of the hash iterators themselves. What is unsafe is modifying the keys of a hash while you're iterating over it. (It's perfectly safe to modify the values.) The only potential side-effect I can think of is that values returns aliases which means that modifying them will modify the contents of the hash. This is by design but may not be what you want in some circumstances.
John's accepted answer is good with one exception: the documentation is clear that it is not safe to add keys while iterating over a hash. It may work for some data sets but will fail for others depending on the hash order.
As already noted, it is safe to delete the last key returned by each. This is not true for keys as each is an iterator while keys returns a list.
I always use method 2 as well. The only benefit of using each is if you're just reading (rather than re-assigning) the value of the hash entry, you're not constantly de-referencing the hash.
I may get bitten by this one but I think that it's personal preference. I can't find any reference in the docs to each() being different than keys() or values() (other than the obvious "they return different things" answer. In fact the docs state the use the same iterator and they all return actual list values instead of copies of them, and that modifying the hash while iterating over it using any call is bad.
All that said, I almost always use keys() because to me it is usually more self documenting to access the key's value via the hash itself. I occasionally use values() when the value is a reference to a large structure and the key to the hash was already stored in the structure, at which point the key is redundant and I don't need it. I think I've used each() 2 times in 10 years of Perl programming and it was probably the wrong choice both times =)
I usually use keys and I can't think of the last time I used or read a use of each.
Don't forget about map, depending on what you're doing in the loop!
map { print "$_ => $hash{$_}\n" } keys %hash;
I woudl say:
Use whatever's easiest to read/understand for most people (so keys, usually, I'd argue)
Use whatever you decide consistently throught the whole code base.
This give 2 major advantages:
It's easier to spot "common" code so you can re-factor into functions/methiods.
It's easier for future developers to maintain.
I don't think it's more expensive to use keys over each, so no need for two different constructs for the same thing in your code.