What is the meaning of this line keys(%S)=#C_fields;? - perl

I have one general question in Perl.What is the meaning of below line
keys(%S)=#C_fields;

keys(%S)=#C_fields; is identical to keys(%S) = scalar #C_fields;
and from perldoc -f keys
Used as an lvalue, keys allows you to increase the number of hash buckets allocated for the given hash. This can gain you a measure of efficiency if you know the hash is going to get big. (This is similar to pre-extending an array by assigning a larger number to $#array.) If you say
keys %hash = 200;
then %hash will have at least 200 buckets allocated for it--256 of them, in fact, since it rounds up to the next power of two.
So hash %S will get number of buckets which are at least size of #C_fields array.

Related

how can I create a hash function in which different permutaions of digits of an integer form the same key?

for example 20986 and 96208 should generate the same key (but not 09862 or 9862 as leading zero means it not even a 5 digit number so we igore those).
One option is to get the least/max sorted permutation and then the sorted number is the hashkey, but sorting is too costly for my case. I need to generate key in O(1) time.
Other idea I have is to traverse the number and get frequency of each digits and the then get a hash function out of it. Now whats the best function to combine the frequencies given that 0<= Summation(f[i]) <= no_of_digits.
To create an order-insensitive hash simply hash each value (in your case the digits of the number) and then combine them using a commutative function (e.g. addition/multiplication/XOR). XOR is probably the most appropriate as it retains a constant hash output size and is very fast.
Also, you will want to strip away any leading 0's before hashing the number.

Is key mod TableSize a good hash function in this particular case

If a user is designing a hash table and knows that all the keys will be multiples of 4 between 0 and 10,000 and evenly distributed. Is the following hash function good?
hash(key) = key mod TableSize
where TableSize is some prime number.
My intuition is that this function is highly flawed because only 1/4 of the possible keys actually occur. But when I ran tests the hash values were about evenly distributed.
Am I missing something?
Good enough if except keys are multiples of 4, they are effectively random. BTW why don't you divide each key by 4 (>> 2) before putting into the hash table?

How to evaluate a hash generating algorithm

What ways do you know to evaluate the efficiency of a hash function besides generating a large set of values and see the distribution of values?
By efficiency I mean that the keys generated by your hash function distribute evenly. Is there a way to prove this without actually testing for actual values?
A hash function is only even in the context of the data being hashed
Consider two data sets:
Set 1
1, 3, 6, 2, 7, 9, 5, 8, 4
Set 2
65355, 96424664, 86463624, 133, 643564, 24232, 88677, 865747, 2224
A good hashing function for one set (ie mod 10 for set 1) gives no collisions and could be seen as the perfect hash for that data set
However apply it to the second set and there are collisions everywhere
Hash = (x * 37) mod 256
Is much better for the second set but may not suit the first set quite so well... Especially when partitioning the hash for eg a small number of buckets.
What you can do is evaluate a hash against random data that you "expect" your function to have to handle... But that is making assumptions...
Premature optimisation is looking for the perfect hash function before you have enough real data to base your assessment on.
You should get enough data well before the cost of rehashing becomes prohibitive to change your hash function
Update
Lets suppose we are looking for a hash function that generates an 8 bit hash of the input data. Lets further suppose that the hash function is supposed to take byte-streams of varying length.
If we assume that the bytes in the byte-streams are uniformly distributed, we can make some assessment of different hash functions.
int hash = 0;
for (byte b in datastream) hash = hash xor b;
This function will produce uniformly distributed hash values for the specified data set, and would therefore be a good hash function in this context. If you don't see why this is, then you might have other problems.
int hash = 37;
for (byte b in datastream hash = (31 * hash + b) mod 256;
This function will produce uniformly distributed hash values for the specified data set, and would therefore be a good hash function in this context.
Now lets change the data set from being variable length strings of random numbers in the range 0 to 255 to being variable length strings comprising English sentences encoded as US-ASCII.
The XOR is then a poor hash because the input data never has the 8th bit set and as a result only generates hashes in the range 0-127, also there is a higher likelyhood of some "hot" values because of the letter frequency in english words and the cancelling affect of the XOR.
The pair of primes remains reasonably good as a hash function because it uses the full output range and the prime initial offset coupled with a different prime multiplier tends to spread the values out. But it is still weak for collisions due to how English language is structured... Something that only testing with real data can show.

Sorting vs Linear search for finding min/max

Recently, I came across the following piece of code in perl that returns the minimum numeric value among all passed arguments.
return 0 + ( sort { $a <=> $b } grep { $_ == $_ } #_ )[0];
I usually use simple linear search to find the min/max in a list, which for me seems to be simple and adequately optimal. Is the above code in any way better than simple linear search? Anything to do with perl in this case? Thanks!
O() doesn't say anything about how long an algorithm takes. For example, all else being equal, I'd always choose Algorithm 2 among the following two:
Algorithm 1: O(2*N + 1000 days) = O(N)
Algorithm 2: O(5*N + 100 ms) = O(N log N)
O() specifies how the time the algorihm takes scales as the size of the input increases. (Well, it can be used for any resources, not just time.) Since the earlier two answers only talk in terms of O(), they are useless.
If you want to know how fast an algorithm which algorithm is better for an input of a given size, you'll need to benchmark them.
In this case, it looks like List::Util's min is always significantly better.
$ perl x.pl 10
Rate sort LUmin
sort 1438165/s -- -72%
LUmin 5210584/s 262% --
$ perl x.pl 100
Rate sort LUmin
sort 129073/s -- -91%
LUmin 1485473/s 1051% --
$ perl x.pl 1000
Rate sort LUmin
sort 6382/s -- -97%
LUmin 199698/s 3029% --
Code:
use strict;
use warnings;
use Benchmark qw( cmpthese );
use List::Util qw( min );
my %tests = (
'sort' => 'my $x = ( sort { $a <=> $b } #n )[0];',
'LUmin' => 'my $x = min #n;',
);
$_ = 'use strict; use warnings; our #n; ' . $_
for values %tests;
local our #n = map rand, 1..( $ARGV[0] // 10 );
cmpthese(-3, \%tests);
You are right. If you do not need sorted data for any other purpose, the simple linear search is fastest. To do its job, a sort would have to look at each datum at least once, anyway.
Only when the sorted data would be useful for other purposes -- or when I didn't care about run time, power usage, heat dissipation, etc. -- would I sort data to find the minimum and maximum values.
Now, #SimeonVisser is correct. The sort does have O(n*log(n)). This is not as much slower than O(n) as many programmers imagine that it were. In practical cases of interest, the overhead of managing the sort's balanced binary tree (or other such structure) probably matters about as much as the log(n) factor does. So, one needn't shrink in horror from the prospect of sorting! However, the linear search is still faster: you are quite right about this.
Moreover, #DavidO adds such an insightful comment that I would quote it here in his own words:
A linear search is also an easier algorithm to generalize. A linear search could easily (and relatively efficiently) be disk based for large data sets, for example. Whereas doing a disk based sort becomes relatively expensive, and even more complex if the field sizes aren't
normalized.
Linear search is O(n) for obvious reasons. Sorting is O(n log n) (see sort in Perl documentation). So yes, linear search is indeed faster in terms of complexity. This does not only apply to Perl but to any programming language that implements these algorithms.
As with many problems, there are multiple ways to solve it and there are also multiple ways to obtain the min/max of a list. Conceptually I would say that linear search is better when you only want the min or max of a list as the problem does not call for sorting.

How to create empty buckets in extendible hashing

I know how to do extendible hashing on paper, but I don't know how it's possible for empty buckets to be created.
What would cause empty buckets to be created in extendible hashing? Can you show a simple example?
Assume the hash function h(x) is h(x) = x and each bucket can hold two things in it.
I will also use the least significant bits of the hash code as in index into the hash directory, as opposed to the most significant bits.
Basically, to get an empty bucket, we want to induce a doubling of the hash table by trying to place something into a bucket that has no space but we want that doubling to fail.
So, let's start inserting stuff.
First, insert 0. This should go in the first bucket, since h(0) = 0 and 0 % 2 = 0.
Then, insert 4. This should also go in the first bucket, since h(4) = 4 and 4 % 2 = 0.
Now, inserting 8 fails since the bucket can only hold two things, so the table must be doubled in size. Therefore, the global hash level increases from 1 to 2. Other changes include a new third bucket and the fourth hash index pointing to the second bucket.
Unfortunately, since the rehashing process takes h(x) % 4 and all of our numbers are (deliberately) multiples of 4, the first bucket remains too full and the third bucket is empty. This resolves itself with yet another doubling of the hash table.