Perl Multi hash vs Single hash - perl

I want to read and process sets of input from a file and then print it out.
There are 3 keys which I need to use to store data.
Assume the 3 keys are k1, k2, k3
Which of the following will give better performance
$hash{k1}->{k2}->{k3} = $val;
or
$hash{"k1,k2,k3"} = $val;
For my previous question I got the answer that all perl hash keys are treated as strings.

Unless you're really dealing with large datasets, use whichever one produces cleaner code. I may be wrong but this reeks of premature optimization.
If it isn't, this may depend on the range of possible keys. If ordering is not an issue, arrange your data in order so that k1 is the smallest set of keys and k3 is the largest. I suspect you'll use less memory on hashes that way. Depending on your datasets it may even be prudent to presize your hashes (I think %hash = 100 does the trick).
As to which is faster, only profiling will tell. Try both and see for yourself.
Also, note that $hash{k1}->{k2}-{k3} is unnecessary. You can write $hash{k1}{k2}{k3}. Dereferences aren't needes in between brackets, either square or curly.

Hash lookup speed is independent of the number of items in the hash, so the version which only does one hash lookup will perform the hash lookup portion of the operation faster than the version which does three hash lookups. But, on the other hand, the single-lookup version has to concatenate the three keys into a single string before they can be used as a combined key; if this string is anonymous (e.g., $hash{"$a,$b,$c"}), this will likely involve some fun stuff like memory allocation. Overall, I would expect the concatenation to be quick enough that the one-lookup version would be faster than the three-lookup version in most cases, but the only way to know which is faster in your case would be to write the same code in both styles and Benchmark the difference.
However, like everyone else has already said, this is a premature and worthless micro-optimization. Unless you know that you have a performance problem (or you have historical performance data which shows that a problem is developing and will be upon you in the near future) and you have profiled your code to determine that hash lookups are the cause of your performance problem, you're wasting your time worrying about this. Hash lookups are fast. It's hardly a real benchmark, but:
$ time perl -e '$foo{bar} for 1 .. 1_000_000'
real 0m0.089s
user 0m0.088s
sys 0m0.000s
In this trivial (and, admittedly, highly flawed) example, I got a rate equivalent to roughly 11 million hash lookups per second. In the time you spent asking the question, your computer could have done hundreds of millions, if not billions, of hash lookups.
Write your hash lookups in whatever style is most readable and most maintainable in your application. If you try to optimize this to be as fast as possible, the wasted programmer time will be (many!) orders of magnitude larger than any processing time that you could ever hope to save with the optimizations.

If you have memory concerns I would suggest use Devel::Size from CPAN in a early fase of development to get the size of both alternatives.
Otherwise use the one which seems friendly for you!

Related

Perfect hashing for OpenCL

I have a set (static, known in compile time) of about 2 million values, 20 bytes each. What I need is a fast O(1) way to check if a given value is in this set. It seems that perfect hash function with a bit array is ideal for this, but I can't find a simple way to create it. There are some utilities such as gperf, but they are too complicated. Also, in my case it's not necessary to have a close to 100% load factor, even 10% is enough, but with guarantee of no collisions. Another requirement for this function is simplicity, without many conditions: it will run on GPU.
What would you advice for this case?
See my answer here. The problem is a bit different, but the solution could be tailored to suit your needs. The original uses a 100% load factor, but that could be easily changed. It works by shuffling the array in-place at startup-time (this could be done at compile time, but that would imply compiling generated code).
WRT the hashfunction: if you don't know anything about the contents of 20byte objects, any reasonable hashfunction (FNV, Jenkins, or mine) will be good enough.
After reading more information about perfect hashing, I've decided not to try implementing it, and instead used cuckoo hashtable. It's much simpler and requires at most 2 accesses to the table (or any other number >1, the most used are 2..5) instead of 1 for perfect hashing.

When does it make sense to presize a hash?

From perldata:
You can preallocate space for a hash by assigning to the keys() function.
This rounds up the allocated buckets to the next power of two:
keys(%users) = 1000; # allocate 1024 buckets
Is there a rule of thumb for when presizing a hash will improve performance?
The rule of thumb is that the larger you know the Hash will be, the more likely you'll get value out of pre-sizing it. Consider if your hash has 10 slots, and you start adding one after the other, the number of expansions will a) be few (if at all), and b) small (since there is little data).
But if you KNOW you're going to need at least 1M items, then there's no reason to expand, and copy the underlying and ever expanding data structures over and over while the table grows.
Will you NOTICE this expansion? Eh, maybe. Modern machines are pretty darn fast, it may not come up. But it's a grand opportunity for heap expansion, thus causing a GC and a cascade of all sorts of things. So, if you know you're going to use it, it's a "cheap" fix to tweak out a few more milibleems of performance.
I tried to benchmark expansion cost on hash growing:
use Benchmark qw(cmpthese);
# few values
cmpthese(-4, {
prealloc => sub {
my %hash;
keys(%hash) = 17576;
$hash{$_} = $_ for 'aaa' .. 'zzz';
},
normal => sub {
my %hash;
$hash{$_} = $_ for 'aaa' .. 'zzz';
},
});
# more values
cmpthese(-8, {
prealloc => sub {
my %hash;
keys(%hash) = 456976;
$hash{$_} = $_ for 'aaaa' .. 'zzzz';
},
normal => sub {
my %hash;
$hash{$_} = $_ for 'aaaa' .. 'zzzz';
},
});
Results does not sound like big optimization, however reducing heap fragmentation mentioned by Will Hartung might be benefit. Running perl 5.12 on WinXP machine.
Rate normal prealloc
normal 48.3/s -- -2%
prealloc 49.4/s 2% --
(warning: too few iterations for a reliable count)
s/iter normal prealloc
normal 3.62 -- -1%
prealloc 3.57 1% --
Basically it is the door to optimize hash performance. Hash performance depends heavily both on the hashing algorithm used and on the data you are handling, so it is almost impossible to come up with rule of thumbs. Anyway, something can be said.
You know that each data structure offers a given balance between space and time efficiency. Hash tables are especially good as to time efficiency, offering an appealing constant (0(1)) time access.
This holds true unless there is a collision. When a collision happens, then access time is linear with the size of the bucket corresponding to the collision value. (Have a look at this for more details). Collisions, apart from being "slower", are mostly a disruption of the access time guarantee that is the single most important aspect that often leads to choosing a hash table in the first place.
Ideally, hash tables could aim at what is known as "perfect hashing" (which is actually feasible only when you can fine-tune the algorithm to the kind of data you will handle), but this is not so easy to attain in the general case (this is an euphemism, actually). Anyway, it is a matter of fact that bigger hash tables (together with a good hashing algorithm) can reduce the frequency of collisions, and thus improve performance, at the expense of memory. Smaller hash tables will see more collisions (hence will have less performance and a lesser quality access time guarantee) but occupy less memory.
So, if you profile your program and see that hash table access is a bottleneck (for any reasons) you have a chance to solve this by reserving more memory for the hash space (if you have memory to give).
In any case, I would not increase this value at random, but only after thorough profiling, since it is also true that the algorithm perl uses is compiled in (AFAIK) and this also has a big effect on hash performance (in other words, you could have a lot of collisions even if you make the hash space bigger).
As usual with performance related things, it could be useful or not, it depends on your concrete case.

How is Tie::IxHash implemented in Perl?

I've recently come upon a situation in Perl where the use of an order-preserving hash would make my code more readable and easier to use. After a bit of searching, I found out about the Tie::IxHash CPAN module, which does exactly what I want. Before I throw caution to the wind and just start using it though, I'd like to get a better idea of how it works and what kind of performance I can expect from it.
From what I know, ordered associative arrays are usually implemented as tries, which I've never actually used before, but do know that their performance falls in line with my expectations (I expect to do a lot of reading and writing, and will need to always remember the order keys were originally inserted). My problem is I can't figure out if this is how Tie::IxHash was made, or what sort of performance I should expect from it, or whether there's some better/cleaner option for me (I'd really rather not keep a separate array and hash to accomplish what I need as this produces ugly code and space inefficiency). I'm also just curious for curiosity's sake. If it wasn't implemented as a trie, how was it implemented? I know I can wade at the source code, but I'm hoping someone else has already done this, and I'm guessing that I'm not the only person who'll be interested in the answer.
So... Ideas? Suggestions? Advice?
A Tie::IxHash object is implemented in a direct fashion, using the regular Perl building blocks that one would expect. Specifically, such an object is a blessed array reference holding 4 elements.
[0] A hash reference to store the keys of the user's hash. This is used any time the module needs to check for the existence of a key.
[1] An array reference to store the keys of the user's hash, in order.
[2] A parallel array reference to store the values, also in order.
[3] An integer to keep track of the current position within the two parallel arrays. This is needed for iteration.
Regarding performance, a good benchmark is usually worth more than speculation. My guess is that the biggest performance hit will come with deletion, because the arrays holding the ordered keys and values will require adjustment.
The source will tell you how this functionality is implemented and measurement it's performance.

When to use $sth->fetchrow_hashref, $sth->fetchrow_arrayref and $sth->fetchrow_array?

I know that:
$sth->fetchrow_hashref returns a hashref of the fetched row from database,
$sth->fetchrow_arrayref returns an arrayref of the fetched row from database, and
$sth->fetchrow_array returns an array of the fetched row from database.
But I want to know best practices about these. When should we use fetchrow_hashref and when should we use fetchrow_arrayref and when should we use fetchrow_array?
When I wrote YAORM for $work, I benchmarked all of these in our environment (MySQL) and found that arrayref performed the same as array, and hashref was much slower. So I agree, it is best to use array* whenever possible; it helps to sugar your application to know which column names it is dealing with. Also the fewer columns you fetch the better, so avoid SELECT * statements as much as you can - go directly for SELECT <just the field I want>.
But this only applies to enterprise applications. If you are doing something that is not time-critical, go for whichever form presents the data in a format you can most easily work with. Remember, until you start refining your application, efficiency is what is fastest for the programmer, not for the machine. It takes many millions of executions of your application to start saving more time than you spent writing the code.
DBI has to do more work to present the result as a hashref than it does as an arrayref or as an array. If the utmost in efficiency is an issue, you will more likely use the arrayref or array. Whether this is really measurable is perhaps more debatable.
There might be an even more marginal performance difference between the array and the arrayref.
If you will find it easier to refer to the columns by name, then use the hashref; if using numbers is OK, then either of the array notations is fine.
If the first thing you're going to do is return the value from the fetching function, or pass it onto some other function, then the references may be more sensible.
Overall, there isn't any strong reason to use one over the other. The gotcha highlighted by Ed Guiness can be decisive if you are not in charge of the SQL.
You could do worse than read DBI recipes by gmax.
It notes, among other things:
The problem arises when your result
set, by mean of a JOIN, has one or
more columns with the same name. In
this case, an arrayref will report all
the columns without even noticing that
a problem was there, while a hashref
will lose the additional columns
In general, I use fetchrow_hashref (I get around two columns with the same name issue by using alias in the SQL), but I fall back to fetch (AKA fetchrow_arrayref) if I need it to be faster. I believe that fetchrow_array is there for people who don't know how to work with references.
I don't use any of them since switching all of my DB code to use DBIx::Class.

Optimizing word count

(This is rather hypothetical in nature as of right now, so I don't have too many details to offer.)
I have a flat file of random (English) words, one on each line. I need to write an efficient program to count the number of occurrences of each word. The file is big (perhaps about 1GB), but I have plenty of RAM for everything. They're stored on permanent media, so read speeds are slow, so I need to just read through it once linearly.
My two off-the-top-of-my-head ideas were to use a hash with words => no. of occurrences, or a trie with the no. of occurrences at the end node. I have enough RAM for a hash array, but I'm thinking that a trie would have as fast or faster lookups.
What approach would be best?
I think a trie with the count as the leaves could be faster.
Any decent hash table implementation will require reading the word fully, processing it using a hash function, and finally, a look-up in the table.
A trie can be implemented such that the search occurs as you are reading the word. This way, rather than doing a full look-up of the word, you could often find yourself skipping characters once you've established the unique word prefix.
For example, if you've read the characters: "torto", a trie would know that the only possible word that starts this way is tortoise.
If you can perform this inline searching faster on a word faster than the hashing algorithm can hash, you should be able to be faster.
However, this is total overkill. I rambled on since you said it was purely hypothetical, I figured you'd like a hypothetical-type of answer. Go with the most maintainable solution that performs the task in a reasonable amount of time. Micro-optimizations typically waste more time in man-hours than they save in CPU-hours.
I'd use a Dictionary object where the key is word converted to lower case and the value is the count. If the dictionary doesn't contain the word, add it with a value of 1. If it does contain the word, increment the value.
Given slow reading, it's probably not going to make any noticeable difference. The overall time will be completely dominated by the time to read the data anyway, so that's what you should work at optimizing. For the algorithm (mostly data structure, really) in memory, just use whatever happens to be most convenient in the language you find most comfortable.
A hash table is (if done right, and you said you had lots of RAM) O(1) to count a particular word, while a trie is going to be O(n) where n is the length of the word.
With a sufficiently large hash space, you'll get much better performance from a hash table than from a trie.
I think that a trie is overkill for your use case. A hash of word => # of occurrences is exactly what I would use. Even using a slow interpreted language like Perl, you can munge a 1GB file this way in just a few minutes. (I've done this before.)
I have enough RAM for a hash array, but I'm thinking that a trie would have as fast or faster lookups.
How many times will this code be run? If you're just doing it once, I'd say optimize for your time rather than your CPU's time, and just do whatever's fastest to implement (within reason). If you have a standard library function that implements a key-value interface, just use that.
If you're doing it many times, then grab a subset (or several subsets) of the data file, and benchmark your options. Without knowing more about your data set, it'd be dubious to recommend one over another.
Use Python!
Add these elements to a set data type as you go line by line, before asking whether it is in the hash table. After you know it is in the set, then add a dictionary value of 2, since you already added it to the set once before.
This will take some of the memory and computation away from asking the dictionary every single time, and instead will handle unique valued words better, at the end of the call just dump all the words that are not in the dictionary out of the set with a value of 1. (Intersect the two collections in respect to the set)
To a large extent, it depends on what you want you want to do with the data once you've captured it. See Why Use a Hash Table over a Trie (Prefix Tree)?
a simple python script:
import collections
f = file('words.txt')
counts = collections.defaultdict(int)
for line in f:
counts[line.strip()] +=1
print "\n".join("%s: %d" % (word, count) for (word, count) in counts.iteritems())