How is Tie::IxHash implemented in Perl? - perl

I've recently come upon a situation in Perl where the use of an order-preserving hash would make my code more readable and easier to use. After a bit of searching, I found out about the Tie::IxHash CPAN module, which does exactly what I want. Before I throw caution to the wind and just start using it though, I'd like to get a better idea of how it works and what kind of performance I can expect from it.
From what I know, ordered associative arrays are usually implemented as tries, which I've never actually used before, but do know that their performance falls in line with my expectations (I expect to do a lot of reading and writing, and will need to always remember the order keys were originally inserted). My problem is I can't figure out if this is how Tie::IxHash was made, or what sort of performance I should expect from it, or whether there's some better/cleaner option for me (I'd really rather not keep a separate array and hash to accomplish what I need as this produces ugly code and space inefficiency). I'm also just curious for curiosity's sake. If it wasn't implemented as a trie, how was it implemented? I know I can wade at the source code, but I'm hoping someone else has already done this, and I'm guessing that I'm not the only person who'll be interested in the answer.
So... Ideas? Suggestions? Advice?

A Tie::IxHash object is implemented in a direct fashion, using the regular Perl building blocks that one would expect. Specifically, such an object is a blessed array reference holding 4 elements.
[0] A hash reference to store the keys of the user's hash. This is used any time the module needs to check for the existence of a key.
[1] An array reference to store the keys of the user's hash, in order.
[2] A parallel array reference to store the values, also in order.
[3] An integer to keep track of the current position within the two parallel arrays. This is needed for iteration.
Regarding performance, a good benchmark is usually worth more than speculation. My guess is that the biggest performance hit will come with deletion, because the arrays holding the ordered keys and values will require adjustment.

The source will tell you how this functionality is implemented and measurement it's performance.

Related

Hash Sensibility for data changes

I've seen many hash algorithms has a common feature, it is any change in the data produce a total change in the hash code, although this is so, I would like to know if there is any known standard hash algorithm with a different behaviour, with little changes of hash for little changes of data, a kind of near-linear relation of amount of hash changes, respect to amount of data changes.
An idea for doing this is to create a hash concatenating various hashes calculated from parts of the data, it would use small partial hashes, or a bigger final hash, anyway, I would like to know if there is any algorithm having this behaviour.
I think you're looking for something like Simhash. It's actually meant for finding "near duplicates".
e.g. http://irl.cs.tamu.edu/people/sadhan/papers/cikm2011.pdf

Perfect hashing for OpenCL

I have a set (static, known in compile time) of about 2 million values, 20 bytes each. What I need is a fast O(1) way to check if a given value is in this set. It seems that perfect hash function with a bit array is ideal for this, but I can't find a simple way to create it. There are some utilities such as gperf, but they are too complicated. Also, in my case it's not necessary to have a close to 100% load factor, even 10% is enough, but with guarantee of no collisions. Another requirement for this function is simplicity, without many conditions: it will run on GPU.
What would you advice for this case?
See my answer here. The problem is a bit different, but the solution could be tailored to suit your needs. The original uses a 100% load factor, but that could be easily changed. It works by shuffling the array in-place at startup-time (this could be done at compile time, but that would imply compiling generated code).
WRT the hashfunction: if you don't know anything about the contents of 20byte objects, any reasonable hashfunction (FNV, Jenkins, or mine) will be good enough.
After reading more information about perfect hashing, I've decided not to try implementing it, and instead used cuckoo hashtable. It's much simpler and requires at most 2 accesses to the table (or any other number >1, the most used are 2..5) instead of 1 for perfect hashing.

Perl Multi hash vs Single hash

I want to read and process sets of input from a file and then print it out.
There are 3 keys which I need to use to store data.
Assume the 3 keys are k1, k2, k3
Which of the following will give better performance
$hash{k1}->{k2}->{k3} = $val;
or
$hash{"k1,k2,k3"} = $val;
For my previous question I got the answer that all perl hash keys are treated as strings.
Unless you're really dealing with large datasets, use whichever one produces cleaner code. I may be wrong but this reeks of premature optimization.
If it isn't, this may depend on the range of possible keys. If ordering is not an issue, arrange your data in order so that k1 is the smallest set of keys and k3 is the largest. I suspect you'll use less memory on hashes that way. Depending on your datasets it may even be prudent to presize your hashes (I think %hash = 100 does the trick).
As to which is faster, only profiling will tell. Try both and see for yourself.
Also, note that $hash{k1}->{k2}-{k3} is unnecessary. You can write $hash{k1}{k2}{k3}. Dereferences aren't needes in between brackets, either square or curly.
Hash lookup speed is independent of the number of items in the hash, so the version which only does one hash lookup will perform the hash lookup portion of the operation faster than the version which does three hash lookups. But, on the other hand, the single-lookup version has to concatenate the three keys into a single string before they can be used as a combined key; if this string is anonymous (e.g., $hash{"$a,$b,$c"}), this will likely involve some fun stuff like memory allocation. Overall, I would expect the concatenation to be quick enough that the one-lookup version would be faster than the three-lookup version in most cases, but the only way to know which is faster in your case would be to write the same code in both styles and Benchmark the difference.
However, like everyone else has already said, this is a premature and worthless micro-optimization. Unless you know that you have a performance problem (or you have historical performance data which shows that a problem is developing and will be upon you in the near future) and you have profiled your code to determine that hash lookups are the cause of your performance problem, you're wasting your time worrying about this. Hash lookups are fast. It's hardly a real benchmark, but:
$ time perl -e '$foo{bar} for 1 .. 1_000_000'
real 0m0.089s
user 0m0.088s
sys 0m0.000s
In this trivial (and, admittedly, highly flawed) example, I got a rate equivalent to roughly 11 million hash lookups per second. In the time you spent asking the question, your computer could have done hundreds of millions, if not billions, of hash lookups.
Write your hash lookups in whatever style is most readable and most maintainable in your application. If you try to optimize this to be as fast as possible, the wasted programmer time will be (many!) orders of magnitude larger than any processing time that you could ever hope to save with the optimizations.
If you have memory concerns I would suggest use Devel::Size from CPAN in a early fase of development to get the size of both alternatives.
Otherwise use the one which seems friendly for you!

Cost of isEqualToString: vs. Numerical comparisons

I'm working on a project with designing a core data system for searching and cataloguing images and documents. One of the objects in my data model is a 'key word' object. Every time I add a new key word I first want to first run though all of the existing keywords to make sure it doesn't already exist in the current context.
I've read in posts here and in a lot of my reading that doing string comparisons is a far more expensive processing than some other comparison operations. Since I could easily end up having to check many thousands of words before a new addition I'm wondering if it would be worth using some method that would represent the key word strings numerically for the purpose of this process. Possibly breaking down each character in the string into a number formed from the UTF code for each character and then storing that in an ID property for each key word.
I was wondering if anyone else thought any benefit might come from this approach or if anyone else had any better ideas.
What you might useful is a suitable hash function to convert your text strings into (probably) unique numbers. (You might still have to check for collision effects.)
Comparing intrinsic numbers in C code is a much faster for several reasons. It avoids the Objective C runtime dispatch overhead. It requires accessing less total memory. And the executable code for each comparison is usually just an instruction or 3, rather than a loop with incrementers and several decision points.

When to use $sth->fetchrow_hashref, $sth->fetchrow_arrayref and $sth->fetchrow_array?

I know that:
$sth->fetchrow_hashref returns a hashref of the fetched row from database,
$sth->fetchrow_arrayref returns an arrayref of the fetched row from database, and
$sth->fetchrow_array returns an array of the fetched row from database.
But I want to know best practices about these. When should we use fetchrow_hashref and when should we use fetchrow_arrayref and when should we use fetchrow_array?
When I wrote YAORM for $work, I benchmarked all of these in our environment (MySQL) and found that arrayref performed the same as array, and hashref was much slower. So I agree, it is best to use array* whenever possible; it helps to sugar your application to know which column names it is dealing with. Also the fewer columns you fetch the better, so avoid SELECT * statements as much as you can - go directly for SELECT <just the field I want>.
But this only applies to enterprise applications. If you are doing something that is not time-critical, go for whichever form presents the data in a format you can most easily work with. Remember, until you start refining your application, efficiency is what is fastest for the programmer, not for the machine. It takes many millions of executions of your application to start saving more time than you spent writing the code.
DBI has to do more work to present the result as a hashref than it does as an arrayref or as an array. If the utmost in efficiency is an issue, you will more likely use the arrayref or array. Whether this is really measurable is perhaps more debatable.
There might be an even more marginal performance difference between the array and the arrayref.
If you will find it easier to refer to the columns by name, then use the hashref; if using numbers is OK, then either of the array notations is fine.
If the first thing you're going to do is return the value from the fetching function, or pass it onto some other function, then the references may be more sensible.
Overall, there isn't any strong reason to use one over the other. The gotcha highlighted by Ed Guiness can be decisive if you are not in charge of the SQL.
You could do worse than read DBI recipes by gmax.
It notes, among other things:
The problem arises when your result
set, by mean of a JOIN, has one or
more columns with the same name. In
this case, an arrayref will report all
the columns without even noticing that
a problem was there, while a hashref
will lose the additional columns
In general, I use fetchrow_hashref (I get around two columns with the same name issue by using alias in the SQL), but I fall back to fetch (AKA fetchrow_arrayref) if I need it to be faster. I believe that fetchrow_array is there for people who don't know how to work with references.
I don't use any of them since switching all of my DB code to use DBIx::Class.