Perfect hashing for OpenCL - hash

I have a set (static, known in compile time) of about 2 million values, 20 bytes each. What I need is a fast O(1) way to check if a given value is in this set. It seems that perfect hash function with a bit array is ideal for this, but I can't find a simple way to create it. There are some utilities such as gperf, but they are too complicated. Also, in my case it's not necessary to have a close to 100% load factor, even 10% is enough, but with guarantee of no collisions. Another requirement for this function is simplicity, without many conditions: it will run on GPU.
What would you advice for this case?

See my answer here. The problem is a bit different, but the solution could be tailored to suit your needs. The original uses a 100% load factor, but that could be easily changed. It works by shuffling the array in-place at startup-time (this could be done at compile time, but that would imply compiling generated code).
WRT the hashfunction: if you don't know anything about the contents of 20byte objects, any reasonable hashfunction (FNV, Jenkins, or mine) will be good enough.

After reading more information about perfect hashing, I've decided not to try implementing it, and instead used cuckoo hashtable. It's much simpler and requires at most 2 accesses to the table (or any other number >1, the most used are 2..5) instead of 1 for perfect hashing.

Related

Do cats and scalaz create performance overhead on application?

I know it is totally a nonsense question but due to my illiteracy on programming skill this question came to my mind.
Cats and scalaz are used so that we can code in Scala similar to Haskell/in pure functional programming way. But for achieving this we need to add those libraries additionally with our projects. Eventually for using these we need to wrap our codes with their objects and functions. It is something adding extra codes and dependencies.
I don't know whether these create larger objects in memory.
These is making me think about. So my question: will I face any performance issue like more memory consumption if I use cats/scalaz ?
Or should I avoid these if my application needs performance?
Do cats and scalaz create performance overhead on application?
Absolutely.
The same way any line of code adds performance overhead.
So, if that is your concern, then don't write any code (well, actually the world may be simpler if we would have never tried all this).
Now, dick answer outside. The proper question you should be asking is: "Does the overhead of X library is harmful to my software?"; remember this applies to any library, actually to any code you write, to any algorithm you pick, etc.
And, in order to answer that question, we need some things before.
Define the SLAs the software you are writing must hold. Without those, any performance question / observation you made is pointless. It doesn't matter if something is faster / slower if you don't know if that is meaningful for you and your clients.
Once you have SLAs you need to perform stress tests to verify if your current version of the software satisfies those. Because, if your current code is performant enough, then you should worry about other things like maintainability, testing, adding more features, etc.
PS: Remember that those SLAs should not be raw numbers but be expressed in terms of percentiles, the same goes for the results of the tests.
When you found that you are falling your SLAs then you need to do proper benchmarking and debugging to identify the bottlenecks of your project. As you saw, caring about performance must be done on each line of code, but that is a lot of work that usually doesn't produce any relevant output. Thus, instead of evaluating the performance of everything, we find the bottlenecks first, those small pieces of code that have the biggest contributions to the overall performance of your software (remember the Pareto principle).
Remember that in this step, we have to be integral, network matters too. (and you will see this last one is usually the biggest slowdown; thus, usually you would rather search for architectural solutions like using Fibers instead of Threads rather than trying to optimize small functions. Also, sometimes the easier and cheaper solution is better infrastructure).
When you find the bottleneck, then you need to formulate some alternatives, implement those and not only benchmark them but do Statistical hypothesis testing to validate if the proposed changes are worth it or not. And, of course, validate if they were enough to satisfy the SLAs.
Thus, as you can see, performance is an art and a lot of work. So, unless you are committed to doing all this then stop worrying about something you will not measure and optimize properly.
Rather, focus on increasing the maintainability of your code. This actually also helps performance, because when you find that you need to change something you would be grateful that the code is as clean as possible and that the whole architecture of the code allows for an easy change.
And, believe me when I say that, using tools like cats, cats-effect, fs2, etc will help with that regard. Also, they actually pretty optimized on their core so you should be good for a lot of use cases.
Now, the big exception is that if you know that the work you are doing will be very CPU and memory bound then yeah, you pretty much can be sure all those abstractions will be harmful. In those cases, you may even want to stay away from the JVM and rather write pretty low-level code in a language like Rust which will provide you with proper tools for that kind of problem and still be way safer than plain old C.

Ensuring a hash function is well-mixed with slicing

Forgive me if this question is silly, but I'm starting to learn about consistent hashing and after reading Tom White blog post on it here and realizing that most default hash functions are NOT well mixed I had a thought on ensuring that an arbitrary hash function is minimally well-mixed.
My thought is best explained using an example like this:
Bucket 1: 11000110
Bucket 2: 11001110
Bucket 3: 11010110
Bucket 4: 11011110
Under a standard hash ring implementation for consistent caching across these buckets, you would be get terribly performance, and nearly every entry would be lumped into Bucket 1. However, if we use bits 4&5 as the MSBs in each case then these buckets are suddenly excellently mixed, and assigning a new object to a cache becomes trivial and only requires examining 2 bits.
In my mind this concept could very easily be extended when building distributed networks across multiple nodes. In my particular case I would be using this to determine which cache to place a given piece of data into. The increased placement speed isn't a real concern, but ensuring that my caches are well-mixed is and I was considering just choosing a few bits that are optimally mixed for my given caches. Any information later indexed would be indexed on the basis of the same bits.
In my naive mind this is a much simpler solution than introducing virtual nodes or building a better hash function. That said, I can't see any mention of an approach like this and I'm concerned that in my hashing ignorance I'm doing something wrong here and I might be introducing unintended consequences.
Is this approach safe? Should I use it? Has this approach been used before and are there any established algorithms for determining the minimum unique group of bits?

Merging huge sets (HashSet) in Scala

I have two huge (as in millions of entries) sets (HashSet) that have some (<10%) overlap between them. I need to merge them into one set (I don't care about maintaining the original sets).
Currently, I am adding all items of one set to the other with:
setOne ++= setTwo
This takes several minutes to complete (after several attempts at tweaking hashCode() on the members).
Any ideas how to speed things up?
You can get slightly better performance with Parallel Collections API in Scala 2.9.0+:
setOne.par ++ setTwo
or
(setOne.par /: setTwo)(_ + _)
There are a few things you might wanna try:
Use the sizeHint method to keep your sets at the expected size.
Call useSizeMap(true) on it to get better hash table resizing.
It seems to me that the latter option gives better results, though both show improvements on tests here.
Can you tell me a little more about the data inside the sets? The reason I ask is that for this kind of thing, you usually want something a bit specialized. Here's a few things that can be done:
If the data is (or can be) sorted, you can walk pointers to do a merge, similar to what's done using merge sort. This operation is pretty trivially parallelizable since you can partition one data set and then partition the second data set using binary search to find the correct boundary.
If the data is within a certain numeric range, you can instead use a bitset and just set bits whenever you encounter that number.
If one of the data sets is smaller than the other, you could put it in a hash set and loop over the other dataset quickly, checking for containment.
I have used the first strategy to create a gigantic set of about 8 million integers from about 40k smaller sets in about a second (on beefy hardware, in Scala).

Perl Multi hash vs Single hash

I want to read and process sets of input from a file and then print it out.
There are 3 keys which I need to use to store data.
Assume the 3 keys are k1, k2, k3
Which of the following will give better performance
$hash{k1}->{k2}->{k3} = $val;
or
$hash{"k1,k2,k3"} = $val;
For my previous question I got the answer that all perl hash keys are treated as strings.
Unless you're really dealing with large datasets, use whichever one produces cleaner code. I may be wrong but this reeks of premature optimization.
If it isn't, this may depend on the range of possible keys. If ordering is not an issue, arrange your data in order so that k1 is the smallest set of keys and k3 is the largest. I suspect you'll use less memory on hashes that way. Depending on your datasets it may even be prudent to presize your hashes (I think %hash = 100 does the trick).
As to which is faster, only profiling will tell. Try both and see for yourself.
Also, note that $hash{k1}->{k2}-{k3} is unnecessary. You can write $hash{k1}{k2}{k3}. Dereferences aren't needes in between brackets, either square or curly.
Hash lookup speed is independent of the number of items in the hash, so the version which only does one hash lookup will perform the hash lookup portion of the operation faster than the version which does three hash lookups. But, on the other hand, the single-lookup version has to concatenate the three keys into a single string before they can be used as a combined key; if this string is anonymous (e.g., $hash{"$a,$b,$c"}), this will likely involve some fun stuff like memory allocation. Overall, I would expect the concatenation to be quick enough that the one-lookup version would be faster than the three-lookup version in most cases, but the only way to know which is faster in your case would be to write the same code in both styles and Benchmark the difference.
However, like everyone else has already said, this is a premature and worthless micro-optimization. Unless you know that you have a performance problem (or you have historical performance data which shows that a problem is developing and will be upon you in the near future) and you have profiled your code to determine that hash lookups are the cause of your performance problem, you're wasting your time worrying about this. Hash lookups are fast. It's hardly a real benchmark, but:
$ time perl -e '$foo{bar} for 1 .. 1_000_000'
real 0m0.089s
user 0m0.088s
sys 0m0.000s
In this trivial (and, admittedly, highly flawed) example, I got a rate equivalent to roughly 11 million hash lookups per second. In the time you spent asking the question, your computer could have done hundreds of millions, if not billions, of hash lookups.
Write your hash lookups in whatever style is most readable and most maintainable in your application. If you try to optimize this to be as fast as possible, the wasted programmer time will be (many!) orders of magnitude larger than any processing time that you could ever hope to save with the optimizations.
If you have memory concerns I would suggest use Devel::Size from CPAN in a early fase of development to get the size of both alternatives.
Otherwise use the one which seems friendly for you!

How is Tie::IxHash implemented in Perl?

I've recently come upon a situation in Perl where the use of an order-preserving hash would make my code more readable and easier to use. After a bit of searching, I found out about the Tie::IxHash CPAN module, which does exactly what I want. Before I throw caution to the wind and just start using it though, I'd like to get a better idea of how it works and what kind of performance I can expect from it.
From what I know, ordered associative arrays are usually implemented as tries, which I've never actually used before, but do know that their performance falls in line with my expectations (I expect to do a lot of reading and writing, and will need to always remember the order keys were originally inserted). My problem is I can't figure out if this is how Tie::IxHash was made, or what sort of performance I should expect from it, or whether there's some better/cleaner option for me (I'd really rather not keep a separate array and hash to accomplish what I need as this produces ugly code and space inefficiency). I'm also just curious for curiosity's sake. If it wasn't implemented as a trie, how was it implemented? I know I can wade at the source code, but I'm hoping someone else has already done this, and I'm guessing that I'm not the only person who'll be interested in the answer.
So... Ideas? Suggestions? Advice?
A Tie::IxHash object is implemented in a direct fashion, using the regular Perl building blocks that one would expect. Specifically, such an object is a blessed array reference holding 4 elements.
[0] A hash reference to store the keys of the user's hash. This is used any time the module needs to check for the existence of a key.
[1] An array reference to store the keys of the user's hash, in order.
[2] A parallel array reference to store the values, also in order.
[3] An integer to keep track of the current position within the two parallel arrays. This is needed for iteration.
Regarding performance, a good benchmark is usually worth more than speculation. My guess is that the biggest performance hit will come with deletion, because the arrays holding the ordered keys and values will require adjustment.
The source will tell you how this functionality is implemented and measurement it's performance.