improve hashing using genetic programming/algorithm - hash

I'm writing a program which can significantly lessen the number of collisions that occur while using hash functions like 'key mod table_size'. For this I would like to use Genetic Programming/Algorithm. But I don't know much about it. Even after reading many articles and examples I don't know that in my case (as in program definition) what would be the fitness function, target (target is usually the required result), what would pose as the population/individuals and parents, etc.
Please help me in identifying the above and with a few codes/pseudo-codes snippets if possible as this is my project.
Its not necessary to be using genetic programming/algorithm, it can be anything using evolutionary programming/algorithm.
thanks..

My advice would be: don't do this that way. The literature on hash functions is vast and we more or less understand what makes a good hash function. We know enough mathematics not to look for them blindly.
If you need a hash function to use, there is plenty to choose from.
However, if this is your uni project and you cannot possibly change the subject or steer it in a more manageable direction, then as you noticed there will be complex issues of getting fitness function and mutation operators right. As far as I can tell off the top of my head, there are no obvious candidates.
You may look up e.g. 'strict avalanche criterion' and try to see if you can reason about it in terms of fitness and mutations.
Another question is how do you want to represent your function? Just a boolean expression? Something built from word operations like AND, XOR, NOT, ROT ?
Depending on your constraints (or rather, assumptions) the question of fitness and mutation will be different.

Broadly fitness is clearly minimize the number of collisions in your 'hash modulo table-size' model.
The obvious part is to take a suitably large and (very important) representative distribution of keys and chuck them through your 'candidate' function.
Then you might pass them through 'hash modulo table-size' for one or more values of table-size and evaluate some measure of 'niceness' of the arising distribution(s).
So what that boils down to is what table-sizes to try and what niceness measure to apply.
Niceness is context dependent.
You might measure 'fullest bucket' as a measure of 'worst case' insert/find time.
You might measure sum of squares of bucket sizes as a measure of 'average' insert/find time based on uniform distribution of amongst the keys look-up.
Finally you would need to decide what table-size (or sizes) to test at.
Conventional wisdom often uses primes because hash modulo prime tends to be nicely volatile to all the bits in hash where as something like hash modulo 2^n only involves the lower n-1 bits.
To keep computation down you might consider the series of next prime larger than each power of two. 5(>2^2) 11 (>2^3), 17 (>2^4) , etc. up to and including the first power of 2 greater than your 'sample' size.
There are other ways of considering fitness but without a practical application the question is (of course) ill-defined.
If your 'space' of potential hash functions don't all have the same execution time you should also factor in 'cost'.
It's fairly easy to define very good hash functions but execution time can be a significant factor.

Related

How do you choose an optimal PlainModulus in SEAL?

I am currently learning how to use SEAL and in the parameters for BFV scheme there was a helper function for choosing the PolyModulus and CoeffModulus and however this was not provided for choosing the PlainModulus other than it should be either a prime or a power of 2 is there any way to know which optimal value to use?
In the given example the PlainModulus was set to parms.PlainModulus = new SmallModulus(256); Is there any special reason for choosing the value 256?
In BFV, the plain_modulus basically determines the size of your data type, just like in normal programming when you use 32-bit or 64-bit integers. When using BatchEncoder the data type applies to each slot in the plaintext vectors.
How you choose plain_modulus matters a lot: the noise budget consumption in multiplications is proportional to log(plain_modulus), so there are good reasons to keep it as small as possible. On the other hand, you'll need to ensure that you don't get into overflow situations during your computations, where your encrypted numbers exceed plain_modulus, unless you specifically only care about correctness of the results modulo plain_modulus.
In almost all real use-cases of BFV you should want to use BatchEncoder to not waste plaintext/ciphertext polynomial space, and this requires plain_modulus to be a prime. Therefore, you'll probably want it to be a prime, except in some toy examples.

compare the way of resolving collisions in hash tables

how can I compare methods of conflict resolution (ie. linear hashing, square hashing and double hashing) in the tables hash? What data would be best to show the differences between them? Maybe someone has seen such comparisons.
There is no simple approach that's also universally meaningful.
That said, a good approach if you're tuning an actual app is to instrument (collect stats) for the hash table implementation you're using in the actual application of interest, with the real data it processes, and for whichever functions are of interest (insert, erase, find etc.). When those functions are called, record whatever you want to know about the collisions that happen: depending on how thorough you want to be, that might include the number of collisions before the element was inserted or found, the number of CPU/memory cache lines touched during that probing, the elapsed CPU or wall-clock time etc..
If you want a more general impression, instrument an implementation and throw large quantities of random data at it - but be aware that the real-world applicability of whatever conclusions you draw may only be as good as the random data is similar to the real-world data.
There are also other, more subtle implications to the choice of collision-handling mechanism: linear probing allows an implementation to cleanup "tombstone" buckets where deleted elements exist, which takes time but speeds later performance, so the mix of deletions amongst other operations can affect the stats you collect.
At the other extreme, you could try a mathematical comparison of the properties of different collision handling - that's way beyond what I'm able or interested in covering here.

Efficient Function to Map (or Hash) Integers and Integer Ranges into Index

We are looking for the computationally simplest function that will enable an indexed look-up of a function to be determined by a high frequency input stream of widely distributed integers and ranges of integers.
It is OK if the hash/map function selection itself varies based on the specific integer and range requirements, and the performance associated with the part of the code that selects this algorithm is not critical. The number of integers/ranges of interest in most cases will be small (zero to a few thousand). The performance critical portion is in processing the incoming stream and selecting the appropriate function.
As a simple example, please consider the following pseudo-code:
switch (highFrequencyIntegerStream)
case(2) : func1();
case(3) : func2();
case(8) : func3();
case(33-122) : func4();
...
case(10,000) : func40();
In a typical example, there would be only a few thousand of the "cases" shown above, which could include a full range of 32-bit integer values and ranges. (In the pseudo code above 33-122 represents all integers from 33 to 122.) There will be a large number of objects containing these "switch statements."
(Note that the actual implementation will not include switch statements. It will instead be a jump table (which is an array of function pointers) or maybe a combination of the Command and Observer patterns, etc. The implementation details are tangential to the request, but provided to help with visualization.)
Many of the objects will contain "switch statements" with only a few entries. The values of interest are subject to real time change, but performance associated with managing these changes is not critical. Hash/map algorithms can be re-generated slowly with each update based on the specific integers and ranges of interest (for a given object at a given time).
We have searched around the internet, looking at Bloom filters, various hash functions listed on Wikipedia's "hash function" page and elsewhere, quite a few Stack Overflow questions, abstract algebra (mostly Galois theory which is attractive for its computationally simple operands), various ciphers, etc., but have not found a solution that appears to be targeted to this problem. (We could not even find a hash or map function that considered these types of ranges as inputs, much less a highly efficient one. Perhaps we are not looking in the right places or using the correct vernacular.)
The current plan is to create a custom algorithm that preprocesses the list of interesting integers and ranges (for a given object at a given time) looking for shifts and masks that can be applied to input stream to help delineate the ranges. Note that most of the incoming integers will be uninteresting, and it is of critical importance to make a very quick decision for as large a percentage of that portion of the stream as possible (which is why Bloom filters looked interesting at first (before we starting thinking that their implementation required more computational complexity than other solutions)).
Because the first decision is so important, we are also considering having multiple tables, the first of which would be inverse masks (masks to select uninteresting numbers) for the easy to find large ranges of data not included in a given "switch statement", to be followed by subsequent tables that would expand the smaller ranges. We are thinking this will, for most cases of input streams, yield something quite a bit faster than a binary search on the bounds of the ranges.
Note that the input stream can be considered to be randomly distributed.
There is a pretty extensive theory of minimal perfect hash functions that I think will meet your requirement. The idea of a minimal perfect hash is that a set of distinct inputs is mapped to a dense set of integers in 1-1 fashion. In your case a set of N 32-bit integers and ranges would each be mapped to a unique integer in a range of size a small multiple of N. Gnu has a perfect hash function generator called gperf that is meant for strings but might possibly work on your data. I'd definitely give it a try. Just add a length byte so that integers are 5 byte strings and ranges are 9 bytes. There are some formal references on the Wikipedia page. A literature search in ACM and IEEE literature will certainly turn up more.
I just ran across this library I had not seen before.
Addition
I see now that you are trying to map all integers in the ranges to the same function value. As I said in the comment, this is not very compatible with hashing because hash functions deliberately try to "erase" the magnitude information in a bit's position so that values with similar magnitude are unlikely to map to the same hash value.
Consequently, I think that you will not do better than an optimal binary search tree, or equivalently a code generator that produces an optimal "tree" of "if else" statements.
If we wanted to construct a function of the type you are asking for, we could try using real numbers where individual domain values map to consecutive integers in the co-domain and ranges map to unit intervals in the co-domain. So a simple floor operation will give you the jump table indices you're looking for.
In the example you provided you'd have the following mapping:
2 -> 0.0
3 -> 1.0
8 -> 2.0
33 -> 3.0
122 -> 3.99999
...
10000 -> 42.0 (for example)
The trick is to find a monotonically increasing polynomial that interpolates these points. This is certainly possible, but with thousands of points I'm certain you'ed end up with something much slower to evaluate than the optimal search would be.
Perhaps our thoughts on hashing integers can help a little bit. You will also find there a hashing library (hashlib.zip) based on Bob Jenkins' work which deals with integer numbers in a smart way.
I would propose to deal with larger ranges after the single cases have been rejected by the hashing mechanism.

Can a cryptographic hash algorithm be used as a PRNG?

Can MD5/SHA256/SHA512, etc., be used as a PRNG? E.g., given an integer seed, is the pseudo-code:
random_number = truncate_to_desired_range(
sha512( seed.toString() + ',' + i.toString() )
…a decent PRNG? (i is an increasing integer, e.g., the outputs are:
convert(sha512("<seed>,0"))
convert(sha512("<seed>,1"))
convert(sha512("<seed>,2"))
convert(sha512("<seed>,3"))
…
"Decent", in the context of this question, refers only to the distribution of the output: is the output of cryptographic hash functions uniform, when used this way? (Though I suppose it would depend on the hash function, all cryptographic hashes should also have uniform output, right?)
Note: I will concede that this is going to be a slow PRNG, compared to say Mersenne-Twister, due to the use of a cryptographic hash. I'm not interested in speed, and I'm not interested in the result being secure — just that the distribution is correct.
In my particular use case, I'm looking for something similar to XKCD's geohashing, in that it is easily implemented by distributed parties, who will all arrive at the same answer. Mersenne-Twister can be substituted, but it less available in many target languages. (Some languages lack it entirely, some lack access to the raw U32 output of it, etc. SHA512 is either built in, or easily available.)
Assuming the cryptographic hash function meets its design goals, the output will (provably) follow a uniform distribution over its period, as every input to the hash function is unique by design.
One of the goals of a hash function is to approximate a random oracle, that is, for any two distinct inputs A and B, the outputs H(A) and H(B) should (for a true random oracle) be uncorrelated. Hash functions get pretty close to that, but of course weaknesses creep in with time and cryptanalysis.
That said, cryptographic primitives are essentially the best mathematical algorithms we have available when it comes to quality, therefore it is safe to say that if they cannot solve your problem, nothing will.
It can be made to work (with good sized inputs, padding, etc. as mentioned in other answers/comments) and will provide reasonably good results, but it's going to be slow, so don't do that if you are doing simulations or something that require heavy PRNG use...

Hash function combining - is there a significant decrease in collision risk?

Does anyone know if there's a real benefit regarding decreasing collision probability by combining hash functions? I especially need to know this regarding 32 bit hashing, namely combining Adler32 and CRC32.
Basically, will adler32(crc32(data)) yield a smaller collision probability than crc32(data)?
The last comment here gives some test results in favor of combining, but no source is mentioned.
For my purpose, collision is not critical (i.e. the task does not involve security), but I'd rather minimize the probability anyway, if possible.
PS: I'm just starting in the wonderful world of hashing, doing a lot of reading about it. Sorry if I asked a silly question, I haven't even acquired the proper "hash dialect" yet, probably my Google searches regarding this were also poorly formed.
Thanks.
This doesn't make sense combining them in series like that. You are hashing one 32-bit space to another 32-bit space.
In the case of a crc32 collision in the first step, the final result is still a collision. Then you add on any potential collisions in the adler32 step. So it can not get any better, and can only be the same or worse.
To reduce collisions, you might try something like using the two hashes independently to create a 64-bit output space:
adler32(data) << 32 | crc32(data)
Whether there is significant benefit in doing that, I'm not sure.
Note that the original comment you referred to was storing the hashes independently:
Whichever algorithm you use there is
going to be some chance of false
positives. However, you can reduce
these chances by a considerable margin
by using two different hashing
algorithms. If you were to calculate
and store both the CRC32 and the
Alder32 for each url, the odds of a
simultaneous collision for both hashes
for any given pair of urls is vastly
reduced.
Of course that means storing twice as
much information which is a part of
your original problem. However, there
is a way of storing both sets of hash
data such that it requires minimal
memory (10kb or so) whilst giving
almost the same lookup performance (15
microsecs/lookup compared to 5
microsecs) as Perl's hashes.