Hash functions multiplying key with constant - hash

How does a constant before the key in the formula:
h(k) = (const * key) % m,
affect the distribution of the hash values in the table?
Are there any rules on how to choose such a constant to minimize collisions and get an even distribution of the keys in the hash table?

The constant factor should be prime, and if I remember correctly it should be relatively prime w.r.t. the modulus. This is all discussed at great length in Knuth Volume III.

Related

Convert 32 bit uniform distribution to uniform distribution on any int

Given a discrete uniform distribution D~U([0:2^N-1] from which a sample yields a number in the inclusive integer range [0, 2^N-1] for an integer N, I need a function convert such that for a sample d~D, convert(d, m) will have an integer uniform inclusive distribution Dc~U([0:m]).
Thoughts:
If the distribution is continuous, this is easy. Just cutoff the infinite representation of the number, and the uniformity is preserved.
I can't think of a way to do this for all numbers and keep uniformity.
I could re-roll for tie conditions, but am not able to formulate an algorithm.
What I eventually want, is a murmur hash on a custom range (m), rather than exact 32 bit numbers.

MatLab:Generate N pseudo-random numbers with a Poisson distribution having mean M and total T where N,M, and T are user defined

I’d like to be able to generate in MatLab a sequence of N pseudo-random numbers with a Poisson distribution having mean M. The sum of the N numbers should be T. N, M, and T are always positive or zero and would be user specified parameters to any function.
Obviously, if T is small relative to N it is likely that there will be problems achieving a total of T. In that case the function could just return the values T and then N-1 zeros or an error code. However, it is highly likely that in most cases T>>N.
I have been trying variations based on the method of generating random numbers with a given distribution provided at http://matlabtricks.com/post-44/generate-random-numbers-with-a-given-distribution and trying various normalizations at each step but have not been successful.
You could try to approximate what you want by using multinomial distribution.
If you use Wikipedia notation, then k=N, n=T and pi=M/T. Poisson distribution has distinctive property of mean equal to variance, but if your parameters are such that pi is small, then mean npi would be quite close to variance npi(1-pi). Sum would be automatically (by property of multinomial) equal of T.
Multinomial sampling in Matlab is done using mnrmd function.
UPDATE
Wrt comment, lets consider N sampled values vi, and write their sum
Sum(i=1...N) vi = T
Lets compute mean value of the left and right side of this equation.
Sum(i=1...N) E(vi) = E(T) = T
On the right side, mean value of constant is constant itself. On the left side we have
Sum(i=1...N) E(vi) = Sum(i=1...N) M = N*M = T
Therefore, M=T/N and pi=M/T=1/N.

Does a linear cryptographic hash function exist?

Does a linear cryptographic hash function exist?
By linear I mean a function 'f' such that:
where + is mod n for some large constant n
Yes,the cryptographically strong SWIFFT algorithm (a variant was a condender for the SHA3 standard) is linear such that h(a + b) = h(a) + h(b)
It is an interesting example of a hash that is both cryptographically strong and not psuedorandom. It is also another unexpected use of the much lauded FFT algorithm.
http://en.wikipedia.org/wiki/SWIFFT

How to calculate hash of a set (unordered list) of values?

I want to calculate sha1 hash of a set (unordered list) of elements. I have already calculated sha1 hash of each element. I'm considering two solutions:
Sort elements by their hashes and calculate top hash of such list.
Treat element hashes as 160 bits integer values and XOR (bitwise operation) them together into one 160 bits hash.
Does second solution is weaker in terms of secure hash function properties? (pre-image resistance, second pre-image resistance, collision resistance).
Option 1 is what is done in ERS: that standard uses hash trees, where each node contains a hash value computed over the set of hash values from the child nodes; since order is not significant in the tree, the values are sorted lexicographically before hashing. This is good, and, as far as we know, safe.
Option 2 is very unsafe: if the hash function has 160-bit output, then I can easily generate 160 random inputs such that the corresponding hash values constitute a basis of the vector space GF(2)160, at which point I can produce a matching set for any aggregate hash value. Attack cost is negligible.
Option 3 suggested by #paj28 (sorting the values to hash, then hash them) is fine, too, as long as you "concatenate" the sorted values with an unambiguous separator. For instance, if you hash the set of strings containing "bar" and "foo", you don't want to obtain the same hash value as with the set of strings containing "ba" and "rfoo". It is easier to get something safe when all values to hash have the same length.
Therefore, use option 1: hash each value in the set, then sort the hash values in lexicographic order, and hash the sorted list of values again.
On the attack with option 2: this is linear algebra. Suppose that you have k vectors of n bits, such that none of them is equal to the XOR of some of the k-1 other vectors (they are said to be linearly independent). Then consider a new random vector v; the probability that this vector is equal to the XOR of some of the k vectors is equal to 2k-n, i.e. it is small as long as k < n. If the new vector v indeed linearly independent with the k vectors you already have (thus with probability 1-2k-n), then add it to the set: you now have k+1 linearly independent vectors.
Recurse: you will soon obtain n vectors of n bits which are linearly independent to each other. But you cannot go further, because probability of any new vector to be linearly independent from the n previous has dropped to 0. The n vectors are said to be a basis for the vector space.
In this case, the vectors are obtained by simply hashing values (random values, or values with structure, it does not matter much, because the hash function acts as a randomizer).
For a given set of k vectors, determining whether a new vector v is linearly independent with the k vectors is easy with Gaussian elimination. The same algorithm lets you know, once you have a basis, which of your n basis vectors shall be XORed together to yield any vector v'. In the setup of this question, this means that once I have produced n values mi such that the h(mi) constitute a basis, then for any target n-bit output t, I can use Gauss elimination to work out which of my h(mi) may be XORed together to yield exactly the value t. The corresponding mi values are then a preimage set for t.
The other option (3) is to sort the elements first, then combine them into a single string using a separator that cannot appear as part of an element.
Of these possibilities, 2 would concern me the most. I can't think now how you could attack it in a practical way, but it seems the riskiest.
So 1 and 3 are basically fine. But I would recommend 3 because you are using the hash in the way it is intended.

Associative noncommutative hash function

Is there a hash function with following properties?
is associative
is not commutative
easily implementable on 32 bit integers: int32 hash(int32, int32)
If I am correct, such function allows achieving following goals
calculate hash of concatenated string from hashes of substrings
calculate hash concurrently
calculate hash of list implemented on binary tree - including order, but excluding how tree is balanced
The best I found so far is multiplication of 4x4 matrix of bits, but thats awkward to implement and reduces space to 16bits.
I am grateful for any help.
Polynomial rolling hash could help:
H(A1,...,An) = (H(A1,...,An-1) * Base + An) Mod P
It's easy to concat two results or substract prefix/suffix from result, as long as the length is known.
Matrix multiplication is associative and non-commutative.
You could try representing your hashes as matrices but this will result in a loss of information if they have 0 determinant (which is likely!).
So instead you should generate a triangle matrix with a diagonal of 1's to ensure that you have a determinant of 1 (this guarantees that composition does not loose information).
Furthermore the composition of triangle matrices produces a new triangle matrix, making reading the composition the same as generation.
Note: to use this method the length of your hash must be a triangle number!