Does Murmurhash have collisions on 32-bit inputs? - hash

Consider the standard Murmurhash, giving 32-bit output values.
Suppose that we apply it on 32-bit inputs -- are there collisions?
In other words, does Murmurmash basically encodes a permutation when applied to 32-bit inputs?
If collisions exist, can anyone give an example (scanning random inputs didn't yield any)?

I assume you mean MurmurHash3, 32 bit, and specially the 32-bit fmix method:
FORCE_INLINE uint32_t fmix32 ( uint32_t h )
{
h ^= h >> 16;
h *= 0x85ebca6b;
h ^= h >> 13;
h *= 0xc2b2ae35;
h ^= h >> 16;
return h;
}
If not, then you need to better specify what you mean.
For the above, there are no collisions (two distinct inputs won't result in the same output). There is only one entry that returns the input value: 0.
As there are not "that many" 32-bit values, you can actually iterate over all of them to verify, in a couple of minutes. This will require some memory for a bit field, but that's it.
Btw, there is also a way to reverse the function (get the input from the output).

Related

Unexpected Xs in the result of addition of two logic signals

I have a code where two signals are added. Only top bits of the signals are X. Bottom bits are 0s.
But, the result has all Xs. I would expect the bottom bits to be 0s in the result.
logic [3:0] a; // 4'bxx00
logic [3:0] b; // 4'bxx00
logic [4:0] c;
assign c = a + b; // Results in c = 4'bxxxxx
I am trying to understand why the bottom 2 bits are x in the result.
This behavior is specified in IEEE Std 1800-2017, section 11.4.3 Arithmetic operators:
For the arithmetic operators, if any operand bit value is the unknown
value x or the high-impedance value z , then the entire result value
shall be x .
The expression a + b contains an arithmetic operator: +

Why is the following code correct for computing the hash of a string?

I am currently reading about the Rabin Karp algorithm and as part of that I need to understand string polynomial hashing. From what I understand, the hash of a string is given by the following formula:
hash = ( char_0_val * p^0 + char_1_val * p^1 + ... + char_n_val ^ p^n ) mod m
Where:
char_i_val: is the integer value of the character plus 1 given by string[i]-'a' + 1
p is a prime number larger than the character set
m is a large prime number
The website cp-algorithms has the following entry on the subject. They say that the code to write the above is as follows:
long long compute_hash(string const& s) {
const int p = 31;
const int m = 1e9 + 9;
long long hash_value = 0;
long long p_pow = 1;
for (char c : s) {
hash_value = (hash_value + (c - 'a' + 1) * p_pow) % m;
p_pow = (p_pow * p) % m;
}
return hash_value;
}
I understand what the program is trying to do but I do not understand why it is correct.
My question
I am having trouble understanding why the above code is correct. It has been a long time since I have done any modular math. After searching online I see that we have the following formulas for modular addition and modular multiplication:
a+b (mod m) = (a%m + b%m)%m
a*b (mod m) = (a%m * b%m)%m
Based on the above shouldn't the code be as follows?
long long compute_hash(string const& s) {
const int p = 31;
const int m = 1e9 + 9;
long long hash_value = 0;
long long p_pow = 1;
for (char c : s) {
int char_value = (c - 'a' + 1);
hash_value = (hash_value%m + ((char_value%m * p_pow%m)%m)%m ) % m;
p_pow = (p_pow%m * p%m) % m;
}
return hash_value;
}
What am I missing? Ideally I am seeking a breakdown of the code and an explanation of why the first version is correct.
Mathematically, there is no reason to reduce intermediate results modulo m.
Operationally, there are a couple of very closely related reasons to do it:
To keep numbers small enough that they can be represented efficiently.
To keep numbers small enough that operations on them do not overflow.
So let's look at some quantities and see if they need to be reduced.
p was defined as some value less than m, so p % m == p.
p_pow and hash_value have already been reduced modulo m when they were computed, reducing them modulo m again would do nothing.
char_value is at most 26, which is already less than m.
char_value * p_pow is at most 26 * (m - 1). That can be, and often will be, more than m. So reducing it modulo m would do something. But it can still be delayed, because the next step is still "safe" (no overflow)
char_value * p_pow + hash_value is still at most 27 * (m - 1) which is still much less than 263-1 (the maximum value for a long long, see below why I assume that a long long is 64-bit), so there is no problem yet. It's fine to reduce modulo m after the addition.
As a bonus, the loop could actually do (263-1) / (27 * (m - 1)) iterations before it needs to reduce hash_value modulo m. That's over 341 million iterations! For most practical purposes you could therefore remove the first % m and return hash_value % m; instead.
I used 263-1 in this calculation because p_pow = (p_pow * p) % m requires long long to be a 64-bit type (or, hypothetically, an exotic size of 36 bits or higher). If it was a 32-bit type (which is technically allowed, but rare nowadays) then the multiplication could overflow, because p_pow can be approximately a billion and a 32-bit type cannot hold 31 billion.
BTW note that this hash function is specifically for strings that only contain lower-case letters and nothing else. Other characters could result in a negative value for char_value which is bad news because the remainder operator % in C++ works in a way such that for negative numbers it is not the "modulo operator" (misnomer, and the C++ specification does not call it that). A very similar function can be written that can take any string as input, and that would change the analysis above a little bit, but not qualitatively.

Hashing integer coordinates of different sizes

I'm trying to hash some 3D coordinates to a 16-bit integer.
The coordinates have the following constraints:
x [0, 16]
y [0,256]
z [0, 16]
Is it possible to get O(1) access, zero collisions, and still fit it in a 16-bit word?
My thought was to shift the coordinates such that x takes up the first 4 bits, y the next 8 and z the last 4. After some iterations I came up with the following which shifts and masks the bits such that they shouldn't overlap and cause collisions:
unsigned int hash(unsigned char x, unsigned char y, unsigned char z) {
return (x << 12) & 0xF000 |
(y << 8) & 0x0FF0 |
z & 0x000F;
}
However this does produce collisions somehow! I'm not sure why and would grateful if anyone could tell me.
In researching hashing I've found that z-order curves/morton encoding would be a good way to go but that assumes the range of the coordinates in each dimension is constant. Could an option be to morton encode x and z into 8 bits and somehow combine that with the y coordinate for a 16-bit word?
I tried instead mapping to a 32-bit integer with the following code.
return ((x) << 24) & 0xFF000000 |
((y) << 16) & 0x00FFFF00 |
z & 0x000000FF;
My unit tests passed and it seems to work however I fear that this may eat a lot more memory than a 16-bit hash.
I'll mark this as answered but the original question still stands if anyone can enlighten me.
It might be because you’ve written
x & 0x000F
when it should be
z & 0x000F
The second shift count is also wrong, so try:
unsigned int hash(unsigned char x, unsigned char y, unsigned char z) {
return (x << 12) & 0xF000 |
(y << 4) & 0x0FF0 |
z & 0x000F;
}

Reverse multiplication of 32-bit numbers

I have two large signed 32-bit numbers (java ints) being multiplied together such that they'll overflow. Actually, I have one of the numbers, and the result. Can I determine what the other operand was?
knownResult = unknownOperand * knownOperand;
Why? I have a string and a suffix being hashed with fnv1a. I know the resulting hash and the suffix, I want to see how easy it is to determine the hash of the original string.
This is the core of fnv1a:
hash ^= byte
hash *= PRIME
It depends. If the multiplier is even, at least one bit must inevitably be lost. So I hope that prime isn't 2.
If it's odd, then you can absolutely reverse it, just multiply by the modular multiplicative inverse of the multiplier to undo the multiplication.
There is an algorithm to calculate the modular multiplicative inverse modulo a power of two in Hacker's Delight.
For example, if the multiplier was 3, then you'd multiply by 0xaaaaaaab to undo (because 0xaaaaaaab * 3 = 1). For 0x01000193, the inverse is 0x359c449b.
You want to solve the equation y = prime * x for x, which you do by division in the finite ring modulo 232: x = y / prime.
Technically you do that by multiplying y with the multiplicative inverse of the prime modulo 232, which can be computed by the extended Euclidean algorithm.
Uh, division? Or am I not understanding the question?
It's not the fastest method, but something very easy to memorise is this:
unsigned inv(unsigned x) {
unsigned xx = x * x;
while (xx != 1) {
x *= xx;
xx *= xx;
}
return x;
}
It returns x**(2**n-1) (as in x*(x**2)*(x**4)*(x**8)*..., or x**(1+2+4+8+...)). As the loop exit condition implies, x**(2**n) is 1 when n is big enough, provided x is odd.
So, x**(2**n-1) equals x**(2**n)/x equals 1/x equals the thing you multiply x by to get the value 1 (mod 2**n). Which you then apply:
knownResult = unknownOperand * knownOperand
knownResult * inv(knownOperand) = unknownOperand * knownOperand * inv(knownOperand)
knownResult * inv(knownOperand) = unknownOperand * 1
or simply:
unknownOperand = knownResult * inv(knownOperand);
But there are faster ways, as given in other answers here. This one's just easy to remember.
Also, obligatory SO "use a library function" answer: BN_mod_inverse().

Increase Hex2dec or dec2hex output range in Matlab

I have a strange problem with hex2dec function in Matlab.
I realized in 16bytes data, it omits 2 LSB bytes.
hex2dec('123123123123123A');
dec2hex(ans)
Warning: At least one of the input numbers is larger than the largest integer-valued floating-point
number (2^52). Results may be unpredictable.
ans =
1231231231231200
I am using this in Simulink. Therefore I cannot process 16byte data. Simulink interpret this as a 14byte + '00'.
You need to use uint64 to store that value:
A='123123123123123A';
B=bitshift(uint64(hex2dec(A(1:end-8))),32)+uint64(hex2dec(A(end-7:end)))
which returns
B =
1310867527582290490
An alternative way in MATLAB using typecast:
>> A = '123123123123123A';
>> B = typecast(uint32(hex2dec([A(9:end);A(1:8)])), 'uint64')
B =
1310867527582290490
And the reverse in the opposite direction:
>> AA = dec2hex(typecast(B,'uint32'));
>> AA = [AA(2,:) AA(1,:)]
AA =
123123123123123A
The idea is to treat the 64-integer as two 32-bit integers.
That said, Simulink does not support int64 and uint64 types as others have already noted..