Hashing integer coordinates of different sizes

Hashing integer coordinates of different sizes - hash

I'm trying to hash some 3D coordinates to a 16-bit integer.
The coordinates have the following constraints:
x [0, 16]
y [0,256]
z [0, 16]
Is it possible to get O(1) access, zero collisions, and still fit it in a 16-bit word?
My thought was to shift the coordinates such that x takes up the first 4 bits, y the next 8 and z the last 4. After some iterations I came up with the following which shifts and masks the bits such that they shouldn't overlap and cause collisions:
unsigned int hash(unsigned char x, unsigned char y, unsigned char z) {
return (x << 12) & 0xF000 |
(y << 8) & 0x0FF0 |
z & 0x000F;
}
However this does produce collisions somehow! I'm not sure why and would grateful if anyone could tell me.
In researching hashing I've found that z-order curves/morton encoding would be a good way to go but that assumes the range of the coordinates in each dimension is constant. Could an option be to morton encode x and z into 8 bits and somehow combine that with the y coordinate for a 16-bit word?

I tried instead mapping to a 32-bit integer with the following code.
return ((x) << 24) & 0xFF000000 |
((y) << 16) & 0x00FFFF00 |
z & 0x000000FF;
My unit tests passed and it seems to work however I fear that this may eat a lot more memory than a 16-bit hash.
I'll mark this as answered but the original question still stands if anyone can enlighten me.

It might be because you’ve written
x & 0x000F
when it should be
z & 0x000F
The second shift count is also wrong, so try:
unsigned int hash(unsigned char x, unsigned char y, unsigned char z) {
return (x << 12) & 0xF000 |
(y << 4) & 0x0FF0 |
z & 0x000F;
}

Related

Why is the following code correct for computing the hash of a string?

I am currently reading about the Rabin Karp algorithm and as part of that I need to understand string polynomial hashing. From what I understand, the hash of a string is given by the following formula:
hash = ( char_0_val * p^0 + char_1_val * p^1 + ... + char_n_val ^ p^n ) mod m
Where:
char_i_val: is the integer value of the character plus 1 given by string[i]-'a' + 1
p is a prime number larger than the character set
m is a large prime number
The website cp-algorithms has the following entry on the subject. They say that the code to write the above is as follows:
long long compute_hash(string const& s) {
const int p = 31;
const int m = 1e9 + 9;
long long hash_value = 0;
long long p_pow = 1;
for (char c : s) {
hash_value = (hash_value + (c - 'a' + 1) * p_pow) % m;
p_pow = (p_pow * p) % m;
}
return hash_value;
}
I understand what the program is trying to do but I do not understand why it is correct.
My question
I am having trouble understanding why the above code is correct. It has been a long time since I have done any modular math. After searching online I see that we have the following formulas for modular addition and modular multiplication:
a+b (mod m) = (a%m + b%m)%m
a*b (mod m) = (a%m * b%m)%m
Based on the above shouldn't the code be as follows?
long long compute_hash(string const& s) {
const int p = 31;
const int m = 1e9 + 9;
long long hash_value = 0;
long long p_pow = 1;
for (char c : s) {
int char_value = (c - 'a' + 1);
hash_value = (hash_value%m + ((char_value%m * p_pow%m)%m)%m ) % m;
p_pow = (p_pow%m * p%m) % m;
}
return hash_value;
}
What am I missing? Ideally I am seeking a breakdown of the code and an explanation of why the first version is correct.

Mathematically, there is no reason to reduce intermediate results modulo m.
Operationally, there are a couple of very closely related reasons to do it:
To keep numbers small enough that they can be represented efficiently.
To keep numbers small enough that operations on them do not overflow.
So let's look at some quantities and see if they need to be reduced.
p was defined as some value less than m, so p % m == p.
p_pow and hash_value have already been reduced modulo m when they were computed, reducing them modulo m again would do nothing.
char_value is at most 26, which is already less than m.
char_value * p_pow is at most 26 * (m - 1). That can be, and often will be, more than m. So reducing it modulo m would do something. But it can still be delayed, because the next step is still "safe" (no overflow)
char_value * p_pow + hash_value is still at most 27 * (m - 1) which is still much less than 263-1 (the maximum value for a long long, see below why I assume that a long long is 64-bit), so there is no problem yet. It's fine to reduce modulo m after the addition.
As a bonus, the loop could actually do (263-1) / (27 * (m - 1)) iterations before it needs to reduce hash_value modulo m. That's over 341 million iterations! For most practical purposes you could therefore remove the first % m and return hash_value % m; instead.
I used 263-1 in this calculation because p_pow = (p_pow * p) % m requires long long to be a 64-bit type (or, hypothetically, an exotic size of 36 bits or higher). If it was a 32-bit type (which is technically allowed, but rare nowadays) then the multiplication could overflow, because p_pow can be approximately a billion and a 32-bit type cannot hold 31 billion.
BTW note that this hash function is specifically for strings that only contain lower-case letters and nothing else. Other characters could result in a negative value for char_value which is bad news because the remainder operator % in C++ works in a way such that for negative numbers it is not the "modulo operator" (misnomer, and the C++ specification does not call it that). A very similar function can be written that can take any string as input, and that would change the analysis above a little bit, but not qualitatively.

Does Murmurhash have collisions on 32-bit inputs?

Consider the standard Murmurhash, giving 32-bit output values.
Suppose that we apply it on 32-bit inputs -- are there collisions?
In other words, does Murmurmash basically encodes a permutation when applied to 32-bit inputs?
If collisions exist, can anyone give an example (scanning random inputs didn't yield any)?

I assume you mean MurmurHash3, 32 bit, and specially the 32-bit fmix method:
FORCE_INLINE uint32_t fmix32 ( uint32_t h )
{
h ^= h >> 16;
h *= 0x85ebca6b;
h ^= h >> 13;
h *= 0xc2b2ae35;
h ^= h >> 16;
return h;
}
If not, then you need to better specify what you mean.
For the above, there are no collisions (two distinct inputs won't result in the same output). There is only one entry that returns the input value: 0.
As there are not "that many" 32-bit values, you can actually iterate over all of them to verify, in a couple of minutes. This will require some memory for a bit field, but that's it.
Btw, there is also a way to reverse the function (get the input from the output).

How to use bitset function in MATLAB to modify multiple bits simultaneously

>> a = 255
a =
255
>> bitset(a,1,0)
ans =
254
here the first bit is set to 0 so we get 11111110 equivalent to 254
>> bitset(a,[1,2],0)
ans =
254 253
here the 1st bit and 2nd bit are being set to 0 seperately. Hence we get
11111110 equivalent to 254
11111101 equivalent to 253
how to get 11111100 equivalent to 252?

Apply bitset twice:
bitset(bitset(a, 1, 0), 2, 0)
The order of application should not matter.
Alternatively, you can use the fact that bitset is an equivalent to applying the correct sequence of bitand, bitor and bitcmp operations.
Since you are interested in turning off multiple bits, you can do
bitand(bitset(a, 1, 0), bitset(a, 2, 0))

Here's a one-liner:
a = 255;
bits = [1,2];
bitand(a,bitcmp(sum(2.^(bits-1)),'uint32'))
Taken apart:
b = sum(2.^(bits-1))
computes the integer with the given bits set. Note that bits must not contain duplicate elements. Use unique to enforce this: bits = unique(bits).
c = bitcmp(b,'uint32')
computes the 32-bit complement of the above. ANDing with the complement resets the given bits.
bitand(a,c)
computes the binary AND of the input number and the integer with the given bits turned off.
Setting bits is easier:
a = 112;
bits = [1,2];
bitor(a,sum(2.^(bits-1)))

Maybe most explicit, easiest to understand, you can convert to a string representing binary and then do the operations there, then convert back.
a = 255
bin_a = flip(dec2bin(a)) % flip to make bigendian
bin_a([1, 2]) = '0'
a = bin2dec(flip(bin_a))

Here is a little recursive function based on the answer from #Mad Physicist that will allow zeroing of any number of bits in data . Thanks for the original info. The recursion is probably dead obvious to most people but it might help somebody out.
function y = zero_nbits(x, n)
y = bitset(x, n, 0)
if n > 1
y = zero_nbits(y, n-1);
end
end

Converting characters to decimal while I need to append the values together

I'm receiving two characters from a serial port let's say '\x10' and 'Q'.
I need to convert them to decimal.
Each character indicates a two byte Hex code "10" and "51"
However, I need to append them and have "1051" and then convert this to decimal to give me 4177 which indicates my voltage value multiplied by 100.
So what is the question?
I know printf("%x", ...) gives me a HEX value however it does not give me this ability to modify and append two hex code to get one. ( It just shows the HEX format)
Then what is the actual way to convert and append them together?

This was pretty easy but tricky. And I got confused by division in the received packet! It was just a simple hex to dec and dec to hex in order to append to chunk together!
int x, y, z, v, n;
string s = SerialBuffer ;
char ch;
ch = SerialBuffer[17];
char ch2;
ch2 = SerialBuffer[16];// s.at(16);
x = (ch / 16);// +(int)sum;
y = (ch % 16);
z = (ch2 / 16);
v = ch2 % 16;
n = v + z * 16 + y * 256 + x * 4096; // decimal

Reverse multiplication of 32-bit numbers

I have two large signed 32-bit numbers (java ints) being multiplied together such that they'll overflow. Actually, I have one of the numbers, and the result. Can I determine what the other operand was?
knownResult = unknownOperand * knownOperand;
Why? I have a string and a suffix being hashed with fnv1a. I know the resulting hash and the suffix, I want to see how easy it is to determine the hash of the original string.
This is the core of fnv1a:
hash ^= byte
hash *= PRIME

It depends. If the multiplier is even, at least one bit must inevitably be lost. So I hope that prime isn't 2.
If it's odd, then you can absolutely reverse it, just multiply by the modular multiplicative inverse of the multiplier to undo the multiplication.
There is an algorithm to calculate the modular multiplicative inverse modulo a power of two in Hacker's Delight.
For example, if the multiplier was 3, then you'd multiply by 0xaaaaaaab to undo (because 0xaaaaaaab * 3 = 1). For 0x01000193, the inverse is 0x359c449b.

You want to solve the equation y = prime * x for x, which you do by division in the finite ring modulo 232: x = y / prime.
Technically you do that by multiplying y with the multiplicative inverse of the prime modulo 232, which can be computed by the extended Euclidean algorithm.

Uh, division? Or am I not understanding the question?

It's not the fastest method, but something very easy to memorise is this:
unsigned inv(unsigned x) {
unsigned xx = x * x;
while (xx != 1) {
x *= xx;
xx *= xx;
}
return x;
}
It returns x**(2**n-1) (as in x*(x**2)*(x**4)*(x**8)*..., or x**(1+2+4+8+...)). As the loop exit condition implies, x**(2**n) is 1 when n is big enough, provided x is odd.
So, x**(2**n-1) equals x**(2**n)/x equals 1/x equals the thing you multiply x by to get the value 1 (mod 2**n). Which you then apply:
knownResult = unknownOperand * knownOperand
knownResult * inv(knownOperand) = unknownOperand * knownOperand * inv(knownOperand)
knownResult * inv(knownOperand) = unknownOperand * 1
or simply:
unknownOperand = knownResult * inv(knownOperand);
But there are faster ways, as given in other answers here. This one's just easy to remember.
Also, obligatory SO "use a library function" answer: BN_mod_inverse().