Help designing a hash function to detect duplicate records?

Help designing a hash function to detect duplicate records? - hash

Let me explain my program thus far. It is a rubiks cube solver. I am given a scrambled cube (this is the initial state). This becomes the root node of a graph. I am using iterative deepening depth first search to "brute force" this scrambled cube to a recognizable state which I can then use pattern recognition to solve.
As you can imagine, this is a very large graph, so I would like to come up with some sort of hashing functionality to detect duplicate nodes in this graph (thus speeding up the traversal).
I am largely unfamiliar with hashing functions, but here is what I am thinking... Each node is essentially a different state of the rubik's cube. So if I come to a cube state (node) that has already be seen, I want to skip over it. So I need a hashing function that takes me from the state variable to a checksum, where the state variable is a 54-character string. The only allowed characters are y, r, g, o, b, w (which correspond to colors).
Any help designing this hash function would be greatly appreciated.

For the fastest duplicate detection and removal - avoid generating many of the repeated positions in the first place. This is easy to do and quicker than generating and then finding the repeats. So for example if you have moves like F and B, if you allow the sub sequence FB don't also allow BF, which gives the same result. If you've just done 3F, don't follow it with F. You can generate a small look-up table for allowed next moves, given the last three moves.
For the remaining duplicates you want a fast hash because there are a lot of positions. To make your hash go fast, as others have commented, you want what it hashes from, the representation of the position, to be small. There are 12 edge cubies and there are 8 corner cubies. Representing each cubies position and orientation need take only five bits per cubie, i.e. 100 bits (12.5 bytes) total. For edges its four bits for position and one for flip. For corners its three bits for position and 2 for spin. You can ignore the last edge cubie since its position and flip is fixed by the others. With this representation you are already down to 12 bytes for the position.
You have about 70 real bits of information in a rubik cube position, and 96 bits is close enough to 70 to make it actually counter productive hashing those bits further. I.e. treat this representation of the board as your hash. That may sound a bit strange, but from your question I'm envisaging you at the same time experimenting with a less compact representation of the cube that is more amenable to your pattern matching. In that case the 12 byte value can be treated as if it were a hash, with the advantage that it's a hash that never has a collision. That makes the duplicate testing code and new value insertion shorter and simpler and faster. It's going to be cheaper than the MD5 solutions suggested so far.
There are many other tricks you could use to cut down the work in searching for repeated positions. Have a look at http://cube20.org/ for ideas.

You can always try a cryptographic hash function. Since your problem is not a question of security (there is no attacker purposely trying to find distinct states which hash to the same value), you can use a broken hash function. I recommend trying MD4, which is quite fast. Your 54-character string is quite appropriate for MD4 input (MD4 can process inputs up to 55 bytes as a single block).
A basic 2.4 GHz PC can hash about 12 millions such strings per second, using a single core, with a simple unrolled C implementation (e.g. one which would look like the MD4Transform() function in the sample code included in RFC 1320). This may be enough for your needs.

1) Don't Use A Hash
You have 9*6 = 54 separate faces on a rubik cube. Even wastefully using 1 byte per face this is 432 bits, so hashing won't save you too much space. A better packing of 3 bits per face comes to 162 bits (21 bytes). It sounds to me like you need a compact way to represent the rubik.
OTOH, if you are looking to store a set of many many previously-visited states then I've found that using a bloom filter instead of a true set gets me decent results (but often non-optimal) with much lower space utilization.
2) If you are married to the idea of a hash:
Just use MD5, its slightly more compact than the proposed rubik states, rather fast, and has good collision properties - it's not like you have a malicious adversary trying to cause rubik cube hash collisions ;-).
EDIT: Using cryptographic hash functions, such as MD4/MD5, is usually simple once you have a library or function implementing the algorithm (ex: OpenSSL, GNU TLS, and many stand-alone implementations exist). Usually the function is something like void md5(unsigned char *buf, size_t len, unsigned char *digest) where digest points to a pre-allocated 16 byte buffer and buf is the data to be hashed (your rubik cube structure). Here is some untested C code:
#include <openssl/md5.h>
void main()
{
unsigned char digest[16];
unsigned char buf[BUFLEN];
initializeBuffer(buf);
MD5(buf,BUFLEN,digest); // This is the openssl function
printDigest(digest);
}
And be sure to compile/link with -lssl.

8 corner cubes:
You can assign each of these corners to 8 positions which each require 3 bits to determine which corner cube is at which position for a total of 24 bits.
You can further reduce this to just recording 7-of-8 positions as you can easily use a process of elimination to determine what the 8th corner is (for 21 bits).
However, this can be reduced further as the 8 corners can only be arranged in 8! = 40320 permutations and 40320 can be represented in 16 bits.
Each corner cube can be orientated correctly or be rotated 120° clockwise or anti-clockwise to be in three different positions (represented as 0, 1 and 2 respectively).
This requires 2 bits per corner to represent.
However, the sum of the orientations (modulo 3) is always 0; so, if you know 7-of-8 orientations then (assuming you have a solvable cube) you can calculate the orientation of the 8th corner (giving a total of 14 bits).
Or for a further reduction, seven ternary (base 3) digits can represent the orientation of the corners and this can be represented in 12 binary digits (bits).
So the corners cubes can be represented in 28 bits, if you want to decode the permutations, or in 33 bits, if you want to directly record the positions of 7-of-8 corners.
12 edge cubes:
Each can be represented in 4 bits (for a total of 48 bits) which can be reduced to 44 bits by only recording the position of 11-of-12 edges (for a total of 44 bits).
However, the 12! = 479001600 permutations of the edges can be stored in 29 bits.
Each edge can be either be oriented correctly or flipped:
This requires 1 bit to represent.
However, edges are always flipped in pairs so the parity of the flipped edges will always be zero (again, meaning that you only need to record 11-of-12 orientations for the edges) giving a total of 11 bits required.
So edge cubes can be represented in 40 bits, if you want to decode the permutations, or in 55 bits if you want to record all the positions and flips of 11-of-12 edges.
6 centre cubes
You do not need to record any information about the centre cubes - they are fixed relative to the ball at the centre of the Rubik's cube (so assuming you are not worried about the orientation of any logos on the cube) are immobile.
Total:
Using permutations: 68 bits
Using positions: 88 bits

Just to establish the theoretical minimum representation - the state space of a valid Rubik's cube is about 4.3*10^19. Log2(4.3*10^19) will then determine how many bits you need to represent that full space, the ceiling of which is 66. So in theory, if you could number every valid state, any given state could be uniquely represented in 66 bits.
While you may want to follow others' advice and find a more compact way of representing the cube, consider representing the state in terms of edge, corner, and face pieces. Due to the swapping laws of legal cube moves, you should be able to concatenate a sequence of 12 4-bit edge locations, 8 3-bit corner locations, and 6 3-bit face locations. This should result in a unique representation using 90 bits.
This representation may not be conducive to the way you are creating your tree, but it is unique, easily comparable, and should be possible to find given a state in your existing representation.

Related

Compression, using invalid data to represent chess board states

I am making a chess program and I have been trying to optimize the game board state to use the least amount of data storage possible. I've realized there is a set of data that makes sense (can be decompressed) with my compression algorithm but is also invalid, for example a rook that is eligible for a castling move but is not in its starting position, or a pawn that is not in the fourth/fifth row but is eligible for an En passant move. I am trying to figure out an effective way to have those invalid positions encode a string of 0's between 1 and 61 bits long, to represent empty board spaces.
So given the input either 10110 or 11110 (input might be irrelevant to this problem?) what is a good way to represent between 1 and 61 bits of 0's? Any bits of your choosing can follow so long as it, plus the input, is shorter than the equivalent amount of zero's it would be taking the place of.
This would be used optionally in place of the string of zeros, so for example if there was three 0's it would make more sense to encode it as 000 rather than 10110(your bits here). But if it was say, 16 zeros, then it would be potentially cheaper to encode it as 10110(your bits here). At compression time the decision would be made which to use, and by nature of it at decompression time both would be decompressed to mean the same thing.

32-1024 bit fixed point vector arithmetic with AVX-2

For a mandelbrot generator I want to used fixed point arithmetic going from 32 up to maybe 1024 bit as you zoom in.
Now normaly SSE or AVX is no help there due to the lack of add with carry and doing normal integer arithmetic is faster. But in my case I have literally millions of pixels that all need to be computed. So I have a huge vector of values that all need to go through the same iterative formula over and over a million times too.
So I'm not looking at doing a fixed point add/sub/mul on single values but doing it on huge vectors. My hope is that for such vector operations AVX/AVX2 can still be utilized to improve the performance despite the lack of native add with carry.
Anyone know of a library for fixed point arithmetic on vectors or some example code how to do emulate add with carry on AVX/AVX2.

FP extended precision gives more bits per clock cycle (because double FMA throughput is 2/clock vs. 32x32=>64-bit at 1 or 2/clock on Intel CPUs); consider using the same tricks that Prime95 uses with FMA for integer math. With care it's possible to use FPU hardware for bit-exact integer work.
For your actual question: since you want to do the same thing to multiple pixels in parallel, probably you want to do carries between corresponding elements in separate vectors, so one __m256i holds 64-bit chunks of 4 separate bigintegers, not 4 chunks of the same integer.
Register pressure is a problem for very wide integers with this strategy. Perhaps you can usefully branch on there being no carry propagation past the 4th or 6th vector of chunks, or something, by using vpmovmskb on the compare result to generate the carry-out after each add. An unsigned add has carry out of a+b < a (unsigned compare)
But AVX2 only has signed integer compares (for greater-than), not unsigned. And with carry-in, (a+b+c_in) == a is possible with b=carry_in=0 or with b=0xFFF... and carry_in=1 so generating carry-out is not simple.
To solve both those problems, consider using chunks with manual wrapping to 60-bit or 62-bit or something, so they're guaranteed to be signed-positive and so carry-out from addition appears in the high bits of the full 64-bit element. (Where you can vpsrlq ymm, 62 to extract it for addition into the vector of next higher chunks.)
Maybe even 63-bit chunks would work here so carry appears in the very top bit, and vmovmskpd can check if any element produced a carry. Otherwise vptest can do that with the right mask.
This is a handy-wavy kind of brainstorm answer; I don't have any plans to expand it into a detailed answer. If anyone wants to write actual code based on this, please post your own answer so we can upvote that (if it turns out to be a useful idea at all).

Just for kicks, without claiming that this will be actually useful, you can extract the carry bit of an addition by just looking at the upper bits of the input and output values.
unsigned result = a + b + last_carry; // add a, b and (optionally last carry)
unsigned carry = (a & b) // carry if both a AND b have the upper bit set
| // OR
((a ^ b) // upper bits of a and b are different AND
& ~r); // AND upper bit of the result is not set
carry >>= sizeof(unsigned)*8 - 1; // shift the upper bit to the lower bit
With SSE2/AVX2 this could be implemented with two additions, 4 logic operations and one shift, but works for arbitrary (supported) integer sizes (uint8, uint16, uint32, uint64). With AVX2 you'd need 7uops to get 4 64bit additions with carry-in and carry-out.
Especially since multiplying 64x64-->128 is not possible either (but would require 4 32x32-->64 products -- and some additions or 3 32x32-->64 products and even more additions, as well as special case handling), you will likely not be more efficient than with mul and adc (maybe unless register pressure is your bottleneck).As
As Peter and Mystical suggested, working with smaller limbs (still stored in 64 bits) can be beneficial. On the one hand, with some trickery, you can use FMA for 52x52-->104 products. And also, you can actually add up to 2^k-1 numbers of 64-k bits before you need to carry the upper bits of the previous limbs.

How can I calculate the impact on collision probability when truncating a hash?

I'd like to reduce an MD5 digest from 32 characters down to, ideally closer to 16. I'll be using this as a database key to retrieve a set of (public) user-defined parameters. I'm expecting the number of unique "IDs" to eventually exceed 10,000. Collisions are undesirable but not the end of the world.
I'd like to understand the viability of a naive truncation of the MD5 digest to achieve a shorter key. But I'm having trouble digging up a formula that I can understand (given I have a limited Math background), let alone use to determine the impact on collision probability that truncating the hash would have.
The shorter the better, within reason. I feel there must be a simple formula, but I'd rather have a definitive answer than do my own guesswork cobbled together from bits and pieces I have read around the web.

You can calculate the chance of collisions with this formula:
chance of collision = 1 - e^(-n^2 / (2 * d))
Where n is the number of messages, d is the number of possibilities, and e is the constant e (2.718281828...).

#mypetition's answer is great.
I found a few other equations that are more-or-less accurate and/or simplified here, along with a great explanation and a handy comparison of real-world probabilities:
1−e^((−k(k−1))/2N) - sample plot here
(k(k-1))/2N - sample plot here
k^2/2N - sample plot here
...where k is the number of ID's you'll be generating (the "messages") and N is the largest number that can be produced by the hash digest or the largest number that your truncated hexadecimal number could represent (technically + 1, to account for 0).
A bit more about "N"
If your original hash is, for example, "38BF05A71DDFB28A504AFB083C29D037" (32 hex chars), and you truncate it down to, say, 12 hex chars (e.g.: "38BF05A71DDF"), the largest number you could produce in hexadecimal is "0xFFFFFFFFFFFF" (281474976710655 - which is 16^12-1 (or 256^6 if you prefer to think in terms of bytes). But since "0" itself counts as one of the numbers you could theoretically produce, you add back that 1, which leaves you simply with 16^12.
So you can think of N as 16 ^ (numberOfHexDigits).

Enhancing 8 bit images to 16 bit

My objective is to enhance 8 bit images to 16 bit ones. In other words, I want to increase the dynamic range of an 8 bit image. And to do that, I can sequentially take multiple images of 8 bit with fixed scene and fixed camera. To simplify the issue, let's assume they are grayscale images
Intuitively, I think I can achieve the goal by
Multiplying two 8 bit images
resimage = double(img1) .* double(img2)
Averaging specified number of 8 bit images
resImage = mean(images,3)
assuming images(:,:,i) contains ith 8 bit image.
After that, I can make the resulting image to 16 bit one.
resImage = uint16(resImage)
But before testing these methods, I wonder there is another way to do this - except for buying 16 bit camera, or literature for this subject might be better.
UPDATE: As comments below display, I got great information on drawbacks of simple averaging above and on image stacks for the enhancement. So it may be a good topic to study after all. Thank all for your great comments.

This question appears to relate to increasing the Dynamic Range of an image by integrating information from multiple 8 bit exposures into a 16 bit image. This is related to the practice of capturing and combining "image stacks" in astronomical imaging among other fields. An explanation of this practice and how it can both reduce image noise, and enhance dynamic range is available here:
http://keithwiley.com/astroPhotography/imageStacking.shtml
The idea is that successive captures of the same scene are subject to image noise, and this noise leads to stochastic variation of the pixel values captured. In the simplest case these variations can be leveraged by summing and dividing i.e. mean averaging the stack to improve its dynamic range but the practicality would depend very much on the noise characteristics of the camera.

You want to sum many images together, assuming there is no jitter and the camera is steady. Accumulate a large sum and then divide by some amount.
Note that to get a reasonable 16-bit image from an 8 bit source, you'd need to take hundreds of images to get any kind of reasonable result. Note that jitter will distort edge information and there is some inherent noise level of the camera that might mean you are essentially 'grinding metal'. In a practical sense, you might get 2 or 3 more bits of data from image summing, but not 8 more. To get 3 bits more would require at least 64 images (6 bits) to sum. Then divide by 8 (3 bits), as the lower bits are garbage.
Rule of thumb is to get a new bit of data, you need the squared(bits) of images, so 3 bits (8) means 64 images, 4 bits would be 256 images, etc.
Here's a link that talks about sampling:
http://electronicdesign.com/analog/understand-tradeoffs-increasing-resolution-averaging
"In fact, it can be shown that the improvement is proportional to the square root of the number of samples in the average."
Note that SNR is a log scale so equating it to bits is reasonable.

Understanding cyclic polynomial hash collisions

I have a code that uses a cyclic polynomial rolling hash (Buzhash) to compute hash values of n-grams of source code. If i use small hash values (7-8 bits) then there are some collisions i.e. different n-grams map to the same hash value. If i increase the bits in the hash value to say 31, then there are 0 collisions - all ngrams map to different hash values.
I want to know why this is so? Do the collisions depend on the number of n-grams in the text or the number of different characters that an n-gram can have or is it the size of an n-gram?
How does one choose the number of bits for the hash value when hashing n-grams (using rolling hashes)?

How Length effects Collisions
This is simply a question of permutations.
If i use small hash values (7-8 bits) then there are some collisions
Well, let's analyse this. With 8 bits, there are 2^8 possible binary sequences that can be generated for any given input. That is 256 possible hash values that can be generated, which means that in theory, every 256 message digest values generated guarantee a collision. This is called the birthday problem.
If i increase the bits in the hash value to say 31, then there are 0 collisions - all ngrams map to different hash values.
Well, let's apply the same logic. With 31 bit precision, we have 2^31 possible combinations. That is 2147483648 possible combinations. And we can generalise this to:
Let N denote the amount of bits we use.
Amount of different hash values we can generate (X) = 2^N
Assuming repetition of values is allowed (which it is in this case!)
This is an exponential growth, which is why with 8 bits, you found a lot of collisions and with 31 bits, you've found very little collisions.
How does this effect collisions?
Well, with a very small amount of values, and an equal chance for each of those values being mapped to an input, you have it that:
Let A denote the number of different values already generated.
Chance of a collision is: A / X
Where X is the possible number of outputs the hashing algorithm can generate.
When X equals 256, you have a 1/256 chance of a collision, the first time around. Then you have a 2/256 chance of a collision when a different value is generated. Until eventually, you have generated 255 different values and you have a 255/256 chance of a collision. The next time around, obviously it becomes a 256/256 chance, or 1, which is a probabilistic certainty. Obviously it usually won't reach this point. A collision will likely occur a lot more than every 256 cycles. In fact, the Birthday paradox tells us that we can start to expect a collision after 2^N/2 message digest values have been generated. So following our example, that's after we've created 16 unique hashes. We do know, however, that it has to happen, at minimum, every 256 cycles. Which isn't good!
What this means, on a mathematical level, is that the chance of a collision is inversely proportional to the possible number of outputs, which is why we need to increase the size of our message digest to a reasonable length.
A note on hashing algorithms
Collisions are completely unavoidable. This is because, there are an extremely large number of possible inputs (2^All possible character codes), and a finite number of possible outputs (as demonstrated above).

If you have hash values of 8 bits the total possible number of values is 256 - that means that if you hash 257 different n-grams there will be for sure at least one collision (...and very likely you will get many more collisions, even with less that 257 n-grams) - and this will happen regardless of the hashing algorithm or the data being hashed.
If you use 32 bits the total possible number of values is around 4 billion - and so the likelihood of a collision is much less.
'How does one choose the number of bits': I guess depends on the use of the hash. If it is used to store the n-grams in some kind of hashed data structure (a dictionary) then it should be related to the possible number of 'buckets' of the data structure - e.g. if the dictionary has less than 256 buckets that a 8 bit hash is OK.
See this for some background

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse