Is there a way to verify a common seed to a cumulative sequence of hashes with unknown repetitions between each value presented? - hash

I am writing a variant of the Cuckoo Cycle that uses an adjacency list for presenting solutions from two pairs of 8 bit coordinates, and I am not having any problems finding what I think should be an optimal solver for it, that uses two pairs of head/tail binary search trees to keep track of possible solution nodes, reject (branches) nodes and a binary tree that keeps a list of the candidate cycles as they are being assembled (as I understand it, binary search trees shorten the amount of processing for finding duplicates), but I need to refine the verifier function for solutions.
I see in Cuckoo that there is some process by which it modifies the edges with XOR functions and masks to identify a valid cycle, but I have two issues.
One is that each hash is generated from the previous hash, starting with the nonce, and proving that all offered node/edge pairs are valid derivatives from the nonce seems to me to require the verifier to repeat the hash function each time checking for a match until it gets a hit, which could be up to several thousand, in the worst case. Is there some property that can be used to shortcut this identification process, since unlike protection against DoS, we are providing the salt of the hash?
Second is that even if the presented cycle is perfectly valid, it is possible that one or more of the node/edge pairs in the cycle has a duplicate coordinate. The hashes are 32 bits long, and each coordinate is 8 bits. The answer to this probably has some relation to the previous question also, as having the seed for a hash function is a known security risk because of collisions. So obviously, as well as verifying the nodes are part of a cycle in the lowest possible values in the finite field, I need a way to be sure that a pair does not overlap with another possible, and branching pair.
I will be studying the verifier closer in the Cuckoo Cycle implementation to see if I can figure out how the algorithm ensures it is not approving a cycle that actually has a branch (and thus is invalid), but I thought I'd pop the question on this site in case someone knows better the ways of recognising hashes from a common seed, and if there is any way to recognise a 50% collision between a given coordinate and another one.
Note: After thinking about it for a while, I realised that I could solve the 'fake cycle' with one or more nodes having a branch by simply splitting the heads and tails into separate hashes, subsequent (odd then even), such as Murmur3 16 bit hashes.
And further thinking about it, I realised that Cuckoo Cycle is actually a special type of hash collision search that seeks only collisions that occur only once in the low order of the finite field. I am devising a new scheme called Hummingbird, which instead will not target the smallest numbers (which is also the same thing done by hashcash) but instead will target the most proximate hashes in a chain to the seed nonce. This will mean that attempts to insert branched nodes in the graph of the solution will be discovered in the verification. Which will probably take about 2-5 seconds depending on how deep. These solutions could be eliminated by specifying a maximum hash chain length as part of the consensus.

I just wanted to add that I answered my own question by realising that I am looking for, essentially, a hash collision, in my algorithm, and the simplest solution, with the least bit-twiddling was to make each coordinate a distinct hash in a hash chain (hash of nonce, then hash of hash, etc)
I didn't understand fully that Cuckoo Cycle is essentially a search for partial hash collisions, and when that dawned on me, I realised that the simple solution is to just make it into a search for hash collisions.
I have, from this realisation, moved very quickly forward to figuring out how my variation of Cuckoo can be much more simply implemented, as well as how to structure the B-tree based progressive search algorithm, the difficulty adjustment, and the rest.
I wasn't aware there was a stackexchange specialist site for math, or cryptography, or I would have posted it there instead. I studied FEC a few months ago and that opened the floodgates to a whole bunch of other ideas that led me to getting so worked up about Cuckoo Cycle. I believe I have generalised the Cuckoo Cycle into a generic, parameterisable graph theoretic proof of work and I will get back to finishing my implementation.
Thanks to everyone who submitted an answer, I will upvote as I deem correct, though I have zero or nearly zero rep, for what it's worth.

Related

Comparing hashes to test for collisions

I wish to compare hashes to check for collisions (Yes, I know it is time consuming, but never mind that). In checking for collisions, hashes need to be compared. Is the best method to have a single hash in a variable to compare against or to have a list of all hashes previously generated and compare the latest hash to each item in the list.
I would prefer the first option because it is much faster, but is there a recommended method? Are you less likely to find a collision by using the first method?
Is the best method to have a single hash in a variable to compare against or to have a list of all hashes previously generated and compare the latest hash to each item in the list.
Neither.
I would prefer the first option because it is much faster, but is there a recommended method?
I don't understand why you think the first method might work, but then you haven't fully explained your situation. Still, if you want to detect hash values that repeat, you do indeed need to keep track of already-seen hash values: to do that you don't want to search linearly though a list, and should use a set container to store seen hashes; a hash table - as suggested in a comment by gnasher729 a few hours back - would give O(1) performance e.g. in C++ in your hashes are 64 bit, std::unordered_set<uint64_t>), or a balance binary tree for O(logN) performance (e.g. C++ std::set<uint64_t>).
Are you less likely to find a collision by using the first method?
You're very likely to miss collisions.
All that said, you may want to reexamine your premise. The chance of a good (cryptographic quality) hash function producing collisions closely approaches the odds described by the "birthday paradox". As a rule of thumb, if you have 2^N distinct values to hash you're statistically unlikely to see collisions if your hashes are comfortably more than 2*N bits wide: if you allow enough "comfort", you're more likely to be hit on the noggin by a meteor than have your program see a collision. You mentioned MD5 so I'd expect 128 bits: unless you're storing order-of a quadrillion values or more (literally), it's pretty safe to ignore the potential for collisions.
Do note one important use of hash values where collisions happen more often for a different reason, and that's in hash tables, where even non-colliding hash values may collide at the same bucket index after they're "wrapped" - often a la h % N when N is the number of buckets. In general, it's impractical to ignore the potential for collisions in a hash table, and very unwise to try.

choosing a hash function

I was wondering: what are maximum number of bytes that can safely be hashed while maintaining the expected collision count of a hash function?
For md5, sha-*, maybe even crc32 or adler32.
Your question isn't clear. By "maximum number of bytes" you mean "maximum number of items"? The size of the files being hashed has no relation with the number of collisions (assuming that all files are different, of course).
And what do you mean by "maintaining the expected collision count"? Taken literally, the answer is "infinite", but after a certain number you will aways have collisions, as expected.
As for the answer to the question "How many items I can hash while maintaining the probability of a collision under x%?", take a look at the following table:
http://en.wikipedia.org/wiki/Birthday_problem#Probability_table
From the link:
For comparison, 10^-18 to 10^-15 is the uncorrectable bit error rate of a typical hard disk [2]. In theory, MD5, 128 bits, should stay within that range until about 820 billion documents, even if its possible outputs are many more.
This assumes a hash function that outputs a uniform distribution. You may assume that, given enough items to be hashed and cryptographic hash functions (like md5 and sha) or good hashes (like Murmur3, Jenkins, City, and Spooky Hash).
And also assumes no malevolent adversary actively fabricating collisions. Then you really need a secure cryptographic hash function, like SHA-2.
And be careful: CRC and Adler are checksums, designed to detect data corruption, NOT minimizing expected collisions. They have proprieties like "detect all bit zeroing of sizes < X or > Y for inputs up to Z kbytes", but not as good statistical proprieties.
EDIT: Don't forget this is all about probabilities. It is entirely possible to hash only two files smaller than 0.5kb and get the same SHA-512, though it is extremely unlikely (no collision has ever been found for SHA hashes till this date, for example).
You are basically looking at the Birthday paradox, only looking at really big numbers.
Given a normal 'distribution' of your data, I think you could go to about 5-10% of the amount of possibilities before running into issues, though nothing is guaranteed.
Just go with a long enough hash to not run into problems ;)

best way to resolve collisions in hashing strings

I got asked this question at an interview and said to use a second has function, but the interviewer kept probing me for other answers. Anyone have other solutions?
best way to resolve collisions in hashing strings
"with continuous inserts"
Assuming the inserts are of strings whose contents can't be predicted, then reasonable options are:
Use a displacement list, so you try a number of offsets from the
hashed-to bucket until you find a free bucket (modding by table
size). Displacement lists might look something like { 3, 5, 11,
19... } etc. - ideally you want to have the difference between
displacements not be the sum of a sequence of other displacements.
rehash using a different algorithm (but then you'd need yet another
algorithm if you happen to clash twice etc.)
root a container in the
buckets, such that colliding strings can be searched for. Typically
the number of buckets should be similar to or greater than the
number of elements, so elements per bucket will be fairly small and
a brute-force search through an array/vector is a reasonable
approach, but a linked list is also credible.
Comparing these, displacement lists tend to be fastest (because adding an offset is cheaper than calculating another hash or support separate heap & allocation, and in most cases the first one or two displacements (which can reasonably be by a small number of buckets) is enough to find an empty bucket so the locality of memory use is reasonable) though they're more collision prone than an alternative hashing algorithm (which should approach #elements/#buckets chance of further collisions). With both displacement lists and rehashing you have to provide enough retries that in practice you won't expect a complete failure, add some last-resort handling for failures, or accept that failures may happen.
Use a linked list as the hash bucket. So any collisions are handled gracefully.
Alternative approach: You might want to concider using a trie instead of a hash table for dictionaries of strings.
The up side of this approach is you get O(|S|) worst case complexity for seeking/inserting each string [where |S| is the length of that string]. Note that hash table allows you only average case of O(|S|), where the worst case is O(|S|*n) [where n is the size of the dictionary]. A trie also does not require rehashing when load balance is too high.
Assuming we are not using a perfect hash function (which you usually don't have) the hash tells you that:
if the hashes are different, the objects are distinct
if the hashes are the same, the objects are probably the same (if good hashing function is used), but may still be distinct.
So in a hashtable, the collision will be resolved with some additional checking if the objects are actually the same or not (this brings some performance penalty, but according to Amdahl's law, you still gained a lot, because collisions rarely happen for good hashing functions). In a dictionary you just need to resolve that rare collision cases and assure you get the right object out.
Using another non-perfect hash function will not resolve anything, it just reduces the chance of (another) collision.

Hash function combining - is there a significant decrease in collision risk?

Does anyone know if there's a real benefit regarding decreasing collision probability by combining hash functions? I especially need to know this regarding 32 bit hashing, namely combining Adler32 and CRC32.
Basically, will adler32(crc32(data)) yield a smaller collision probability than crc32(data)?
The last comment here gives some test results in favor of combining, but no source is mentioned.
For my purpose, collision is not critical (i.e. the task does not involve security), but I'd rather minimize the probability anyway, if possible.
PS: I'm just starting in the wonderful world of hashing, doing a lot of reading about it. Sorry if I asked a silly question, I haven't even acquired the proper "hash dialect" yet, probably my Google searches regarding this were also poorly formed.
Thanks.
This doesn't make sense combining them in series like that. You are hashing one 32-bit space to another 32-bit space.
In the case of a crc32 collision in the first step, the final result is still a collision. Then you add on any potential collisions in the adler32 step. So it can not get any better, and can only be the same or worse.
To reduce collisions, you might try something like using the two hashes independently to create a 64-bit output space:
adler32(data) << 32 | crc32(data)
Whether there is significant benefit in doing that, I'm not sure.
Note that the original comment you referred to was storing the hashes independently:
Whichever algorithm you use there is
going to be some chance of false
positives. However, you can reduce
these chances by a considerable margin
by using two different hashing
algorithms. If you were to calculate
and store both the CRC32 and the
Alder32 for each url, the odds of a
simultaneous collision for both hashes
for any given pair of urls is vastly
reduced.
Of course that means storing twice as
much information which is a part of
your original problem. However, there
is a way of storing both sets of hash
data such that it requires minimal
memory (10kb or so) whilst giving
almost the same lookup performance (15
microsecs/lookup compared to 5
microsecs) as Perl's hashes.

Uniquely identifying URLs with one 64-bit number

This is basically a math problem, but very programing related: if I have 1 billion strings containing URLs, and I take the first 64 bits of the MD5 hash of each of them, what kind of collision frequency should I expect?
How does the answer change if I only have 100 million URLs?
It seems to me that collisions will be extremely rare, but these things tend to be confusing.
Would I be better off using something other than MD5? Mind you, I'm not looking for security, just a good fast hash function. Also, native support in MySQL is nice.
EDIT: not quite a duplicate
If the first 64 bits of the MD5 constituted a hash with ideal distribution, the birthday paradox would still mean you'd get collisions for every 2^32 URL's. In other words, the probability of a collision is the number of URL's divided by 4,294,967,296. See http://en.wikipedia.org/wiki/Birthday_paradox#Cast_as_a_collision_problem for details.
I wouldn't feel comfortable just throwing away half the bits in MD5; it would be better to XOR the high and low 64-bit words to give them a chance to mix. Then again, MD5 is by no means fast or secure, so I wouldn't bother with it at all. If you want blinding speed with good distribution, but no pretence of security, you could try the 64-bit versions of MurmurHash. See http://en.wikipedia.org/wiki/MurmurHash for details and code.
You have tagged this as "birthday-paradox", I think you know the answer already.
P(Collision) = 1 - (2^64)!/((2^64)^n (1 - n)!)
where n is 1 billion in your case.
You will be a bit better using something other then MD5, because MD5 have pratical collusion problem.
From what I see, you need a hash function with the following requirements,
Hash arbitrary length strings to a 64-bit value
Be good -- Avoid collisions
Not necessarily one-way (security not required)
Preferably fast -- which is a necessary characteristic for a non-security application
This hash function survey may be useful for drilling down to the function most suitable for you.
I will suggest trying out multiple functions from here and characterizing them for your likely input set (pick a few billion URL that you think you will see).
You can actually generate another column like this test survey for your test URL list to characterize and select from the existing or any new hash functions (more rows in that table) that you might want to check. They have MSVC++ source code to start with (reference to ZIP link).
Changing the hash functions to suit your output width (64-bit) will give you a more accurate characterization for your application.
If you have 2^n hash possibilities, there's over a 50% chance of collision when you have 2^(n/2) items.
E.G. if your hash is 64 bits, you have 2^64 hash possibilities, you'd have a 50% chance of collision if you have 2^32 items in a collection.
Just by using a hash, there is always a chance of collisions. And you don't know beforehand wether collisions will happen once or twice, or even hundreds or thousands of times in your list of urls.
The probability is still just a probability. Its like throwing a dice 10 or 100 times, what are the chances of getting all sixes? The probability says it is low, but it still can happen. Maybe even many times in a row...
So while the birthday paradox shows you how to calculate the probabilities, you still need to decide if collisions are acceptable or not.
...and collisions are acceptable, and hashes are still the right way to go; find a 64 bit hashing algorithm instead of relying on "half-a-MD5" having a good distribution. (Though it probably has...)