Does SHA256 round trip - hash

If you take a 32-bit sequence, and perform a CRC32 on it, you get another 32-bit sequence as the result; if you do CRC32 of this, you get another, and so on. It is easy to show that if you keep doing this, you end up with a single loop of 2^32 bit sequences, before starting over.
Simple question: does anyone know if the same holds true (or not) for SHA256, starting with a 256-bit sequence? Would a similar process cycle through a loop of all 2^256 possible 256-bit sequences before starting over? Or are there known (or likely) shorter loops within this hash?
Brian

SHA256 has not been designed to meet the property of 2^256 loop. However, as far as I know, nobody has proven there is no such loop. Also, there are not known any shorter loops because if anybody found some, then he would find also a collision and from the nature of the cryptographic hash function, it muse be difficult.
So, since nobody has not proven it, yes, there is a probability the 2^256 cycle would exists. However, it's extremely unlikely an I'm willing to bet my left testicle for it. :-)
Let me also note that, IMO to design a cryptographic hash function which has 2^256 loop would be extremely difficult even for the best crypto experts.

Related

Is there a way to verify a common seed to a cumulative sequence of hashes with unknown repetitions between each value presented?

I am writing a variant of the Cuckoo Cycle that uses an adjacency list for presenting solutions from two pairs of 8 bit coordinates, and I am not having any problems finding what I think should be an optimal solver for it, that uses two pairs of head/tail binary search trees to keep track of possible solution nodes, reject (branches) nodes and a binary tree that keeps a list of the candidate cycles as they are being assembled (as I understand it, binary search trees shorten the amount of processing for finding duplicates), but I need to refine the verifier function for solutions.
I see in Cuckoo that there is some process by which it modifies the edges with XOR functions and masks to identify a valid cycle, but I have two issues.
One is that each hash is generated from the previous hash, starting with the nonce, and proving that all offered node/edge pairs are valid derivatives from the nonce seems to me to require the verifier to repeat the hash function each time checking for a match until it gets a hit, which could be up to several thousand, in the worst case. Is there some property that can be used to shortcut this identification process, since unlike protection against DoS, we are providing the salt of the hash?
Second is that even if the presented cycle is perfectly valid, it is possible that one or more of the node/edge pairs in the cycle has a duplicate coordinate. The hashes are 32 bits long, and each coordinate is 8 bits. The answer to this probably has some relation to the previous question also, as having the seed for a hash function is a known security risk because of collisions. So obviously, as well as verifying the nodes are part of a cycle in the lowest possible values in the finite field, I need a way to be sure that a pair does not overlap with another possible, and branching pair.
I will be studying the verifier closer in the Cuckoo Cycle implementation to see if I can figure out how the algorithm ensures it is not approving a cycle that actually has a branch (and thus is invalid), but I thought I'd pop the question on this site in case someone knows better the ways of recognising hashes from a common seed, and if there is any way to recognise a 50% collision between a given coordinate and another one.
Note: After thinking about it for a while, I realised that I could solve the 'fake cycle' with one or more nodes having a branch by simply splitting the heads and tails into separate hashes, subsequent (odd then even), such as Murmur3 16 bit hashes.
And further thinking about it, I realised that Cuckoo Cycle is actually a special type of hash collision search that seeks only collisions that occur only once in the low order of the finite field. I am devising a new scheme called Hummingbird, which instead will not target the smallest numbers (which is also the same thing done by hashcash) but instead will target the most proximate hashes in a chain to the seed nonce. This will mean that attempts to insert branched nodes in the graph of the solution will be discovered in the verification. Which will probably take about 2-5 seconds depending on how deep. These solutions could be eliminated by specifying a maximum hash chain length as part of the consensus.
I just wanted to add that I answered my own question by realising that I am looking for, essentially, a hash collision, in my algorithm, and the simplest solution, with the least bit-twiddling was to make each coordinate a distinct hash in a hash chain (hash of nonce, then hash of hash, etc)
I didn't understand fully that Cuckoo Cycle is essentially a search for partial hash collisions, and when that dawned on me, I realised that the simple solution is to just make it into a search for hash collisions.
I have, from this realisation, moved very quickly forward to figuring out how my variation of Cuckoo can be much more simply implemented, as well as how to structure the B-tree based progressive search algorithm, the difficulty adjustment, and the rest.
I wasn't aware there was a stackexchange specialist site for math, or cryptography, or I would have posted it there instead. I studied FEC a few months ago and that opened the floodgates to a whole bunch of other ideas that led me to getting so worked up about Cuckoo Cycle. I believe I have generalised the Cuckoo Cycle into a generic, parameterisable graph theoretic proof of work and I will get back to finishing my implementation.
Thanks to everyone who submitted an answer, I will upvote as I deem correct, though I have zero or nearly zero rep, for what it's worth.

hash algorithm to produce a nice key from a messy value

I'm looking for an algorithm which given a messy value (like a url with a querystring) produces a value which can be used as a key.
Ideally, it should have a fairly low collision rate, produce a value which is shorter than the input, be made of alphanumerics (a-z 1-9), and create the same output given the same input (though not necessarily reversible).
Anything come to mind?
Several excellent examples exist as industry standards, such as SHA-2 and MD5. There will be a library for running either of these any language you are using.
MD5 or or the SHA functions are fine (however, MD5 is less secure in terms of collision rate, I'd suggest SHA1, or the best, but longest SHA512).
They're implemented in OpenSSL for C, or CommonCrypto in Objective-C, etc.
You might also consider using CRC64 as a hash function, because it has a really small footprint and a rather good collision rate -- compared to its length, of course.
Quite a lot. MD5, SHA-1, SHA-128, SHA-256,...
If say... MD5's 32 character are to long for you, use substring and just take the first x characters.
Of course there are many other solutions like CRC32 that you could use, but it's hard to say what will be good for your use case without knowing what your goal is...

Can you show me two actual, non-trivial strings that produce the same MD5 or SHA1 hash?

...and if not, why not?
So here's the question behind the question.
I understand that the likelihood of accidental collisions in MD5 and SHA1 is small (though less likely in SHA1 than in MD5). I also understand that deliberate collisions are theoretically possible. Is it practically possible? Could I go through some process to deliberately generate two messages with the same hash, in either of these algorithms? What process would I go through?
Collisions necessarily exist for a given hash function, in a mathematical sense: there are more possible inputs than possible outputs, so there must be two inputs which map to the same output. Now proving the existence of a collision, and actually finding one, are two different things. If I drop a diamond in the middle of the ocean, I positively know that there is now a diamond somewhere in the ocean -- but I am quite at a loss if I want to recover it.
For a "generic" hash function with an output of n bits, there are generic methods to find a collision, with average cost 2n/2 evaluations of the function (see this page). Depending on n, this can range from the easy to the totally unfeasible. MD5 has an output of 128 bits, and 264 is "quite high": you can do it, but it will require a few thousands of machines and months of computations.
Now there are known weaknesses in MD5, i.e. some internal structure which can be exploited to produce collisions much more easily. Best attack on MD5 known so far requires a bit less than 221 function invocations, so this is a matter of a few seconds (at most) on a basic PC. #Omri points in his response to a great example of an MD5 collision, in which the colliding messages are actually executable files with widely different behaviors.
For SHA-1, the output has size 160 bits. This means that a generic collision attack has cost about 280, which is not attainable with existing technology (well, Mankind could do it, but certainly not discreetly: it should be doable with, say, the equivalent of one year of budget for the whole US Army). However, SHA-1, like MD5, has known weaknesses. Right now, these weaknesses are still theoretical, in that they lead to a collision attack with cost 261, which is too expensive for any single crypto research lab, and thus has not been fully conducted yet (there was an announced attack with cost 251 but it seems that it was a dud -- the analysis was flawed). So no actual collision to show (but researchers are pretty sure that the 261 attack is correct and would work, if someone found the budget).
With SHA-256, there is no known weakness, and the 256-bit output size implies a generic cost of 2128, far away into the undoable with today's and tomorrow's technology.

Hash function combining - is there a significant decrease in collision risk?

Does anyone know if there's a real benefit regarding decreasing collision probability by combining hash functions? I especially need to know this regarding 32 bit hashing, namely combining Adler32 and CRC32.
Basically, will adler32(crc32(data)) yield a smaller collision probability than crc32(data)?
The last comment here gives some test results in favor of combining, but no source is mentioned.
For my purpose, collision is not critical (i.e. the task does not involve security), but I'd rather minimize the probability anyway, if possible.
PS: I'm just starting in the wonderful world of hashing, doing a lot of reading about it. Sorry if I asked a silly question, I haven't even acquired the proper "hash dialect" yet, probably my Google searches regarding this were also poorly formed.
Thanks.
This doesn't make sense combining them in series like that. You are hashing one 32-bit space to another 32-bit space.
In the case of a crc32 collision in the first step, the final result is still a collision. Then you add on any potential collisions in the adler32 step. So it can not get any better, and can only be the same or worse.
To reduce collisions, you might try something like using the two hashes independently to create a 64-bit output space:
adler32(data) << 32 | crc32(data)
Whether there is significant benefit in doing that, I'm not sure.
Note that the original comment you referred to was storing the hashes independently:
Whichever algorithm you use there is
going to be some chance of false
positives. However, you can reduce
these chances by a considerable margin
by using two different hashing
algorithms. If you were to calculate
and store both the CRC32 and the
Alder32 for each url, the odds of a
simultaneous collision for both hashes
for any given pair of urls is vastly
reduced.
Of course that means storing twice as
much information which is a part of
your original problem. However, there
is a way of storing both sets of hash
data such that it requires minimal
memory (10kb or so) whilst giving
almost the same lookup performance (15
microsecs/lookup compared to 5
microsecs) as Perl's hashes.

Uniquely identifying URLs with one 64-bit number

This is basically a math problem, but very programing related: if I have 1 billion strings containing URLs, and I take the first 64 bits of the MD5 hash of each of them, what kind of collision frequency should I expect?
How does the answer change if I only have 100 million URLs?
It seems to me that collisions will be extremely rare, but these things tend to be confusing.
Would I be better off using something other than MD5? Mind you, I'm not looking for security, just a good fast hash function. Also, native support in MySQL is nice.
EDIT: not quite a duplicate
If the first 64 bits of the MD5 constituted a hash with ideal distribution, the birthday paradox would still mean you'd get collisions for every 2^32 URL's. In other words, the probability of a collision is the number of URL's divided by 4,294,967,296. See http://en.wikipedia.org/wiki/Birthday_paradox#Cast_as_a_collision_problem for details.
I wouldn't feel comfortable just throwing away half the bits in MD5; it would be better to XOR the high and low 64-bit words to give them a chance to mix. Then again, MD5 is by no means fast or secure, so I wouldn't bother with it at all. If you want blinding speed with good distribution, but no pretence of security, you could try the 64-bit versions of MurmurHash. See http://en.wikipedia.org/wiki/MurmurHash for details and code.
You have tagged this as "birthday-paradox", I think you know the answer already.
P(Collision) = 1 - (2^64)!/((2^64)^n (1 - n)!)
where n is 1 billion in your case.
You will be a bit better using something other then MD5, because MD5 have pratical collusion problem.
From what I see, you need a hash function with the following requirements,
Hash arbitrary length strings to a 64-bit value
Be good -- Avoid collisions
Not necessarily one-way (security not required)
Preferably fast -- which is a necessary characteristic for a non-security application
This hash function survey may be useful for drilling down to the function most suitable for you.
I will suggest trying out multiple functions from here and characterizing them for your likely input set (pick a few billion URL that you think you will see).
You can actually generate another column like this test survey for your test URL list to characterize and select from the existing or any new hash functions (more rows in that table) that you might want to check. They have MSVC++ source code to start with (reference to ZIP link).
Changing the hash functions to suit your output width (64-bit) will give you a more accurate characterization for your application.
If you have 2^n hash possibilities, there's over a 50% chance of collision when you have 2^(n/2) items.
E.G. if your hash is 64 bits, you have 2^64 hash possibilities, you'd have a 50% chance of collision if you have 2^32 items in a collection.
Just by using a hash, there is always a chance of collisions. And you don't know beforehand wether collisions will happen once or twice, or even hundreds or thousands of times in your list of urls.
The probability is still just a probability. Its like throwing a dice 10 or 100 times, what are the chances of getting all sixes? The probability says it is low, but it still can happen. Maybe even many times in a row...
So while the birthday paradox shows you how to calculate the probabilities, you still need to decide if collisions are acceptable or not.
...and collisions are acceptable, and hashes are still the right way to go; find a 64 bit hashing algorithm instead of relying on "half-a-MD5" having a good distribution. (Though it probably has...)