Is there an algorithm that will yield the same hash for two numbers, no matter what order they are in?
For example, hashing 3268 and 2642 should yield the same result as hashing 2642 and 3268.
Is this possible?
Of course, XOR does that.
3268^2642 == 2642^3268
There's a lot more (addition, multiplication, basically any commutative operation), but XOR is usually used for hashing anyway (because it's easy to "unhash").
Hash the two numbers separately (using an integer-to-integer hash of your choice), and then either add or xor the results.
You could add or XOR the two numbers before hashing them.
Related
I often see or hear of modulus being used as a last step of hashing or after hashing. e.g. h(input)%N where h is the hash function and % is the modulus operator. If I am designing a hash table, and want to map a large set of keys to a smaller space of indices for the hash table, doesn't the modulus operator achieve that? Furthermore, if I wanted to randomize the distribution across those locations within the hash table, is the remainder generated by modulus not sufficient? What does the hashing function h provide on top of the modulus operator?
I often see or hear of modulus being used as a last step of hashing or after hashing. e.g. h( input ) % N where h is the hash function and % is the modulus operator.
Indeed.
If I am designing a hash table, and want to map a large set of keys to a smaller space of indices for the hash table, doesn't the modulus operator achieve that?
That's precisely the purpose of the modulo operator: to restrict the range of array indexes, so yes.
But you cannot simply use the modulo operator by itself: the modulo operator requires an integer value: you cannot get the "modulo of a string over N" or "modulo of an object-graph over N"[1].
Furthermore, if I wanted to randomize the distribution across those locations within the hash table, is the remainder generated by modulus not sufficient?
No, it does not - because the modulo operator doesn't give you pseudorandom output - nor does it have any kind of avalanche effect - which means that similar input values will have similar output hashes, which will result in clustering in your hashtable bins, which will result in subpar performance due to the greatly increased likelihood of hash-collisions (and so requiring slower techniques like linear-probing which defeat the purpose of a hashtable because you lose O(1) lookup times.
What does the hashing function h provide on top of the modulus operator?
The domain of h can be anything, especially non-integer values.
[1] Technically speaking, this is possible if you use the value of the memory address of an object (i.e. an object pointer), but that doesn't work if you have hashtable keys that don't use object identity, such as a stack-allocated object or custom struct.
First, the hash function's primary purpose is to turn something that's not a number into a number. Even if you just use modulus after that to get a number in your range, getting the number is still the first step and is the responsibility of the hash function. If you're hashing integers and you just use the integers as their own hashes, it isn't that there's no hash function, it's that you've chosen the identity function as your hash function. If you don't write out the function, that means you inlined it.
Second, the hash function can provide a more unpredictable distribution to reduce the likelihood of unintentional collisions. The data people work with often contain patterns and if you're just using a simple identity function with modulus, the pattern in inputs may be such that the modulus is more likely to cause collisions. The hash function presents an opportunity to break this up so it becomes unlikely that modulus exposes patterns in the original data sequence.
Interestingly I haven't found enough information regarding any test or experiment of collision chances of single 512bit hash like whirlpool versus concatenation of 4 128bit hashes like md5, sha1 etc.
Possibility of 4 128bit hashes to appear same seems less probable than single 512bit hash when the data on which hashing is performed is considerably of small size merely on average 100 characters.
But its just an apparent guess with no basis because I haven't performed any test. What you think about it?
Edit its like
512bit hash vs 128bit hash . 128bit hash . 128bit hash . 128bit hash (4 128bit hash concatenated)
Edit2
I want to use hash for this index on url or hashing considering RAM
and purpose is to minimize the possibility of collision because I want to set hash column as unique instead of url column.
Edit3
Please note that purpose of this question is to find the way to minimize the possibility of collision. Having said that, Why I need to focus more on minimizing the possibility of collision? Here comes my Edit2 description which leads to finding the solution to use less RAM. So, interests are both in minimizing the collision and lower RAM usage. But prime focus of this question is lowering the possibility of collision.
It sounds like you want to compare the collision behaviour of:
hash512(x)
with the collision behaviour of:
hash128_a(x) . hash128_b(x) . hash128_c(x) . hash128_d(x)
where "." denotes concatenation, and hash128_a, hash128_b, etc. are four different 128-bit hash algorithms.
The answer is: it depends entirely on the properties of the individual hashes involved.
Consider, for instance that the 128-bit hash functions could be implemented as:
uint128_t hash128_a(T x) { return hash512(x)[ 0:127]; }
uint128_t hash128_b(T x) { return hash512(x)[128:255]; }
uint128_t hash128_c(T x) { return hash512(x)[256:383]; }
uint128_t hash128_d(T x) { return hash512(x)[384:511]; }
In which case, the performance would be identical.
The classical article to read on that question is due to Hoch and Shamir. It builds on previous discoveries, especially by Joux. Bottom-line is the following: if you take four hash functions with a 128-bit output, and the four hash functions use the Merkle-Damgård construction, then finding a collision for the whole 512-bit ouput is no more difficult than finding a collision for one of the hash functions. MD5, SHA-1... use the MD construction.
On the other hand, if some of your hash functions use a distinct structure, in particular with a wider running state, the concatenation could yield a stronger function. See the example from #Oli: if all four functions are SHA-512 with some surgery on the output, then the concatenated hash function could be plain SHA-512.
The only sure thing about the concatenation of four hash functions is that the result will be no less collision-resistant than the strongest of the four hash functions. This has been used within SSL/TLS, which, up to version 1.1, internally uses concurrently both MD5 and SHA-1 in an attempt to resist breaks on either.
512 bits is 512 bits. The only difference is in the difference in imperfections in the hashes. The best overall hash would be a 512 using the best algorithm available.
Edit to add clarification, because it's too long for a comment:
An ideal hash maps content uniformly onto x bits. If you have 4 (completely independent) x-bit hashes, that maps the file uniformly onto 4x bits; a 4x-bit hash still maps the same file uniformly onto 4x bits. 4x bits is 4x bits; as long as it's perfectly uniform, it doesn't matter whether it comes from one (4x) hash function or 4 (x). However, no hash can be completely ideal, so you want the most uniform obtainable distribution, and if you use 4 different functions, only 1 can be the closest to optimal so you have x optimal bits and 3x suboptimal, whereas a single algorithm can cover the entire 4x space with the most optimal distribution.
I suppose it is possible that enough larger algorithms could have subsets of bits that are more uniformly distributed than a single 512, and could be combined to get more uniformity, but that seems like it would be a great deal extra research and implementation for little potential benefit.
If you are comparing concatenating four different 'ideal' 128bit hashing algorithms with one ideal 512 bit hashing algorithm, then yes, both methods will get you the same probability of a collision. Using md5 would make it easier to crack a hash though. If an attacker for example knew you were doing md5 + md5 w/ salt + md5 with another salt .. then that would be much easier to crack as md5 collision attack. Look here for more information about hash functions that have known attacks.
Say a known SHA1 hash was calculated by concatenating several chunks of data and that the order in which the chunks were concatenated is unknown. The straight forward way to find the order of the chunks that gives the known hash would be to calculate an SHA1 hash for each possible ordering until the known hash is found.
Is it possible to speed this up by calculating an SHA1 hash separately for each chunk and then find the order of the chunks by only manipulating the hashes?
In short, No.
If you are using SHA-1, due to Avalanche Effect ,any tiny change in the plaintext (in your case, your chunks) would alter its corresponding SHA-1 significantly.
Say if you have 4 chunks : A B C and D,
the SHA1 hash of A+B+C+D (concated) is supposed to be uncorrelated with the SHA1 hash for A, B, C and D computed as separately.
Since they are unrelated, you cannot draw any relationship between the concated chunk (A+B+C+D, B+C+A+D etc) and each individual chunk (A,B,C or D).
If you could identify any relationship in-between, the SHA1 hashing algorithm would be in trouble.
Practical answer: no. If the hash function you use is any good, then it is supposed to look like a Random Oracle, the output of which on an exact given input being totally unknown until that input is tried. So you cannot infer anything from the hashes you compute until you hit the exact input ordering that you are looking for. (Strictly speaking, there could exist a hash function which has the usual properties of a hash function, namely collision and preimage resistances, without being a random oracle, but departing from the RO model is still considered as a hash function weakness.)(Still strictly speaking, it is slightly improper to talk about a random oracle for a single, unkeyed function.)
Theoretical answer: it depends. Assuming, for simplicity, that you have N chunks of 512 bits, then you can arrange for the cost not to exceed N*2160 elementary evaluations of SHA-1, which is lower than N! when N >= 42. The idea is that the running state of SHA-1, between two successive blocks, is limited to 160 bits. Of course, that cost is ridiculously infeasible anyway. More generally, your problem is about finding a preimage to SHA-1 with inputs in a custom set S (the N! sequences of your N chunks) so the cost has a lower bound of the size of S and the preimage resistance of SHA-1, whichever is lower. The size of S is N!, which grows very fast when N is increased. SHA-1 has no known weakness with regards to preimages, so its resistance is still assumed to be about 2160 (since it has a 160-bit output).
Edit: this kind of question would be appropriate on the proposed "cryptography" stack exchange, when (if) it is instantiated. Please commit to help create it !
Depending on your hashing library, something like this may work: Say you have blocks A, B, C, and D. You can process the hash for block A, and then clone that state and calculate A+B, A+C, and A+D without having to recalculate A each time. And then you can clone each of those to calculate A+B+C and A+B+D from A+B, A+C+B and A+C+D from A+C, and so on.
Nope. Calculating the complete SHA1 hash requires that the chunks be put in in order. The calculation of the next hash chunk requires the output of the current one. If that wasn't true then it would be much easier to manipulate documents so that you could reorder the chunks at will, which would greatly decrease the usefulness of the algorithm.
I've been tasked with implementing an XOR hash for a variable length binary string in Perl; the length can range from 18 up to well over 100. In my understanding of it, I XOR the binary string I have with a key. I've read two different applications of this online:
One of the options is if the length of my key is shorter than the string, I divide up the string into blocks that are the length of the key; these are then all folded together (so the length of the resulting hash would be the length of the key).
I've also read that you just XOR the key across each key-length block of the string (so the resulting hash would be the length of string).
Is one of these more correct than the other? This is for hashing values in an index, so I'm inclined to think the first option (which could produce shorted hashes) would be better.
Finally, is there a good way to generate a sufficiently random key? And is there a good length to choose for the key based on the length of the strings to be hashed?
EDIT: By the way, I am very aware of how bad this hash works. It's strictly for comparison purposes. :)
One other alternative, from here (search for XOR hashing).
Assuming the hash is supposed to be x bytes long, break the message into blocks of x bytes; and xor them together. This is effectively the same as using method 1 with a key of x 0's. (or, alternatively, starting with a key of the first x bytes of the string, and ignoring those first bytes of the string. All manner of fun ways to think about it)
(Also note what is said about XOR hashing, namely that it is bad. Very bad.) (Roughly. It's better then alternatives, but it is not sufficient for a lot of what hashing is used for)
EDIT: One other small thing; if method 1 uses the same key across all binary strings that are hashed; then it doesn't really matter what the key is. xor'ing against a constant is akin to, say, ROT13. <sarcasm>Alternatively, if you use SHA1 to derive a key per string... that might make the XOR hash much better.</sarcasm>
key xor key == 0 //always
key xor (((key xor msg1) xor msg2) xor msg3)
== (msg1 xor msg2 xor msg3)
Generally you want your hash values to all be a consistant length. The second method you describe sounds like encryption where you want to recover your data, the first is a one way hash.
xor is not a really good way to hash:
1 is sort of a hash since you realy cant get the original data back, with or without a key. i suggest using sha2 (224/256/384/512), md5, ripemd160 or whirlpool, if you can
2 is an xor cipher with a repeating key. it is definitely not a hash.
as for generating random numbers, you can find programs that generate irrational numbers in hex (like pi: 3.243F6A8885A308D313198A2E03707344A4093822299....)
The first technique can be used to create a quick and dirty hash of the string.
The second technique can be used to create a quick, dirty and terribly insecure symmetric encryption of the string.
If you want a hash, use the first method (or even better, pick an existing hash function off-the-shelf.)
The randomness of the key isn't going to be your biggest issue - the whole technique is insecure.
The longer the key, the more distinct hash values you will get, the less likely you have a collision. It doesn't take long before collisions are very rare for moderately sized data sets.
If you want to perform a 'hash' that only uses XOR, I'd simply split the string up into blocks of some predetermined size X. Don't forget to somehow compensate for when the input string is smaller than X.
This is more of a cryptography theory question, but is it possible that the result of a hash algorithm will ever be the same value as the source? For example, say I have a string:
baf34551fecb48acc3da868eb85e1b6dac9de356
If I get the SHA1 hash on it, the result is:
4d2f72adbafddfe49a726990a1bcb8d34d3da162
In theory, is there ever a case where these two values would match? I'm not asking about SHA1 specifically here - it's just my example. I'm just wondering if hashing algorithms are built in such a way as to prevent this.
Well, it would depend on the hashing algorithm - but I'd be surprised to see anything explicitly prevent this. After all, it really shouldn't matter.
I suspect it's very unlikely to happen, of course (for cryptographic hashes)... but even if it does, that shouldn't cause a problem.
For non-crypto hashes (used in hash tables etc) it would be perfectly reasonable to return the source value in some cases. For example, in Java, Integer.hashCode() just returns the embedded value.
Sure, the Python hashing algorithm for integers returns the value of the integer. So hash(1) == 1.
Given a good hashing algorithm, one that returns a seemingly random output, I believe there should be on average one input that gives itself as the output. Let's say the hash can give N possible outputs. That means there are N possible inputs for which this is possible. For each of those, the odds of the output matching the input is 1/N, so there the expected number of fixed points is N*1/N, or 1.
A hash function might be defined to avoid ‘fixed points’ where hash(x)==x, but your hash-quine differs a little in that you're taking the string representation in hex of the hash rather than the raw binary. It would, I think, be infeasible to design a hash that could frustrate that, and it's mathematically less interesting since it depends on the arbitrary mapping of 0-F to ASCII character codes.
See Is there an MD5 Fixed Point where md5(x) == x? for a discussion about fixed points in MD5. The probability calculation would be equally true for hex hash-quines and any other hash function with 128 bits of output.