I saw that Deterministic is one of the property of Hashing. So can we predict the data on which we performed hasing algo by checking the hash output value?
Eg. If me and my friend uses the same hashign algo on our individual data. And at some point we both got ABCD as hash output value. So in that We can get to know on each other on which data we are performing hashing, right?
Related
I have an algorithm to one-hot encode minHashed genomes and I am seeking opinions on whether I have constructed it correctly based on the nature of minHashing. There's some disagreement between myself and a collaborator and we are trying to find the correct approach.
I have used MASH (https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0997-x) to minHash a database of raw genetic sequence reads (fastq files) for 1,000 samples. In summary, for one sample this produces a sketch of 2000 hash functions, where each hash function encodes a 21-kmer sequence of alleles (alphabet {ATCG}).
I one-hot encode these sketches by comparing the hash functions in each new sketch to the hash functions from previously processed samples database. If the new sketch has a hash in the database it gets a 1 in that column, if the hash is not in the database we add a column to the database for that hash with a 1 for the current sample and a 0 for all previous samples. I believe this produces an accurate one-hot encoding.
My collaborator believes the order of the hash functions in the sketches matter. If this is true, then comparison to the database of previous hashes is only valid if the hash function in the new sample has the same index in the 2,000 length vector as the previous hash function it is being compared to.
My understanding of minHashing is that assuming no hash collisions, each hash function should represent a unique k-mer. Sorting the sketch in ascending order of hashes is largely for randomization and thus it is not important to compare hashes at the same index, but rather to see if any of the hashes contained in one sketch are present in the others.
This feels quite niche and difficult to explain in writing so please let me know if any clarification is needed. Thanks!
Was just wondering that if we are given an input string x and we hash it with function f to get f(x) can we repeat this process indefinitely i.e f(f(x)) and so on. Because most hash functions generate a different fixed output that is not the same as the input.
So by this premise, would we be able to carry this out indefinitely? One possible issue I can think is that it has to be fixed length and usually hashes are shorter than the input?
Please correct me if I am wrong. Would love an explanation!
Yes you absolutely can hash the prior hash output.
When we do this with cryptographic keys it’s called ratcheting.
The output size of the hashing algo will determine how many outputs you can rehash before you get a collision.
Thus for a 256-bit hash function we will see a collision with 50% probability after 2^128 hashing calls.
Can i reverse sha256 hash like 2nd hash to 1st hash ?
ca978112ca1bbdcafac231b39a23dc4da786eff8147c4e72b9807785afee48bb
da3811154d59c4267077ddd8bb768fa9b06399c486e1fc00485116b57c9872f5
2nd hash is generated by sha256(1) so is it possible to reverse to 1st hash ?
In short, as of 2019, NO.
Cryptographic Hash functions are, in short, one-way deterministic but random functions. Deterministic means the same input has always the same output and the random in the sense that the output is unpredictable.
In Cryptography, we consider the security of hash functions by
Preimage-Resistance: for essentially all pre-specified outputs, it is computationally infeasible to find any input which hashes to that output, i.e., to find any preimage x' such that h(x') = y when given any y for which a corresponding input is not known.
2nd-preimage resistance, weak-collision: it is computationally infeasible to find any second input which has the same output as any specified input, i.e., given x, to find a 2nd-preimage x' != x such that h(x) = h(x').
Collision resistance: it is computationally infeasible to find any two distinct inputs x, x' which hash to the same output, i.e., such that h(x) = h(x').
What you are looking for is the preimage. There are cryptographic hash functions like MD4 and SHA-1 for those collisions are found. But all of them are still have pre and 2nd-preimage resistance.
For Sha256 there are no known pre-secondary yet collision attacks. It is considered a secure hash function.
You may find some rainbow tables for SHA-256 that may include your hash values but probably not since the space is too big to cover.
Hashing is meant to be a one way process. If a hashing algorithm were easily reversible, then it would be insecure. To answer your question, no, it's not possible to "unhash" 2 and obtain 1. In order to "crack" the second hash, you would have to brute force it by computing the sha256 of other strings and comparing the result with 2. If they match, then you (probably) have the original string.
Sha256 is a hash function, as defined in wikipedia https://en.wikipedia.org/wiki/Cryptographic_hash_function :
The ideal cryptographic hash function has five main properties:
it is deterministic so the same message always results in the same hash
it is quick to compute the hash value for any given message
it is infeasible to generate a message from its hash value except by trying all possible messages
a small change to a message should change the hash value so extensively that the new hash value appears uncorrelated with the old hash value
it is infeasible to find two different messages with the same hash value
By definition a hash function is useful as long as you cannot reverse to the input.
I have two UUIDs. I want to hash them perfectly to produce a single unique value, but with a constraint that f(m,n) and f(n,m) must generate the same hash.
UUIDs are 128-bit values
the hash function should have no collisions - all possible input pairings must generate unique hash values
f(m,n) and f(n,m) must generate the same hash - that is, ordering is not important
I'm working in Go, so the resulting value must fit in a 256-bit int
the hash does not need to be reversible
Can anyone help?
Concatenate them with the smaller one first.
To build on user2357112's brilliant solution and boil down the comment chain, let's consider your requirements one by one (and out of order):
No collisions
Technically, that's not a hash function. A hash function is about mapping heterogeneous, arbitrary length data inputs into fixed-width, homogenous outputs. The only way to accomplish that if the input is longer than the output is through some data loss. For most applications, this is tolerable because the hash function is only used as a fast lookup key and the code falls back onto the slower, complete comparison of the data. That's why many guides and languages insist that if you implement one, you must implement the other.
Fortunately, you say:
Two UUID inputs m and n
UUIDs are 128 bits each
Output of f(m,n) must be 256 bits or less
Combined your two inputs are exactly 256 bits, which means you do not have to lose any data. If you needed a smaller output, then you would be out of luck. As it is, you can concatenate the two numbers together and generate a perfect, unique representation.
f(m,n) and f(n,m) must generate the same hash
To accomplish this final requirement, make a decision on the concatenation order by some intrinsic value of the two UUIDs. The suggested smaller-first works just great. However...
The hash does not need to be reversible
If you specifically need irreversible hashing, that's a different question entirely. You could still use the less-than comparison to ensure order independence when feeding to a cryptographically hash function, but you would be hard pressed to find something that guaranteed no collisions even with fixed-width inputs a 256 bit output width.
I know that jenkinshash produces an integer (2^32) for a given value. The documentation at this link:
http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/util/JenkinsHash.html
says
Returns:
a 32-bit value. Every bit of the key affects every bit of the return value. Two keys differing by one or two bits will have totally different hash values.
jenkinshash can return at most 2^32 different results for given values.
What if I have more than 2^32 values?
Will it return same result for two different values?
Thanks
As most hash functions, yes, it may return duplicate hash values for different input data. The guarantee, according to the documentation you linked to, is that values that differs with one or two bits are different. As soon as they differ with 3 bits or more you have no uniqueness-guarantee.
The input data to the hash function may be of a larger size (have more unique input values) than the output of the hash. This trivially makes it so that duplicates must exist in the output data. Consider a hashing function that outputs an integer in the range 1-10 but takes an input in the range 1-100: it is obvious that multiple values must hash to the same value because you cannot enumerate the values 1-100 using only ten different integers. This is called the pigeonhole principle.
Any good hashing function will, however, try to distribute the output values evenly. In the 1-10 example you can expect a good hashing function to give a 2 approximately the same amount of times as a 6.
Hashing functions that guarantee uniqueness are called perfect hash functions. They all provide an output data of at least the same cardinality as the input data. A perfect hashing function for the input integers 1-100 must at least have 100 different output values.
Note that according to Wikipedia the Jenkins hash functions are not cryptographic. This means that you should avoid them for password security and the like, but you can use the hash for somewhat even work distribution and checksums.