Having 128 bytes of data, for example:
00000001c570c4764aadb3f09895619f549000b8b51a789e7f58ea750000709700000000103ca064f8c76c390683f8203043e91466a7fcc40e6ebc428fbcc2d89b574a864db8345b1b00b5ac00000000000000800000000000000000000000000000000000000000000000000000000000000000000000000000000080020000
And wanting to perform SHA-256 hash on it, one would have to separate it into two 64 bytes of data and hash them individually before hashing the results together. If one was to often change some bits in the second half of the data, one could simplify the calculations and hash the first half of the data only once. How would one do that in Google Go? I tried calling
func SingleSHA(b []byte)([]byte){
var h hash.Hash = sha256.New()
h.Write(b)
return h.Sum()
}
But instead of the proper answer
e772fc6964e7b06d8f855a6166353e48b2562de4ad037abc889294cea8ed1070
I got
12E84A43CBC7689AE9916A30E1AA0F3CA12146CBF886B60103AEC21A5CFAA268
When discussing the matter on Bitcoin forum, someone mentioned that there could be some problems with getting that midstate hash.
How do I calculate a midstate SHA-256 hash in Google Go?
Bitcoin-related byte operations are a bit tricky, as they tend to switch endianness at a whim. First of, we take the initial []byte array representing
00000001c570c4764aadb3f09895619f549000b8b51a789e7f58ea750000709700000000103ca064f8c76c390683f8203043e91466a7fcc40e6ebc428fbcc2d89b574a864db8345b1b00b5ac00000000000000800000000000000000000000000000000000000000000000000000000000000000000000000000000080020000
Then, we separate out the first half of the array, obtaining:
00000001c570c4764aadb3f09895619f549000b8b51a789e7f58ea750000709700000000103ca06 4f8c76c390683f8203043e91466a7fcc40e6ebc428fbcc2d8
After that, we need to swap some bytes around. We reverse the order of bytes in every slice of 4 bytes, thusly obtaining:
0100000076C470C5F0B3AD4A9F619598B80090549E781AB575EA587F977000000000000064A03C10396CC7F820F8830614E94330C4FCA76642BC6E0ED8C2BC8F
And that is the array we will be using for calculating the midstate. Now, we need to alter the file hash.go, adding to type Hash interface:
Midstate() []byte
And change the file sha256.go, adding this function:
func (d *digest) Midstate() []byte {
var answer []byte
for i:=0;i<len(d.h);i++{
answer=append(answer[:], Uint322Hex(d.h[i])...)
}
return answer
}
Where Uint322Hex converts an uint32 variable into a []byte variable. Having all that, we can call:
var h BitSHA.Hash = BitSHA.New()
h.Write(Str2Hex("0100000076C470C5F0B3AD4A9F619598B80090549E781AB575EA587F977000000000000064A03C10396CC7F820F8830614E94330C4FCA76642BC6E0ED8C2BC8F"))
log.Printf("%X", h.Midstate())
Where Str2Hex turns a string into []byte. The result is:
69FC72E76DB0E764615A858F483E3566E42D56B2BC7A03ADCE9492887010EDA8
Remembering the proper answer:
e772fc6964e7b06d8f855a6166353e48b2562de4ad037abc889294cea8ed1070
We can compare them:
69FC72E7 6DB0E764 615A858F 483E3566 E42D56B2 BC7A03AD CE949288 7010EDA8
e772fc69 64e7b06d 8f855a61 66353e48 b2562de4 ad037abc 889294ce a8ed1070
So we can see that we just need to swap the bytes around a bit in each slice of 4 bytes and we will have the proper "midstate" used by Bitcoin pools and miners (until it will no longer be needed due to being deprecated).
The Go code you have is the right way to compute sha256 of a stream of bytes.
Most likely the answer is that what you want to do is not sha256. Specifically:
one would have to separate it into two 64 bits of data and hash them individually before hashing the results together. If one was to often change some bits in the second half of the data, one could simplify the calculations and hash the first half of the data only once.
is not a valid way to calculate sha256 (read http://doc.golang.org/src/pkg/crypto/sha256/sha256.go to e.g. see that sha256 does its work on blocks of data, which must be padded etc.).
The algorithm you described calculates something, but not sha256.
Since you know the expected value you presumably have some reference implementation of your algorithm in another language so just do a line-by-line port to Go.
Finally, it's a dubious optimization in any case. 128 bits is 16 bytes. Hashing cost is usually proportional to the size of data. At 16 bytes, the cost is so small that the additional work of trying to be clever by splitting data in 8 byte parts will likely cost more than what you saved.
In sha256.go, at the start of function Sum() the implementation is making a copy of the SHA256 state. The underlying datatype of SHA256 (struct digest) is private to the sha256 package.
I would suggest to make your own private copy of the sha256.go file (it is a small file). Then add a Copy() function to save the current state of the digest:
func (d *digest) Copy() hash.Hash {
d_copy := *d
return &d_copy
}
Then simply call the Copy() function to save a midstate SHA256 hash.
I ran two Go benchmarks on your 128 bytes of data, using an Intel i5 2.70 GHz CPU. First, 1,000 times, I wrote all 128 bytes to the SHA256 hash and read the sum, which took a total of about 9,285,000 nanoseconds. Second, I wrote the first 64 bytes to the SHA256 hash once and then, 1,000 times, I wrote the second 64 bytes to a copy of the SHA256 hash and read the sum, which took a total of about 6,492,371 nanoseconds. The second benchmark, which assumed the first 64 bytes are invariant, ran in 30% less time than the first benchmark.
Using the first method, you could calculate about 9,305,331,179 SHA256 128-byte sums per day before buying a faster CPU. Using the second method, you could calculate 13,307,927,103 SHA256 128-byte sums per day, assuming the first 64 bytes are invariant 1,000 times in a row, before buying a faster CPU. How many SHA256 128-byte sums per day do you need to calculate? For how many SHA256 128-byte sums per day are the first 64 bytes are invariant?
What benchmarks did you run and what were the results?
Related
Let's say I have strings that need not be reversible and let's say I use SHA224 to hash it.
The hash of hello world is 2f05477fc24bb4faefd86517156dafdecec45b8ad3cf2522a563582b and its length is 56 bytes.
What if I convert every two chars to its numerical representation and make a single byte out of them?
In Python I'd do something like this:
shalist = list("2f05477fc24bb4faefd86517156dafdecec45b8ad3cf2522a563582b")
for first_byte,next_byte in zip(shalist[0::2],shalist[1::2]):
chr(ord(first_byte)+ord(next_byte))
The result will be \x98ek\x9d\x95\x96\x96\xc7\xcb\x9ckhf\x9a\xc7\xc9\xc8\x97\x97\x99\x97\xc9gd\x96im\x94. 28 bytes. Effectively halved the input.
Now, is there a higher hash collision risk by doing so?
The simple answer is pretty obvious: yes, it increases the chance of collision by as many powers of 2 as there are bits missing. For 56 bytes halved to 28 bytes you get the chance of collision increased 2^(28*8). That still leaves the chance of collision at 1:2^(28*8).
Your use of that truncation can be still perfectly legit, depending what it is. Git for example shows only the first few bytes from a commit hash and for most practical purposes the short one works fine.
A "perfect" hash should retain a proportional amount of "effective" bits if you truncate it. For example 32 bits of SHA256 result should have the same "strength" as a 32-bit CRC, although there may be some special properties of CRC that make it more suitable for some purposes while the truncated SHA may be better for others.
If you're doing any kind of security with this it will be difficult to prove your system, you're probably better of using a shorter but complete hash.
Lets shrink the size to make sense of it and use 2 bytes hash instead of 56. The original hash will have 65536 possible values, so if you hash more than that many strings you will surely get a collision. Half that to 1 bytes and you will get a collision after at most 256 strings hashed, regardless do you take the first or the second byte. So your chance of collision is 256 greater (2^(1byte*8bits)) and is 1:256.
Long hashes are used to make it truly impractical to brute-force them, even after long years of cryptanalysis. When MD5 was introduced in 1991 it was considered secure enough to use for certificate signing, in 2008 it was considered "broken" and not suitable for security-related use. Various cryptanalysis techniques can be developed to reduce the "effective" strength of hash and encryption algorithms, so the more spare bits there are (in an otherwise strong algorithm) the more effective bits should remain to keep the hash secure for all practical purposes.
If I have some data I hash with SHA256 like this :- hash=SHA256(data)
And then copy only the first 8 bytes of the hash instead of the whole 32 bytes, how easy is it to find a hash collision with different data? Is it 2^64 or 2^32 ?
If I need to reduce a hash of some data to a smaller size (n bits) is there any way to ensure the search space 2^n ?
I think you're actually interested in three things.
The first you need to understand is the entropy distribution of the hash. If the output of a hash function is n-bits long, then the maximum entropy is n bits. Note that I say maximum; you are never guaranteed to have n bits of entropy. Similarly, if you truncate the hash output to n/4 bits, you are not guaranteed to have a 2n/4 bits of entropy in the result. SHA-256 is fairly uniformly distributed, which means in part that you are unlikely to have more entropy in the high bits than the low bits (or vice versa).
However, information on this is sparse because the hash function is intended to be used with its whole hash output. If you only need an 8-byte hash output, then you might not even need a cryptographic hash function and could consider other algorithms. (The point is, if you need a cryptographic hash function, then you need as many bits as it can give you, as shortening the output weakens the security of the function.)
The second is the search space: it is not dependent on the hash function at all. Searching for an input that creates a given output on a hash function is more commonly known as a Brute-Force attack. The number of inputs that will have to be searched does not depend on the hash function itself; how could it? Every hash function output is the same: every SHA-256 output is 256 bits. If you just need a collision, you could find one specific input that generated each possible output of 256 bits. Unfortunately, this would take up a minimum storage space of 256 * 2256 ≈ 3 * 1079 for just the hash values themselves (i.e. not counting the inputs needed to generate them), which vastly eclipses the entire hard drive capacity of the entire world.
Therefore, the search space depends on the complexity and length of the input to the hash function. If your data is 8-character long ASCII strings, then you're pretty well guaranteed to never have a collision, BUT the search space for those hash values is only 27*8 ≈ 7.2 * 1016, which could be searched by your computer in a few minutes, probably. After all, you don't need to find a collision if you can find the original input itself. This is why salts are important in cryptography.
Third, you're interested in knowing the collision resistance. As GregS' linked article points out, the collision resistance of a space is much more limited than the input search space due to the pigeonhole principle.
Every hash function with more inputs than outputs will necessarily have collisions. Consider a hash function such as SHA-256 that produces 256 bits of output from an arbitrarily large input. Since it must generate one of 2256 outputs for each member of a much larger set of inputs, the pigeonhole principle guarantees that some inputs will hash to the same output. Collision resistance doesn't mean that no collisions exist; simply that they are hard to find.
The "birthday paradox" places an upper bound on collision resistance: if a hash function produces N bits of output, an attacker who computes "only" 2N/2 (or sqrt(2N)) hash operations on random input is likely to find two matching outputs. If there is an easier method than this brute force attack, it is typically considered a flaw in the hash function.
So consider what happens when you examine and store only the first 8 bytes (one fourth) of your output. Your collision resistance has dropped from 2256/2 = 2128 to 264/2 = 232. How much smaller is 232 than 2128? It's a whole lot smaller, as it turns out, approximately 0.0000000000000000000000000001% of the size at best.
So its time for me to index my database file format and after looking at various methods, I decided that a hash table would be my best option. Since I've only familiarized myself with the inner workings of a hash table just today though, heres my understanding of it so please correct me if I'm wrong:
A hash table has a constant size that is equivalent of the maximum value storable in its hash function output size * key value pair size * bucket size + overflow bucket size. So for example, if the hash function makes 16 bit hashes and the bucket size is 4 and the values are 32bit then it would be 2^16 * 4 * 6 = 1572864 or 1.5MB plus overflow.
That in essence would make the hash table a sort of compressed lookup table. If the hash function changes, the whole table has to be reevaluated. Otherwise it just adds stuff to empty slots. Also the hash table can contain the maximum of units that its hash size could address (so for a 16bit hash its 65536) but to perform well without many collisions it would have to be much less.
Ok and heres the things I'm trying to index: (up to) 100 million pairs with 64bit integer keys and a 96bit value. The keys are object ID's(that mostly come in short sequences but can be all over the place) and the values are the object location + length. Reads/writes are equally important and very frequent.
The other options i looked into were various trees but the reason I didn't like them is because it seems to me that i would have to do a lot of sparse reads/writes to look up the data or to restructure the tree each time I go in.
So here are my questions:
It seems to me that I need a hash with a weird number of bits in it, I'm thinking up to ~38 since it would be just about the maximum I can store on a single disk and should be comfy enough for the 100 million. Is the weird bit amount unheard of? I'm thinking I'll probably bottleneck on disk activity way before CPU.
Are there any articles out there on how to design a good hash function for my particular case? Googling gave me an overview of the common methods but I'm looking for explanations behind them.
Any other general tips/pitfalls I should know of?
A hash table has a constant size
...not necessarily - a hash table can support resizing, but that tends to be done in fairly dramatic and invasive chunks where you can reason about the hash table as if it were constant size both before and after.
...that is equivalent of the maximum value storable in its hash function output size * key value pair size * bucket size + overflow bucket size. So for example, if the hash function makes 16 bit hashes and the bucket size is 4 and the values are 32bit then it would be 2^16 * 4 * 6 = 1572864 or 1.5MB plus overflow.
Not at all. A better way to calculate size is to say there are N values of a certain size, and you want to maintain a capacity:size ratio somewhere between say 3:1 and 5:4: the table memory usage is: N * sizeof(Value) * ratio.
The number of bits in the hash value is only relevant in that it indicates the maximum number of distinct buckets you can hash to: if you try to have a bigger table then you'll get more collisions than you would with a hash function generating wider-bit hash values. If you have more bits from your hash function than you need it is not a problem, you e.g. take the modulus with the current table size to find your bucket: hashed_to_bucket = hash_value % num_buckets.
That in essence would make the hash table a sort of compressed lookup table.
That's a good way to look at a hash table.
If the hash function changes, the whole table has to be reevaluated. Otherwise it just adds stuff to empty slots.
Definitely reevaluated/regenerated. Otherwise adding to empty slots is but one of the undesirable consequences.
Also the hash table can contain the maximum of units that its hash size could address (so for a 16bit hash its 65536) but to perform well without many collisions it would have to be much less.
As above, that (e.g. 65536) is not a hard maximum, but "to perform well without collisions" going over that should be avoided. To perform well it does not have to be much less: anything right up to 65536 is perfectly fine if it's a good quality 16-bit hash function.
Ok and heres the things I'm trying to index: (up to) 100 million pairs with 64bit integer keys and a 96bit value. The keys are object ID's(that mostly come in short sequences but can be all over the place) and the values are the object location + length. Reads/writes are equally important and very frequent.
The other options i looked into were various trees but the reason I didn't like them is because it seems to me that i would have to do a lot of sparse reads/writes to look up the data or to restructure the tree each time I go in.
Could be... a lot depends on your access patterns. For example, if you happen to try to access the keys following the "short sequences" then a data organisation model that tends to put them nearby in memory/disk helps. Some types of tree structures do that nicely, and you can sometimes hack your hash function to do it too (but need to balance that up against collision proneness).
It seems to me that I need a hash with a weird number of bits in it, I'm thinking up to ~38 since it would be just about the maximum I can store on a single disk and should be comfy enough for the 100 million. Is the weird bit amount unheard of? I'm thinking I'll probably bottleneck on disk activity way before CPU.
Not so... you have 64 bit integer keys - a 64 bit or larger hash would be desirable. That said, a 32 bit hash may well be fine too - that generates 4 billion distinct values which is greater than your 100 million keys.
Are there any articles out there on how to design a good hash function for my particular case? Googling gave me an overview of the common methods but I'm looking for explanations behind them.
Not that I'm aware of.
Any other general tips/pitfalls I should know of?
For tips... I'd say start simple (e.g. with the hash function returning the key unchanged and using modulus with a hash table capacity that's a prime number, OR using any common hash if you're picking up a hash table implementation that uses e.g. power-of-2 numbers of buckets) and measure your collision rates: that tells you how much effort it's worth putting into improving your hashing.
One very simple way to get "ideal, randomised" hashing in your case is to have 8 tables of 256 32-bit integers - initialised with hardcoded random numbers (you can google for random number download websites). Given any 64-bit key, just slice it into 8 bytes then use each byte as a key in the successive tables, XORing the 32-bit values you look up. A single bit of difference in any of the 64 input bits will then impact all 32 bits in the hash value with equal probability.
uint32_t table[8][256] = { ...add some random numbers... };
uint32_t h(uint64_t n)
{
uint32_t result = 0;
unsigned char* p = (unsigned char*)&n;
for (int i = 0; i < 8; ++i)
result ^= table[i][*p++];
return result;
}
I'm implementing some program which uses id's with variable length. These id's identify a message and are sent to a broker which will perform some operation (not relevant to the question). However, the maximum length for this id in the broker is 24 bytes. I was thinking about hashing the id (prior to sending to the broker) with SHA and removing some bytes until it gets 24 bytes only.
However, I want to have an idea of how much will this increase the collisions. So this is what I got until now:
I found out that for a "perfect" hash we have the formula p^2 / 2^n+1 to describe the probability of collisions and where p is the number of messages and n is the size of the message in bits. Here is where my problem starts. I'm assuming that removing some bytes from the final hash the function still remains "perfect" and I can still use the same formula. So assuming this I get:
5160^2 / 2^192 + 1 = 2.12x10^-51
Where 5160 is the pick number of messages and 192 is basically the number of bits in 24 bytes.
My questions:
Is my assumption correct? Does the hash stay "perfect" by removing some bytes.
If so and since the probability is really small, which bytes should I remove? Most or less significant? Does it really matter at all?
PS: Any other suggestion to achieve the same result is welcomed. Thanks.
However, the maximum length for this id in the broker is 24 bytes. I was thinking about hashing the id (prior to sending to the broker) with SHA and removing some bytes until it gets 24 bytes only.
SHA-1 outputs only 20 bytes (160 bits), so you'd need to pad it. At least if all bytes are valid, and you're not restricted to hex or Base64. I recommend using truncated SHA-2 instead.
Is my assumption correct? Does the hash stay "perfect" by removing some bytes.
Pretty much. Truncating hashes should conserve all their important properties, obviously at the reduced security level corresponding to the smaller output size.
If so and since the probability is really small, which bytes should I remove? Most or less significant? Does it really matter at all?
That should not matter at all. NIST defined a truncated SHA-2 variant, called SHA-224, which takes the first 28 bytes of SHA-256 using a different initial state for the hash calculation.
My recommendation is to use SHA-256, keeping the first 24 bytes. This requires around 2^96 hash-function calls to find one collision. Which is currently infeasible, even for extremely powerful attackers, and essentially impossible for accidental collisions.
Here is a little conundrum for you: If you use a hash algorithm like CRC-64 then how many bytes in a string would be necessary to read to calculate a good hash? Lets say all your strings are at least 2 KB long then it seems a waste or resources using the whole string to calculate the cache, but just how many characters do you think is enough? Would just 8 ASCII-characters be enough since it equals 64-bits? Wont using more than 8 ASCII characters just be pointless? I want to know your though on this.
Update:
With a 'good hash' I mean the point where the likelihood of hash collisions can not get any less by using even more bytes to calculate it.
If you use CRC-64 over 8 bytes or less then there is no point in using CRC-64: just use the 8 bytes "as is". A CRC does not have any added value unless the input is longer than the intended output.
As a general rule, if your hash function has an output of n bits then collisions begin to appear once you have accumulated about 2n/2 strings. In shorter words, if you use 64 bits, then it is very unlikely that you encounter a collision in the first 2 billions of strings. If you get a 160-bit or more output, then collisions are virtually unfeasible (you will encounter much less collisions than hardware failures such as the CPU catching fire). This assumes that the hash function is "perfect". If your hash function begins by selecting a few data bytes, then, necessarily, the bytes that you do not select cannot have any influence on the hash output, so you'd better use the "good" bytes -- which utterly depends on the kind of strings that you are hashing. There is no general rule here.
My advice would be to first try using a generic hash function over the whole string; I usually recommend MD4. MD4 is a cryptographic hash function, which has been utterly broken, but for a problem with no security involved, it is still very good at mixing data elements (cryptographically speaking, a CRC is so much more broken than MD4). MD4 has been reported to actually be faster than CRC-32 on some platforms, so you could give it a shot. On a basic PC (my 2.4 GHz Core2), a MD4 implementation works at about 700 MBytes/s, so we are talking about 35000 hashed 2 kB strings per second, which is not bad.
What are the chances that the first 8 letters of two different strings are the same? Depending on what these strings are, it could be very high, in which case you'll definitely get hash collisions.
Hash the whole thing. A few kilobytes is nothing. Unless you actually have a need to save nanoseconds in your program, not hashing the full strings would be premature optimization.