Is it possible to encode a series of bytes this way? - encoding

Given:
You have a large series of bytes -- call it O.
You a pair of bytes (2 bytes) -- call it E.
Is it possible?
Can you somehow encode the O series with the E pair to produce a new series S that is the same size (length) as O, such that given S, and S alone, you can derive the original series O and pair E?

No.
I assume we are talking about completely random data here.
There is the amount of information that can be stored in length of O bytes. Which is the same amount of bytes in S. Every combination of bytes may be a valid dataset.
It is not possible to store more information with the same amount of bytes.
At least for completely random data (like hashes or encrypted data)
As soon as you know anything about the data it's a different story. Non random data means that the data might take up more space than necessary. Therefore there could be space for more information.

Related

Compression or encrypt data

I have two bytes and I want to compress them into a single byte using a key( key length can be up to 64 bits).
And further I want to be able to retrieve the two bytes by using the compressed byte and the same key.
Someone has an idea how to do that?
Thanks.
There are 2^{16} = 65,536 ways two choose a pair of 8-bit bytes.
However the result of your procedure is only one 8-bit byte, which can occur in 2^8 = 256 different variations.
So you could use this one byte as input to some decompressing procedure, but because there are only 256 different inputs, the procedure can not produce more than 256 different results, so you can retrieve no more than 256 of the 65,536 possible pairs, the other pairs are not accessible, because you ran out of names for them, so to say.
This makes the procedure impractical, if more than 256 different input byte pairs occur.
(See the comments below for more details)
Compression would only be practical, if restrictions on your input data exist. E.g. if only the pairs p1 = (42,37) and p2 = (127,255) can occur as possible input you could compress them as 01 and and 02.

Reducing size of hash

If I have some data I hash with SHA256 like this :- hash=SHA256(data)
And then copy only the first 8 bytes of the hash instead of the whole 32 bytes, how easy is it to find a hash collision with different data? Is it 2^64 or 2^32 ?
If I need to reduce a hash of some data to a smaller size (n bits) is there any way to ensure the search space 2^n ?
I think you're actually interested in three things.
The first you need to understand is the entropy distribution of the hash. If the output of a hash function is n-bits long, then the maximum entropy is n bits. Note that I say maximum; you are never guaranteed to have n bits of entropy. Similarly, if you truncate the hash output to n/4 bits, you are not guaranteed to have a 2n/4 bits of entropy in the result. SHA-256 is fairly uniformly distributed, which means in part that you are unlikely to have more entropy in the high bits than the low bits (or vice versa).
However, information on this is sparse because the hash function is intended to be used with its whole hash output. If you only need an 8-byte hash output, then you might not even need a cryptographic hash function and could consider other algorithms. (The point is, if you need a cryptographic hash function, then you need as many bits as it can give you, as shortening the output weakens the security of the function.)
The second is the search space: it is not dependent on the hash function at all. Searching for an input that creates a given output on a hash function is more commonly known as a Brute-Force attack. The number of inputs that will have to be searched does not depend on the hash function itself; how could it? Every hash function output is the same: every SHA-256 output is 256 bits. If you just need a collision, you could find one specific input that generated each possible output of 256 bits. Unfortunately, this would take up a minimum storage space of 256 * 2256 ≈ 3 * 1079 for just the hash values themselves (i.e. not counting the inputs needed to generate them), which vastly eclipses the entire hard drive capacity of the entire world.
Therefore, the search space depends on the complexity and length of the input to the hash function. If your data is 8-character long ASCII strings, then you're pretty well guaranteed to never have a collision, BUT the search space for those hash values is only 27*8 ≈ 7.2 * 1016, which could be searched by your computer in a few minutes, probably. After all, you don't need to find a collision if you can find the original input itself. This is why salts are important in cryptography.
Third, you're interested in knowing the collision resistance. As GregS' linked article points out, the collision resistance of a space is much more limited than the input search space due to the pigeonhole principle.
Every hash function with more inputs than outputs will necessarily have collisions. Consider a hash function such as SHA-256 that produces 256 bits of output from an arbitrarily large input. Since it must generate one of 2256 outputs for each member of a much larger set of inputs, the pigeonhole principle guarantees that some inputs will hash to the same output. Collision resistance doesn't mean that no collisions exist; simply that they are hard to find.
The "birthday paradox" places an upper bound on collision resistance: if a hash function produces N bits of output, an attacker who computes "only" 2N/2 (or sqrt(2N)) hash operations on random input is likely to find two matching outputs. If there is an easier method than this brute force attack, it is typically considered a flaw in the hash function.
So consider what happens when you examine and store only the first 8 bytes (one fourth) of your output. Your collision resistance has dropped from 2256/2 = 2128 to 264/2 = 232. How much smaller is 232 than 2128? It's a whole lot smaller, as it turns out, approximately 0.0000000000000000000000000001% of the size at best.

SHA collision probability when removing bytes

I'm implementing some program which uses id's with variable length. These id's identify a message and are sent to a broker which will perform some operation (not relevant to the question). However, the maximum length for this id in the broker is 24 bytes. I was thinking about hashing the id (prior to sending to the broker) with SHA and removing some bytes until it gets 24 bytes only.
However, I want to have an idea of how much will this increase the collisions. So this is what I got until now:
I found out that for a "perfect" hash we have the formula p^2 / 2^n+1 to describe the probability of collisions and where p is the number of messages and n is the size of the message in bits. Here is where my problem starts. I'm assuming that removing some bytes from the final hash the function still remains "perfect" and I can still use the same formula. So assuming this I get:
5160^2 / 2^192 + 1 = 2.12x10^-51
Where 5160 is the pick number of messages and 192 is basically the number of bits in 24 bytes.
My questions:
Is my assumption correct? Does the hash stay "perfect" by removing some bytes.
If so and since the probability is really small, which bytes should I remove? Most or less significant? Does it really matter at all?
PS: Any other suggestion to achieve the same result is welcomed. Thanks.
However, the maximum length for this id in the broker is 24 bytes. I was thinking about hashing the id (prior to sending to the broker) with SHA and removing some bytes until it gets 24 bytes only.
SHA-1 outputs only 20 bytes (160 bits), so you'd need to pad it. At least if all bytes are valid, and you're not restricted to hex or Base64. I recommend using truncated SHA-2 instead.
Is my assumption correct? Does the hash stay "perfect" by removing some bytes.
Pretty much. Truncating hashes should conserve all their important properties, obviously at the reduced security level corresponding to the smaller output size.
If so and since the probability is really small, which bytes should I remove? Most or less significant? Does it really matter at all?
That should not matter at all. NIST defined a truncated SHA-2 variant, called SHA-224, which takes the first 28 bytes of SHA-256 using a different initial state for the hash calculation.
My recommendation is to use SHA-256, keeping the first 24 bytes. This requires around 2^96 hash-function calls to find one collision. Which is currently infeasible, even for extremely powerful attackers, and essentially impossible for accidental collisions.

Go, midstate SHA-256 hash

Having 128 bytes of data, for example:
00000001c570c4764aadb3f09895619f549000b8b51a789e7f58ea750000709700000000103ca064f8c76c390683f8203043e91466a7fcc40e6ebc428fbcc2d89b574a864db8345b1b00b5ac00000000000000800000000000000000000000000000000000000000000000000000000000000000000000000000000080020000
And wanting to perform SHA-256 hash on it, one would have to separate it into two 64 bytes of data and hash them individually before hashing the results together. If one was to often change some bits in the second half of the data, one could simplify the calculations and hash the first half of the data only once. How would one do that in Google Go? I tried calling
func SingleSHA(b []byte)([]byte){
var h hash.Hash = sha256.New()
h.Write(b)
return h.Sum()
}
But instead of the proper answer
e772fc6964e7b06d8f855a6166353e48b2562de4ad037abc889294cea8ed1070
I got
12E84A43CBC7689AE9916A30E1AA0F3CA12146CBF886B60103AEC21A5CFAA268
When discussing the matter on Bitcoin forum, someone mentioned that there could be some problems with getting that midstate hash.
How do I calculate a midstate SHA-256 hash in Google Go?
Bitcoin-related byte operations are a bit tricky, as they tend to switch endianness at a whim. First of, we take the initial []byte array representing
00000001c570c4764aadb3f09895619f549000b8b51a789e7f58ea750000709700000000103ca064f8c76c390683f8203043e91466a7fcc40e6ebc428fbcc2d89b574a864db8345b1b00b5ac00000000000000800000000000000000000000000000000000000000000000000000000000000000000000000000000080020000
Then, we separate out the first half of the array, obtaining:
00000001c570c4764aadb3f09895619f549000b8b51a789e7f58ea750000709700000000103ca06 4f8c76c390683f8203043e91466a7fcc40e6ebc428fbcc2d8
After that, we need to swap some bytes around. We reverse the order of bytes in every slice of 4 bytes, thusly obtaining:
0100000076C470C5F0B3AD4A9F619598B80090549E781AB575EA587F977000000000000064A03C10396CC7F820F8830614E94330C4FCA76642BC6E0ED8C2BC8F
And that is the array we will be using for calculating the midstate. Now, we need to alter the file hash.go, adding to type Hash interface:
Midstate() []byte
And change the file sha256.go, adding this function:
func (d *digest) Midstate() []byte {
var answer []byte
for i:=0;i<len(d.h);i++{
answer=append(answer[:], Uint322Hex(d.h[i])...)
}
return answer
}
Where Uint322Hex converts an uint32 variable into a []byte variable. Having all that, we can call:
var h BitSHA.Hash = BitSHA.New()
h.Write(Str2Hex("0100000076C470C5F0B3AD4A9F619598B80090549E781AB575EA587F977000000000000064A03C10396CC7F820F8830614E94330C4FCA76642BC6E0ED8C2BC8F"))
log.Printf("%X", h.Midstate())
Where Str2Hex turns a string into []byte. The result is:
69FC72E76DB0E764615A858F483E3566E42D56B2BC7A03ADCE9492887010EDA8
Remembering the proper answer:
e772fc6964e7b06d8f855a6166353e48b2562de4ad037abc889294cea8ed1070
We can compare them:
69FC72E7 6DB0E764 615A858F 483E3566 E42D56B2 BC7A03AD CE949288 7010EDA8
e772fc69 64e7b06d 8f855a61 66353e48 b2562de4 ad037abc 889294ce a8ed1070
So we can see that we just need to swap the bytes around a bit in each slice of 4 bytes and we will have the proper "midstate" used by Bitcoin pools and miners (until it will no longer be needed due to being deprecated).
The Go code you have is the right way to compute sha256 of a stream of bytes.
Most likely the answer is that what you want to do is not sha256. Specifically:
one would have to separate it into two 64 bits of data and hash them individually before hashing the results together. If one was to often change some bits in the second half of the data, one could simplify the calculations and hash the first half of the data only once.
is not a valid way to calculate sha256 (read http://doc.golang.org/src/pkg/crypto/sha256/sha256.go to e.g. see that sha256 does its work on blocks of data, which must be padded etc.).
The algorithm you described calculates something, but not sha256.
Since you know the expected value you presumably have some reference implementation of your algorithm in another language so just do a line-by-line port to Go.
Finally, it's a dubious optimization in any case. 128 bits is 16 bytes. Hashing cost is usually proportional to the size of data. At 16 bytes, the cost is so small that the additional work of trying to be clever by splitting data in 8 byte parts will likely cost more than what you saved.
In sha256.go, at the start of function Sum() the implementation is making a copy of the SHA256 state. The underlying datatype of SHA256 (struct digest) is private to the sha256 package.
I would suggest to make your own private copy of the sha256.go file (it is a small file). Then add a Copy() function to save the current state of the digest:
func (d *digest) Copy() hash.Hash {
d_copy := *d
return &d_copy
}
Then simply call the Copy() function to save a midstate SHA256 hash.
I ran two Go benchmarks on your 128 bytes of data, using an Intel i5 2.70 GHz CPU. First, 1,000 times, I wrote all 128 bytes to the SHA256 hash and read the sum, which took a total of about 9,285,000 nanoseconds. Second, I wrote the first 64 bytes to the SHA256 hash once and then, 1,000 times, I wrote the second 64 bytes to a copy of the SHA256 hash and read the sum, which took a total of about 6,492,371 nanoseconds. The second benchmark, which assumed the first 64 bytes are invariant, ran in 30% less time than the first benchmark.
Using the first method, you could calculate about 9,305,331,179 SHA256 128-byte sums per day before buying a faster CPU. Using the second method, you could calculate 13,307,927,103 SHA256 128-byte sums per day, assuming the first 64 bytes are invariant 1,000 times in a row, before buying a faster CPU. How many SHA256 128-byte sums per day do you need to calculate? For how many SHA256 128-byte sums per day are the first 64 bytes are invariant?
What benchmarks did you run and what were the results?

Faster way to find the correct order of chunks to get a known SHA1 hash?

Say a known SHA1 hash was calculated by concatenating several chunks of data and that the order in which the chunks were concatenated is unknown. The straight forward way to find the order of the chunks that gives the known hash would be to calculate an SHA1 hash for each possible ordering until the known hash is found.
Is it possible to speed this up by calculating an SHA1 hash separately for each chunk and then find the order of the chunks by only manipulating the hashes?
In short, No.
If you are using SHA-1, due to Avalanche Effect ,any tiny change in the plaintext (in your case, your chunks) would alter its corresponding SHA-1 significantly.
Say if you have 4 chunks : A B C and D,
the SHA1 hash of A+B+C+D (concated) is supposed to be uncorrelated with the SHA1 hash for A, B, C and D computed as separately.
Since they are unrelated, you cannot draw any relationship between the concated chunk (A+B+C+D, B+C+A+D etc) and each individual chunk (A,B,C or D).
If you could identify any relationship in-between, the SHA1 hashing algorithm would be in trouble.
Practical answer: no. If the hash function you use is any good, then it is supposed to look like a Random Oracle, the output of which on an exact given input being totally unknown until that input is tried. So you cannot infer anything from the hashes you compute until you hit the exact input ordering that you are looking for. (Strictly speaking, there could exist a hash function which has the usual properties of a hash function, namely collision and preimage resistances, without being a random oracle, but departing from the RO model is still considered as a hash function weakness.)(Still strictly speaking, it is slightly improper to talk about a random oracle for a single, unkeyed function.)
Theoretical answer: it depends. Assuming, for simplicity, that you have N chunks of 512 bits, then you can arrange for the cost not to exceed N*2160 elementary evaluations of SHA-1, which is lower than N! when N >= 42. The idea is that the running state of SHA-1, between two successive blocks, is limited to 160 bits. Of course, that cost is ridiculously infeasible anyway. More generally, your problem is about finding a preimage to SHA-1 with inputs in a custom set S (the N! sequences of your N chunks) so the cost has a lower bound of the size of S and the preimage resistance of SHA-1, whichever is lower. The size of S is N!, which grows very fast when N is increased. SHA-1 has no known weakness with regards to preimages, so its resistance is still assumed to be about 2160 (since it has a 160-bit output).
Edit: this kind of question would be appropriate on the proposed "cryptography" stack exchange, when (if) it is instantiated. Please commit to help create it !
Depending on your hashing library, something like this may work: Say you have blocks A, B, C, and D. You can process the hash for block A, and then clone that state and calculate A+B, A+C, and A+D without having to recalculate A each time. And then you can clone each of those to calculate A+B+C and A+B+D from A+B, A+C+B and A+C+D from A+C, and so on.
Nope. Calculating the complete SHA1 hash requires that the chunks be put in in order. The calculation of the next hash chunk requires the output of the current one. If that wasn't true then it would be much easier to manipulate documents so that you could reorder the chunks at will, which would greatly decrease the usefulness of the algorithm.