If I know most of the clear text, can I crack AES? - aes

To be specific, suppose that AES_256_CBC is used (the key is 32 bytes, the iv is 16 bytes) and that the text is about 500 bytes, but most of the text is known to me. Consider an extreme case, only 1 byte of it, I do not know.
And suppose the message head (the first 32 bytes) does not contain that byte.
Can I crack this? I mean whether I can find out what that unknown byte is.

Related

SHA-256 does bit append still required if input is already N x 512 bit?

If my input is less than a multiple of 512 bit , i need to append bit and length bits to my input so that it is a multiple of 512 bit .
https://infosecwriteups.com/breaking-down-sha-256-algorithm-2ce61d86f7a3
But what if my input is already a multiple of 512 bit ? is it still required to do the bit append ? for example , if my message is already 512 bit long , do i need to bit append it to become 1024 bit long ?
And what if my input is less than a multiple of 512 bit , but long enough to not allow append length bits ? for example , my input is 504 bit long.
It seems to explain it rather clearly:
The number of bits we add is calculated as such so that after addition of these bits the length of the message should be exactly 64 bits less than a multiple of 512.
If there were 512 bits, you have to pad it to 960 bits, then add 64 bits for length for a total of 1024. The same with 504, since 504 > 512-64. Otherwise you'd have a situation where the last 64 bits are sometimes the length bits and sometimes not, which doesn't seem right.
The only case where you wouldn't add any padding is if the data is already 512*n-64, e.g. if it were 448 bits. Then you add no padding but still add the length bits, and end up with 512.
According to RFC 62634, you should always pad data with atleast one bit set to 1 and append length (64-bit).
Even if input is 448 bit long you should pad it resulting in two 512-bit chunks:
[448 bits of input]['1']['0' x 511][64 bits representing input length]
So this is wrong:
The only case where you wouldn't add any padding is if the data is already 512*n-64, e.g. if it were 448 bits. Then you add no padding but still add the length bits, and end up with 512.
Note: If you accept input only as byte array, minimum padding is 1 byte set to 10000000 (binary).

Size of binary file after base64 encoding? Need explanation on the solution

So I'm studying for the upcoming exam, and there's this question: given a binary file with the size of 31 bytes what will its size be, after encoding it to base64?
The solution teacher gave us was (40 + 4) bytes as it needs to be a multiple of 4.
I'm not being able to come across this solution, and I have no idea how to solve this, so I was hoping somebody could help me figure this out.
Because base 64 encoding divide the input data in six bit block and one block use an ascii code.
If you have 31 byte in input you have 31*8/6 bit block to encode. As a rule of thumb every three byte in input you have 4 byte in output
If input data is not a multiple of six bit the base64 encoding fills the last block with 0 bit
In your example you have 42 block of six bit, with last filled with missing 0 bit.
Base 64 algorithm implementation filled the encoded data with '=' symbol in order to have of multiple of 4 as final result.

SHA collision probability when removing bytes

I'm implementing some program which uses id's with variable length. These id's identify a message and are sent to a broker which will perform some operation (not relevant to the question). However, the maximum length for this id in the broker is 24 bytes. I was thinking about hashing the id (prior to sending to the broker) with SHA and removing some bytes until it gets 24 bytes only.
However, I want to have an idea of how much will this increase the collisions. So this is what I got until now:
I found out that for a "perfect" hash we have the formula p^2 / 2^n+1 to describe the probability of collisions and where p is the number of messages and n is the size of the message in bits. Here is where my problem starts. I'm assuming that removing some bytes from the final hash the function still remains "perfect" and I can still use the same formula. So assuming this I get:
5160^2 / 2^192 + 1 = 2.12x10^-51
Where 5160 is the pick number of messages and 192 is basically the number of bits in 24 bytes.
My questions:
Is my assumption correct? Does the hash stay "perfect" by removing some bytes.
If so and since the probability is really small, which bytes should I remove? Most or less significant? Does it really matter at all?
PS: Any other suggestion to achieve the same result is welcomed. Thanks.
However, the maximum length for this id in the broker is 24 bytes. I was thinking about hashing the id (prior to sending to the broker) with SHA and removing some bytes until it gets 24 bytes only.
SHA-1 outputs only 20 bytes (160 bits), so you'd need to pad it. At least if all bytes are valid, and you're not restricted to hex or Base64. I recommend using truncated SHA-2 instead.
Is my assumption correct? Does the hash stay "perfect" by removing some bytes.
Pretty much. Truncating hashes should conserve all their important properties, obviously at the reduced security level corresponding to the smaller output size.
If so and since the probability is really small, which bytes should I remove? Most or less significant? Does it really matter at all?
That should not matter at all. NIST defined a truncated SHA-2 variant, called SHA-224, which takes the first 28 bytes of SHA-256 using a different initial state for the hash calculation.
My recommendation is to use SHA-256, keeping the first 24 bytes. This requires around 2^96 hash-function calls to find one collision. Which is currently infeasible, even for extremely powerful attackers, and essentially impossible for accidental collisions.

Is there any classic 3 byte fingerprint function?

I need a checksum/fingerprint function for short strings (say, 16 to 256 bytes) which fits in a 24 bits word. Is there any well known algorithm for that?
I propose to use a 24-bit CRC as an easy solution. CRCs are available in all lengths and always simple to compute. Wikipedia has a matching entry. The quality is far better than a modulo-reduced sum, because swapping characters will most likely produce a different CRC.
The next step (if it is a real threat to have a wrong string with the same checksum) would be a cryptographic MAC like CMAC. While this is too long out of the book, it can be reduced by taking the first 24 bits.
Simplest thing to do is a basic checksum - add up the bytes in the string, mod (2^24).
You have to watch out for character set issues when converting to bytes though, so everyone agrees on the same encoding of characters to bytes.

Would it be possible to have a UTF-8-like encoding limited to 3 bytes per character?

UTF-8 requires 4 bytes to represent characters outside the BMP. That's not bad; it's no worse than UTF-16 or UTF-32. But it's not optimal (in terms of storage space).
There are 13 bytes (C0-C1 and F5-FF) that are never used. And multi-byte sequences that are not used such as the ones corresponding to "overlong" encodings. If these had been available to encode characters, then more of them could have been represented by 2-byte or 3-byte sequences (of course, at the expense of making the implementation more complex).
Would it be possible to represent all 1,114,112 Unicode code points by a UTF-8-like encoding with at most 3 bytes per character? If not, what is the maximum number of characters such an encoding could represent?
By "UTF-8-like", I mean, at minimum:
The bytes 0x00-0x7F are reserved for ASCII characters.
Byte-oriented find / index functions work correctly. You can't find a false positive by starting in the middle of a character like you can in Shift-JIS.
Update -- My first attempt to answer the question
Suppose you have a UTF-8-style classification of leading/trailing bytes. Let:
A = the number of single-byte characters
B = the number of values used for leading bytes of 2-byte characters
C = the number of values used for leading bytes of 3-byte characters
T = 256 - (A + B + C) = the number of values used for trailing bytes
Then the number of characters that can be supported is N = A + BT + CT².
Given A = 128, the optimum is at B = 0 and C = 43. This allows 310,803 characters, or about 28% of the Unicode code space.
Is there a different approach that could encode more characters?
It would take a little over 20 bits to record all the Unicode code points (assuming your number is correct), leaving over 3 bits out of 24 for encoding which byte is which. That should be adequate.
I fail to see what you would gain by this, compared to what you would lose by not going with an established standard.
Edit: Reading the spec again, you want the values 0x00 through 0x7f reserved for the first 128 code points. That means you only have 21 bits in 3 bytes to encode the remaining 1,113,984 code points. 21 bits is barely enough, but it doesn't really give you enough extra to do the encoding unambiguously. Or at least I haven't figured out a way, so I'm changing my answer.
As to your motivations, there's certainly nothing wrong with being curious and engaging in a little thought exercise. But the point of a thought exercise is to do it yourself, not try to get the entire internet to do it for you! At least be up front about it when asking your question.
I did the math, and it's not possible (if wanting to stay strictly "UTF-8-like").
To start off, the four-byte range of UTF-8 covers U+010000 to U+10FFFF, which is a huge slice of the available characters. This is what we're trying to replace using only 3 bytes.
By special-casing each of the 13 unused prefix bytes you mention, you could gain 65,536 characters each, which brings us to a total of 13 * 0x10000, or 0xD0000.
This would bring the total 3-byte character range to U+010000 to U+0DFFFF, which is almost all, but not quite enough.
Sure it's possible. Proof:
224 = 16,777,216
So there is enough of a bit-space for 1,114,112 characters but the more crowded the bit-space the more bits are used per character. The whole point of UTF-8 is that it makes the assumption that the lower code points are far more likely in a character stream so the entire thing will be quite efficient even though some characters may use 4 bytes.
Assume 0-127 remains one byte. That leaves 8.4M spaces for 1.1M characters. You can then solve this is an equation. Choose an encoding scheme where the first byte determines how many bytes are used. So there are 128 values. Each of these will represent either 256 characters (2 bytes total) or 65,536 characters (3 bytes total). So:
256x + 65536(128-x) = 1114112 - 128
Solving this you need 111 values of the first byte as 2 byte characters and the remaining 17 as 3 byte. To check:
128 + 111 * 256 + 17 * 65536 = 1,114,256
To put it another way:
128 code points require 1 byte;
28,416 code points require 2 bytes; and
1,114,112 code points require 3 bytes.
Of course, this doesn't allow for the inevitable expansion of Unicode, which UTF-8 does. You can adjust this to the first byte meaning:
0-127 (128) = 1 byte;
128-191 (64) = 2 bytes;
192-255 (64) = 3 bytes.
This would be better because it's simple bitwise AND tests to determine length and gives an address space of 4,210,816 code points.