How to encode byte sequence to avoid repetitions? - encoding

There is a problem I currently cannot resolve and need help.
I have transmitter and receiver devices. Transmitter needs to transmit random byte sequence (with possible repetitions) or unknown length. The sequence can be transformed (encoded) before transmission if we need.
Receiver device receives the sequence bytewise. It strongly requires to not have repetitioned bytes in incoming sequence. Every new byte must be different from the previous received one.
The question is how to encode the input byte sequence on transmitter side to avoid repetitions in receiver incoming byte sequence?
All bytes of incoming sequence should be uniquely decoded on receiver side.
I've heard about scramblers. As I understand some of they can output byte sequence without repetitions. But is there some simpler way?

Ignoring the first byte, the restriction that you can't have repeat bytes means that every byte represents a 1-in-255 choice, not a one-in-256. That means you can send slightly less than 8 bits per byte (7.994353 bits)
Hence, coding theory tells us that you need to transform your 256-symbol input stream into a 255-symbol stream. You then encode this 255-symbol stream by remembering the previous byte you sent out. If the symbol you want to send is lower than the previous byte you send, you send it unmodified, else you send it +1.
The decoding algorithm is the reverse. If you receive a byte that higher than the previously received byte, subtract one.
As a simple example, consider sending 254 254 254. The first one can be sent straight away (first symbol), the second will be sent as 255 (+1) and the next one will be 254 again. Thus the receiver sees 254 255 254. The only byte that's higher than the preceding byte is 255, so subtract one from that to recover the initial sequence 254 254 254.
This coding is the most efficient possible, we just have the minor challenge of mapping a random byte stream (256 symbols) to a 255-symbol stream. Remember, in this 255-symbol code, duplications are allowed. That's the reason why we invented it.
One easy but inefficient hack is to replace 254 with 254 0 and 255 with 254 1. The downside is that this effectively uses 15.998 bits for those two inputs. One difficult but perfectly space-efficient hack is to consider the whole input as a base-256 number, and convert it to base-255.
What exactly you choose probably depends on your input.

You can send data as padded group of 8 bytes, with 7 bytes of data and one special byte which has 7 bits to set 0 or 1 depends if it requires corresponding data byte to be modified (xor with FF or something) to make them different than previous. Last bit is used to make this byte different than last data one.
original data:
0x00 0x00 0x00 0x00 0x00 0x00 0x00
packet:
0x00 0xFF 0x00 0xFF 0x00 0xFF 0x00 0b01010100 <- last bit would be flipped if previous data byte is the same as this byte
Note: if you do not have buffer for 8 bytes you can send padding byte first keep it in register, left/right shift and process next byte based on bit value.

A simple approach is to just use the top bit for uniqueness, and the lower 7 bits to carry data:
0xxxxxxx 1xyyyyyy 0yyzzzzz 1zzz....
This encodes every 7 bytes of input as 8 bytes on the connection. On the sender, you have one bit of state to toggle between 0 and 1, and a 0-6 counter for the variable bitshifts. On the receiver, you don't even need to decode the top bit, so you just have the 0-6 counter to reverse the bit shifts. On both sides, you need to keep part of one byte as well, so you need about 2 bytes of state for this. Still, not too bad and certainly doable in an FPGA or about a dozen ARM instructions.

Guys what do you think about idea to use some soft byte scrambler?
I mean some simple algorithm than will transform original byte stream into pseudo-random sequence without repetitions?
Is it possible to avoid repetitions reliably this way?
Just would like to know it as additional possible solution...

Related

A protocol is telling me to encode the numeric value 150 to 0x01 0x50, and the value 35 to 0x00 0x35?

So I'm trying to implement the 'ECR' protocol that talks to a credit card terminal (Ingenico/Telium device in Costa Rica).
The documentation for the 'length' bytes states:
Length of field DATA (it does not include ETX nor LRC)
Example: if length of field Message Data
is 150 bytes; then, 0x01 0x50 is sent.
I would think that the value '150' should be send as 0x00 0x96.
I've verified that that is not a typo. In a working example message which has 35 bytes of data, they really do send 0x00 0x35.
Am I missing something? Is this form of encoding the decimal representation of a value to its literal representation in hex a thing? Does it have a name? Why would anyone do this?
It has a name, and it was frequent in the past: it is Binary coded Decimal or in short BCD, see https://en.wikipedia.org/wiki/Binary-coded_decimal.
In fact Intel CPU but the 64-bit versions had special instructions to deal with them.
How it works: every decimal digit is encoded in 4 bits (a nibble), so a byte can host two decimal digits. And you get string of them to describe integer numbers. Note: to convert to string (or back from strings): you divide the nibbles and then it is just an addition ('0' + nibble): the C language requires that character encoding of digits must be consecutive (and ordered).
If you works a lot with decimals, it is convenient and fast: no need to transform to binary (which requires shift and addition, or just multiplications) and back (again shift or divisions). So in past when most CPU didn't have floating point co-processors, this was very convenient (especially if you need just to add or subtract numbers). So no need to handle precision errors (which banks doesn't like; was the first movie of Super Man about the villain getting rich by inserting a round error on a bank system? This show the worries of the time).
It has also less problem on number of bits: banks needs accounts with potential billions with a precision of cents. A BCD makes easier to port program on different platforms, with different endianess and different number of bits. Again: it was for the past, where 8-bit, 16-bit, 32-bit, 36-bit, etc. were common, and no real standard architecture.
It is obsolete system: newer CPUs doesn't have problem converting decimal to binary and back, and we have enough bits to handle cents. Note: still in financial sector the floating point is avoided. Just integers with a fixed point (usually 2 digits). But protocols and some sectors tend not to change protocols very often (for interoperability).

If I know most of the clear text, can I crack AES?

To be specific, suppose that AES_256_CBC is used (the key is 32 bytes, the iv is 16 bytes) and that the text is about 500 bytes, but most of the text is known to me. Consider an extreme case, only 1 byte of it, I do not know.
And suppose the message head (the first 32 bytes) does not contain that byte.
Can I crack this? I mean whether I can find out what that unknown byte is.

Are there any UTF-8 code units that have byte 60 or 62 (`<` and `>`) as not the first byte of their binary representation?

I need to debug a XML parser and I am wondering if I can construct "malicious" input that will cause it to not recognize opening and closing tags correctly.
Additionally, where can I find this sort of information in general? After this I will also want to be sure that the parser I am working with won't have trouble with other special characters such as &, = , ", etc.
UTF-8 makes it very easy to figure out what the role of a code unit (i.e. a byte) is:
If the highest bit is not set, i.e. the code unit is 0xxxxxxx, then this is byte expresses an entire code point, whose value is xxxxxxx (i.e. 7 bits of information).
If the highest bit is set and the code unit is 10xxxxxx, then it is a continuation part of a multibyte sequence, carrying six bits of information.
Otherwise, the code unit is the initial byte of a multibyte sequence, as follows:
110xxxxx: Two bytes (one continuation byte), for 5 + 6 = 11 bits.
1110xxxx: Three bytes (two continuation bytes), for 4 + 6 + 6 = 16 bits.
11110xxx: Four bytes (three continuation bytes), for 3 + 6 + 6 + 6 = 21 bits.
As you can see, a value 60, which is 00111100, is a single-byte codepoint of value 60, and the same byte cannot occur as part of any multibyte sequence.
The scheme can actually be extended up to seven bytes, encoding up to 36 bits, but since Unicode only requires 21 bits, four bytes suffice. The standard mandates that a code point must be represented with the minimal number of code units.
Update: As #Mark Tolonen rightly points out, you should check carefully whether each encoded code point is actually encoded with the minimal number of code units. If a browser would inadvertently accept such input, a user could sneak something past you that you would not spot in a byte-for-byte analysis. As a starting point you could look for bytes like 10111100, but you'd have to check the entire multibyte sequence of which it is a part (since it can of course occur legitimately as a part of different code points). Ultimately, if you can't trust the browser, you don't really get around decoding everything and just check­ing the resulting code point sequence for occurrences of U+3C etc., and don't even bother looking at the byte stream.
In UTF-8, no. In other encodings, yes.
In UTF-8, by design, all bytes of a multibyte character will always have the highest bit set. Vice versa, a byte that doesn't have the highest bit set is always an ASCII character.
However, this is not true for other encodings, which are also valid for XML.
For more information about UTF-8, check e.g wikipedia
A poorly-designed UTF-8 decoder could interpret the bytes C0 BC and C0 BE as U+003C and U+003E. As #KerrekSB stated in his answer:
The standard mandates that a code point must be represented with the minimal number of code units.
But a poor algorithm might still decode a malformed two-byte UTF-8 sequence that is not the minimal number of code units:
C0 BC = 11000000 10111100 = 00000111100 = 3Chex = 60dec = '<'
So in your testing be sure to include malformed UTF-8 sequences and verify that they are rejected.

Would it be possible to have a UTF-8-like encoding limited to 3 bytes per character?

UTF-8 requires 4 bytes to represent characters outside the BMP. That's not bad; it's no worse than UTF-16 or UTF-32. But it's not optimal (in terms of storage space).
There are 13 bytes (C0-C1 and F5-FF) that are never used. And multi-byte sequences that are not used such as the ones corresponding to "overlong" encodings. If these had been available to encode characters, then more of them could have been represented by 2-byte or 3-byte sequences (of course, at the expense of making the implementation more complex).
Would it be possible to represent all 1,114,112 Unicode code points by a UTF-8-like encoding with at most 3 bytes per character? If not, what is the maximum number of characters such an encoding could represent?
By "UTF-8-like", I mean, at minimum:
The bytes 0x00-0x7F are reserved for ASCII characters.
Byte-oriented find / index functions work correctly. You can't find a false positive by starting in the middle of a character like you can in Shift-JIS.
Update -- My first attempt to answer the question
Suppose you have a UTF-8-style classification of leading/trailing bytes. Let:
A = the number of single-byte characters
B = the number of values used for leading bytes of 2-byte characters
C = the number of values used for leading bytes of 3-byte characters
T = 256 - (A + B + C) = the number of values used for trailing bytes
Then the number of characters that can be supported is N = A + BT + CT².
Given A = 128, the optimum is at B = 0 and C = 43. This allows 310,803 characters, or about 28% of the Unicode code space.
Is there a different approach that could encode more characters?
It would take a little over 20 bits to record all the Unicode code points (assuming your number is correct), leaving over 3 bits out of 24 for encoding which byte is which. That should be adequate.
I fail to see what you would gain by this, compared to what you would lose by not going with an established standard.
Edit: Reading the spec again, you want the values 0x00 through 0x7f reserved for the first 128 code points. That means you only have 21 bits in 3 bytes to encode the remaining 1,113,984 code points. 21 bits is barely enough, but it doesn't really give you enough extra to do the encoding unambiguously. Or at least I haven't figured out a way, so I'm changing my answer.
As to your motivations, there's certainly nothing wrong with being curious and engaging in a little thought exercise. But the point of a thought exercise is to do it yourself, not try to get the entire internet to do it for you! At least be up front about it when asking your question.
I did the math, and it's not possible (if wanting to stay strictly "UTF-8-like").
To start off, the four-byte range of UTF-8 covers U+010000 to U+10FFFF, which is a huge slice of the available characters. This is what we're trying to replace using only 3 bytes.
By special-casing each of the 13 unused prefix bytes you mention, you could gain 65,536 characters each, which brings us to a total of 13 * 0x10000, or 0xD0000.
This would bring the total 3-byte character range to U+010000 to U+0DFFFF, which is almost all, but not quite enough.
Sure it's possible. Proof:
224 = 16,777,216
So there is enough of a bit-space for 1,114,112 characters but the more crowded the bit-space the more bits are used per character. The whole point of UTF-8 is that it makes the assumption that the lower code points are far more likely in a character stream so the entire thing will be quite efficient even though some characters may use 4 bytes.
Assume 0-127 remains one byte. That leaves 8.4M spaces for 1.1M characters. You can then solve this is an equation. Choose an encoding scheme where the first byte determines how many bytes are used. So there are 128 values. Each of these will represent either 256 characters (2 bytes total) or 65,536 characters (3 bytes total). So:
256x + 65536(128-x) = 1114112 - 128
Solving this you need 111 values of the first byte as 2 byte characters and the remaining 17 as 3 byte. To check:
128 + 111 * 256 + 17 * 65536 = 1,114,256
To put it another way:
128 code points require 1 byte;
28,416 code points require 2 bytes; and
1,114,112 code points require 3 bytes.
Of course, this doesn't allow for the inevitable expansion of Unicode, which UTF-8 does. You can adjust this to the first byte meaning:
0-127 (128) = 1 byte;
128-191 (64) = 2 bytes;
192-255 (64) = 3 bytes.
This would be better because it's simple bitwise AND tests to determine length and gives an address space of 4,210,816 code points.

How does Base 64 handle binary data with zeroes at the end

As I understand the spec, a Base64 encoder
a) takes the source binary, and pads it out with zeroes to be a multiple of 24 bytes long.
b) it then transcodes it, six bits at a time, to the target set of 64 characters (A..Z, a..z, 0..9, +, -). If it finds the last two bytes (16 bits) have been zero-padded, the last two characters are transcoded as '=='. If it finds that the last one byte (8 bits) have been zero-padded, the last character is transcoded as '='.
My question is, in step (b), how does it know that the last bytes are zeroes because they have been padded vs. they are zeroes because thay are part of the valid binary source data?
Is it that the subsystem that is responsible for part (b) has to know what took place during part (a)
The encoder (as opposed to the decoder) will know the length of the input data and be able to figure out whether to output nothing, "=" or "==" at the end. Your question assumes there is no connect between the two stages you mention but that is not true in the implementations I've seen.
The implementation I had to write didn't do the first stage at all since it had routines to extract 6-bit groups from the input stream one at a time, incrementing byteCount each time. Then at the end, the expression "byteCount%3" was used to decide which string to append to the output stream.