How does Base 64 handle binary data with zeroes at the end - encoding

As I understand the spec, a Base64 encoder
a) takes the source binary, and pads it out with zeroes to be a multiple of 24 bytes long.
b) it then transcodes it, six bits at a time, to the target set of 64 characters (A..Z, a..z, 0..9, +, -). If it finds the last two bytes (16 bits) have been zero-padded, the last two characters are transcoded as '=='. If it finds that the last one byte (8 bits) have been zero-padded, the last character is transcoded as '='.
My question is, in step (b), how does it know that the last bytes are zeroes because they have been padded vs. they are zeroes because thay are part of the valid binary source data?
Is it that the subsystem that is responsible for part (b) has to know what took place during part (a)

The encoder (as opposed to the decoder) will know the length of the input data and be able to figure out whether to output nothing, "=" or "==" at the end. Your question assumes there is no connect between the two stages you mention but that is not true in the implementations I've seen.
The implementation I had to write didn't do the first stage at all since it had routines to extract 6-bit groups from the input stream one at a time, incrementing byteCount each time. Then at the end, the expression "byteCount%3" was used to decide which string to append to the output stream.

Related

Struggling with Base 64 encoding in T-SQL - and the padding [duplicate]

What is the purpose of padding in base64 encoding. The following is the extract from wikipedia:
"An additional pad character is allocated which may be used to force the encoded output into an integer multiple of 4 characters (or equivalently when the unencoded binary text is not a multiple of 3 bytes) ; these padding characters must then be discarded when decoding but still allow the calculation of the effective length of the unencoded text, when its input binary length would not be not a multiple of 3 bytes (the last non-pad character is normally encoded so that the last 6-bit block it represents will be zero-padded on its least significant bits, at most two pad characters may occur at the end of the encoded stream)."
I wrote a program which could base64 encode any string and decode any base64 encoded string. What problem does padding solves?
Your conclusion that padding is unnecessary is right. It's always possible to determine the length of the input unambiguously from the length of the encoded sequence.
However, padding is useful in situations where base64 encoded strings are concatenated in such a way that the lengths of the individual sequences are lost, as might happen, for example, in a very simple network protocol.
If unpadded strings are concatenated, it's impossible to recover the original data because information about the number of odd bytes at the end of each individual sequence is lost. However, if padded sequences are used, there's no ambiguity, and the sequence as a whole can be decoded correctly.
Edit: An Illustration
Suppose we have a program that base64-encodes words, concatenates them and sends them over a network. It encodes "I", "AM" and "TJM", sandwiches the results together without padding and transmits them.
I encodes to SQ (SQ== with padding)
AM encodes to QU0 (QU0= with padding)
TJM encodes to VEpN (VEpN with padding)
So the transmitted data is SQQU0VEpN. The receiver base64-decodes this as I\x04\x14\xd1Q) instead of the intended IAMTJM. The result is nonsense because the sender has destroyed information about where each word ends in the encoded sequence. If the sender had sent SQ==QU0=VEpN instead, the receiver could have decoded this as three separate base64 sequences which would concatenate to give IAMTJM.
Why Bother with Padding?
Why not just design the protocol to prefix each word with an integer length? Then the receiver could decode the stream correctly and there would be no need for padding.
That's a great idea, as long as we know the length of the data we're encoding before we start encoding it. But what if, instead of words, we were encoding chunks of video from a live camera? We might not know the length of each chunk in advance.
If the protocol used padding, there would be no need to transmit a length at all. The data could be encoded as it came in from the camera, each chunk terminated with padding, and the receiver would be able to decode the stream correctly.
Obviously that's a very contrived example, but perhaps it illustrates why padding might conceivably be helpful in some situations.
On a related note, here's an arbitrary base converter I created for you. Enjoy!
https://convert.zamicol.com/
What are Padding Characters?
Padding characters help satisfy length requirements and carry no other meaning.
Decimal Example of Padding:
Given the arbitrary requirement all strings be 8 characters in length, the number 640 can meet this requirement using preceding 0's as padding characters as they carry no meaning, "00000640".
Binary Encoding
The Byte Paradigm: For encoding, the byte is the de facto standard unit of measurement and any scheme must relate back to bytes.
Base256 fits exactly into the byte paradigm. One byte is equal to one character in base256.
Base16, hexadecimal or hex, uses 4 bits for each character. One byte can represent two base16 characters.
Base64 does not fit evenly into the byte paradigm (nor does base32), unlike base256 and base16. All base64 characters can be represented in 6 bits, 2 bits short of a full byte.
We can represent base64 encoding versus the byte paradigm as a fraction: 6 bits per character over 8 bits per byte. Reduced this fraction is 3 bytes over 4 characters.
This ratio, 3 bytes for every 4 base64 characters, is the rule we want to follow when encoding base64. Base64 encoding can only promise even measuring with 3 byte bundles, unlike base16 and base256 where every byte can stand on it's own.
So why is padding encouraged even though encoding could work just fine without the padding characters?
If the length of a stream is unknown or if it could be helpful to know exactly when a data stream ends, use padding. The padding characters communicate explicitly that those extra spots should be empty and rules out any ambiguity. Even if the length is unknown with padding you'll know where your data stream ends.
As a counter example, some standards like JOSE don't allow padding characters. In this case, if there is something missing, a cryptographic signature won't work or other non base64 characters will be missing (like the "."). Although assumptions about length aren't made, padding isn't needed because if there is something wrong it simply won't work.
And this is exactly what the base64 RFC says,
In some circumstances, the use of padding ("=") in base-encoded data
is not required or used. In the general case, when assumptions about
the size of transported data cannot be made, padding is required to
yield correct decoded data.
[...]
The padding step in base 64 [...] if improperly
implemented, lead to non-significant alterations of the encoded data.
For example, if the input is only one octet for a base 64 encoding,
then all six bits of the first symbol are used, but only the first
two bits of the next symbol are used. These pad bits MUST be set to
zero by conforming encoders, which is described in the descriptions
on padding below. If this property do not hold, there is no
canonical representation of base-encoded data, and multiple base-
encoded strings can be decoded to the same binary data. If this
property (and others discussed in this document) holds, a canonical
encoding is guaranteed.
Padding allows us to decode base64 encoding with the promise of no lost bits. Without padding there is no longer the explicit acknowledgement of measuring in three byte bundles. Without padding you may not be able to guarantee exact reproduction of original encoding without additional information usually from somewhere else in your stack, like TCP, checksums, or other methods.
Alternatively to bucket conversion schemes like base64 is radix conversion which has no arbitrary bucket sizes and for left-to-right readers is left padded. The "iterative divide by radix" conversion method is typically employed for radix conversions.
Examples
Here is the example form RFC 4648 (https://www.rfc-editor.org/rfc/rfc4648#section-8)
Each character inside the "BASE64" function uses one byte (base256). We then translate that to base64.
BASE64("") = "" (No bytes used. 0 % 3 = 0)
BASE64("f") = "Zg==" (One byte used. 1 % 3 = 1)
BASE64("fo") = "Zm8=" (Two bytes. 2 % 3 = 2)
BASE64("foo") = "Zm9v" (Three bytes. 3 % 3 = 0)
BASE64("foob") = "Zm9vYg==" (Four bytes. 4 % 3 = 1)
BASE64("fooba") = "Zm9vYmE=" (Five bytes. 5 % 3 = 2)
BASE64("foobar") = "Zm9vYmFy" (Six bytes. 6 % 3 = 0)
Here's an encoder that you can play around with: http://www.motobit.com/util/base64-decoder-encoder.asp
There is not much benefit to it in the modern day. So let's look at this as a question of what the original historical purpose may have been.
Base64 encoding makes its first appearance in RFC 1421 dated 1993. This RFC is actually focused on encrypting email, and base64 is described in one small section 4.3.2.4.
This RFC does not explain the purpose of the padding. The closest we have to a mention of the original purpose is this sentence:
A full encoding quantum is always completed at the end of a message.
It does not suggest concatenation (top answer here), nor ease of implementation as an explicit purpose for the padding. However, considering the entire description, it is not unreasonable to assume that this may have been intended to help the decoder read the input in 32-bit units ("quanta"). That is of no benefit today, however in 1993 unsafe C code would have very likely actually taken advantage of this property.
With padding, a base64 string always has a length that is a multiple of 4 (if it doesn't, the string has been corrupted for sure) and thus code can easily process that string in a loop that processes 4 characters at a time (always converting 4 input characters to three or less output bytes). So padding makes sanity checking easy (length % 4 != 0 ==> error as not possible with padding) and it makes processing simpler and more efficient.
I know what people will think: Even without padding, I can process all 4-byte chunks in a loop and then just add special handling for the last 1 to 3 bytes, if those exist. It's just a few lines of extra code and the speed difference will be too tiny to even measure. Probably true but you are thinking in terms of C (or higher languages) and a powerful CPU with plenty of RAM. What if you need to decode base64 in hardware, using a simple DSP, that has very limited processing power, no RAM storage and you have to write the code in very limited micro-assembly? What if you cannot use code at all and everything has to be done with just transistors stacked together (a hardwired hardware implementation)? With padding that's way simpler than without.
Padding fills the output length to a multiple of four bytes in a defined way.

How does huffman encoding know the length of each value code it's reading?

I'm trying to understand how Huffman encoding works. All of the summaries I've read explain how to generate the value codes, but not the full process of how to actually read them in. I'm wondering how the algorithm knows the bit-length of each value it's reading in.
For example, if you are representing the symbol string "ETQ A" with the code series "01 -- 110 -- 1101 -- 1 -- 10", how do you know where one symbol begins and another one ends? How do you know to read two bits at index 1, three bits at index 2, etc?
When decoding, there are two givens:
Huffman encoding tree
Input bytes
You then just consume the bits until reaching a leaf node, at which point you terminate for that single character decoding.
(source: kirk at people.cs.pitt.edu)
You then continue consuming bits until all characters are decoded
Taken from http://people.cs.pitt.edu/~kirk/cs1501/animations/Huffman.html
The decoding procedure is deceptively simple.
Starting with the first bit in the stream, one then uses successive bits from
the stream to determine whether
to go left or right in the decoding tree.
When we reach a leaf of the tree,
we've decoded a character, so we place
that character onto the (uncompressed)
output stream. The next bit in the
input stream is the first bit of the
next character.

Maximum size in bytes of single 16 bit character in UTF-7 representation

What would be maximum size in bytes of single UTF-16 character (2-byte character i.e. char type in .NET) saved in UTF-7 format?
This is what I've found on Wikipedia:
5 for an isolated case inside a run of single byte characters. For
runs 2 2⁄3 per character plus padding to make it a whole number of
bytes plus two to start and finish the run
http://en.wikipedia.org/wiki/Comparison_of_Unicode_encodings#Seven-bit_environments
For a single UTF-16 code unit, the only number you need to pay attention to in there is 5.
Essentially, in UTF-7, characters not within its "safe" alphabet
are converted to UTF-16 and then that is converted to modified Base64. With a single UTF-16 code unit, it is converted into 2 2/3 Base64 units then padded to a full 3. An escape character is added at the beginning and possibly the end to signify it as a UTF-7 sequence, resulting in a maximum of 5 bytes.

Are there any UTF-8 code units that have byte 60 or 62 (`<` and `>`) as not the first byte of their binary representation?

I need to debug a XML parser and I am wondering if I can construct "malicious" input that will cause it to not recognize opening and closing tags correctly.
Additionally, where can I find this sort of information in general? After this I will also want to be sure that the parser I am working with won't have trouble with other special characters such as &, = , ", etc.
UTF-8 makes it very easy to figure out what the role of a code unit (i.e. a byte) is:
If the highest bit is not set, i.e. the code unit is 0xxxxxxx, then this is byte expresses an entire code point, whose value is xxxxxxx (i.e. 7 bits of information).
If the highest bit is set and the code unit is 10xxxxxx, then it is a continuation part of a multibyte sequence, carrying six bits of information.
Otherwise, the code unit is the initial byte of a multibyte sequence, as follows:
110xxxxx: Two bytes (one continuation byte), for 5 + 6 = 11 bits.
1110xxxx: Three bytes (two continuation bytes), for 4 + 6 + 6 = 16 bits.
11110xxx: Four bytes (three continuation bytes), for 3 + 6 + 6 + 6 = 21 bits.
As you can see, a value 60, which is 00111100, is a single-byte codepoint of value 60, and the same byte cannot occur as part of any multibyte sequence.
The scheme can actually be extended up to seven bytes, encoding up to 36 bits, but since Unicode only requires 21 bits, four bytes suffice. The standard mandates that a code point must be represented with the minimal number of code units.
Update: As #Mark Tolonen rightly points out, you should check carefully whether each encoded code point is actually encoded with the minimal number of code units. If a browser would inadvertently accept such input, a user could sneak something past you that you would not spot in a byte-for-byte analysis. As a starting point you could look for bytes like 10111100, but you'd have to check the entire multibyte sequence of which it is a part (since it can of course occur legitimately as a part of different code points). Ultimately, if you can't trust the browser, you don't really get around decoding everything and just check­ing the resulting code point sequence for occurrences of U+3C etc., and don't even bother looking at the byte stream.
In UTF-8, no. In other encodings, yes.
In UTF-8, by design, all bytes of a multibyte character will always have the highest bit set. Vice versa, a byte that doesn't have the highest bit set is always an ASCII character.
However, this is not true for other encodings, which are also valid for XML.
For more information about UTF-8, check e.g wikipedia
A poorly-designed UTF-8 decoder could interpret the bytes C0 BC and C0 BE as U+003C and U+003E. As #KerrekSB stated in his answer:
The standard mandates that a code point must be represented with the minimal number of code units.
But a poor algorithm might still decode a malformed two-byte UTF-8 sequence that is not the minimal number of code units:
C0 BC = 11000000 10111100 = 00000111100 = 3Chex = 60dec = '<'
So in your testing be sure to include malformed UTF-8 sequences and verify that they are rejected.

Why does base64 encoding require padding if the input length is not divisible by 3?

What is the purpose of padding in base64 encoding. The following is the extract from wikipedia:
"An additional pad character is allocated which may be used to force the encoded output into an integer multiple of 4 characters (or equivalently when the unencoded binary text is not a multiple of 3 bytes) ; these padding characters must then be discarded when decoding but still allow the calculation of the effective length of the unencoded text, when its input binary length would not be not a multiple of 3 bytes (the last non-pad character is normally encoded so that the last 6-bit block it represents will be zero-padded on its least significant bits, at most two pad characters may occur at the end of the encoded stream)."
I wrote a program which could base64 encode any string and decode any base64 encoded string. What problem does padding solves?
Your conclusion that padding is unnecessary is right. It's always possible to determine the length of the input unambiguously from the length of the encoded sequence.
However, padding is useful in situations where base64 encoded strings are concatenated in such a way that the lengths of the individual sequences are lost, as might happen, for example, in a very simple network protocol.
If unpadded strings are concatenated, it's impossible to recover the original data because information about the number of odd bytes at the end of each individual sequence is lost. However, if padded sequences are used, there's no ambiguity, and the sequence as a whole can be decoded correctly.
Edit: An Illustration
Suppose we have a program that base64-encodes words, concatenates them and sends them over a network. It encodes "I", "AM" and "TJM", sandwiches the results together without padding and transmits them.
I encodes to SQ (SQ== with padding)
AM encodes to QU0 (QU0= with padding)
TJM encodes to VEpN (VEpN with padding)
So the transmitted data is SQQU0VEpN. The receiver base64-decodes this as I\x04\x14\xd1Q) instead of the intended IAMTJM. The result is nonsense because the sender has destroyed information about where each word ends in the encoded sequence. If the sender had sent SQ==QU0=VEpN instead, the receiver could have decoded this as three separate base64 sequences which would concatenate to give IAMTJM.
Why Bother with Padding?
Why not just design the protocol to prefix each word with an integer length? Then the receiver could decode the stream correctly and there would be no need for padding.
That's a great idea, as long as we know the length of the data we're encoding before we start encoding it. But what if, instead of words, we were encoding chunks of video from a live camera? We might not know the length of each chunk in advance.
If the protocol used padding, there would be no need to transmit a length at all. The data could be encoded as it came in from the camera, each chunk terminated with padding, and the receiver would be able to decode the stream correctly.
Obviously that's a very contrived example, but perhaps it illustrates why padding might conceivably be helpful in some situations.
On a related note, here's an arbitrary base converter I created for you. Enjoy!
https://convert.zamicol.com/
What are Padding Characters?
Padding characters help satisfy length requirements and carry no other meaning.
Decimal Example of Padding:
Given the arbitrary requirement all strings be 8 characters in length, the number 640 can meet this requirement using preceding 0's as padding characters as they carry no meaning, "00000640".
Binary Encoding
The Byte Paradigm: For encoding, the byte is the de facto standard unit of measurement and any scheme must relate back to bytes.
Base256 fits exactly into the byte paradigm. One byte is equal to one character in base256.
Base16, hexadecimal or hex, uses 4 bits for each character. One byte can represent two base16 characters.
Base64 does not fit evenly into the byte paradigm (nor does base32), unlike base256 and base16. All base64 characters can be represented in 6 bits, 2 bits short of a full byte.
We can represent base64 encoding versus the byte paradigm as a fraction: 6 bits per character over 8 bits per byte. Reduced this fraction is 3 bytes over 4 characters.
This ratio, 3 bytes for every 4 base64 characters, is the rule we want to follow when encoding base64. Base64 encoding can only promise even measuring with 3 byte bundles, unlike base16 and base256 where every byte can stand on it's own.
So why is padding encouraged even though encoding could work just fine without the padding characters?
If the length of a stream is unknown or if it could be helpful to know exactly when a data stream ends, use padding. The padding characters communicate explicitly that those extra spots should be empty and rules out any ambiguity. Even if the length is unknown with padding you'll know where your data stream ends.
As a counter example, some standards like JOSE don't allow padding characters. In this case, if there is something missing, a cryptographic signature won't work or other non base64 characters will be missing (like the "."). Although assumptions about length aren't made, padding isn't needed because if there is something wrong it simply won't work.
And this is exactly what the base64 RFC says,
In some circumstances, the use of padding ("=") in base-encoded data
is not required or used. In the general case, when assumptions about
the size of transported data cannot be made, padding is required to
yield correct decoded data.
[...]
The padding step in base 64 [...] if improperly
implemented, lead to non-significant alterations of the encoded data.
For example, if the input is only one octet for a base 64 encoding,
then all six bits of the first symbol are used, but only the first
two bits of the next symbol are used. These pad bits MUST be set to
zero by conforming encoders, which is described in the descriptions
on padding below. If this property do not hold, there is no
canonical representation of base-encoded data, and multiple base-
encoded strings can be decoded to the same binary data. If this
property (and others discussed in this document) holds, a canonical
encoding is guaranteed.
Padding allows us to decode base64 encoding with the promise of no lost bits. Without padding there is no longer the explicit acknowledgement of measuring in three byte bundles. Without padding you may not be able to guarantee exact reproduction of original encoding without additional information usually from somewhere else in your stack, like TCP, checksums, or other methods.
Alternatively to bucket conversion schemes like base64 is radix conversion which has no arbitrary bucket sizes and for left-to-right readers is left padded. The "iterative divide by radix" conversion method is typically employed for radix conversions.
Examples
Here is the example form RFC 4648 (https://www.rfc-editor.org/rfc/rfc4648#section-8)
Each character inside the "BASE64" function uses one byte (base256). We then translate that to base64.
BASE64("") = "" (No bytes used. 0 % 3 = 0)
BASE64("f") = "Zg==" (One byte used. 1 % 3 = 1)
BASE64("fo") = "Zm8=" (Two bytes. 2 % 3 = 2)
BASE64("foo") = "Zm9v" (Three bytes. 3 % 3 = 0)
BASE64("foob") = "Zm9vYg==" (Four bytes. 4 % 3 = 1)
BASE64("fooba") = "Zm9vYmE=" (Five bytes. 5 % 3 = 2)
BASE64("foobar") = "Zm9vYmFy" (Six bytes. 6 % 3 = 0)
Here's an encoder that you can play around with: http://www.motobit.com/util/base64-decoder-encoder.asp
There is not much benefit to it in the modern day. So let's look at this as a question of what the original historical purpose may have been.
Base64 encoding makes its first appearance in RFC 1421 dated 1993. This RFC is actually focused on encrypting email, and base64 is described in one small section 4.3.2.4.
This RFC does not explain the purpose of the padding. The closest we have to a mention of the original purpose is this sentence:
A full encoding quantum is always completed at the end of a message.
It does not suggest concatenation (top answer here), nor ease of implementation as an explicit purpose for the padding. However, considering the entire description, it is not unreasonable to assume that this may have been intended to help the decoder read the input in 32-bit units ("quanta"). That is of no benefit today, however in 1993 unsafe C code would have very likely actually taken advantage of this property.
With padding, a base64 string always has a length that is a multiple of 4 (if it doesn't, the string has been corrupted for sure) and thus code can easily process that string in a loop that processes 4 characters at a time (always converting 4 input characters to three or less output bytes). So padding makes sanity checking easy (length % 4 != 0 ==> error as not possible with padding) and it makes processing simpler and more efficient.
I know what people will think: Even without padding, I can process all 4-byte chunks in a loop and then just add special handling for the last 1 to 3 bytes, if those exist. It's just a few lines of extra code and the speed difference will be too tiny to even measure. Probably true but you are thinking in terms of C (or higher languages) and a powerful CPU with plenty of RAM. What if you need to decode base64 in hardware, using a simple DSP, that has very limited processing power, no RAM storage and you have to write the code in very limited micro-assembly? What if you cannot use code at all and everything has to be done with just transistors stacked together (a hardwired hardware implementation)? With padding that's way simpler than without.
Padding fills the output length to a multiple of four bytes in a defined way.