base64 encoding: input character - encoding

I'm trying to understand what the input requirements are for base64 encoding. Nicholas Zakas, who I have tremendous respect for has an article here where he quotes a specification that an error should be thrown if input contains any character with a code higher than 255 Zakas Article on base64
Before even attempting to base64 encode a string, you should check to see if the string contains only ASCII characters. Since base64 encoding requires eight bits per input character, any character with a code higher than 255 cannot be accurately represented. The specification indicates that an error should be thrown in this case:
if (/([^\u0000-\u00ff])/.test(text)){
throw new Error("Can't base64 encode non-ASCII characters.");
}
He provides a link in another separate part of the article to the RFC 3548 but I don't see any input requirements other than:
Implementations MUST reject the encoding if it contains characters
outside the base alphabet when interpreting base encoded data, unless
the specification referring to this document explicitly states
otherwise.
Not sure what "base alphabet" means but perhaps this is what Zakas is referring to. But by saying they must reject the encoding it seems to imply that this is something that has already been encoded as opposed to the input (of course if the input is invalid it will also show up in the encoding so perhaps the point is moot).
A bit confused on what the standard is.

Fundamentally, it's a mistake to talk about "base64 encoding a string" where "string" is meant in terms of text.
Base64 encoding is applied to binary data (a sequence of bytes, or octets if you want to be even more picky), and the result is text. Every character in the output is printable ASCII text. The whole point of base64 is to provide a safe way of converting arbitrary binary data into a text format which can be reliably embedded in other text, transported etc. ASCII is compatible with almost all character sets, so you're very unlikely to be unable to encode ASCII text as part of something else.
When someone talks about "base64 encoding a string" they're really talking about encoding text as binary using some existing encoding (e.g. UTF-8), then applying a base64 encoding to the result. When decoding, you'd need to decode the base64 back to binary, and then decode that binary data with the original encoding, to get the original text.

For me the (first) linked article has a fundamental problem:
Before even attempting to base64 encode a string, you should check to see if the string contains only ASCII characters
You don't base64 encode strings. You base64 encode byte sequences. And when you're dealing with any kind of encoding work, it's extremely important to keep in mind this difference.
Also, his check for 'ASCII' actually lets through everything from 80 to ff, which aren't ASCII - ASCII is only 00 to 7f.
Now, if you have a string which you have checked is pure ASCII, you can then safely treat it as a byte sequence of the ASCII values of the characters in it - but this is a separate earlier step, nothing strictly to do with the act of base64 encoding.
(I should say that I do like his repeated urging for the reader to note that base64 encoding is not in any shape or form encryption)

Related

Understanding encoding schemes

I cannot understand some key elements of encoding:
Is ASCII only a character or it also has its encoding scheme algorithm ?
Does other windows code pages such as Latin1 have their own encoding algorithm ?
Are UTF7, 8, 16, 32 the only encoding algorithms ?
Does the UTF alghoritms are used only with the UNICODE set ?
Given the ASCII text: Hello World, if I want to convert it into Latin1 or BIG5, which encoding algorithms are being used in this process ? More specifically, does Latin1/Big5 use their own encoding alghoritm or I have to use a UTF alghoritm ?
1: Ascii is just an encoding — a really simple encoding. It's literally just the positive end of a signed byte (0...127) mapped to characters and control codes.
Refer to https://www.ascii.codes/ to see the full set and inspect the characters.
There are definitely encoding algorithms to convert ascii strings to and from strings in other encodings, but there is no compression/decompression algorithm required to write or read ascii strings like there is for utf8 or utf16, if that's what you're implying.
2: LATIN-1 is also not a compressed (usually called 'variable width') encoding, so there's no algorithm needed to get in and out of it.
See https://kb.iu.edu/d/aepu for a nice description of LATIN-1 conceptually and of each character in the set. Like a lot of encodings, its first 128 slots are just ascii. Like ascii, it's 1 byte in size, but it's an unsigned byte, so after the last ascii character (DEL/127), LATIN1 adds another 128 characters.
As with any conversion from one string encoding to another, there is an algorithm specifically tailored to that conversion.
3: Again, unicode encodings are just that — encodings. But they're all compressed except for utf32. So unless you're working with utf32 there is always a compression/decompression step required to write and read them.
Note: When working with utf32 strings there is one nonlinear oddity that has to be accounted for... combining characters. Technically that is yet another type of compression since they save space by not giving a codepoint to every possible combination of uncombined character and combining character. They "precombine" a few, but they would run out of slots very quickly if they did them all.
4: Yes. The compression/decompression algorithms for the compressed unicode encodings are just for those encodings. They would not work for any other encoding.
Think of it like zip/unzip. Unzipping anything other than a zipped file or folder would of course not work. That goes for things that are not compressed in the first place and also things that are compressed but using another compression algorithm (e.g.: rar).
I recently wrote the utf8 and utf16 compression/decompression code for a new cross-platform library being developed, and I can tell you quite confidently if you feed a Big5-encoded string into my method written specifically for decompressing utf8... not only would it not work, it might very well crash.
Re: your "Hello World" question... Refer to my answer to your second question about LATIN-1. No conversion is required to go from ascii to LATIN-1 because the first 128 characters (0...127) of LATIN-1 are ascii. If you're converting from LATIN-1 to ascii, the same is true for the lower half of LATIN-1, but if any of the characters beyond 127 are in the string, it would be what's called a "lossy"/partial conversion or an outright failure, depending on your tolerance level for lossiness. In your example, however, all of the characters in "Hello World" have the exact same values in both encodings, so it would convert perfectly, without loss, in either direction.
I know practically nothing about Big5, but regardless, don't use utf-x algos for other encodings. Each one of those is written very specifically for 1 particular encoding (or in the case of conversion: pair of encodings).
If you're curious about utf8/16 compression/decompression algorithms, the unicode website is where you should start (watch out though. they don't use the compression/decompression metaphor in their documentation):
http://unicode.org
You probably won't need anything else.
... except maybe a decent codepoint lookup tool: https://www.unicode.codes/
You can roll your own code based on the unicode documentation, or use the official unicode library:
http://site.icu-project.org/home
Hope this helps.
In general, most encoding schemes like ASCII or Latin-1 are simply big tables mapping characters to specific byte sequences. There may or may not be some specific algorithm how the creators came up with those specific character⟷byte associations, but there's generally not much more to it than that.
One of the innovations of Unicode specifically is the indirection of assigning each character a unique number first and foremost, and worrying about how to encode that number into bytes secondarily. There are a number of encoding schemes for how to do this, from the UCS and GB 18030 encodings to the most commonly used UTF-8/UTF-16 encodings. Some are largely defunct by now like UCS-2. Each one has their pros and cons in terms of space tradeoffs, ease of processing and transportability (e.g. UTF-7 for safe transport over 7-bit system like email). Unless otherwise noted, they can all encode the full set of current Unicode characters.
To convert from one encoding to another, you pretty much need to map bytes from one table to another. Meaning, if you look at the EBCDIC table and the Windows 1250 table, the characters 0xC1 and 0x41 respectively both seem to represent the same character "A", so when converting between the two encodings, you'd map those bytes as equivalent. Yes, that means there needs to be one such mapping between each possible encoding pair.
Since that is obviously rather laborious, modern converters virtually always go through Unicode as a middleman. This way each encoding only needs to be mapped to the Unicode table, and the conversion can be done with encoding A → Unicode code point → encoding B. In the end you just want to identify which characters look the same/mean the same, and change the byte representation accordingly.
A character encoding is a mapping from a sequence of characters to a sequence of bytes (in the past there were also encodings to a sequence of bits - they are falling out of fashion). Usually this mapping is one-to-one but not necessarily onto. This means there may be byte sequences that don't correspond to a character sequence in this encoding.
The domain of the mapping defines which characters can be encoded.
Now to your questions:
ASCII is both, it defines 128 characters (some of them are control codes) and how they are mapped to the byte values 0 to 127.
Each encoding may define its own set of characters and how they are mapped to bytes
no, there are others as well ASCII, ISO-8859-1, ...
Unicode uses a two step mapping: first the characters are mapped to (relatively) small integers called "code points", then these integers are mapped to a byte sequence. The first part is the same for all UTF encodings, the second step differs. Unicode has the ambition to contain all characters. This means, most characters are in the "UNICODE set".
Every character in the world has been assigned a unicode value [ numbered from 0 to ...]. It is actually an unique value. Now, it depends on an individual that how he wants to use that unicode value. He can even use it directly or can use some known encoding schemes like utf8, utf16 etc. Encoding schemes map that unicode value into some specific bit sequence [ can vary from 1 byte to 4 bytes or may be 8 in future if we get to know about all the languages of universe/aliens/multiverse ] so that it can be uniquely identified in the encoding scheme.
For example ASCII is an encoding scheme which only encodes 128 characters out of all characters. It uses one byte for every character which is equivalent to utf8 representation. GSM7 is one other format which uses 7 bit per character to encode 128 characters from unicode character list.
Utf8:
It uses 1 byte for characters whose unicode value is till 127.
Beyond this it has its own way of representing the unicode values.
Uses 2 byte for Cyrillic then 3 bytes for Hindi characters.
Utf16:
It uses 2 byte for characters whose unicode value is till 127.
and it also uses 2 byte for Cyrillic, Hindi characters.
All the utf encoding schemes fixes initial bits in specific pattern [ eg: 110|restbits] and rest bits [eg: initialbits|11001] takes the unicode value to make a unique representation.
Wikipedia on utf8, utf16, unicode will make it clear.
I coded an utf translator which converts incoming utf8 text across all languages into its equivalent utf16 text.

what's the definition of "encoding-agnostic"?

In lua 5.3 reference manual, we can see:
Lua is also encoding-agnostic; it makes no assumptions about the contents of a string.
I can't understand what the sentence says.
The same byte value in a string may represent different characters depending on the character encoding used for that string. For example, the same value \177 may represent ▒ in Code page 437 encoding or ± in Windows 1252 encoding.
Lua makes no assumption as to what the encoding of a given string is and the ambiguity needs to be resolved at the script level; in other words, your script needs to know whether to deal with the byte sequence as Windows 1252, Code page 437, UTF-8, or something else encoded string.
Essentially, a Lua string is a counted sequence of bytes. If you use a Lua string for binary data, the concept of character encodings is not relevant and does not interfere with the binary data. It that way, string is encoding-agnostic.
There are functions in the standard string library that treat string values as text—an uncounted, sequence of characters. There is no text but encoded text. An encoding maps a member of a character set to a sequence of bytes. A string would have the bytes for zero or more such encoded characters. To understand a string as text, you must know the character set and encoding. To use the string functions, the encoding should be compatible with os.setlocale().

Reconstructing Windows-1252 characters from data incorrectly saved as UTF-8

I'm dealing with data that has been sampled using Java HtmlUnit. The webpage used Windows-1252 encoding but the response was retrieved as if the page was encoded as UTF-8 (ie when getContentAsString on the HtmlUnit WebResponse object was invoked, UTF-8 encoding was specified rather than deferring to the encoding specified in the server response). Is there any way to reverse this process to reconstruct the original Windows-1252 data from the incorrectly labelled UTF-8 character data?
Most other questions on this topic are concerned with identifying the type of file or converting from one stream type to another for characters correctly encoded in the first place. That is not the case here. I don't believe utilities such as iconv will work because they expect the streams to have been correctly persisted in their source encoding to begin with.
Probably not. If Windows-1252-encoded text gets mistaken for UTF-8, all non-ASCII codepoints would be damaged, because of the way UTF-8 deals with those codepoints. Only if you are very very lucky, and all non-ASCII codepoints come in pairs or triplets that, by pure chance, convert to real Unicode codepoints, you can reverse the process.
But you're pretty much out of luck.

Efficient way to ASCII encode UTF-8

I'm looking for a simple and efficient way to store UTF-8 strings in ASCII-7. With efficient I mean the following:
all ASCII alphanumeric chars in the input should stay the same ASCII alphanumeric chars in the output
the resulting string should be as short as possible
the operation needs to be reversable without any data loss
the resulting ASCII string should be case insensitive
there should be no restriction on the input length
the whole UTF-8 range should be allowed
My first idea was to use Punycode (IDNA) as it fits the first four requirements, but it fails at the last two.
Can anyone recommend an alternative encoding scheme? Even better if there's some code available to look at.
UTF-7, or, slightly less transparent but more widespread, quoted-printable.
all ASCII chars in the input should stay ASCII chars in the output
(Obviously not fully possible as you need at least one character to act as an escape.)
Since ASCII covers the full range of 7-bit values, an encoding scheme that preserves all ASCII characters, is 7-bits long, and encodes the full Unicode range is not possible.
Edited to add:
I think I understand your requirements now. You are looking for a way to encode UTF-8 strings in a seven-bit code, in which, if that encoded string were interpreted as ASCII text, then the case of the alphabetic characters may be arbitrarily modified, and yet the decoded string will be byte-for-byte identical to the original.
If that's the case, then your best bet would probably be just to encode the binary representation of the original as a string of hexadecimal digits. I know you are looking for a more compact representation, but that's a pretty tall order given the other constraints of the system, unless some custom encoding is devised.
Since the hexadecimal representation can encode any arbitrary binary values, it might be possible to shrink the string by compressing them before taking the hex values.
If you're talking about non-standard schemes - MECE
URL encoding or numeric character references are two possible options.
It depends on the distribution of characters in your strings.
Quoted-printable is good for mostly-ASCII strings because there's no overhead except with '=' and control characters. However, non-ASCII characters take an inefficient 6-12 bytes each, so if you have a lot of those, you'll want to consider UTF-7 or Base64 instead.
Punycode is used for IDNA, but you can use it outside the restrictions imposed by it
Per se, Punycode doesn't fail your last 2 requirements:
>>> import sys
>>> _ = ("\U0010FFFF"*10000).encode("punycode")
>>> all(chr(c).encode("punycode") for c in range(sys.maxunicode))
True
(for idna, python supplies another homonymous encoding)
obviously, if you don't nameprep the input, the encoded string isn't strictly case-insensitive anymore... but if you supply only lowercase (or if you don't care about the decoded case) you should be good to go

What is encoding & decoding in communication?

Can someone please redirect me to some good references about the encoding and decoding in communication and different encoding techniques(unicode, base64, utf7) etc.
Wikipedia is always a good start.
Then there's always Joel Spolsky's article: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).
Note that the three things you name operate on different levels.
Unicode is a character set: a mapping between characters and numbers (code points).
UTF7 maps between code points and bytes.
base64 maps between bytes and bytes. (It mangles bytes so that they are represented by bytes in the ASCII range.)
The definitions of encoding and decoding are somewhat subjective.
Both are forms of transliteration, being the process of converting from one alphabet to another. ASCII to UTF8, ASCII to base64, etc are all examples of this.
What distinguishes the two is that "encoding" is often used when transliterating from a usable format to a transmission or intermediate format of some kind and decoding is the reverse. This is where the "subjective" bit comes in. ASCII to UTF8 can be viewed as encoding or decoding depending on the context.
Other formats like base64 are used almost universally for transmission only (eg binary data in email) and as such converting to them is almost universally called "encoding" and converting from as "decoding".
The important point to take away from all this is that something like ASCII or UTF8 is not magical in any way. All these formats are simply an agreed-upon encoding of information into a binary format. So ASCII 65 is 'A' for no other reason than that's the standard.
Unicode formats get more interesting because they make the distinction between the code point and the encoding. Unicode defines the code points for each character. The binary data is different for each encoding format. For example, see Unicode Character 'EURO-CURRENCY SIGN' (U+20A0) to see all the different binary values for one code point.
Regarding yours unicode, base64, utf7 (no one uses it, it might be utf8). They are not just "encoding & decoding" but encoding & decoding of text data.
Unicode is the way all real and possible characters are enumerated. It has nothing about encoding itself. UTFXX is set of encoding of unicode (converting code to actual bytes). most popular are UTF8 and UTF16. Very basically UTF8 is ASCII compatible (chars with codes < 128 are represented same way as ASCII), but other characters are represented by 2-3 bytes. UTF16 encode most of characters to 2 bytes.
Base64 has nothing about text data. It encodes generic binary data to text that consists of 64 printable ascii characters. It is used to transfer binary data, UTF8 and UTF16 via Email usually.